This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\zxrsetup

toltxlabel=true, tozreflabel=false \zxrsetuptoltxlabel=true, tozreflabel=false

Differentially Private Federated Learning:
Servers Trustworthiness, Estimation, and Statistical Inference

Zhe Zhang111Rutgers University. Email: zzres0131@gmail.com.    Ryumei Nakada222Rutgers University. Email: rn375@stat.rutgers.edu.    Linjun Zhang333Rutgers University. Email: lz412@stat.rutgers.edu.
Abstract

Differentially private federated learning is crucial for maintaining privacy in distributed environments. This paper investigates the challenges of high-dimensional estimation and inference under the constraints of differential privacy. First, we study scenarios involving an untrusted central server, demonstrating the inherent difficulties of accurate estimation in high-dimensional problems. Our findings indicate that the tight minimax rates depends on the high-dimensionality of the data even with sparsity assumptions. Second, we consider a scenario with a trusted central server and introduce a novel federated estimation algorithm tailored for linear regression models. This algorithm effectively handles the slight variations among models distributed across different machines. We also propose methods for statistical inference, including coordinate-wise confidence intervals for individual parameters and strategies for simultaneous inference. Extensive simulation experiments support our theoretical advances, underscoring the efficacy and reliability of our approaches.

1 Introduction

1.1 Overview

Federated learning is an efficient approach for training machine learning models on distributed networks, such as smartphones and wearable devices, without moving data to a central server (konevcny2016federated, ; kairouz2021advances, ; li2020federated, ). Since its proposal in (mcmahan2017communication, ), federated learning has gained significant attention in both practical and theoretical machine learning communities. One of the key attractions of federated learning is its ability to provide a certain level of data privacy by keeping raw data on local machines. However, without specific design choices, there are no formal privacy guarantees. To fully exploit the benefits of federated learning, researchers have introduced the concept of differential privacy (abadi2016deep, ; dwork2018privacy, ; dwork2006calibrating, ; dwork2014algorithmic, ; dwork2017exposed, ) to quantify the exact privacy level in federated learning. A series of research papers have focused on federated learning with differential privacy, applying various algorithms and methods (hu2020personalized, ; truex2020ldp, ; wei2020federated, ). Despite these efforts, there remains a significant gap between practical usage and statistical guarantees, particularly in the high-dimensional setting with sparsity assumptions, where theoretical results for the optimal rate of convergence and statistical inference results are largely missing.

In this paper, we focus on studying the estimation and inference problems in the federated learning setting under differential privacy, particularly in the high-dimensional regime. In federated learning, there are several local machines containing data sets from different sources, and a central server to coordinate all local machines to train learning models collaboratively. We present our key results in two major settings for privacy and federated learning. In the first setting, we consider an untrusted central server (lowy2021private, ; wei2020federated, ; hu2020personalized, ) where each machine sends only privatized information to the central server. For example, when using smartphones, where users may not fully trust the server and do not want their personal information to be directly updated on the remote central server. In the second setting, we consider a trusted central server where each machine sends raw information without making it private. (mcmahan2022federated, ; geyer2017differentially, ; mcmahan2017learning, ) For example, in different hospitals, patient data may not be shared among hospitals to protect patient privacy, but they can all report their data to a central server, such as a non-profit organization or an institute, to gain more information and publish statistics on certain diseases.

In the first part of our paper, we demonstrate that under the assumption that the central server is untrusted, the optimal rate of convergence for mean estimation is O(sd/(mnϵ2)sd/(mn\epsilon^{2})), where mm is the number of local machines and each containing nn data points, dd is the parameter of interest, ss is the sparsity level, and ϵ\epsilon is the privacy parameter. As commonly assumed in high-dimensional settings where the dimension is comparable or even larger than the number of data, such an optimality result shows the incompatibility of untrusted central server setting and high-dimensional statistics. As a result, we can only hope to get a good estimation under the trusted central server setting in the high-dimensional regime.

In the second part of the paper, we consider the case of a trusted central server and design algorithms that allow for accurate estimations and obtain a near-optimal rate of convergence up to logarithm factors. We also present statistical inference results, including the construction of coordinate-wise confidence intervals with privacy guarantees, and the solution to conduct simultaneous inference privately. This will assist in hypothesis testing problems and construction of confidence intervals for a given subset of indices of a vector simultaneously in high-dimensional settings. We emphasize that our algorithms for estimation and inference are suited for practical purposes, considering its capacity to (1) leverage data from multiple devices to improve machine learning models and (2) draw accurate conclusions about a population from a sample while preserving individual privacy. For instance, in healthcare, we could combine patient data from multiple hospitals to develop more accurate models for disease diagnosis and treatment, while ensuring that patient privacy is protected. We summarize our major contributions as follows:

  • For the untrusted central server setting, we provably show that federated learning is not suited for high-dimensional mean estimation problems by providing the optimal rate of convergence under the untrusted central server constraints. This suggests us to consider a trusted central server setting to utilize federated learning for such problems.

  • For the trusted central server setting, we design novel algorithms to achieve private estimation with federated learning. We first consider the estimation in homogeneous federated learning setting and then we extend it to a more complicated heterogeneous federated learning setting. We also provide a sharp rate of convergence for our algorithm in both settings.

  • In addition, we consider statistical inference problems in both homogeneous and heterogeneous federated learning settings. We provide algorithms for coordinate-wise and simultaneous confidence intervals, which are two common inference problems in high-dimensional statistics. It is worth mentioning that our proposed methods for high-dimensional differentially private inference problems are novel and unique, which has not been developed even for the single-source and non-federated learning setting. Theoretical results show that our proposed confidence intervals are asymptotically valid, supported by simulations.

1.2 Related Work

In the literature, several works focused on designing private algorithms in federated learning/distributed learning based on variants of stochastic gradient decent algorithms. (agarwal2018cpsgd, ) proposed a communication efficient algorithm, CP-SGD algorithm for learning models with local differential privacy (LDP). (erlingsson2020encode, ) proposed a distributed LDP gradient descent algorithm by applying LDP on gradients with ESA framework (bittau2017prochlo, ). (girgis2021shuffled, ) extended works on LDP approach for federated learning and proposed a distributed communication-efficient LDP stochastic gradient descent algorithm through shuffled model and analyzed the upper bound of the convergence rate. However, the trade-off between statistical accuracy and the privacy cost has not been considered in these works.

In the distributed settings, the trade-off between statistical accuracy and information constraints has been discussed in various papers. Two common types of information constraints are communication constraints and privacy constraints. We refer to (zhang2013information, ; braverman2016communication, ; han2018geometric, ; barnes2020lower, ; garg2014communication, ) for more discussions on communication constraints, considering the situation where the bits of the information during communication have constraints.

A series of work discusses the trade-off between accuracy and privacy in high-dimensional and non-federated learning problems, including top-kk selection (steinke2017tight, ), sparse mean estimation (cai2019cost, ), linear regression (cai2019cost, ; talwar2015nearly, ), generalized linear models (cai2020cost, ), latent variable models (zhang2021high, ). However, the discussion on privacy constraints in the distributed settings are still largely lacking. Among the existing works, most of them focus on the local differential privacy (LDP) constraint. In (barnes2020fisher, ), the mean estimation under 2\ell_{2} loss for Gaussian and sparse Bernoulli distributions are discussed. (duchi2019lower, ) discussed the lower bounds under LDP constraints in the blackboard communication model for mean estimation of product Bernoulli distributions and sparse Gaussian distributions. acharya2020general proposed a more general approach to combine both communication constraints and privacy constraints. Compared with previous works, we focus on the problem where there are nn data points on each machine. Our interest lies in the (ϵ,δ)(\epsilon,\delta)-DP instead of LDP, which is a weaker constraint containing broader settings. We further note that, compared with the blackboard communication model (braverman2016communication, ; garg2014communication, ), in the federated learning setting, we assume that the existence of a central server and that each server is only allowed to communicate with the central server. This setting enables us to enhance more privacy.

When we are finalizing this paper444An initial draft of this paper was published as a Ph.D. dissertation in 2023 zhethesis ., we realized an independent and concurrent work li2024federated . li2024federated also considers differentially private federated transfer learning under high-dimensional sparse linear regression model. Namely, they proposed a notion of federated differential privacy that allows multiple rounds of (ϵ,δ)(\epsilon,\delta)-differentially private transmissions between local machines and the central server, and provides algorithms to filter out irrelevant sources, and exploit information from relevant sources to improve the performance of estimation of target parameters. We differntiate our research with their paper as follows: (1) While they consider differentially private federated learning under untrusted server setting, we deal with both trusted and untrusted server settings. We also highlight a fundamental difficulty of pure ϵ\epsilon-differentially private estimation under untrusted central server settings in federated learning by establishing a tight minimax lower bound, and resort to trusted server settings for estimation and inference problems. (2) While their investigation centers on differentially private estimation within a federated transfer learning framework—specifically focusing on parameter estimation for a target distribution using similar source data—our work focuses on private estimation and inference for parameters that are either common across all participating machines, or vary across different machines.

We also cite papers that provided us inspirations for the design of our proposed algorithms and methods. (javanmard2014confidence, ) introduces a de-biasing produce for the statistical inference problems. (li2020transfer, ) considers the transfer learning problem in high-dimensional settings, which enables us to combine information from other sources to benefit the estimation problems. Such idea could be adopted in the hetergeneous federated learning problems. For the simultaneous inference problems, we refer to (zhang2017simultaneous, ; yu2022distributed, ), which discussed how to conduct simultaneous inference for high-dimensional problems.

Notation. We introduce several notations used throughout the paper. Let 𝒗=(v1,v2,,vd)d\bm{v}=(v_{1},v_{2},\dots,v_{d})^{\top}\in\mathbb{R}^{d} represent a vector. Given a set of indices 𝒮\mathcal{S}, 𝒗𝒮\bm{v}_{\mathcal{S}} refers to the components of 𝒗\bm{v} corresponding to the indices in 𝒮\mathcal{S}. The q\ell_{q} norm of 𝒗\bm{v}, for 1q1\leq q\leq\infty, is given by 𝒗q\|\bm{v}\|_{q}, whereas 𝒗0\|\bm{v}\|_{0} represents the number of non-zero elements in 𝒗\bm{v}, also called as its sparsity level.

We use mm to indicate the number of machines, nn for the number of samples per machine, dd for the dimensionality of vectors, and ss for their sparsity level. The total number of samples across all machines is denoted by n0=mnn_{0}=m\cdot n. Additionally, we define the truncation function ΠT:dd\Pi_{T}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}, which projects a vector onto the \ell_{\infty} ball of radius TT centered at the origin.

For a matrix Σ\Sigma, max𝒗2=1,𝒗0s𝒗Σ𝒗\max_{\|\bm{v}\|_{2}=1,\|\bm{v}\|_{0}\leq s}\bm{v}^{\top}\Sigma\bm{v} and min𝒗2=1,𝒗0s𝒗Σ𝒗\min_{\|\bm{v}\|_{2}=1,\|\bm{v}\|_{0}\leq s}\bm{v}^{\top}\Sigma\bm{v} denote the largest and smallest ss-restricted eigenvalues of Σ\Sigma, denoted as μs(Σ)\mu_{s}(\Sigma) and νs(Σ)\nu_{s}(\Sigma), respectively.

For sequences ana_{n} and bnb_{n}, an=o(bn)a_{n}=o(b_{n}) implies an/bn0a_{n}/b_{n}\rightarrow 0 as nn grows, an=O(bn)a_{n}=O(b_{n}) signifies that ana_{n} is upper bounded by a constant multiple of bnb_{n}, and an=Ω(bn)a_{n}=\Omega(b_{n}) indicates that ana_{n} is lower bounded by a constant multiple of bnb_{n}, where constants are independent of nn. The notation anbna_{n}\asymp b_{n} denotes that ana_{n} is both upper and lower bounded by constant multiples of bnb_{n}.

In this work, we often use symbols c0,c1,m0,m1,C,C,K,Kc_{0},c_{1},m_{0},m_{1},C,C^{\prime},K,K^{\prime} to represent universal constants. Their specific values may vary depending on the context, but they are independent from other tunable parameters.

2 Preliminaries

2.1 Differential Privacy

We start form the basic concepts and properties of differential privacy (dwork2006calibrating, ). The intuition behind differential privacy is that a randomized algorithm produces similar outputs even when an individual’s information in the dataset is changed or removed, thereby preserving the privacy of individual data. The formal definition of differential privacy is given below.

Definition 2.1 (Differential Privacy (dwork2006calibrating, ))

Let 𝒳\mathcal{X} be the sample space for an individual data, a randomized algorithm M:𝒳nM:\mathcal{X}^{n}\rightarrow\mathbb{R} is (ϵ,δ)(\epsilon,\delta)-differentially private if and only if for every pair of adjacent data sets 𝐗,𝐗𝒳n\bm{X},\bm{X}^{\prime}\in\mathcal{X}^{n} and for any SS\subseteq\mathbb{R}, the inequality below holds:

(M(𝑿)S)eε(M(𝑿)S)+δ,\displaystyle\mathbb{P}\left(M(\bm{X})\in S\right)\leq e^{\varepsilon}\cdot\mathbb{P}\left(M(\bm{X}^{\prime})\in S\right)+\delta,

where we say that two data sets 𝐗={𝐱i}i=1n\bm{X}=\{\bm{x}_{i}\}_{i=1}^{n} and 𝐗={𝐱i}i=1n\bm{X}^{\prime}=\{{{\bm{x}}_{i}^{\prime}}\}_{i=1}^{n} are adjacent if and only if they differ by one individual datum.

In the above definition, the two parameters ϵ,δ\epsilon,\delta control the privacy level. From the definition, with smaller ϵ\epsilon and δ\delta, the outcomes given adjacent 𝑿\bm{X} and 𝑿\bm{X}^{\prime} become closer, making it harder for an adversary to distinguish if the original dataset is 𝑿\bm{X} or 𝑿\bm{X}^{\prime}, indicating the privacy constraint becomes more stringent. Furthermore, when δ=0\delta=0, we could use ϵ\epsilon-differentially private as the abbreviation of (ϵ,0)(\epsilon,0)-differentially private.

In the rest of this section, we introduce several useful properties of differential privacy and how to create a differential private algorithm from non-private counterparts. One common strategy is through noise injection. The scale of noise is characterized by the sensitivity of the algorithm:

Definition 2.2

For any algorithm f:𝒳ndf:\mathcal{X}^{n}\rightarrow{\mathbb{R}}^{d} and two adjacent data sets 𝐗\bm{X} and 𝐗\bm{X}^{\prime}, the p\ell_{p}-sensitivity of ff is defined as:

Δp(f)=sup𝑿,𝑿𝒳n adjacentf(𝑿)f(𝑿)p.\Delta_{p}(f)=\sup_{\bm{X},\bm{X}^{\prime}\in\mathcal{X}^{n}\text{ adjacent}}\|f(\bm{X})-f(\bm{X}^{\prime})\|_{p}.

We then introduce two mechanisms. For algorithms with finite 1\ell_{1}-sensitivity, we add Laplace noises to achieve differential privacy, while for 2\ell_{2}-sensitivity, we inject Gaussian noises.

Proposition 2.3 (The Laplace Mechanism (dwork2006calibrating, ; dwork2014algorithmic, ))

Let f:𝒳ndf:\mathcal{X}^{n}\to\mathbb{R}^{d} be a deterministic algorithm with Δ1(f)<\Delta_{1}(f)<\infty. For 𝐰d\bm{w}\in\mathbb{R}^{d} with coordinates w1,w2,,wdw_{1},w_{2},\cdots,w_{d} be i.i.d samples drawn from Laplace(Δ1(f)/ϵ)(\Delta_{1}(f)/\epsilon), f(𝐗)+𝐰f(\bm{X})+\bm{w} is (ϵ,0)(\epsilon,0)-differentially private.

Proposition 2.4 (The Gaussian Mechanism (dwork2006calibrating, ; dwork2014algorithmic, ))

Let f:𝒳ndf:\mathcal{X}^{n}\to\mathbb{R}^{d} be a deterministic algorithm with Δ2(f)<\Delta_{2}(f)<\infty. For 𝐰d\bm{w}\in\mathbb{R}^{d} with coordinates w1,w2,,wdw_{1},w_{2},\cdots,w_{d} be i.i.d samples drawn from N(0,2(Δ2(f)/ϵ)2log(1.25/δ))N(0,2(\Delta_{2}(f)/\epsilon)^{2}\log(1.25/\delta)), f(𝐗)+𝐰f(\bm{X})+\bm{w} is (ϵ,δ)(\epsilon,\delta)-differentially private.

The post-processing and composition properties are two key properties in differential privacy, which enable us to design complicated differentially private algorithms by combining simpler ones. Such properties are pivotal in the design of algorithms in later chapters.

Proposition 2.5 (Post-processing Property (dwork2006calibrating, ))

Let MM be an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm and gg be an arbitrary function which takes M(𝐗)M(\bm{X}) as input, then g(M(𝐗))g(M(\bm{X})) is also (ϵ,δ)(\epsilon,\delta)-differentially private.

Proposition 2.6 (Composition property (dwork2006calibrating, ))

For i=1,2i=1,2, let MiM_{i} be (εi,δi)(\varepsilon_{i},\delta_{i})-differentially private algorithm, then (M1,M2)(M_{1},M_{2}) is (ϵ1+ϵ2,δ1+δ2)(\epsilon_{1}+\epsilon_{2},\delta_{1}+\delta_{2})-differentially private algorithm.

We also mention NoisyHT algorithm (Algorithm 1) introduced by dwork2018differentially , which stands for the noisy hard-thresholding algorithm. The algorithm aims to pursue both sparsity of the output and privacy at the same time.

1
2Input: vector-valued function 𝒗=𝒗(𝑿)d\bm{v}=\bm{v}(\bm{X})\in\mathbb{R}^{d} with data 𝑿\bm{X}, sparsity ss, privacy parameters ε,δ\varepsilon,\delta, sensitivity λ\lambda.
3 Initialization: S=S=\emptyset.
4 For ii in 11 to ss:
5  Generate 𝒘id\bm{w}_{i}\in\mathbb{R}^{d} with wi1,wi2,,widi.i.d.Laplace(λ23slog(1/δ)ε)w_{i1},w_{i2},\cdots,w_{id}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\text{Laplace}\left(\lambda\cdot\frac{2\sqrt{3s\log(1/\delta)}}{\varepsilon}\right).
6  Append j=argmaxj[d]S(|vj|+wij)j^{*}=\text{argmax}_{j\in[d]\setminus S}(|v_{j}|+w_{ij}) to SS.
7 End For
8  Generate 𝒘~\tilde{\bm{w}} with w~1,,w~di.i.d.Laplace(λ23slog(1/δ)ε)\tilde{w}_{1},\cdots,\tilde{w}_{d}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\text{Laplace}\left(\lambda\cdot\frac{2\sqrt{3s\log(1/\delta)}}{\varepsilon}\right).
9 Output: PS(𝒗+~w)P_{S}(\bm{v}+\bm{\tilde{}}{w}).
Algorithm 1 Noisy Hard Thresholding Algorithm (NoisyHT(𝒗,s,λ,ϵ,δ\bm{v},s,\lambda,\epsilon,\delta)) (dwork2018differentially, )

In the last step, PS(𝒖)P_{S}(\bm{u}) denotes the operator that makes 𝒖Sc=0\bm{u}_{S^{c}}=0 while preserving 𝒖S\bm{u}_{S}. This algorithm could be seen as a private top-k selection algorithm, which helps build our proposed algorithm in later section.

Specifically, when the sparsity ss is chosen to be 11, the algorithm outputs the maximum element chosen after a single iteration in the private manner. We refer this special case as the Private Max algorithm, which is implemented in Algorithm 7 used for simultaneous inference.

2.2 Federated Learning

Federated learning introduced in (mcmahan2017communication, ) is a technique designed to train a machine learning algorithm across multiple devices, without exchanging data samples. A central server coordinates the process, with each local machine sending model updates to be aggregated centrally. Figure 1 illustrates the basic concept of federated learning

Refer to caption
Figure 1: Federated Learning

One characteristic of federated learning is that the training of machine learning models occurs locally, and only parameters and updates are transferred to the central server and shared by each node. Specifically, communication between local machines and the server is bidirectional: machines send updates to the central server, and in return, they receive aggregated information after processing. Communication among local machines is prohibited to prevent privacy leakage. Intuitively, federated learning inherently provides a certain level of privacy.

Although without rigorous definitions, there are two main branches of central server settings in federated learning: the untrusted central server setting and the trusted central server setting (lowy2021private, ; wei2020federated, ; mcmahan2022federated, ; geyer2017differentially, ). In the first setting, where the central server is untrusted, each piece of information sent from the machine to the central server should be differentially private. In the second setting, we assume a trusted central server exists. In this scenario, it is safe to send raw information from the machine to the central server without additional privacy measures. However, to prevent information leakage among local machines, the information sent back from the server should also be differentially private.

Another key aspect of federated learning is that the datasets on each local machine are commonly not independent and identically distributed (i.i.d.). This allows federated learning to train on heterogeneous datasets, aligning with practical scenarios where the datasets on different machines are typically diverse and their sizes may also vary. We will demonstrate that federated learning can efficiently estimate the local model when models on different local machines differ but share some similarities, a concept we refer to as heterogeneous federated learning in Section 5.

2.3 Problem Formulation

In this paper, we assume that there exists a central server and mm local machines. We denote the data on these machines by 𝑿1,𝑿2,𝑿3,,𝑿m\bm{X}_{1},\bm{X}_{2},\bm{X}_{3},\dots,\bm{X}_{m}, respectively, with 𝑿ini×d\bm{X}_{i}\in\mathbb{R}^{n_{i}\times d}. On any machine i=1,2,,mi=1,2,\dots,m, there are nin_{i} data points 𝑿i=[𝑿i1,𝑿i2,,𝑿ini]\bm{X}_{i}=[\bm{X}_{i1},\bm{X}_{i2},\dots,\bm{X}_{in_{i}}]. For simplicity, we assume that there are equal data points n=n1=n2==nin=n_{1}=n_{2}=\dots=n_{i} for each machine. We note that the result could be easily generalized to cases where the sample sizes on each machine differ.

We consider both untrusted and trusted central server settings. For the untrusted setting, we require that the information sent from local machines to the server is private. In this scenario, we show that in the high-dimensional setting, even with sparsity assumptions, it is impossible to achieve small estimation error when the central server is untrusted. In the trusted setting, we consider the high-dimensional linear regression problem 𝒀=𝑿𝜷+𝑾\bm{Y}=\bm{X}{\bm{\beta}}+\bm{W} with ss-sparse 𝜷{\bm{\beta}}. We will first study the case where all machines share the same 𝜷{\bm{\beta}}, (referred to homogeneous federation learning,) and then study a more general case where models on different machines are not equal, but share certain similarities (referred to heterogeneous federation learning.) We show that our algorithm can adapt to such similarity—with larger similarity, the algorithm achieves a faster rate of convergence.

3 An Impossibility Result in the Untrusted Central Server setting

In this section, we study the untrusted server setting where the local machines need to send privatized information to the central server to ensure privacy. We show an impossibility result that in high-dimensional settings where the data dimension is comparable to or greater than the sample size, accurate estimation is not feasible even if we consider a simple sparse mean estimation problem.

As mentioned in Section 2.3, we consider a federated learning setting with mm machines, where each machine i[m]i\in[m] handles nn data points 𝑿i:=[𝑿i1,𝑿i2,,𝑿in]n×d\bm{X}_{i}:=[\bm{X}_{i1},\bm{X}_{i2},\dots,\bm{X}_{in}]\in\mathbb{R}^{n\times d}. Let Dall={𝑿i}i=1mD_{\textnormal{all}}=\{\bm{X}_{i}\}_{i=1}^{m}. We assume that each data point 𝑿ijd\bm{X}_{ij}\in\mathbb{R}^{d} follows a Gaussian distribution N(𝝁,𝑰d)N(\bm{\mu},\bm{I}_{d}), where 𝝁\bm{\mu} is a sparse dd-dimensional vector with sparsity ss. The goal is to estimate 𝝁\bm{\mu} in the federated learning setting when the central server is untrusted. In this section, we provide an optimal rate of convergence for this problem and show that the untrusted central server setting is not suited for high-dimensional problems.

We begin by deriving the minimax lower bound, which characterizes the fundamental difficulty of this estimation problem. In untrusted server setting, we additionally assume that each piece of information sent from the local machine to the central server follows ϵ\epsilon-differential privacy. To achieve this, we introduce the privacy channel 𝒲ϵ-priv:𝒳n𝒵\mathcal{W}^{\epsilon\textnormal{-priv}}:\mathcal{X}^{n}\to\mathcal{Z}, a function that is responsible for privatizing the information transmitted from the local machines. Given the input X𝒳nX\in\mathcal{X}^{n} and the privacy channel 𝒲ϵ-priv\mathcal{W}^{\epsilon\textnormal{-priv}}, Z𝒵Z\in\mathcal{Z} representing all the information (from multiple rounds) transmitted to the central server. More precisely, we require privacy guarantees such that for any two adjacent datasets XX and X𝒳nX^{\prime}\in\mathcal{X}^{n}, differing by only one data point on any local machine, and for an output Z𝒵Z\in\mathcal{Z} representing the information sent from the local machine to the central server, differential privacy guarantee (𝒲ϵ-priv(X)=Z)eϵ(𝒲ϵ-priv(X)=Z){\mathbb{P}}(\mathcal{W}^{\epsilon\textnormal{-priv}}(X)=Z)\leq e^{\epsilon}\cdot{\mathbb{P}}(\mathcal{W}^{\epsilon\textnormal{-priv}}(X^{\prime})=Z) holds.

We consider any mechanism MM in the federated learning setting with mm local machines and one central server, operated on the dataset DallD_{\textnormal{all}}. M{M} serves as a procedure to estimate 𝝁\bm{\mu}, where each local machine collaborates exclusively with the central server without direct interaction among themselves. On each machine ii, the mechanism M{M} uses the privacy channel 𝒲iϵ-priv\mathcal{W}_{i}^{\epsilon\textnormal{-priv}} and data sample 𝑿i\bm{X}_{i} to generate 𝒁i\bm{Z}_{i}, which is then transmitted to the central server. The central server receives the information from all machines. After multi-rounds of collaboration between local machines and the central server, we obtain the sparse and private estimator 𝝁^d\hat{\bm{\mu}}\in\mathbb{R}^{d}. We denote the class of all mechanisms that satisfy the above constraints as m,ϵuntrust(Dall)\mathcal{M}^{\textnormal{untrust}}_{m,\epsilon}(D_{\textnormal{all}}). Under this setting we establish a lower bound for the estimation error of the mean in Theorem 1.

Theorem 1

Suppose DallD_{\textnormal{all}} is generated as above. Let μ\mu be a ss-sparse dd-dimensional mean of Gaussian distribution satisfying 𝛍1\|\bm{\mu}\|_{\infty}\leq 1. We consider the estimation of the mean vector 𝛍^\hat{\bm{\mu}} under the untrusted central server federated learning setting with mm local machines and nn data points in each machine. Then, there exists a constant c>0c>0 such that

infMm,ϵuntrustsup𝝁d,𝝁1𝝁M(Dall)22cmin(sn,sdmnϵ2).\inf_{M\in\mathcal{M}^{\textnormal{untrust}}_{m,\epsilon}}\sup_{\bm{\mu}\in\mathbb{R}^{d},\|\bm{\mu}\|_{\infty}\leq 1}\|\bm{\mu}-M(D_{\textnormal{all}})\|_{2}^{2}\geq c\cdot\min\quantity(\frac{s}{n},\frac{sd}{mn\epsilon^{2}}).

The lower bound contains two terms. The first term, of order s/ns/n, represents the minimax risk of mean estimation using only the samples from a local machine. The second term, of order sd/(mnϵ2)sd/(mn\epsilon^{2}), accounts for the error from federated learning across multiple machines under privacy constraints. Theorem 1 suggests that we cannot perform better than either choosing to estimate the mean using only the local machine or adopting the federated learning approach and combining information from different machines. However, in the latter approach, we must at least incur a rate of O(d/(mnϵ2))O(d/(mn\epsilon^{2})), which is linearly proportional to the dimension dd. This result suggests that privacy constraints significantly impact the efficiency of federated learning in high-dimensional settings. Furthermore, as the number of machines mm increases, we can possibly attain better performance, highlighting the merit of federated learning.

We also show the tightness of the lower bound in Theorem 1 by providing the upper bound.

Theorem 2

Suppose that conditions in Theorem 1 hold. Then, there exists an ϵ\epsilon-differentially private algorithm for the estimation of 𝛍\bm{\mu} as 𝛍^\hat{\bm{\mu}}

𝝁𝝁^22csdmnϵ2,\|\bm{\mu}-\hat{\bm{\mu}}\|_{2}^{2}\leq c\cdot\frac{s\cdot d}{mn\epsilon^{2}},

where c>0c>0 is some constant.

The proof follows by constructing an algorithm that transforms Gaussian mean to Bernoulli mean according to the sign of the Gaussian mean, motivated by Algorithm 2 discussed in (acharya2020general, ), where the authors discuss ll-bit protocol for estimating the product of Bernoulli family. More details of the algorithm are deferred to Section A.2. Based on the results from Theorems 1 and 2, we obtain the optimal rate of convergence for sparse mean estimation under differentially private federated learning setting. As a result, when the central server is untrusted, it is impossible to find an approach to achieve accurate estimation under the untrusted server assumption. This highlights the necessity of the trusted server setting for statistical estimation and inference in high-dimensional federated learning scenarios. In the following sections, we develop estimation and inference procedures under the trusted server settings.

4 Homogeneous Federated Learning Setting

4.1 Algorithms for Estimation Problems

In this section, we consider the setting of a trusted central server, where local machines fully trust the central server and send unprivatized information to it without implementing privacy measures. However, when the central server sends information back to the local machines, it must ensure that this information is privatized to avoid any privacy leakage across local machines.

In this subsection, we first focus on the statistical estimation problems in this setting and then develop inference results in the next subsection. More specifically, our primary focus is on the linear regression problem in a high-dimensional setting, where the ground truth, denoted as 𝜷{\bm{\beta}}, is a sparse dd-dimensional vector. We initially study the simpler case in this section, where the underlying generative models for each local machine are identical, which we refer to as the homogeneous federated learning setting. A more complicated heterogeneous setting will be discussed in the following section. Specifically, we consider the following high-dimensional linear regression model:

𝒀=𝑿𝜷+𝑾,\bm{Y}=\bm{X}\bm{\beta}+\bm{W},

where we assume 𝑾\bm{W} is the error term whose coordinates are independent and following sub-Gaussian distribution with variance proxy σ2\sigma^{2}, denoted by WisubG(σ2)W_{i}\sim\text{subG}(\sigma^{2}). 𝑿\bm{X} is a random matrix whose rows are following sub-Gaussian distribution with a covariance matrix 𝚺\bm{\Sigma}.

We first introduce the parameter estimation algorithm under differentially private federated settings with a trusted central server.

Input : Dataset Dall={(𝑿i,𝒀i)}i[m]D_{\textnormal{all}}=\{(\bm{X}_{i},\bm{Y}_{i})\}_{i\in[m]}, number of machines mm, number of samples on each machine nn, step size η0\eta^{0}, privacy parameters ε,δ\varepsilon,\delta, noise scale B0B_{0}, number of iterations TT, truncation level RR, feasibility parameter C0C_{0}, sparsity ss, initial value 𝜷0{\bm{\beta}}^{0}.
1 for tt from 0 to T1T-1 do
2       Step 1:
3       On each local machine i=1,2,,mi=1,2,\dots,m, calculate the local gradient
𝒈i=1nj=1n(𝑿ij𝜷tΠR(yij))𝑿ij.\bm{g}_{i}=\frac{1}{n}\sum_{j=1}^{n}(\bm{X}_{ij}^{\top}{\bm{\beta}}^{t}-\Pi_{R}(y_{ij}))\bm{X}_{ij}.
Send the gradient 𝒈i\bm{g}_{i} to the central server.
4       Step 2:
5       Compute 𝜷t+0.5=𝜷t(η0/m)i=1m𝒈i{\bm{\beta}}^{t+0.5}={\bm{\beta}}^{t}-(\eta^{0}/m)\sum_{i=1}^{m}\bm{g}_{i} at the central server;
6        Compute 𝜷t+1=ΠC0(NoisyHT(𝜷t+0.5,s,εT,δT,η0B0mn)){\bm{\beta}}^{t+1}=\Pi_{C_{0}}\left(\text{NoisyHT}({\bm{\beta}}^{t+0.5},s,\frac{\varepsilon}{T},\frac{\delta}{T},\frac{\eta^{0}B_{0}}{mn})\right) at the central server.
7      Step 3: Send the output 𝜷t+1{\bm{\beta}}^{t+1} back to each local machine from the server.
8 end for
9
10Output: Return 𝜷T{\bm{\beta}}^{T}.
Algorithm 2 Differentially Private Sparse Linear Regression under Federated Learning

In Step 1 of Algorithm 2, the information computed on each local machine is transmitted to the central server. The second step involves calculations performed at the central server. Prior to sending the information back to the local machines, it undergoes privacy preservation through the application of the NoisyHT algorithm, as introduced in Algorithm 1. Subsequently, the local machine updates its estimation based on the information received from the central server.

We compare Algorithm 2 with Algorithm 4.2 in (cai2019cost, ), which addresses the private estimation in linear regression under non-federated learning settings. Unlike the latter, our algorithm does not transmit all data points to the central server. Instead, we calculate the gradient updates locally on each machine and send only these local gradients to the server. This design enhances privacy protection, as the original data remains visible only on the local machine and is not exposed externally. Furthermore, this approach of gradient updates also reduces communication costs by transmitting only a dd-dimensional vector from each local machine for the gradient update. Previous research has also considered non-private distributed methods for linear regression problems, such as (lee2017communication, ; zhang2012communication, ). Our algorithm, however, ensures differential privacy. In practice, the sparsity level ss can be determined using a private version of cross-validation, while other parameters may be pre-chosen based on our theoretical analysis.

4.2 Algorithms for Inference Problems

In this subsection, we focus on statistical inference problems in the homogeneous federated learning setting, such as constructing coordinate-wise confidence intervals for parameters and performing simultaneous inference. To begin, we develop a method for constructing coordinate-wise confidence intervals, for example, for the kk-th index of 𝜷{\bm{\beta}}, βk\beta_{k}. However, it is important to note that the output of Algorithm 2 is biased due to hard thresholding. To overcome this bias, we employ a de-biasing procedure, a common technique in high-dimensional statistics, as demonstrated in previous studies (javanmard2014confidence, ). This procedure involves approximating the kk-th column of the precision matrix 𝚯=𝚺1\bm{\Theta}=\bm{\Sigma}^{-1} to construct confidence intervals for each βk\beta_{k}. Subsequently, we focus on obtaining an estimate of the precision matrix in a private manner.

Input : Number of machines mm, number of data points in each machine nn, dataset 𝑿i=(𝒙i1,,𝒙in)\bm{X}_{i}=(\bm{x}_{i1},\dots,\bm{x}_{in}) for i=1,,mi=1,\dots,m, step size η1\eta^{1}, privacy parameters ε,δ\varepsilon,\delta, noise scale B1B_{1}, number of iterations TT, feasibility parameter C1C_{1}, sparsity ss, initial value 𝚯k0\bm{\Theta}_{k}^{0}.
1 for tt from 0 to T1T-1 do
2       Step 1: On each local machine i=1,2,,mi=1,2,\dots,m, calculate local gradient 𝒈i=1nj=1n𝑿ij𝑿ij𝚯kt𝒆j\bm{g}_{i}=\frac{1}{n}\sum_{j=1}^{n}\bm{X}_{ij}\bm{X}_{ij}^{\top}\bm{\Theta}_{k}^{t}-\bm{e}_{j}. Send the gradients (𝒈1,𝒈2,,𝒈m)(\bm{g}_{1},\bm{g}_{2},\dots,\bm{g}_{m}) to the central server.
3       Step 2:
4       Compute 𝚯kt+0.5=𝚯kt(η0/m)i=1m𝒈i\bm{\Theta}_{k}^{t+0.5}=\bm{\Theta}_{k}^{t}-(\eta^{0}/m)\sum_{i=1}^{m}\bm{g}_{i} at the central server;
5        Compute 𝚯kt+1=ΠC1(NoisyHT(𝒘kt+0.5,s,εT,δT,η1B1mn))\bm{\Theta}_{k}^{t+1}=\Pi_{C_{1}}\left(\text{NoisyHT}(\bm{w}_{k}^{t+0.5},s,\frac{\varepsilon}{T},\frac{\delta}{T},\frac{\eta^{1}B_{1}}{mn})\right) at the central server.
6      Step 3: Send 𝚯kt+1\bm{\Theta}_{k}^{t+1} back to each local machine from the server.
7 end for
8
9Output: Return 𝚯kT\bm{\Theta}_{k}^{T}.
Algorithm 3 Differentially Private Precision Matrix Estimation in Federated Learning

The structure of Algorithm 3 is similar to Algorithm 2, as both adopt an iterative communication between the central server and the local machines; the information is initially transmitted from the local machines to the server, then, the central server performs calculations and use the NoisyHT algorithm (Algorithm 1) to ensure the privacy of the information. Subsequently, each local machine updates the gradient and progresses to the next iteration. The primary distinction between two algorithms lies in the computation of the gradient on each machine.

Denote the output of Algorithm 2 as 𝜷^\hat{\bm{\beta}} and the output of Algorithm 3 as 𝚯^k\hat{\bm{\Theta}}_{k}. Then the de-biased differentially private estimator of βk\beta_{k} is given by

β^ku=β^k+1mi=1m𝚯^k𝒈i+Ek,\displaystyle\hat{\beta}^{\text{u}}_{k}=\hat{\beta}_{k}+\frac{1}{m}\sum_{i=1}^{m}\hat{\bm{\Theta}}_{k}^{\top}\bm{g}_{i}+E_{k}, (4.1)

where 𝒈i=(1/n)j=1n(𝑿ij𝜷^ΠR(yij))𝑿ij\bm{g}_{i}=(1/n)\sum_{j=1}^{n}(\bm{X}_{ij}^{\top}\hat{\bm{\beta}}-\Pi_{R}(y_{ij}))\bm{X}_{ij}, and EkE_{k} is the injected random noise to ensure privacy, following a Gaussian distribution N(0,8Δ12log(1.25/δ)/(n2m2ϵ2))N(0,8\Delta_{1}^{2}\log(1.25/\delta)/(n^{2}m^{2}\epsilon^{2})), where Δ1=sc1cxR+sc0c1cx2\Delta_{1}=\sqrt{s}c_{1}c_{x}R+sc_{0}c_{1}c_{x}^{2} with some constants c0,c1,cxc_{0},c_{1},c_{x} defined later.

The debiased estimator in (4.1) enables us to construct a differentially private confidence intervals. Although the variance σ\sigma of the error term 𝑾\bm{W} in the linear regression model is usually unknown, we can estimate σ\sigma from the data in a private manner. The estimation is based on the residual term between the response 𝒀\bm{Y} and the fitted value 𝑿𝜷^\bm{X}\hat{{\bm{\beta}}}. We summarize the method to estimate σ\sigma in the private federated learning setting in Algorithm 4.

Input : Dataset (𝑿i,𝒀i)[i=1,2,,m](\bm{X}_{i},\bm{Y}_{i})_{[i=1,2,\dots,m]}, privacy parameters ε\varepsilon, noise scale B2B_{2}, truncation level RR, estimated parameter 𝜷^\hat{{\bm{\beta}}} from Algorithm 2.
1 Step 1: On each machine i=1,2,,mi=1,2,\dots,m, compute W^i=ΠR(𝒀i)𝑿i𝜷^22/n\hat{W}_{i}=\|\Pi_{R}(\bm{Y}_{i})-\bm{X}_{i}\hat{{\bm{\beta}}}\|_{2}^{2}/n and send W^i\hat{W}_{i} to the central server.
2 Step 2: Generate a random variable EvarE^{\textnormal{var}}, where EvarE^{\textnormal{var}}\sim N(0,2B22log(1.25/δ)/ϵ2)N(0,2B_{2}^{2}\log(1.25/\delta)/\epsilon^{2})
3 Step 3: Compute σ^2\hat{\sigma}^{2} such that σ^2=i=1mW^i/m+Evar\hat{\sigma}^{2}=\sum_{i=1}^{m}\hat{W}_{i}/m+E^{\textnormal{var}} at the central server
Output :  Estimated variance σ^2\hat{\sigma}^{2}.
Algorithm 4 Differentially Private Variance Estimation in Federated Learning

When examining the convergence rates of 𝜷^\hat{{\bm{\beta}}} and 𝚯^k\hat{\bm{\Theta}}_{k} in Theorem 3, we observe the crucial roles of the largest and smallest restricted eigenvalues of 𝚺\bm{\Sigma}. Since these eigenvalues directly influence the construction of confidence intervals and cannot be directly obtained from the data, their private estimation becomes essential. Below, we outline an algorithm to estimate the largest restricted eigenvalue, μs(Σ)\mu_{s}(\Sigma). To estimate the smallest restricted eigenvalue, νs(Σ)\nu_{s}(\Sigma), the same algorithm can be used by modifying Step 4 from “argmax” to “argmin”.

Input : Number of machines mm, dataset (𝑿i)[i=1,,m](\bm{X}_{i})_{[i=1,\dots,m]}, number of data points in each machine nn, privacy parameters ε\varepsilon, noise scale B3B_{3}, number of vectors n1n_{1}.
1 Step 1: Sample n1n_{1} dd-dimensional, ss-sparse unit vectors 𝒗1,𝒗2,,𝒗n1\bm{v}_{1},\bm{v}_{2},\dots,\bm{v}_{n_{1}}.
2 Step 2: On each machine, compute ti,k=(𝒗kT𝑿iT𝑿i𝒗k)/nt_{i,k}=(\bm{v}_{k}^{T}\bm{X}_{i}^{T}\bm{X}_{i}\bm{v}_{k})/n where k=1,2,n1k=1,2\dots,n_{1} and send them on to the central server.
3Step 3: Sample ξ1,,ξn1\xi_{1},\dots,\xi_{n_{1}}\sim Laplace (2B3/ϵ)(2B_{3}/\epsilon).
4 Step 4: Compute kmaxk_{\max} such that kmax=argmaxki=1mti,k/m+ξkk_{\max}=\mathop{\rm arg\max}_{k}\sum_{i=1}^{m}t_{i,k}/m+\xi_{k}.
Output :  μs(𝚺^)=i=1mti,kmax/m+ξ\mu_{s}(\hat{\bm{\Sigma}})=\sum_{i=1}^{m}t_{i,k_{\max}}/m+\xi, where ξLaplace(2B3/ϵ)\xi\sim\operatorname{Laplace}(2B_{3}/\epsilon) independently.
Algorithm 5 Differentially Private Restricted Eigenvalue Estimation in Federated Learning

Based on Algorithms 4 and 5, we provide a constuction for coordinate-wise confidence intervals in Algorithm 6.

Input : Number of machines mm, dataset (𝑿i,𝒀i)[i=1,,m](\bm{X}_{i},\bm{Y}_{i})_{[i=1,\dots,m]}, number of data points in each machine nn, privacy parameters ε,δ\varepsilon,\delta, truncation level RR, sparsity ss, estimators of parameters 𝜷^\hat{{\bm{\beta}}}, 𝚯^k\hat{\bm{\Theta}}_{k} from Algorithms 2 and 3, constants Δ1,γ\Delta_{1},\gamma.
1 Step 1:  On each local machine i=1,2,,mi=1,2,\dots,m, calculate local gradient 𝒈i=1nj=1n(𝑿ij𝜷^ΠR(yij))𝑿ij\bm{g}_{i}=\frac{1}{n}\sum_{j=1}^{n}(\bm{X}_{ij}^{\top}\hat{\bm{\beta}}-\Pi_{R}(y_{ij}))\bm{X}_{ij}. Send the gradient (𝒈1,𝒈2,,𝒈m)(\bm{g}_{1},\bm{g}_{2},\dots,\bm{g}_{m}) to the central server.
2 Step 2:  Generate a random variable EE from the Gaussian distribution N(0,8Δ12log(1.25/δ)/n2m2ϵ2)N(0,8\Delta_{1}^{2}\log(1.25/\delta)/n^{2}m^{2}\epsilon^{2}).
3 Step 3:  Calculate de-biased estimation, β^ku=β^k+1mi=1m𝚯^k𝒈i+E\hat{\beta}^{\text{u}}_{k}=\hat{\beta}_{k}+\frac{1}{m}\sum_{i=1}^{m}\hat{\bm{\Theta}}_{k}^{\top}\bm{g}_{i}+E
4
5Step 4: Estimate σ^\hat{\sigma} from Algorithm 4 and μ^s,ν^s\hat{\mu}_{s},\hat{\nu}_{s} from Algorithm 5.
6
7Step 5:  Calculate the confidence interval Jk(α)J_{k}(\alpha).
Jk(α)=[β^kuγμ^s2ν^s2s2log2dlog(1/δ)log3mnm2n2ϵ2Φ1(1α/2)σmn𝚯^k𝚺^𝚯^k+8Δ12log(1/δ)mnϵ2,\displaystyle J_{k}(\alpha)=\biggl{[}\hat{\beta}_{k}^{\text{u}}-\frac{\gamma\hat{\mu}_{s}^{2}}{\hat{\nu}_{s}^{2}}\frac{s^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}-\Phi^{-1}(1-\alpha/2)\frac{\sigma}{\sqrt{mn}}\sqrt{\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{mn\epsilon^{2}}},
β^ku+γμ^s2ν^s2s2log2dlog(1/δ)log3mnm2n2ϵ2+Φ1(1α/2)σmn𝚯^k𝚺^𝚯^k+8Δ12log(1/δ)mnϵ2]\displaystyle\hat{\beta}_{k}^{\text{u}}+\frac{\gamma\hat{\mu}_{s}^{2}}{\hat{\nu}_{s}^{2}}\frac{s^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}+\Phi^{-1}(1-\alpha/2)\frac{\sigma}{\sqrt{mn}}\sqrt{\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{mn\epsilon^{2}}}\biggr{]}
Output: Return the final result Jk(α)J_{k}(\alpha).
Algorithm 6 Differentially Private Coordinate-wise Confidence interval for βk\beta_{k} in Federated Learning

So far we focused on constructing confidence intervals for individual coordinates of the parameter vector 𝜷\bm{\beta}. However, in high-dimensional settings, we are often interested in group inference problem, where we test hypotheses involving multiple coordinates simultaneously. Specifically, we consider the problem of testing the null hypothesis given by

H0:β^k=βk, for all kGH_{0}:\hat{\beta}_{k}=\beta_{k}\text{, for all }k\in G

against the alternative hypothesis,

H1:β^kβk,H_{1}:\hat{\beta}_{k}\neq\beta_{k},

for at least one kGk\in G, where GG is a subset of all coordinates {1,2,,d}\{1,2,\dots,d\} and we allow |G||G| to be the same order as dd. Additionally, we also construct simultaneous confidence intervals for all coordinates in GG. Note that the problem discussed above are common in high-dimensional data analysis, with applications such as multi-factor analysis of variance (hothorn2008simultaneous, ), additive modeling (wiesenfarth2012direct, ). Previous research works have discussed similar problems in the non-private setting, including (chernozhukov2013gaussian, ; zhang2017simultaneous, ; yu2022distributed, ).

To address the problem, simultaneous inference can be conducted using a test statistic

maxkG|β^kuβk|.\max_{k\in G}|\hat{\beta}_{k}^{\text{u}}-\beta_{k}|.

Major challenges of simultaneous inference in a private federated learning setting include: (1) minimizing the communication cost from local machines to the server while retaining all data on the local machines, and (2) ensuring the privacy of the procedure, which necessitates a tailored privacy-preserving mechanism at each step of the algorithm.

In our framework, we propose an algorithm based on the bootstrap method. As previously mentioned, to build confidence intervals, our interest lies in the statistic computed by the maximum coordinate of β^kuβk\hat{\beta}_{k}^{\text{u}}-\beta_{k} over GG. By decomposing this statistic, we obtain a term σmni=1mj=1n𝚯^𝑿ij(𝐲ij𝑿ijT𝜷)\frac{\sigma}{\sqrt{mn}}\sum_{i=1}^{m}\sum_{j=1}^{n}\hat{\bm{\Theta}}\bm{X}_{ij}({\mathbf{y}}_{ij}-\bm{X}_{ij}^{T}{\bm{\beta}}). To determine the distribution of this term, we bootstrap the residuals 𝐲ij𝑿ijT𝜷{\mathbf{y}}_{ij}-\bm{X}_{ij}^{T}{\bm{\beta}}.

We outline the algorithm as follows: we first estimate 𝜷^\hat{{\bm{\beta}}} and 𝚯^k\hat{\bm{\Theta}}_{k} using Algorithm 2 and 3, respectively. Accordingly, by stacking 𝚯^k\hat{\bm{\Theta}}_{k} for all kk, we get an estimator of the precision matrix ^𝚯\bm{\hat{}}{\bm{\Theta}}. The details are provided in Algorithm 7.

Input : number of machines mm, dataset (𝑿i,𝒀i)[i=1,2,m](\bm{X}_{i},\bm{Y}_{i})_{[i=1,2,\dots m]}, number of data on each machine nn, privacy parameters ε,δ\varepsilon,\delta, estimators of parameters 𝜷^\hat{{\bm{\beta}}}, 𝚯^\hat{\bm{\Theta}} from Algorithms 2 and 3, number of iterations for bootstrap qq, quantile α\alpha, noise level B4B_{4}, subset of coordinates GG.
1 for tt from 0 to qq do
2       Step 1: For each local machine i=1,,mi=1,\dots,m, generate nn independent standard Gaussian random variables ei1,,eine_{i1},\dots,e_{in}. Calculate 𝒖i=1nj=1n𝚯^𝑿ijeij\bm{u}_{i}=\frac{1}{\sqrt{n}}\sum_{j=1}^{n}\hat{\bm{\Theta}}\bm{X}_{ij}e_{ij}.
3       Step 2: Send (𝒖i)[i=1,2,,m](\bm{u}_{i})_{[i=1,2,\dots,m]} from local machines to the central server.
4       Step 3: Calculate Ut=Privatemax([(1/m)i=1m𝒖i]G,ϵ,δ,B4)U_{t}=\text{Privatemax}([(1/\sqrt{m})\sum_{i=1}^{m}\bm{u}_{i}]_{G},\epsilon,\delta,B_{4}) at the central server.
5 end for
6Output: Compute the α\alpha-quantile CU(α)C_{U}(\alpha) of (|U1|,|U2|,|Uq|)(|U_{1}|,|U_{2}|,\dots|U_{q}|) for α(0,1)\alpha\in(0,1).
Algorithm 7 Private Bootstrap Method for Simultaneous Inference in Federated Learning

On line 4 of Algorithm 7, we employ the Private Max algorithm, which we mentioned earlier as a variation of NoisyHT algorithm (Algorithm 1) by directly picking s=1s=1, to obtain the maximum element in a vector in a private manner. It is also important to note that the Private Max algorithm is applied to a subset of GG. After presenting the algorithm, we denote MM as:

M(𝜷^)=maxkG|mn(β^kuβk)|.M(\hat{{\bm{\beta}}})=\max_{k\in G}|\sqrt{mn}(\hat{\beta}_{k}^{\text{u}}-\beta_{k})|.

MM is used as the statistic for inference problems later.

As previously mentioned, we can easily construct a simultaneous confidence interval for each kGk\in G by:

[β^kuσ^mnCU(α),β^ku+σ^mnCU(α)],\quantity[\hat{\beta}_{k}^{\text{u}}-\frac{\hat{\sigma}}{\sqrt{mn}}C_{U}(\alpha),\ \ \hat{\beta}_{k}^{\text{u}}+\frac{\hat{\sigma}}{\sqrt{mn}}C_{U}(\alpha)],

where CU(α)C_{U}(\alpha) is obtained from our algorithm with prespecified α\alpha. We can similarly perform hypothesis testing; first calculate the test statistic and obtain CU(α)C_{U}(\alpha) from our algorithm with prespecified α\alpha, then reject if the statistic lies in the rejection region.

4.3 Theoretical Results

In this subsection, we provide theoretical guarantee for the algorithms and methods discussed in the previous subsections. Before proceeding, we outline key assumptions concerning the design matrix 𝑿\bm{X}, precision matrix 𝚯\bm{\Theta}, and the true parameter 𝜷\bm{\beta} of the linear regression model, which are essential for our subsequent analyses.

  • (P1)

    Parameter Sparsity: The true parameter vector 𝜷{\bm{\beta}} satisfies 𝜷2<c0\|{\bm{\beta}}\|_{2}<c_{0} for some constant 0<c0<0<c_{0}<\infty and 𝜷0s0=o(n)\|{\bm{\beta}}\|_{0}\leq s_{0}^{*}=o(n).

  • (P2)

    Precision matrix sparsity: For each column of the precision matrix 𝚯k\bm{\Theta}_{k}, k=1,2,,dk=1,2,\dots,d, it satisfies that 𝚯k2<c1\|\bm{\Theta}_{k}\|_{2}<c_{1} for some constant 0<c1<0<c_{1}<\infty and 𝚯k0s1=o(n)\|\bm{\Theta}_{k}\|_{0}\leq s_{1}^{*}=o(n).

  • (D1)

    Design Matrix: for each row of the design matrix 𝑿\bm{X}, denote by 𝒙\bm{x}, 𝒙𝚺1/2\bm{x}\bm{\Sigma}^{-1/2} is sub-Gaussian with sub-Gaussian norm κ:=𝚺1/2𝒙ψ2\kappa:=\|\bm{\Sigma}^{-1/2}\bm{x}\|_{\psi_{2}}.

  • (D2)

    Bounded Eigenvalues of the covariance matrix: For the covariance matrix 𝚺=𝔼𝒙𝒙\bm{\Sigma}=\mathbb{E}\bm{x}\bm{x}^{\top}, there exists a constant 0<L<0<L<\infty such that 0<1/L<λmin(𝚺)λmax(𝚺)<L0<1/L<\lambda_{\min}(\bm{\Sigma})\leq\lambda_{\max}(\bm{\Sigma})<L.

The above assumptions (P1) and (P2) bounds the 2\ell_{2} norm and 0\ell_{0} norm of the parameters 𝜷{\bm{\beta}} and 𝚯k\bm{\Theta}_{k}, and assumption (D1) guarantees that each row of XX follows a sub-Gaussian distribution, and assumption (D2) requires the covariance matrix has bounded eigenvalues. These assumptions are commonly used for theoretical analysis of differentially private algorithms and debiased estimators (cai2020cost, ; cai2019cost, ; javanmard2014confidence, ).

With assumptions (P1)-(D2), we analyze the algorithms we presented. We begin with the estimation problem and provide a rate of convergence of 𝜷^\hat{\bm{\beta}} and 𝚯^k\hat{\bm{\Theta}}_{k}.

Theorem 3

Let {(yij,𝐗ij)}i[m],j[n]\{(y_{ij},\bm{X}_{ij})\}_{i\in[m],j\in[n]} be an i.i.d. samples from the high-dimensional linear model. Suppose that assumptions (P1), (P2), (D1), (D2) are satisfied. Additionally,

  • we choose parameters as follows: let s=max(s0,s1)s^{*}=max(s_{0}^{*},s_{1}^{*}), R=σ2logmnR=\sigma\sqrt{2\log mn}, C0=c0C_{0}=c_{0}, C1=c1C_{1}=c_{1}, cx=32Lκ2logdc_{x}=3\sqrt{2L\kappa^{2}\log d}, B0=2(R+sc0cx)cxB_{0}=2(R+\sqrt{s}c_{0}c_{x})c_{x}, B1=2sc1cx2B_{1}=2\sqrt{s}c_{1}c_{x}^{2}, Δ1=sc1cxR+sc0c1cx2\Delta_{1}=\sqrt{s}c_{1}c_{x}R+sc_{0}c_{1}c_{x}^{2} and γ=max(μs(9μs+1/4),17/16μs+1/96)\gamma=\max(\mu_{s}(9\mu_{s}+1/4),17/16\mu_{s}+1/96), where μs,νs\mu_{s},\nu_{s} are the largest and smallest s-restricted eigenvalues of Σ^\hat{\Sigma}.

  • we set 𝜷0=𝟎{\bm{\beta}}^{0}=\bm{0} and 𝚯k0=𝟎\bm{\Theta}_{k}^{0}=\bm{0} as the initialization used in Algorithm 2 and Algorithm 3.

Then there exists some absolute constant ρ>0\rho>0 such that, if s=ρL4ss=\rho L^{4}s^{*}, η0=η1=s/6L\eta^{0}=\eta^{1}=s/6L, T=ρL2log(8c02Ln)T=\rho L^{2}\log(8c_{0}^{2}Ln) and nKR(s)3/2logdlog(1/δ)logn/εn\geq KR(s^{*})^{3/2}\log d\sqrt{\log(1/\delta)}\log n/\varepsilon for a sufficiently large constant K>0K>0, then, for the output from Algorithm 2 and Algorithm 3,

𝜷^𝜷22σ2(k0slogdmn+6γμsνs2s2log2dlog(1/δ)log3mnm2n2ϵ2),\displaystyle\|\hat{{\bm{\beta}}}-{\bm{\beta}}\|_{2}^{2}\leq\sigma^{2}\quantity(k_{0}\cdot\frac{s\log d}{mn}+\frac{6\gamma\mu_{s}}{\nu_{s}^{2}}\cdot\frac{{s}^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}), (4.2)

and

𝚯^k𝚯k22σ2(k1slogdmn+6γμsνs2s2log2dlog(1/δ)log3mnm2n2ϵ2),\displaystyle\|\hat{\bm{\Theta}}_{k}-\bm{\Theta}_{k}\|_{2}^{2}\leq\sigma^{2}\quantity(k_{1}\cdot\frac{s\log d}{mn}+\frac{6\gamma\mu_{s}}{\nu_{s}^{2}}\frac{{s}^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}), (4.3)

hold with probability 1exp(Ω(log(d/slogd)+logn))1-\exp(-\Omega(\log(d/s\log d)+\log n)).

The upper bound of Algorithm 3 in (4.3) can be interpreted as follows. The first term represents the statistical error, while the second term accounts for the privacy cost. Furthermore, the result is comparable to that of Theorem 4.4 in (cai2019cost, ), which addresses private linear regression in a non-federated setting. This comparison suggests that the federated learning approach does not affect the convergence rate adversely; instead, it allows us to leverage the benefits of federated learning. We also note that the advantages of federated learning will be further explored in the heterogeneous federated learning setting, which will be discussed in the next chapter.

The remainder of this subsection presents the theoretical results for the inference problem. We begin with the construction of coordinate-wise confidence intervals. As mentioned before, σ\sigma is usually unknown and we estimate σ\sigma in a private manner, presented in Algorithm 4. Lemma 4.1 states the statistical guarantee of our algorithm.

Lemma 4.1

Let σ^2\hat{\sigma}^{2} be the output from Algorithm 4 by choosing R=O(2logmn)R=O(\sqrt{2\log mn}), B2=4mn(R2+s2c02cx2)B_{2}=\frac{4}{mn}(R^{2}+s^{2}c_{0}^{2}c_{x}^{2}) and 𝛃^\hat{{\bm{\beta}}} as the output from Algorithm 2. Then, Algorithm 4 is (ϵ,δ)(\epsilon,\delta)-differentially private, and it follows that

|σ2σ^2|c(1mn+slogdmn+s2log2dlog(1/δ)log3mnm2n2ϵ2),\displaystyle|\sigma^{2}-\hat{\sigma}^{2}|\leq c\cdot\quantity(\frac{1}{\sqrt{mn}}+\frac{s\log d}{mn}+\frac{s^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}),

where c>0c>0 is a universal constant.

Next, we consider a simplified version of the confidence interval, where the privacy cost is dominated by the statistical error. In this scenario, we assume that the privacy level is relatively low and the privacy constraints are loose, meaning that the privacy parameters ϵ\epsilon and δ\delta are relatively large, allowing for nearly cost-free estimation. We present our result in the following theorem.

Theorem 4

Suppose that the conditions in Theorem 3 hold. Assume that slogdmn=o(1)\frac{s^{*}\log d}{\sqrt{mn}}=o(1) and s2log2dlog(1.25/δ)log3mnmnϵ2=o(1)\frac{{s^{*}}^{2}\log^{2}d\log(1.25/\delta)\log^{3}mn}{mn\epsilon^{2}}=o(1). Also assume that the privacy cost is dominated by statistical error, i.e., there exists a constant k0k_{0} such that s2log2dlog(1/δ)log3mnm2n2ϵ2k0slogdmn\frac{{s^{*}}^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}\leq k_{0}\cdot\frac{s^{*}\log d}{mn}. Then, given the de-biased estimator β^ku\hat{\beta}_{k}^{\text{u}} defined in (4.1), the confidence interval is asymptotically valid:

limmn(βkJk(α))=1α,\lim_{mn\to\infty}\mathbb{P}(\beta_{k}\in J_{k}(\alpha))=1-\alpha,

where

Jk(α)=[β^kuΦ1(1α/2)σ^mn𝚯^k𝚺^𝚯^k,βk^+Φ1(1α/2)σ^mn𝚯^k𝚺^𝚯^k]J_{k}(\alpha)=\quantity[\hat{\beta}_{k}^{\text{u}}-\Phi^{-1}(1-\alpha/2)\frac{\hat{\sigma}}{\sqrt{mn}}\sqrt{\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}},\ \ \hat{\beta_{k}}+\Phi^{-1}(1-\alpha/2)\frac{\hat{\sigma}}{\sqrt{mn}}\sqrt{\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}}]

Also, the confidence interval Jk(α)J_{k}(\alpha) is (ϵ,δ)(\epsilon,\delta)-differentially private.

Theorem 4 assumes that the privacy cost is dominated by the statistical error. However, when the privacy constraint is more stringent with small privacy parameters ϵ\epsilon and δ\delta, the privacy cost may be larger than the statistical error. In this scenario, we generalize Theorem 4 to analyze Algorithm 6. We note that the largest and smallest restricted eigenvalues of 𝚺^\hat{\bm{\Sigma}} also need to be estimated by Algorithm 5. Lemma 4.2 quantifies the estimation error of the largest restricted eigenvalue of 𝚺^\hat{\bm{\Sigma}}.

Lemma 4.2

If n1=cdsn_{1}=cd^{s} and B3=2scx2/nB_{3}=2sc_{x}^{2}/n for some constant c>0c>0, then the output from Algorithm 5 is (ϵ,0)(\epsilon,0)-differentially private. Moreover, (1/9)λsλ^sλs(1/9)\lambda_{s}\leq\hat{\lambda}_{s}\leq\lambda_{s} holds where λs\lambda_{s} is the largest restricted eigenvalue of 𝚺^\hat{\bm{\Sigma}}.

We then present a theoretical result for the confidence interval in a more general case in Theorem 5.

Theorem 5

Assume the conditions in Theorem 3 hold. Suppose that slogdmn=o(1)\frac{s^{*}\log d}{\sqrt{mn}}=o(1) and s2log2dlog(1.25/δ)log3mnm2n2ϵ2=o(1)\frac{s^{*2}\log^{2}d\log(1.25/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}=o(1), then, given the de-biased estimator β^ku\hat{\beta}_{k}^{\text{u}} defined in (4.1) and the estimated restricted eigenvalues μ^s\hat{\mu}_{s} and ν^s\hat{\nu}_{s} from Algorithm 5, the confidence interval constructed by Algorithm 6 is asympotically valid:

limmn(βkJk(α))=1α,\lim_{mn\to\infty}\mathbb{P}(\beta_{k}\in J_{k}(\alpha))=1-\alpha,

where

Jk(α)=[β^kuγμ^s2ν^s2s2log2dlog(1/δ)log3mnm2n2ϵ2Φ1(1α/2)σ^mn𝚯^k𝚺^𝚯^k+8Δ12log(1/δ)mnϵ2,\displaystyle J_{k}(\alpha)=\biggl{[}\hat{\beta}_{k}^{\text{u}}-\frac{\gamma\hat{\mu}_{s}^{2}}{\hat{\nu}_{s}^{2}}\frac{s^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}-\Phi^{-1}(1-\alpha/2)\frac{\hat{\sigma}}{\sqrt{mn}}\sqrt{\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{mn\epsilon^{2}}},
βj^+γμ^s2ν^s2s2log2dlog(1/δ)log3mnm2n2ϵ2+Φ1(1α/2)σ^mn𝚯^k𝚺^𝚯^k+8Δ12log(1/δ)mnϵ2].\displaystyle\hat{\beta_{j}}+\frac{\gamma\hat{\mu}_{s}^{2}}{\hat{\nu}_{s}^{2}}\frac{s^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}+\Phi^{-1}(1-\alpha/2)\frac{\hat{\sigma}}{\sqrt{mn}}\sqrt{\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{mn\epsilon^{2}}}\biggr{]}.

Also, Jk(α)J_{k}(\alpha) is (ϵ,δ)(\epsilon,\delta)-differentially private.

Compared to the non-private counterpart in (javanmard2014confidence, ), we claim that our confidence interval has a similar form but with additional noise injected to ensure privacy. When the noise level is low, the confidence interval closely approximates the non-private counterpart, allowing us to nearly achieve privacy without incurring additional costs. Furthermore, when the privacy level is high, the confidence interval has a larger length to attain the same confidence level.

Finally, for the simultaneous inference problems, we demonstrate that α\alpha-quantile of statistic MM in (4.2) is close to the α\alpha-quantile of UU calculated in Algorithm 7 for each α(0,1)\alpha\in(0,1) using the bootstrap method. The next theorem states the statistical properties of Algorithm 7.

Theorem 6

Assume the conditions in Theorem 4 hold. Additionally, we assume that slogdmn=o(1)\frac{s^{*}\log d}{\sqrt{mn}}=o(1) and the privacy cost is dominated by the statistical error, i.e., there exists a constant c>0c>0 such that s2log2dlog(1/δ)log3mnm2n2ϵ2cslogdmn\frac{s^{*2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}\leq c\cdot\frac{s^{*}\log d}{mn}. We also assume that there exists a constant k0k_{0} such that log7(dmn)/mn1(mn)k0\log^{7}(dmn)/mn\leq\frac{1}{(mn)^{k_{0}}}, and that q=o(mn)q=o(mn), where qq is the number of iterations for bootstrap qq. The noise level is chosen as B4=4Llogmcx(R+c0cxs)/mnB_{4}={4L\sqrt{\log m}}c_{x}(R+c_{0}c_{x}\sqrt{s^{*}})/{\sqrt{mn}}. Then, CU(α)C_{U}(\alpha) computed in Algorithm 7 satisfies

supα(0,1)|(MCU(α))α|=o(1).\sup_{\alpha\in(0,1)}|{\mathbb{P}}(M\leq C_{U}(\alpha))-\alpha|=o(1).

Theorem 6 has useful applications: we can obtain a good estimator of the α\alpha-quantile of UU using the bootstrap method and then use it to construct confidence intervals or perform hypothesis testing. Numerical results will be presented in later chapters to further support our claims.

5 Heterogeneous Federated Learning Setting

5.1 Methods and Algorithms

In this section, we consider a more general setting where the parameters of interest on each machine are not identical, but they share some similarities. Specifically, we consider the scenario where, on each machine i=1,2,,mi=1,2,\dots,m, we assume a linear regression model:

𝒀=𝑿𝜷(i)+𝑾i,\bm{Y}=\bm{X}{\bm{\beta}}^{(i)}+\bm{W}_{i},

where 𝜷(i){\bm{\beta}}^{(i)} represents the true parameter on machine ii. We assume that each 𝑾i\bm{W}_{i} is a vector whose coordinates follow a sub-Gaussian distribution: 𝑾ik\bm{W}_{ik}\sim subG(σ2)(\sigma^{2}), k=1,2,dk=1,2\dots,d i.i.d. We also assume that each row of 𝑿\bm{X} follows a sub-Gaussian distribution i.i.d. with mean zero and covariance matrix 𝚺\bm{\Sigma}. We further quantify the similarity of each 𝜷(i){\bm{\beta}}^{(i)} by assuming that that there exists a subset 𝒮{1,2,,d}\mathcal{S}\in\{1,2,\dots,d\} with |𝒮|=s0|\mathcal{S}|=s_{0} satisfying 𝜷𝒮(i1)=𝜷𝒮(i2){\bm{\beta}}^{(i_{1})}_{\mathcal{S}}={\bm{\beta}}^{(i_{2})}_{\mathcal{S}} for any i1,i2{1,2,,m}i_{1},i_{2}\in\{1,2,\dots,m\}.

A naive approach would be estimating each 𝜷(i){\bm{\beta}}^{(i)} locally, as in the non-private setting. However, in the context of federated learning, we can improve the estimation with a sharper rate of convergence by exploiting similarities of the model across machines. To achieve this, we decompose 𝜷(i){\bm{\beta}}^{(i)} into the sum of two vectors, 𝜷(i)=𝒖+𝒗i{\bm{\beta}}^{(i)}=\bm{u}+\bm{v}_{i}, where 𝒖\bm{u} captures the signals common to all 𝜷(i){\bm{\beta}}^{(i)}, and 𝒗i\bm{v}_{i} captures the signals unique to each machine.

We employ a two-stage procedure to estimate each 𝜷(i){\bm{\beta}}^{(i)}: in the first stage, we estimate 𝒖\bm{u} using Algorithm 2 with a sparsity level of 𝒖0=s0\|\bm{u}\|_{0}=s_{0} indicating the number of shared signals. In the second stage, we estimate 𝒗i\bm{v}_{i} on the individual machine. Our final estimation of 𝜷(i){\bm{\beta}}^{(i)} is given by 𝜷^(i)=^𝒗i+𝒖^\hat{{\bm{\beta}}}^{(i)}=\bm{\hat{}}{\bm{v}}_{i}+\hat{\bm{u}}. The procedure is summarized in Algorithm 8.

Input : Number of machines mm, dataset (𝒚i,𝑿i)[i=1,,m](\bm{y}_{i},\bm{X}_{i})_{[i=1,\dots,m]}, number of data on each machine nn, step size η0\eta^{0}, privacy parameters ε,δ\varepsilon,\delta, noise scale B5B_{5}, number of iterations TT, truncation level RR, feasibility parameter C0C_{0}, initial value 𝒗i0\bm{v}_{i}^{0}, sparsity level of similar vector s0s_{0}, sparsity level ss.
1
2Step 1: Estimate a s0s_{0} sparse vector 𝒖\bm{u} using Algorithm 2.
3 Step 2: Estimate a s1:=ss0s_{1}:=s-s_{0} sparse vector 𝒗i\bm{v}_{i} with samples (𝒚i,𝑿i)(\bm{y}_{i},\bm{X}_{i}) on machine ii with the following iterations from line 3-6.
4 for tt from 0 to T1T-1 do
5         Compute 𝒗it+0.5=𝒗it(η0/n)i=1n(𝒙i𝒗itΠR(yi𝒙i𝒖))𝒙i\bm{v}_{i}^{t+0.5}=\bm{v}_{i}^{t}-(\eta^{0}/n)\sum_{i=1}^{n}(\bm{x}_{i}^{\top}\bm{v}_{i}^{t}-\Pi_{R}(y_{i}-\bm{x}_{i}^{\top}\bm{u}))\bm{x}_{i};
6         𝒗it+1=ΠC0(NoisyHT(𝒗it+0.5,(𝒚i,𝑿i),s,ε/T,δ/T,η0B5/n))\bm{v}_{i}^{t+1}=\Pi_{C_{0}}\left(\text{NoisyHT}(\bm{v}_{i}^{t+0.5},(\bm{y}_{i},\bm{X}_{i}),s,\varepsilon/T,\delta/T,\eta^{0}B_{5}/n)\right).   
7 end for
8Step 3: Estimate 𝜷(i){\bm{\beta}}^{(i)} by 𝜷^(i):=𝒗^i+𝒖^\hat{{\bm{\beta}}}^{(i)}:=\hat{\bm{v}}_{i}+\hat{\bm{u}}.
Output : 𝜷^(i)\hat{{\bm{\beta}}}^{(i)}.
Algorithm 8 Differentially Private Sparse Linear Regression in Heterogeneous Federated Learning Setting

Similar to the previous section, we next address inference problems. Our algorithms consist of two parts: the construction of coordinate-wise confidence intervals and simultaneous inference. We begin by describing the algorithm for coordinate-wise confidence intervals in Algorithm 9.

Input : Number of machines mm, dataset (𝑿i,𝒀i)[i=1,,m](\bm{X}_{i},\bm{Y}_{i})_{[i=1,\dots,m]}, number of data points in each machine nn, privacy parameters ε,δ\varepsilon,\delta, truncation level RR, sparsity ss, estimated parameters 𝜷^(i)\hat{{\bm{\beta}}}^{(i)}, 𝚯^k\hat{\bm{\Theta}}_{k} from Algorithms 8 and 3, and estimated eigenvalues μ^s\hat{\mu}_{s}, ν^s\hat{\nu}_{s} from Algorithm 5, constants Δ1,γ\Delta_{1},\gamma.
1 Step 1:  Generate a random variable E3E_{3} from a Gaussian distribution N(0,8Δ12log(1.25/δ)/(n2ϵ2))N(0,8\Delta_{1}^{2}\log(1.25/\delta)/(n^{2}\epsilon^{2})).
2Step 2:  Calculate de-biased estimation, β^k(i,u)=β^k(i)+1nj=1n(𝚯^k𝑿ijΠR(𝒚ij)𝚯^k𝑿ij𝑿ijβ^k(i))+E3\hat{\beta}^{(i,\text{u})}_{k}=\hat{\beta}^{(i)}_{k}+\frac{1}{n}\sum_{j=1}^{n}(\hat{\bm{\Theta}}_{k}^{\top}\bm{X}_{ij}\Pi_{R}(\bm{y}_{ij})-\hat{\bm{\Theta}}_{k}^{\top}\bm{X}_{ij}\bm{X}_{ij}^{\top}\hat{\beta}^{(i)}_{k})+E_{3}.
3Step 3:  Calculate the confidence interval Jk(α)J_{k}(\alpha).
Jk(α)=\displaystyle J_{k}(\alpha)= [β^k(i,u)aσΦ1(1α/2)n𝚯^k𝚺^𝚯^k+8Δ12log(1/δ)nϵ2,\displaystyle\biggl{[}\hat{\beta}^{(i,\text{u})}_{k}-a-\frac{\sigma\Phi^{-1}(1-\alpha/2)}{\sqrt{n}}\sqrt{\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{n\epsilon^{2}}},
β^k(i,u)+a+σΦ1(1α/2)n𝚯^k𝚺^𝚯^k+8Δ12log(1/δ)nϵ2],\displaystyle\hat{\beta}^{(i,\text{u})}_{k}+a+\frac{\sigma\Phi^{-1}(1-\alpha/2)}{\sqrt{n}}\sqrt{\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{n\epsilon^{2}}}\biggr{]},
where aa is defined in (5.1).
4Output: Return the final result Jk(α)J_{k}(\alpha).
Algorithm 9 Differentially Private Coordinate-wise Confidence interval for βk\beta_{k} in Heterogeneous Federated Learning

In Algorithm 9, 𝚯^j\hat{\bm{\Theta}}_{j} is the (ϵ,δ)(\epsilon,\delta)-differentially private estimator of the jj-th row of the precision matrix of covariance matrix Σ^=1/(mn)i=1mj=1n𝑿ij𝑿ij\hat{\Sigma}=1/(mn)\sum_{i=1}^{m}\sum_{j=1}^{n}\bm{X}_{ij}\bm{X}_{ij}^{\top}. We define the variable aa in step 3 by

a:=2γμ^s2ν^s2s12log2dlog(1/δ)log3mnm2n2ϵ2+2γμ^s2ν^s2s02log2dlog(1/δ)log3nn2ϵ2.\displaystyle a:=\frac{2\gamma\hat{\mu}_{s}^{2}}{\hat{\nu}_{s}^{2}}\frac{s_{1}^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}+\frac{2\gamma\hat{\mu}_{s}^{2}}{\hat{\nu}_{s}^{2}}\frac{s_{0}^{2}\log^{2}d\log(1/\delta)\log^{3}n}{n^{2}\epsilon^{2}}. (5.1)

We then provide Algorithm 10 for the simultaneous inference problem. Similar to the previous chapter, we can perform simultaneous inference for each 𝜷(i){\bm{\beta}}^{(i)} to build simultaneous confidence interval and hypothesis testing.

Input : Dataset (𝒚i,𝑿i)(\bm{y}_{i},\bm{X}_{i}), number of data nn, privacy parameters ε,δ\varepsilon,\delta, estimators of parameters 𝜷^(i)\hat{{\bm{\beta}}}^{(i)}, 𝚯^\hat{\bm{\Theta}} from Algorithms 8 and 3, number of iterations for Bootstrap qq, quantile α\alpha, noise level B6B_{6}. (RN: σ\sigma?)
1 for tt from 0 to qq do
2       Generate nn independent standard Gaussian random variables e1,,ene_{1},\dots,e_{n}.
3       Calculate Ut=Privatemax([σnj=1nΘ^𝑿ijej]G,ϵ,δ,B6)U_{t}=\|\text{Privatemax}([\frac{\sigma}{\sqrt{n}}\sum_{j=1}^{n}\hat{\Theta}\bm{X}_{ij}e_{j}]_{G},\epsilon,\delta,B_{6})\|_{\infty}
4 end for
5
6Output: Compute the α\alpha-quantile CU(α)C_{U}(\alpha) of (|U1|,|U2|,,|Uq|)(|U_{1}|,|U_{2}|,\dots,|U_{q}|) for α(0,1)\alpha\in(0,1).
Algorithm 10 Private Bootstrap Method for Simultaneous Inference in Heterogeneous Federated Learning for Machine i{1,,m}i\in\{1,\dots,m\}

Compared with Algorithm 7 introduced for simultaneous inference in homogeneous federated learning, bootstrap algorithm in Algorithm 10 runs within the local machine of interest. Using the output from Algorithm 10, we build a simultaneous confidence interval for each βk(i)\beta_{k}^{(i)} (kGk\in G) using CU(α)C_{U}(\alpha) by

[β^k(i,u)1nCU(α),β^k(i,u)+1nCU(α)].\quantity[\hat{\beta}_{k}^{(i,\text{u})}-\frac{1}{\sqrt{n}}C_{U}(\alpha),\ \ \hat{\beta}_{k}^{(i,\text{u})}+\frac{1}{\sqrt{n}}C_{U}(\alpha)].

5.2 Theoretical Results

In this subsection, we provide theoretical analysis for the algorithms in heterogeneous federated learning settings. We begin our theoretical analysis with the estimation problem, which resembles Theorem 3.

Intuitively, when 𝜷(i){\bm{\beta}}^{(i)} are similar but not identical, federated learning can be used to estimate their common elements and the remaining parameters can be estimated individually on each machine. This results in a sharper rate of convergence as the estimation of the common component 𝒖\bm{u} can exploit the information from more data points. We summarize the result in Theorem 7.

Theorem 7

Assume that the conditions in Theorem 5 hold. Further assume that for Algorithm 8, 𝐯i0=s1=ss0\|\bm{v}_{i}\|_{0}=s_{1}=s-s_{0} for all i=1,,mi=1,\dots,m, 𝐮0=s0\|\bm{u}\|_{0}=s_{0}, 𝐮2c0/2\|\bm{u}\|_{2}\leq c_{0}/2, and 𝐯i2c0/2\|\bm{v}_{i}\|_{2}\leq c_{0}/2. Let B5=cx(2R+s1c0cx)B_{5}=c_{x}(2R+\sqrt{s_{1}}c_{0}c_{x}). Then, for the output 𝛃^(i)\hat{{\bm{\beta}}}^{(i)} from Algorithm 8, we have

𝜷^(i)𝜷(i)22c0s0logdmn+c1s02logd2log(1/δ)log3mnm2n2ϵ2+c2s1logdn+c3s12logd2log(1/δ)log3nn2ϵ2,\displaystyle\|\hat{{\bm{\beta}}}^{(i)}-{\bm{\beta}}^{(i)}\|_{2}^{2}\leq c_{0}\frac{s_{0}\log d}{mn}+c_{1}\frac{{s_{0}}^{2}\log d^{2}\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}+c_{2}\frac{s_{1}\log d}{n}+c_{3}\frac{{s_{1}}^{2}\log d^{2}\log(1/\delta)\log^{3}n}{n^{2}\epsilon^{2}}, (5.2)

where c0,c1,c2,c3>0c_{0},c_{1},c_{2},c_{3}>0 are some constants.

In the case where s0s1s_{0}\ll s_{1}, i.e., the models are largely different across machines, the third and fourth term on the right hand side of (5.2) dominates the estimation error, and the estimation accuracy of 𝜷(i){\bm{\beta}}^{(i)} via federated learning becomes closer to that with a single machine (m=1m=1). In high level, this is because the information from other machines is not helpful in the estimation when there exists a large dissimilarity of models across machines. However, with a large s0s1s_{0}\gg s_{1}, federated learning can leverage the similarity of models to improve estimation accuracy. As a result, the rate in (5.2) becomes closer to the rate in 4.2 for homogeneous federated learning setting when s0/s10s_{0}/s_{1}\to 0.

We next present our results for the inference problems. To start we verify that the output from Algorithm 9 is a asymptotic 1α1-\alpha confidence interval for βk(i)\beta^{(i)}_{k}.

Theorem 8

Assume the conditions in Theorem 3 hold and assume that slogdn=o(1)\frac{s^{*}\log d}{\sqrt{n}}=o(1) and max(2γμ^s2ν^s2s12log2dlog(1/δ)log3mnm2n2ϵ2,2γμ^s2ν^s2s02log2dlog(1/δ)log3nn2ϵ2)=o(1)\max(\frac{2\gamma\hat{\mu}_{s}^{2}}{\hat{\nu}_{s}^{2}}\frac{s_{1}^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}},\frac{2\gamma\hat{\mu}_{s}^{2}}{\hat{\nu}_{s}^{2}}\frac{s_{0}^{2}\log^{2}d\log(1/\delta)\log^{3}n}{n^{2}\epsilon^{2}})=o(1). Let aa be the variable defined in (5.1). Then, for the de-biased estimator β^k(i,u)\hat{\beta}_{k}^{(i,\text{u})} defined in (4.1), the constructed confidence interval is asympotically valid:

limn(βk(i)Jk(α))=1α,\lim_{n\to\infty}\mathbb{P}(\beta_{k}^{(i)}\in J_{k}(\alpha))=1-\alpha,

where

Jk(α)=\displaystyle J_{k}(\alpha)= [β^k(i,u)aσ^Φ1(1α/2)n𝚯^k𝚺^𝚯^k+8Δ12log(1/δ)nϵ2,\displaystyle\biggl{[}\hat{\beta}_{k}^{(i,\text{u})}-a-\frac{\hat{\sigma}\Phi^{-1}(1-\alpha/2)}{\sqrt{n}}\sqrt{\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{n\epsilon^{2}}},
β^k(i,u)+a+σ^Φ1(1α/2)n𝚯^k𝚺^𝚯^k+8Δ12log(1/δ)nϵ2]\displaystyle\hat{\beta}_{k}^{(i,\text{u})}+a+\frac{\hat{\sigma}\Phi^{-1}(1-\alpha/2)}{\sqrt{n}}\sqrt{\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{n\epsilon^{2}}}\biggr{]}

Also, Jk(α)J_{k}(\alpha) is (ϵ,δ)(\epsilon,\delta)-differentially private.

Finally, we provide a statistical guarantee for Algorithm 10. Similar to the previous section, we define MM as:

M=M(𝜷^(i,u))=maxkG|n(β^k(i,u)βk(i))|.M=M(\hat{{\bm{\beta}}}^{(i,\text{u})})=\max_{k\in G}|\sqrt{n}(\hat{\beta}_{k}^{(i,\text{u})}-\beta_{k}^{(i)})|.
Theorem 9

Assume that the conditions in Theorem 4 hold. We additionally assume that slogdn=o(1)\frac{s^{*}\log d}{\sqrt{n}}=o(1) and the privacy cost is dominated by the statistical error, i.e., there exists a constant cc such that s2log2dlog(1/δ)log3mnm2n2ϵ2cslogdmn\frac{s^{*2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}\leq c\cdot\frac{s^{*}\log d}{mn} and s2log2dlog(1/δ)log3nn2ϵ2cslogdn\frac{s^{*2}\log^{2}d\log(1/\delta)\log^{3}n}{n^{2}\epsilon^{2}}\leq c\cdot\frac{s^{*}\log d}{n}. We also assume that there exists a constant k0k_{0} such that log7(dn)/n1nk0\log^{7}(dn)/n\leq\frac{1}{n^{k_{0}}}. The noise level is chosen as B6=2slognncxc1B_{6}=2\sqrt{\frac{s\log n}{n}}c_{x}c_{1}. Then,

supα(0,1)|(MCU(α))α|=o(1).\sup_{\alpha\in(0,1)}|{\mathbb{P}}(M\leq C_{U}(\alpha))-\alpha|=o(1).

Theorem 9 states that α\alpha-quantile of MM is asymptotically close to CU(α)C_{U}(\alpha), which validates the 1α1-\alpha simultaneous confidence intervals based on CU(α)C_{U}(\alpha) obtained by the bootstrap method. This result allows us to perform simultaneous inference such as the confidence intervals and hypothesis testing based on CU(α)C_{U}(\alpha).

6 Simulations

In this section, we conduct simulations to investigate the performance of our proposed algorithm as discussed in the preceding sections. Specifically, we explore the more complex heterogeneous federated learning setting, where each machine operates on different models yet exhibits similarities. Our simulations are divided into three main parts.

In Section 6.1, we present the simulation results for the coordinate-wise estimation problem within a private federated setting, discussing the differences between the estimated 𝜷^\hat{{\bm{\beta}}} and the true 𝜷{\bm{\beta}}^{*} across various scenarios. We also examine the coverage of our proposed confidence intervals. Section 6.2 extends the settings to simultaneous inference.

We generate simulateion simulation datasets as follows. First, we sample the data 𝑿i\bm{X}_{i}, for i=1,2,,mi=1,2,\ldots,m, where each 𝑿i\bm{X}_{i} follows a Gaussian distribution with mean zero and covariance matrix 𝚺\bm{\Sigma}. We set 𝚺\bm{\Sigma} such that for each j,j{1,2,,d}j,j^{\prime}\in\{1,2,\ldots,d\}, Σj,j=0.5|jj|\Sigma_{j,j^{\prime}}=0.5^{|j-j^{\prime}|}. On each machine, we assume a ss^{*}-sparse unit vector 𝜷(i){\bm{\beta}}^{(i)} with s=s0+s1s^{*}=s_{0}+s_{1}, where s0s_{0} is the number of non-zero shared signals. For each 𝜷(i){\bm{\beta}}^{(i)}, we set the first s0s_{0} shared elements to 1/s1/\sqrt{s^{*}} and additionally select machine-specific s1s_{1} entries from the remaining ds0d-s_{0} indices to be 1/s1/\sqrt{s^{*}}. We then compute 𝒀i=𝑿i𝜷(i)+𝑾i\bm{Y}_{i}=\bm{X}_{i}{\bm{\beta}}^{(i)}+\bm{W}_{i}, where each 𝑾i\bm{W}_{i} follows a Gaussian distribution N(𝟎,σ2𝑰)N(\bm{0},\sigma^{2}\bm{I}) with σ=0.5\sigma=0.5.

6.1 Estimation and Confidence Interval

In this subsection, we investigate the estimation accuracy and confidence interval coverage of our algorithm for coordinate-wise inference. Namely, we consider the following scenarios:

  • Fix number of machines m=15m=15, ϵ=0.8\epsilon=0.8, δ=1/(2mn)\delta=1/(2mn), d=800d=800, s=15s^{*}=15 and s0=6s_{0}=6. Set the number of samples on each machine to be 4000,5000,60004000,5000,6000, respectively.

  • Fix number of samples on each machine n=4000n=4000, ϵ=0.8\epsilon=0.8, δ=1/(2mn)\delta=1/(2mn), d=800d=800, s=15s^{*}=15 and s0=6s_{0}=6. Set the number of machines mm to be 5,10,155,10,15,

  • Fix number of machines m=15m=15, number of samples on each machine n=4000n=4000, ϵ=0.8\epsilon=0.8, δ=1/(2mn)\delta=1/(2mn), d=800d=800. Set, s=15,s0=6s^{*}=15,s_{0}=6, s=10,s0=4s^{*}=10,s_{0}=4, s=20,s0=8s^{*}=20,s_{0}=8, respectively.

  • Fix number of machines m=15m=15, number of samples on each machine n=4000n=4000, ϵ=0.8\epsilon=0.8, δ=1/(2mn)\delta=1/(2mn), d=800d=800, s=15s^{*}=15. Set s0=6,8,10s_{0}=6,8,10, respectively.

  • Fix number of machines m=15m=15, number of samples on each machine n=4000n=4000, δ=1/(2mn)\delta=1/(2mn), d=800d=800, s=15s^{*}=15 and s0=6s_{0}=6. Set ϵ=0.3,0.5,0.8\epsilon=0.3,0.5,0.8 respectively.

  • Fix number of machines m=15m=15, number of samples on each machine n=4000n=4000, ϵ=0.8\epsilon=0.8, δ=1/(2mn)\delta=1/(2mn), s=15s^{*}=15 and s0=6s_{0}=6. Set d=600,800,1000d=600,800,1000, respectively.

For each setting, we report the average estimation error 𝜷^𝜷22\|\hat{{\bm{\beta}}}-{\bm{\beta}}^{*}\|_{2}^{2} among 50 replications. Also, in each setting, we calculate the confidence interval with α=0.95\alpha=0.95 for each index of 𝜷{\bm{\beta}}^{*} using our proposed algorithm. To evaluate the quality of confidence interval, we define cov{\rm cov} as the coverage of the confidence interval:

cov:=d1i=1d[βiJi(α)].{\rm cov}:=d^{-1}\sum_{i=1}^{d}{\mathbb{P}}[\beta^{*}_{i}\in J_{i}(\alpha)].

We also define the coverage for non-zero and zero entries of 𝜷\bm{\beta}^{*} by cov𝒮{\rm cov}_{\mathcal{S}} and cov𝒮c{\rm cov}_{\mathcal{S}^{c}}, respectively, where 𝒮\mathcal{S} is the set of non-zero indices in 𝜷{\bm{\beta}}^{*}.

cov𝒮=|𝒮|1i𝒮[βiJi(α)],cov𝒮c=|𝒮c|1i𝒮c[βiJi(α)].{\rm cov}_{\mathcal{S}}=|\mathcal{S}|^{-1}\sum_{i\in\mathcal{S}}{\mathbb{P}}[\beta^{*}_{i}\in J_{i}(\alpha)]\quad,\quad{\rm cov}_{\mathcal{S}^{c}}=|\mathcal{S}^{c}|^{-1}\sum_{i\in\mathcal{S}^{c}}{\mathbb{P}}[\beta^{*}_{i}\in J_{i}(\alpha)].

We report the estimation error, coverage of true parameter and length of confidence interval for each configuration listed above in Table 2:

Simulation Results
(n,m,d,s,s0,ϵ)(n,m,d,s^{*},s_{0},\epsilon) Estimation Error (Sd) cov{\rm cov} cov𝒮{\rm cov}_{\mathcal{S}} cov𝒮c{\rm cov}_{\mathcal{S}^{c}} length
(3000,15,800,15,8,0.8) 0.0213 (0.0028) 0.940 0.929 0.940 0.0532
(4000,15,800,15,8,0.8) 0.0170 (0.0032) 0.945 0.960 0.944 0.0437
(5000,15,800,15,8,0.8) 0.0141 (0.0021) 0.940 0.945 0.940 0.0378
(4000,10,800,15,8,0.8) 0.0218 (0.0047) 0.945 0.945 0.945 0.0437
(4000,20,800,15,8,0.8) 0.0126 (0.0025) 0.944 0.941 0.944 0.0437
(4000,15,600,15,8,0.8) 0.0162 (0.0031) 0.946 0.933 0.946 0.0436
(4000,15,1000,15,8,0.8) 0.0191 (0.0027) 0.940 0.933 0.940 0.0439
(4000,15,800,15,4,0.8) 0.0188 (0.0032) 0.952 0.945 0.953 0.0420
(4000,15,800,15,12,0.8) 0.0137 (0.0016) 0.944 0.937 0.944 0.0462
(4000,15,800,10,8,0.8) 0.0105 (0.0017) 0.946 0.947 0.946 0.0389
(4000,15,800,20,8,0.8) 0.0243 (0.0036) 0.941 0.932 0.941 0.0497
(4000,15,800,15,8,0.5) 0.0240 (0.0038) 0.940 0.949 0.940 0.0550
(4000,15,800,15,8,0.3) 0.0943 (0.0281) 0.928 0.941 0.928 0.0792
Table 2: Table for Simulation Results of the private federated linear regression

From Table 2, we observe a consistent result with our theory. Namely, for the estimation error, the error becomes small as ϵ\epsilon gets larger as we require less level of privacy. Also, more data points on each machine, more number of machines, smaller sparsity level lead to better estimation accuracy. For confidence intervals, we observe that the coverage is close to 0.950.95 for cov{\rm cov}, cov𝒮{\rm cov}_{\mathcal{S}}, and cov𝒮c{\rm cov}_{\mathcal{S}^{c}}, is and stable in different settings. To further illustrate our claim, we pick the setting of (n,m,d,s,s0,ϵ)=(4000,15,800,15,8,0.8)(n,m,d,s,s_{0},\epsilon)=(4000,15,800,15,8,0.8) and plot the confidence intervals versus the true value among 50 replications in Figure 3. We randomly select 60 out of 800800 coordinates.

Refer to caption
Fig 3: Confidence intervals for βk\beta_{k} for each coordinate kk randomly selected from 800800 coordinates. vertical axis stands for the value of βk\beta_{k}. Red points stand for the true βk\beta_{k} while black points stand for the estimated βk\beta_{k}. We mention that the result averaged over 50 iterations.

We also summarize our results in Figure 4, where we plot the estimation error against the change in the number of samples, sparsity, and number of machines. For the figure, we fixed m=15m=15, d=800d=800, s=16s^{*}=16, s0=8s_{0}=8, for the middle figure, we fixed n=4000n=4000, m=15m=15, d=800d=800, ϵ=0.5\epsilon=0.5, and for the right figure, we fixed d=800d=800, s=16s^{*}=16, s0=8s_{0}=8, ϵ=0.5\epsilon=0.5. The error is averaged over 200200 replications.

Refer to caption
Fig 4: Plot for the estimation results. Left: Log estimation error with different number of samples nn, Middle: Log estimation error with different sparsity ss^{*}, Right: Log estimation error with different number of machines mm.

From the left figure in Figure 4, we observe the decreasing error when we increase nn. When the privacy parameter ϵ\epsilon is large, we have better estimation error. From the middle figure, we observe that as the sparsity level ss grows, the estimation error also increases. Also, when the sparsity for the shared signal s0s_{0} becomes large, the estimation error also becomes large. In the right figure, we observe a consistent decrease of error when we increase the number of machines. All these figures support Theorem 7.

6.2 Simultaneous Inference

In this subsection, we investigate our proposed algorithms for simultaneous inference problems. We aim to build a simultaneous confidence interval when α=0.05\alpha=0.05 under three settings: G={1,2,,d}G=\{1,2,\dots,d\}, G=SG=S, and G=ScG=S^{c}. For each setting, we repeat 50 simulations and report the coverage and length of the confidence intervals. The results are shown in Table 5.

Simulation Results for Simultaneous Inference
(n,m,d,s,s0,ϵ)(n,m,d,s^{*},s_{0},\epsilon) cov{\rm cov} cov𝒮{\rm cov}_{\mathcal{S}} cov𝒮c{\rm cov}_{\mathcal{S}^{c}} len(cov{\rm cov}) len(cov𝒮{\rm cov}_{\mathcal{S}}) len(cov𝒮c{\rm cov}_{\mathcal{S}^{c}})
(3000,15,800,15,8,0.8) 0.981 0.883 0.983 0.091 0.066 0.091
(4000,15,800,15,8,0.8) 0.985 0.910 0.987 0.079 0.057 0.079
(5000,15,800,15,8,0.8) 0.987 0.875 0.990 0.071 0.051 0.071
(4000,10,800,15,8,0.8) 0.989 0.894 0.991 0.079 0.057 0.079
(4000,20,800,15,8,0.8) 0.983 0.898 0.986 0.079 0.057 0.079
(4000,15,600,15,8,0.8) 0.993 0.878 0.995 0.077 0.057 0.077
(4000,15,1000,15,8,0.8) 0.994 0.878 0.997 0.080 0.057 0.080
(4000,15,800,15,4,0.8) 0.983 0.772 0.993 0.079 0.057 0.079
(4000,15,800,15,12,0.8) 0.975 0.957 0.974 0.079 0.058 0.079
(4000,15,800,10,8,0.8) 0.986 0.976 0.985 0.079 0.055 0.079
(4000,15,800,20,8,0.8) 0.974 0.850 0.982 0.078 0.059 0.078
(4000,15,800,15,8,0.5) 0.940 0.882 0.940 0.103 0.083 0.102
(4000,15,800,15,8,0.3) 0.953 0.789 0.975 0.127 0.097 0.126
Table 5: Simulation results of the private simultaneous inference in different settings.

From simulation results, we can observe that our proposed simultaneous confidence interval mostly exhibit over-coverage for G=𝒮cG=\mathcal{S}^{c}, and under-coverage for G=𝒮G=\mathcal{S}. This pattern has also been observed in previous works addressing simultaneous inference (van2014asymptotically, ; zhang2017simultaneous, ). Therefore, this could be attributed to the inherent nature of simultaneous inference rather than to algorithmic reasons.

7 Discussions and Future Work

In this paper, we study the high-dimensional estimation and inference problems within the context of federated learning. In scenarios involving an untrusted central server, our findings reveal that accurate estimation is infeasible, as the rate of convergence is adversely proportional to the dimension dd. Conversely, in the trusted central server setting, we developed algorithms that achieve an optimal rate of convergence. We also explored inference challenges, detailing methodologies for both point-wise confidence intervals and simultaneous inference.

There are several extensions for further research. Currently, our models presume that each machine operates under a linear regression framework. We can possibly expand our algorithm to accomodate more complex models, such as generalized linear models, classification models, or broader machine learning models. Moreover, an interesting extension would be to refine our understanding of model similarity across machines. Although Section 5 currently bases model similarity on L0L_{0} norms, reflecting non-sparse patterns, future studies could explore LpL_{p} norm-based similarities, particularly focusing on L1L_{1} and L2L_{2} norms, to enhance our approach to heterogeneous federated learning settings.

References

  • [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
  • [2] Jayadev Acharya, Clément L Canonne, and Himanshu Tyagi. General lower bounds for interactive high-dimensional estimation under information constraints. arXiv preprint arXiv:2010.06562, 2020.
  • [3] Naman Agarwal, Ananda Theertha Suresh, Felix Yu, Sanjiv Kumar, and H Brendan Mcmahan. cpsgd: Communication-efficient and differentially-private distributed sgd. arXiv preprint arXiv:1805.10559, 2018.
  • [4] Leighton Pate Barnes, Wei-Ning Chen, and Ayfer Özgür. Fisher information under local differential privacy. IEEE Journal on Selected Areas in Information Theory, 1(3):645–659, 2020.
  • [5] Leighton Pate Barnes, Yanjun Han, and Ayfer Ozgur. Lower bounds for learning distributions under communication constraints via fisher information. Journal of Machine Learning Research, 21(236):1–30, 2020.
  • [6] Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnes, and Bernhard Seefeld. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th symposium on operating systems principles, pages 441–459, 2017.
  • [7] Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1011–1020, 2016.
  • [8] T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. arXiv preprint arXiv:1902.04495, 2019.
  • [9] T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy in generalized linear models: Algorithms and minimax lower bounds. arXiv preprint arXiv:2011.03900, 2020.
  • [10] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786–2819, 2013.
  • [11] John Duchi and Ryan Rogers. Lower bounds for locally private estimation via communication complexity. In Conference on Learning Theory, pages 1161–1191. PMLR, 2019.
  • [12] Cynthia Dwork and Vitaly Feldman. Privacy-preserving prediction. In Conference On Learning Theory, pages 1693–1702. PMLR, 2018.
  • [13] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
  • [14] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.
  • [15] Cynthia Dwork, Adam Smith, Thomas Steinke, and Jonathan Ullman. Exposed! a survey of attacks on private data. Annual Review of Statistics and Its Application, 4:61–84, 2017.
  • [16] Cynthia Dwork, Weijie J Su, and Li Zhang. Differentially private false discovery rate control. arXiv preprint arXiv:1807.04209, 2018.
  • [17] Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Shuang Song, Kunal Talwar, and Abhradeep Thakurta. Encode, shuffle, analyze privacy revisited: Formalizations and empirical evaluation. arXiv preprint arXiv:2001.03618, 2020.
  • [18] Ankit Garg, Tengyu Ma, and Huy Nguyen. On communication cost of distributed statistical estimation and dimensionality. Advances in Neural Information Processing Systems, 27:2726–2734, 2014.
  • [19] Robin C Geyer, Tassilo Klein, and Moin Nabi. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557, 2017.
  • [20] Antonious Girgis, Deepesh Data, Suhas Diggavi, Peter Kairouz, and Ananda Theertha Suresh. Shuffled model of differential privacy in federated learning. In International Conference on Artificial Intelligence and Statistics, pages 2521–2529. PMLR, 2021.
  • [21] Yanjun Han, Ayfer Özgür, and Tsachy Weissman. Geometric lower bounds for distributed parameter estimation under communication constraints. In Conference On Learning Theory, pages 3163–3188. PMLR, 2018.
  • [22] Torsten Hothorn, Frank Bretz, and Peter Westfall. Simultaneous inference in general parametric models. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 50(3):346–363, 2008.
  • [23] Rui Hu, Yuanxiong Guo, Hongning Li, Qingqi Pei, and Yanmin Gong. Personalized federated learning with differential privacy. IEEE Internet of Things Journal, 7(10):9530–9539, 2020.
  • [24] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909, 2014.
  • [25] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • [26] Jakub Konecy, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
  • [27] Jason D Lee, Qiang Liu, Yuekai Sun, and Jonathan E Taylor. Communication-efficient sparse regression. The Journal of Machine Learning Research, 18(1):115–144, 2017.
  • [28] Mengchu Li, Ye Tian, Yang Feng, and Yi Yu. Federated transfer learning with differential privacy. arXiv preprint arXiv:2403.11343, 2024.
  • [29] Sai Li, T Tony Cai, and Hongzhe Li. Transfer learning for high-dimensional linear regression: Prediction, estimation, and minimax optimality. arXiv preprint arXiv:2006.10593, 2020.
  • [30] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE signal processing magazine, 37(3):50–60, 2020.
  • [31] Andrew Lowy and Meisam Razaviyayn. Private federated learning without a trusted server: Optimal algorithms for convex losses. arXiv preprint arXiv:2106.09779, 2021.
  • [32] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • [33] Brendan McMahan and Abhradeep Thakurta. Federated learning with formal differential privacy guarantees. Google AI Blog, 2022.
  • [34] H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963, 2017.
  • [35] Thomas Steinke and Jonathan Ullman. Tight lower bounds for differentially private selection. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 552–563. IEEE, 2017.
  • [36] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Nearly-optimal private lasso. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pages 3025–3033, 2015.
  • [37] Stacey Truex, Ling Liu, Ka-Ho Chow, Mehmet Emre Gursoy, and Wenqi Wei. Ldp-fed: Federated learning with local differential privacy. In Proceedings of the Third ACM International Workshop on Edge Systems, Analytics and Networking, pages 61–66, 2020.
  • [38] Sara Van de Geer, Peter Bühlmann, Ya’acov Ritov, and Ruben Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. 2014.
  • [39] Kang Wei, Jun Li, Ming Ding, Chuan Ma, Howard H Yang, Farhad Farokhi, Shi Jin, Tony QS Quek, and H Vincent Poor. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Transactions on Information Forensics and Security, 15:3454–3469, 2020.
  • [40] Manuel Wiesenfarth, Tatyana Krivobokova, Stephan Klasen, and Stefan Sperlich. Direct simultaneous inference in additive models and its application to model undernutrition. Journal of the American Statistical Association, 107(500):1286–1296, 2012.
  • [41] Yang Yu, Shih-Kang Chao, and Guang Cheng. Distributed bootstrap for simultaneous inference under high dimensionality. Journal of Machine Learning Research, 23(195):1–77, 2022.
  • [42] Xianyang Zhang and Guang Cheng. Simultaneous inference for high-dimensional linear models. Journal of the American Statistical Association, 112(518):757–768, 2017.
  • [43] Yuchen Zhang, John C Duchi, Michael I Jordan, and Martin J Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In NIPS, pages 2328–2336. Citeseer, 2013.
  • [44] Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-efficient algorithms for statistical optimization. Advances in neural information processing systems, 25, 2012.
  • [45] Zhe Zhang. Differential privacy in statistical learning. ProQuest Dissertations and Theses, page 156, 2023.
  • [46] Zhe Zhang and Linjun Zhang. High-dimensional differentially-private em algorithm: Methods and near-optimal statistical guarantees. arXiv preprint arXiv:2104.00245, 2021.

Appendix A Proof of main results

A.1 Proof of Theorem 1

We show the proof of the lower bound of the estimation. The main idea of the proof is as follows, we will first assume that in the general case where each data point on each machine follows a general distribution pθp_{\theta}, then we will further assume some conditions of this distribution, and prove that the lower bound of the mean estimation could be attained under these conditions. Finally, we will show that under the assumptions that the data points follow the normal distribution, the specific conditions hold, thus we could finish the proof.

To start this proof, we first introduce the perturbation space 𝒜={1,1}k\mathcal{A}=\{-1,1\}^{k}, where kk is a pre-chosen constant and associate each parameter θ\theta with a𝒜a\in\mathcal{A} and refer the distribution pθp_{\theta} as pap_{a}. We characterize the distance between two parameters θ\theta and θ\theta^{\prime} by the hamming distance of zz and zz^{\prime}, such approach will be compatible with the Assouad’s method, as will be shown later in the proof. We note that when the hamming distance of aa and aa^{\prime} get smaller, it indicts that the distance between θ\theta and θ\theta^{\prime} becomes closer. Also, for each a𝒜a\in\mathcal{A}, we further denote ai𝒜a^{\oplus i}\in\mathcal{A} as the vector which flips the sign of the ii-th coordinate of aa. Then, we state below conditions:

Condition 1

For every a𝒜a\in\mathcal{A} and i[k]i\in[k], it holds that 𝐩ai𝐩a\bm{p}_{a^{\oplus i}}\ll\bm{p}_{a}. Further, there exist qa,iq_{a,i} and measurable functions ϕa,i:𝒳\phi_{a,i}\colon{\mathcal{X}}\to\mathbb{R} such that |qa,i|α|q_{a,i}|\leq\alpha, which qq is a constant and:

d𝒑aid𝒑a=1+qa,iϕa,i.\frac{d\bm{p}_{a^{\oplus i}}}{d\bm{p}_{a}}=1+q_{a,i}\phi_{a,i}.
Condition 2

For all a𝒜a\in\mathcal{A} and i,j[k]i,j\in[k], 𝔼𝐩a[ϕa,iϕa,j]=𝟏i=j\mathbb{E}_{\bm{p}_{a}}[\phi_{a,i}\phi_{a,j}]=\mathbf{1}_{i=j}.

Condition 3

There exists some σ0\sigma\geq 0 such that, for all a𝒜a\in\mathcal{A}, the random vector ϕa(X)=(ϕa,i(X))i[k]k\phi_{a}(X)=(\phi_{a,i}(X))_{i\in[k]}\in\mathbb{R}^{k} is σ2\sigma^{2}-sub-Gaussian for X𝐩aX\sim\bm{p}_{a} with independent coordinates.

The above conditions characterize the distribution 𝒑a\bm{p}_{a}, we will later verify that the Gaussian distribution could satisfy the above conditions in the later proof. Then, we state our first claim.

Corollary 1

For each coordinate of the AA, for any i=1,2,,ki=1,2,\dots,k, fix τ=(Ai=1)(0,1/2]\tau=\mathbb{P}(A_{i}=1)\in(0,1/2]. Let X1,,XmX_{1},\ldots,X_{m} be the inputs on the local servers, i.i.d. with common distribution 𝐩An\bm{p}_{A}^{\otimes n}. Let ZmZ^{m} be the information sent from all the local servers to the central machine generated through the channel 𝒲\mathcal{W}. Then, if the condition 1 satisfies, there exists a constant cc, we have:

(1ki=1kdTV(𝒑+iZm,𝒑iZm))2ckq2mn2maxa𝒜i=1k𝒴𝔼𝒑an[ϕa,i(X)𝒲(z|X)]2𝔼𝒑an[𝒲(z|X)]𝑑μ,\displaystyle\quantity(\frac{1}{k}\sum_{i=1}^{k}d_{\text{TV}}({\bm{p}_{+i}^{Z^{m}}},{\bm{p}_{-i}^{Z^{m}}}))^{2}\leq\frac{c}{k}q^{2}mn^{2}\max_{a\in\mathcal{A}}\sum_{i=1}^{k}\int_{{\mathcal{Y}}}\frac{\mathbb{E}_{\bm{p}_{a}^{\otimes n}}[{\phi_{a,i}(X)\mathcal{W}(z|X)}]^{2}}{\mathbb{E}_{\bm{p}_{a}^{\otimes n}}[{\mathcal{W}(z|X)}]}d{\mu},

where 𝐩+iZm=𝔼[𝐩AZm|Ai=+1]\bm{p}_{+i}^{Z^{m}}=\mathbb{E}[\bm{p}_{A}^{Z^{m}}|A_{i}=+1], 𝐩iZm=𝔼[𝐩AZm|Ai=1]\bm{p}_{-i}^{Z^{m}}=\mathbb{E}[\bm{p}_{A}^{Z^{m}}|A_{i}=-1].

The proof of the above corollary is in appendix B.1. The above corollary characterizes the difference between the distribution of 𝒑+iZm\bm{p}_{+i}^{Z^{m}} and 𝒑iZm\bm{p}_{-i}^{Z^{m}}, which is the difference between the distribution of the information about the each coordinate of AA, which could be seen as the information between YY and AA, namely, the information between the information and the parameters.

In the precious corollary, we just assumed a general channel 𝒲\mathcal{W}, in the following corollary, we could further specifies the above corollary when the channel 𝒲\mathcal{W}, be a ϵ\epsilon-differentially private constraint channel 𝒲priv\mathcal{W}^{priv} and we could further simplify the upper bound in Corollary 1.

Corollary 2

If WprivW^{priv} be a privacy constraint channel and for any family of distributions {𝐩a,a{1,1}k}\{\bm{p}_{a},a\in\{-1,1\}^{k}\} satisfying condition 1 and condition 2. With the same notations as Corollary 1 we have:

(1ki=1kdTV(𝒑+iZm,𝒑iZm))27kmn2q2(enϵ21)\displaystyle\quantity(\frac{1}{k}\sum_{i=1}^{k}d_{TV}({\bm{p}_{+i}^{Z^{m}}},{\bm{p}_{-i}^{Z^{m}}}))^{2}\leq\frac{7}{k}mn^{2}q^{2}(e^{n\epsilon^{2}}-1)

The proof of the above corollary could be found in appendix B.2. The above corollary focus on the upper bound of 1ki=1kdTV(𝒑+iZm,𝒑iZm)\frac{1}{k}\sum_{i=1}^{k}d_{TV}({\bm{p}_{+i}^{Z^{m}}},{\bm{p}_{-i}^{Z^{m}}}), in the next corollary, we will focus on the lower bound, which is an Assouad-type bound. We first introduce another condition:

Condition 4

Fix p[1,)p\in[1,\infty). Let ρ\rho be the p\ell_{p} loss between the true parameter and the estimation. Then, for every a,a𝒜{1,+1}ka,a^{\prime}\in\mathcal{A}\in\{-1,+1\}^{k}, the below inequalities hold:

lp(θa,θa)4ρ(dHam(a,a)τk)1/p,\displaystyle l_{p}(\theta_{a},\theta_{a^{\prime}})\geq 4\rho\quantity(\frac{d_{Ham}(a,a^{\prime})}{\tau k})^{1/p},

where dHam(a,a)d_{Ham}(a,a^{\prime}) denotes the Hamming distance with definition dHam(a,a)=i=1k𝟏(aiai)d_{Ham}(a,a^{\prime})=\sum_{i=1}^{k}\mathbf{1}{(a_{i}\neq a_{i}^{\prime})}, and τ=𝐏(ai=1)(0,1/2]\tau={\mathbf{P}}(a_{i}=1)\in(0,1/2] for each coordinate aia_{i}.

The above condition characterizes the connection between θ\theta with the perturbation space. With the above assumption, we could further obtain the lower bound of 1ki=1kdTV(𝒑+iZm,𝒑iZm)\frac{1}{k}\sum_{i=1}^{k}d_{TV}({\bm{p}_{+i}^{Z^{m}}},{\bm{p}_{-i}^{Z^{m}}}):

Corollary 3

Let p1p\geq 1 and assume that {𝐩a,a𝒜}\{\bm{p}_{a},a\in\mathcal{A}\}, τ[0,1/2]\tau\in[0,1/2] satisfy Condition 4. Let AA be a random variable on {1,1}k\{-1,1\}^{k} with distribution Rad(τ)kRad(\tau)^{\otimes k}. Suppose that θ^\hat{\theta} constitutes an (n,ρ)(n,\rho)-estimator of the true parameter θ\theta^{*} under lpl_{p} loss and [𝐩A𝒫Θ]1τ/4\mathbb{P}[\bm{p}_{A}\in\mathcal{P}_{\Theta}]\geq 1-\tau/4. Then the below inequality holds:

1ki=1kdTV(𝒑+iZm,𝒑iZm)n4,\displaystyle\frac{1}{k}\sum_{i=1}^{k}d_{TV}(\bm{p}_{+i}^{Z^{m}},\bm{p}_{-i}^{Z^{m}})\geq\frac{n}{4},

where 𝐩+iZm=𝔼[𝐩AZm|Ai=+1]\bm{p}_{+i}^{Z^{m}}=\mathbb{E}[\bm{p}_{A}^{Z^{m}}|{A_{i}=+1}], 𝐩iZm=𝔼[𝐩AZm|Ai=1]\bm{p}_{-i}^{Z^{m}}=\mathbb{E}[{\bm{p}_{A}^{Z^{m}}}|{A_{i}=-1}].

The proof of the above corollary could be found in appendix B.3. In the following proof, we are going to verify that the Gaussian distribution satisfies all the above conditions, thus the result in Corollary 2 and Corollary 3 holds. Then, according to these two corollaries, we will present the lower bound for the mean estimation in the high-dimensional federated learning setting.

For the parameters, we could fix p=2p=2, k=dk=d, 𝒜={1,+1}d\mathcal{A}=\{-1,+1\}^{d}. For the probability where τ=(ai)=1\tau={\mathbb{P}}(a_{i})=1, we fix τ=s2d\tau=\frac{s}{2d}. Let φ\varphi denote the probability density function of the standard Gaussian distribution N(𝟎,𝑰)N(\mathbf{0},\bm{I}). We first suppose that, for some ρ(0,1/8]\rho\in(0,1/8], there exists an (n,ρ)(n,\rho)-estimator for the true parameter μ\mu under p\ell_{p} loss. Then, if we have ρ2s/n\rho^{2}\geq s/n, then we could finish the proof. Otherwise, we fix a parameter γ=4ρs/2(0,1/2]\gamma=\frac{4\rho}{\sqrt{s/2}}\in(0,1/2], this is possible with a choice of ss, the sparsity level. We could design the parameter, the mean of the Gaussian distribution μ\mu and AA by the formula: μa=γ(a+𝟏d)\mu_{a}=\gamma(a+\bm{1}_{d}), where a𝒜a\in\mathcal{A}. Then, we could verify that [μa02τd]1τ/4{\mathbb{P}}[\|\mu_{a}\|_{0}\leq 2\tau d]\geq 1-\tau/4, where μa0=i=1d𝟏ai=1=a+\norm{\mu_{a}}_{0}=\sum_{i=1}^{d}\bm{1}_{a_{i}=1}=\|a\|_{+}. From the definition of Gaussian density, for a𝒜a\in\mathcal{A}, we have:

𝒑a(x)=eγ2μa22/2eγx,a+𝟏dφ(x).\bm{p}_{a}(x)=e^{-\gamma^{2}\|\mu_{a}\|_{2}^{2}/2}\cdot e^{\gamma\langle{x},{a+\bm{1}_{d}}\rangle}\cdot\varphi(x).

Therefore, for a𝒜a\in\mathcal{A} and i[d]i\in[d], we have

𝒑ai(x)=e2γxiaie2γ2ai𝒑a(x)=(1+qϕa,i(x))𝒑a(x),\bm{p}_{a^{\oplus i}}(x)=e^{-2\gamma x_{i}a_{i}}e^{2\gamma^{2}a_{i}}\cdot\bm{p}_{a}(x)=({1+q\cdot\phi_{a,i}(x)})\cdot\bm{p}_{a}(x),

where q=e4γ21q=\sqrt{e^{4\gamma^{2}}-1} and ϕa,i(x)=1e2γxiaie2γ2aie4γ21\phi_{a,i}(x)=\frac{1-e^{-2\gamma x_{i}a_{i}}e^{2\gamma^{2}a_{i}}}{\sqrt{e^{4\gamma^{2}}-1}}. By using the Gaussian moment-generating function, we could verify that, for iji\neq j,

𝔼𝒑a[ϕa,i(X)]=0,𝔼𝒑a[ϕa,i(X)2]=1, and 𝔼𝒑a[ϕa,i(X)ϕa,j(X)]=0,\mathbb{E}_{\bm{p}_{a}}[{\phi_{a,i}(X)}]=0,\quad\mathbb{E}_{\bm{p}_{a}}[{\phi_{a,i}(X)^{2}}]=1,\text{ and }\mathbb{E}_{\bm{p}_{a}}[{\phi_{a,i}(X)\phi_{a,j}(X)}]=0,

so that the condition 1 and condition 2 are satisfied. Here, notice that in the proof of Corollary 1, we require that |qϕa,i(x)|=C/n|q\cdot\phi_{a,i}(x)|=C/n where CC is a constant, we could verify that since ρcs/n\rho\leq c\cdot\sqrt{s/n}, then γcn\gamma\leq c\sqrt{n} from the definition of γ\gamma, Then we have |qϕz,i(x)|c0|γ2γxi|c0/n|q\cdot\phi_{z,i}(x)|\leq c_{0}|\gamma^{2}-\gamma x_{i}|\leq c_{0}/n, which could verify the condition for corollary 1. Also, by the choice of γ\gamma and ρ\rho, it is easy to verify that condition 4 also holds with:

2(μ(𝒑a),μ(𝒑a))=4ρdham(a,a)τd.\ell_{2}(\mu(\bm{p}_{a}),\mu(\bm{p}_{a^{\prime}}))={4\rho}\cdot\sqrt{\frac{d_{ham}(a,a^{\prime})}{\tau d}}.

Thus, all the conditions mentioned above have been verified. Then, we could finish the proof of our lower bound. Combining the result of corollary 2 and corollary 3, we get the result below:

n2dcmn2q2(enϵ21),n^{2}d\leq cmn^{2}q^{2}(e^{n\epsilon^{2}}-1),

where cc is a constant. Also, notice that q2=e4γ218γ2q^{2}=e^{4\gamma^{2}}-1\leq 8\gamma^{2} holds since γ1/2\gamma\leq 1/2, we could find a constant c0c_{0}, it follows that

ρ2c0sdmnϵ2,\rho^{2}\geq c_{0}\cdot\frac{s\cdot d}{mn\epsilon^{2}},

From the choice of ρ\rho, we could claim that ρΩ(sdmnϵ21)\rho\geq\Omega(\sqrt{\frac{sd}{mn\epsilon^{2}}}\wedge 1), then we could obtain our lower bound, which finished the proof. \square

A.2 Proof of Theorem 2

In the proof of Theorem 2, we will design a mechanism to get an estimation of the parameter and then we obtain the upper bound of 𝝁𝝁^22\|\bm{\mu}-\hat{\bm{\mu}}\|_{2}^{2}. The overall mechanism is designed as follows: we first calculate the mean for nn data points on each machine. Then, we transform the Gaussian mean to Bernoulli mean according to the sign of the Gaussian mean motivated by the Algorithm 2 discussed in [2], l-bit protocol for estimating product of Bernoulli family.

Then, we could use the ϵ\epsilon-local differentially private mechanism to achieve mean estimation for the product of Bernoulli family in the federated learning setting. After obtaining the estimation, we could convert the estimated Bernoulli mean back to Gaussian mean estimation.

First, for each data point on the machine, it follows the distribution of N(𝝁,𝑰d)N(\bm{\mu},\bm{I}_{d}) . Then for the mean on ii-th machine, the mean 𝑿i¯\bar{\bm{X}_{i}} follows a distribution of N(𝝁,1/n𝑰)N(\bm{\mu},1/n\bm{I}). Then, we could convert it to a Bernoulli variable 𝐙\bf Z, where Zi=1Z_{i}=1 when Xij¯>0\bar{X_{ij}}>0 and Zi=1Z_{i}=-1 when Xij¯0\bar{X_{ij}}\leq 0. Then the mean of 𝒁\bm{Z}, which denote as 𝒗\bm{v} is:

vi=2(Xi>0)1=Erf(nμi2),v_{i}=2{\mathbb{P}}(X_{i}>0)-1=\operatorname{Erf}\quantity(\frac{\sqrt{n}\mu_{i}}{\sqrt{2}}),

for each coordinate of 𝒗\bm{v}. Suppose the estimation of 𝒗\bm{v} is denoted by 𝒗^\hat{\bm{v}}, then suppose the estimation 𝝁^\hat{\bm{\mu}} is given by μi^=2nerf1(vi^)\hat{\mu_{i}}=\frac{\sqrt{2}}{\sqrt{n}}\text{erf}^{-1}(\hat{v_{i}}), we could find such relationship:

𝝁𝝁^22\displaystyle\|\bm{\mu}-\hat{\bm{\mu}}\|_{2}^{2} =i=1d|μiμ^i|2=2ni=1d|Erf1(vi)Erf1(v^i)|2c1ni=1d|viv^i|2,\displaystyle=\sum_{i=1}^{d}|\mu_{i}-\hat{\mu}_{i}|^{2}=\frac{2}{n}\cdot\sum_{i=1}^{d}|\text{Erf}^{-1}(v_{i})-\text{Erf}^{-1}(\hat{v}_{i})|^{2}\leq c\cdot\frac{1}{n}\cdot\sum_{i=1}^{d}|v_{i}-\hat{v}_{i}|^{2}, (A.1)

where cc is a constant. The last inequality comes from the Lipschitz condition of a Erf function. Then, we could get the upper bound for the Bonoulli mean estimation directly from Theorem 3 in [2], where

𝒗𝒗^22cdsmϵ2,\|\bm{v}-\hat{\bm{v}}\|_{2}^{2}\leq c\cdot\frac{d\cdot s}{m\epsilon^{2}},

where ϵ\epsilon is the privacy parameter, mm is the number of machines. Combining the last two inequalities (A.1) and (A.2), we could get the upper bound for the mean Gaussian estimation:

𝝁𝝁^22csdmnϵ2,\|\bm{\mu}-\hat{\bm{\mu}}\|_{2}^{2}\leq c\cdot\frac{s\cdot d}{mn\epsilon^{2}},

which finished the proof. \square

A.3 Proof of Theorem 3

It is not difficult to observe that the convergence rate would be the same as in the non-federated learning setting. We denote LnL_{n} as the sample loss function and LL be the population level. In the estimation of 𝜷{\bm{\beta}}, L(𝜷)=𝒀𝑿𝜷22L({\bm{\beta}})=\|\bm{Y}-\bm{X}{\bm{\beta}}\|_{2}^{2} and LnL_{n} is the sample version. In the estimation of 𝚯k\bm{\Theta}_{k}, L(𝚯k)=12𝚯kΣ𝚯k𝒆j,𝚯kL(\bm{\Theta}_{k})=\frac{1}{2}\bm{\Theta}_{k}^{\top}\Sigma\bm{\Theta}_{k}-\langle\bm{e}_{j},\bm{\Theta}_{k}\rangle. We start from the estimation of 𝜷{\bm{\beta}} and the estimation of Θ\Theta is the same. In this proof, we use n0n_{0} to refer the total number of samples n0=mnn_{0}=m\cdot n. Then, it holds that:

Lemma A.1

Under assumptions of Theorem 5, it holds that:

8νs𝜷t𝜷^22n(𝜷t)n(𝜷^),𝜷t𝜷^8μs𝜷t𝜷^22.\displaystyle{8\nu_{s}}\|{\bm{\beta}}^{t}-\hat{\bm{\beta}}\|_{2}^{2}\leq\langle\nabla{\mathcal{L}}_{n}({\bm{\beta}}^{t})-\nabla{\mathcal{L}}_{n}(\hat{\bm{\beta}}),{\bm{\beta}}^{t}-\hat{\bm{\beta}}\rangle\leq{8\mu_{s}}\|{\bm{\beta}}^{t}-\hat{\bm{\beta}}\|_{2}^{2}. (A.2)

Proof: From direct calculation, we could obtain that:

n(𝜷t)n(𝜷^),𝜷t𝜷^=2(𝜷t𝜷^)T𝚺^(𝜷t𝜷^)2μs+s𝜷t𝜷^222μ2s𝜷t𝜷^22\langle\nabla{\mathcal{L}}_{n}({\bm{\beta}}^{t})-\nabla{\mathcal{L}}_{n}(\hat{\bm{\beta}}),{\bm{\beta}}^{t}-\hat{\bm{\beta}}\rangle=2({\bm{\beta}}^{t}-\hat{\bm{\beta}})^{T}\hat{\bm{\Sigma}}({\bm{\beta}}^{t}-\hat{\bm{\beta}})\leq 2\mu_{s+s^{*}}\|{\bm{\beta}}^{t}-\hat{\bm{\beta}}\|_{2}^{2}\leq 2\mu_{2s}\|{\bm{\beta}}^{t}-\hat{\bm{\beta}}\|_{2}^{2}

The last inequality is according to the choice of ss such that sss^{*}\leq s. Then, we also have μ2s4μs\mu_{2s}\leq 4\mu_{s}. Thus we have obtained the right hand side of the inequality. By a similar approach, we could also obtain the left hand side.

Lemma A.2

Under assumptions of Theorem 5, it holds that there exists an absolute constant ρ\rho such that

n(𝜷t+1)n(𝜷^)(1νs24μs)(n(𝜷t)n(𝜷^))+c3(i[s]𝒘it2+𝒘~St+1t22),\displaystyle{\mathcal{L}}_{n}({\bm{\beta}}^{t+1})-{\mathcal{L}}_{n}(\hat{\bm{\beta}})\leq\left(1-\frac{\nu_{s}}{24\mu_{s}}\right)\left({\mathcal{L}}_{n}({\bm{\beta}}^{t})-{\mathcal{L}}_{n}(\hat{\bm{\beta}})\right)+c_{3}\left(\sum_{i\in[s]}\|\bm{w}^{t}_{i}\|_{\infty}^{2}+\|\tilde{\bm{w}}^{t}_{S^{t+1}}\|_{2}^{2}\right), (A.3)

where c3c_{3} is a constant number such that c3=max(μs(728μs+13),68μs+2/3)c_{3}=\max(\mu_{s}(72\cdot 8\mu_{s}+13),68\mu_{s}+2/3)

Notice that wi,ww_{i},{w} are injected from the NoisyHT algorithm. The proof of the above lemma follows from the result in Lemma 8.3 from [8]. Then, we could start the proof by iterating (A.3) over tt. Denote 𝑾t=c3(i[s]𝒘it2+𝒘~St+1t22)\bm{W}_{t}=c_{3}\left(\sum_{i\in[s]}\|\bm{w}^{t}_{i}\|^{2}_{\infty}+\|\tilde{\bm{w}}^{t}_{S^{t+1}}\|_{2}^{2}\right) to obtain

n0(𝜷T)n0(𝜷^)\displaystyle{\mathcal{L}}_{n_{0}}({\bm{\beta}}^{T})-{\mathcal{L}}_{n_{0}}(\hat{\bm{\beta}}) (1νs24μs)T(n0(𝜷0)n0(𝜷^))+k=0T1(11ρL2)Tk1𝑾k\displaystyle\leq\left(1-\frac{\nu_{s}}{24\mu_{s}}\right)^{T}\left({\mathcal{L}}_{n_{0}}({\bm{\beta}}^{0})-{\mathcal{L}}_{n_{0}}(\hat{\bm{\beta}})\right)+\sum_{k=0}^{T-1}\left(1-\frac{1}{\rho L^{2}}\right)^{T-k-1}\bm{W}_{k}
(1νs24μs)T4μc02+k=0T1(1νs24μs)Tk1𝑾k.\displaystyle\leq\left(1-\frac{\nu_{s}}{24\mu_{s}}\right)^{T}4\mu c_{0}^{2}+\sum_{k=0}^{T-1}\left(1-\frac{\nu_{s}}{24\mu_{s}}\right)^{T-k-1}\bm{W}_{k}. (A.4)

The second inequality is a consequence of the upper inequality in (A.2) and the 2\ell_{2} bounds of 𝜷0{\bm{\beta}}^{0} and 𝜷^\hat{\bm{\beta}}. We can also bound n0(𝜷T)n0(𝜷^){\mathcal{L}}_{n_{0}}({\bm{\beta}}^{T})-{\mathcal{L}}_{n_{0}}(\hat{\bm{\beta}}) from below by the lower inequality in (A.2):

n0(𝜷T)n0(𝜷^)n0(𝜷T)n0(𝜷)4νs𝜷T𝜷22n0(𝜷),𝜷𝜷T.\displaystyle{\mathcal{L}}_{n_{0}}({\bm{\beta}}^{T})-{\mathcal{L}}_{n_{0}}(\hat{\bm{\beta}})\geq{\mathcal{L}}_{n_{0}}({\bm{\beta}}^{T})-{\mathcal{L}}_{n_{0}}({\bm{\beta}}^{*})\geq 4\nu_{s}\|{\bm{\beta}}^{T}-{\bm{\beta}}^{*}\|_{2}^{2}-\langle\nabla{\mathcal{L}}_{n_{0}}({\bm{\beta}}^{*}),{\bm{\beta}}^{*}-{\bm{\beta}}^{T}\rangle. (A.5)

Now (A.4) and (A.5) imply that, with T=(ρL2)log(8c02Ln0)T=(\rho L^{2})\log(8c_{0}^{2}Ln_{0}),

4νs𝜷T𝜷22\displaystyle 4\nu_{s}\|{\bm{\beta}}^{T}-{\bm{\beta}}^{*}\|_{2}^{2} n0(𝜷)s+s𝜷𝜷T2+1n0+k=0T1(1νs24μs)Tk1𝑾k.\displaystyle\leq\|\nabla{\mathcal{L}}_{n_{0}}({\bm{\beta}}^{*})\|_{\infty}\sqrt{s+s^{*}}\|{\bm{\beta}}^{*}-{\bm{\beta}}^{T}\|_{2}+\frac{1}{n_{0}}+\sum_{k=0}^{T-1}\left(1-\frac{\nu_{s}}{24\mu_{s}}\right)^{T-k-1}\bm{W}_{k}. (A.6)
n0(𝜷)s+s𝜷𝜷T2+1n0+24μsνsmaxk𝑾k.\displaystyle\leq\|\nabla{\mathcal{L}}_{n_{0}}({\bm{\beta}}^{*})\|_{\infty}\sqrt{s+s^{*}}\|{\bm{\beta}}^{*}-{\bm{\beta}}^{T}\|_{2}+\frac{1}{n_{0}}+\frac{24\mu_{s}}{\nu_{s}}\max_{k}\bm{W}_{k}. (A.7)

Thus,

𝜷T𝜷22kslogdn0+μsνs2maxk𝑾k.\|{\bm{\beta}}^{T}-{\bm{\beta}}^{*}\|_{2}^{2}\leq k\cdot\frac{s^{*}\log d}{n_{0}}+\frac{\mu_{s}}{\nu_{s}^{2}}\max_{k}\bm{W}_{k}.

In the above inequality, kk is a constant. Then, we could calculate the upper bound of 𝐖k{\mathbf{W}}_{k}. From the result of tail bound of Laplace random variables, we could find that with high probability that 𝐖kc4s2log2dlog(1/δ)logn3/n2ϵ2){\mathbf{W}}_{k}\leq c_{4}s^{2}\log^{2}d\log(1/\delta)\log n^{3}/n^{2}\epsilon^{2}), where c4=max(μs(9μs+1/4),17/16μs+1/96)c_{4}=\max(\mu_{s}(9\mu_{s}+1/4),17/16\mu_{s}+1/96). Then, we have with high probability:

𝜷T𝜷22kslogdn0+6c4μsνs2s2log2dlog(1/δ)logn03/n02ϵ2.\|{\bm{\beta}}^{T}-{\bm{\beta}}^{*}\|_{2}^{2}\leq k\cdot\frac{s\log d}{n_{0}}+\frac{6c_{4}\mu_{s}}{\nu_{s}^{2}}s^{2}\log^{2}d\log(1/\delta)\log n_{0}^{3}/n_{0}^{2}\epsilon^{2}.

Similarly, we could obtain the same result for the estimation of 𝚯^k\hat{\bm{\Theta}}_{k}, which finishes the proof.

A.4 Proof of Theorem 4

The structure of the proof consist of three part, the first part is to show that our algorithm provides an (ϵ,δ)(\epsilon,\delta)-differentially private confidence interval. In the second part, we will show that βk^\hat{\beta_{k}} is a consistent estimator of true βk\beta_{k}, which is unbiased. In the last part, we will show that the (1α)(1-\alpha) confidence interval is asymptotically valid. Before we start the first part, let us first analyze cxc_{x}:
According to the assumptions of the theorem, we have learnt that for each row of 𝑿\bm{X}, 𝒙𝚺1/2\bm{x}\bm{\Sigma}^{-1/2} is sub-Gaussian with κ=𝚺1/2𝒙ψ2\kappa=\|\bm{\Sigma}^{-1/2}\bm{x}\|_{\psi_{2}}. Then according to the properties of sub-Gaussian random variables, we have: 𝒙𝚺1/232κ2logd\|\bm{x}\bm{\Sigma}^{-1/2}\|_{\infty}\leq 3\sqrt{2\kappa^{2}\log d} with probability 1d21-d^{-2}. Then for each element of xix_{i}, i=1,2,,di=1,2,\dots,d, we have:

xi=𝒆j𝒙=𝒆j𝚺1/2𝚺1/2𝒙x_{i}=\bm{e}_{j}^{\top}\bm{x}=\bm{e}_{j}^{\top}\bm{\Sigma}^{1/2}\bm{\Sigma}^{-1/2}\bm{x}

Thus,

xi𝒆j𝚺1/21𝚺1/2𝒙𝚺1/22𝚺1/2𝒙\displaystyle x_{i}\leq\|\bm{e}_{j}^{\top}\bm{\Sigma}^{1/2}\|_{1}\|\bm{\Sigma}^{-1/2}\bm{x}\|_{\infty}\leq\|\bm{\Sigma}^{1/2}\|_{2}\|\bm{\Sigma}^{-1/2}\bm{x}\|_{\infty}

Then, with probability 1d21-d^{-2}, we have xi32Lκ2logdx_{i}\leq 3\sqrt{2L\kappa^{2}\log d}. By a union bound, we could have with probability 1d11-d^{-1}, 𝒙32Lκ2logd\|\bm{x}\|_{\infty}\leq 3\sqrt{2L\kappa^{2}\log d}. By the choice of cxc_{x} in the theorem, we have 𝒙cx\|\bm{x}\|_{\infty}\leq c_{x} with a high probability.
Then, we could verify that the confidence interval is (ϵ,δ)(\epsilon,\delta)-differentially private. From [8], we could obtain that the output 𝜷u^\hat{{\bm{\beta}}^{\text{u}}} is (ϵ,δ)(\epsilon,\delta)-DP. In a similar manner, we could also verify that the output 𝚯^k\hat{\bm{\Theta}}_{k} is also (ϵ,δ)(\epsilon,\delta)-DP. Thus, for two adjacent data sets (𝑿,𝒀)(\bm{X},\bm{Y}) and (𝑿,𝒀)(\bm{X}^{\prime},\bm{Y}^{\prime}) which differ by one data (𝒙ij,yij)(\bm{x}_{ij},y_{ij}) and (𝒙ij,yij)(\bm{x}_{ij}^{\prime},y_{ij}^{\prime}), we have:

|1n0(𝚯^k𝒙ijΠR(yij)𝚯^k𝒙ij𝒙ij𝜷u^)|\displaystyle|\frac{1}{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}\Pi_{R}(y_{ij})-\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}\bm{x}_{ij}^{\top}\hat{{\bm{\beta}}^{\text{u}}})| 1n0|𝚯^k𝒙ijΠR(yij)|+1n0|𝚯^k𝒙ij𝒙ij𝜷u^|\displaystyle\leq\frac{1}{n_{0}}|\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}\Pi_{R}(y_{ij})|+\frac{1}{n_{0}}|\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}\bm{x}_{ij}^{\top}\hat{{\bm{\beta}}^{\text{u}}}|
1n0|𝚯^k𝒙ij||ΠR(yij)|+1n0|𝚯^k𝒙ij||𝒙ij𝜷u^|\displaystyle\leq\frac{1}{n_{0}}|\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}||\Pi_{R}(y_{ij})|+\frac{1}{n_{0}}|\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}||\bm{x}_{ij}^{\top}\hat{{\bm{\beta}}^{\text{u}}}|
1n0sc1cxR+1n0sc0c1cx2\displaystyle\leq\frac{1}{n_{0}}\sqrt{s}c_{1}c_{x}R+\frac{1}{n_{0}}sc_{0}c_{1}c_{x}^{2}

Thus,

|1n0(𝚯^k𝒙ijΠR(yij)𝚯^k𝒙ij𝒙ij𝜷u^)1n0(𝚯^k𝒙ijΠR(yij)𝚯^k𝒙ij𝒙ij𝜷u^)|\displaystyle|\frac{1}{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}\Pi_{R}(y_{ij})-\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}\bm{x}_{ij}^{\top}\hat{{\bm{\beta}}^{\text{u}}})-\frac{1}{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}^{\prime}\Pi_{R}(y_{ij}^{\prime})-\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}^{\prime}\bm{x}_{ij}^{\prime\top}\hat{{\bm{\beta}}^{\text{u}}})|
=|1n0(𝚯^k𝒙ijΠR(yij)𝚯^k𝒙ij𝒙ij𝜷u^)|+|1n0(𝚯^k𝒙ijΠR(yij)𝚯^k𝒙ij𝒙ij𝜷u^)|\displaystyle=|\frac{1}{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}\Pi_{R}(y_{ij})-\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}\bm{x}_{ij}^{\top}\hat{{\bm{\beta}}^{\text{u}}})|+|\frac{1}{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}^{\prime}\Pi_{R}(y_{ij}^{\prime})-\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}^{\prime}\bm{x}_{ij}^{\prime\top}\hat{{\bm{\beta}}^{\text{u}}})|
2n0sc1cxR+2nsc0c1cx2\displaystyle\leq\frac{2}{n_{0}}\sqrt{s}c_{1}c_{x}R+\frac{2}{n}sc_{0}c_{1}c_{x}^{2}

Denote Δ1=sc1cxR+sc0c1cx2\Delta_{1}=\sqrt{s}c_{1}c_{x}R+sc_{0}c_{1}c_{x}^{2}. Thus, if EkE_{k} follows N(0,8Δ12/n02ϵ2log(1.25/δ))N(0,8\Delta_{1}^{2}/n_{0}^{2}\epsilon^{2}\log(1.25/\delta)), βj^\hat{\beta_{j}} is (ϵ,δ)(\epsilon,\delta)-DP. For the term 𝚯^k𝚺^𝚯^k\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}, we could obtain that:

𝚯^k𝚺^𝚯^k=1n0i=1n𝚯^k𝒙ij𝒙ij𝚯^k=1n0i=1n(𝚯^k𝒙ij)2\displaystyle\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}=\frac{1}{n_{0}}\sum_{i=1}^{n}\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}\bm{x}_{ij}^{\top}\hat{\bm{\Theta}}_{k}=\frac{1}{n_{0}}\sum_{i=1}^{n}(\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij})^{2}

Thus, for two adjacent data sets 𝑿\bm{X} and 𝑿\bm{X}^{\prime} differ by one data 𝒙ij\bm{x}_{ij} and 𝒙ij\bm{x}_{ij}^{\prime}, we have:

|𝚯^k𝚺^𝚯^k𝚯^k𝚺^𝚯^k|\displaystyle|\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}-\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}^{\prime}}\hat{\bm{\Theta}}_{k}| 1n0(𝚯^k𝒙ij)2+1n0(𝚯^k𝒙ij)2\displaystyle\leq\frac{1}{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij})^{2}+\frac{1}{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}^{\prime})^{2}

By Holder inequality and Cauchy inequality, we have |𝚯^k𝒙ij|sc1cx|\hat{\bm{\Theta}}_{k}^{\top}\bm{x}_{ij}|\leq\sqrt{s}c_{1}c_{x}, thus we have:

|𝚯^k𝚺^𝚯^k𝚯^k𝚺^𝚯^k|\displaystyle|\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}-\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}^{\prime}}\hat{\bm{\Theta}}_{k}| 2n0(sc1cx)2=2n0sc12cx2\displaystyle\leq\frac{2}{n_{0}}(\sqrt{s}c_{1}c_{x})^{2}=\frac{2}{n_{0}}sc_{1}^{2}c_{x}^{2}

Denote Δ2=sc12cx2\Delta_{2}=sc_{1}^{2}c_{x}^{2}. Then, let EE^{\prime} follows a Gaussian distribution of N(0,8Δ22/n2m2ϵ2log(1.25/δ))N(0,8\Delta_{2}^{2}/n^{2}m^{2}\epsilon^{2}\log(1.25/\delta)). We could claim that 𝚯^k𝚺^𝚯^k+E\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}+E^{\prime} is (ϵ,δ)(\epsilon,\delta)-differentially private.
We start the second part of the proof. First, with probability 1k0exp(k1n0)1-k_{0}\exp(-k_{1}n_{0}), we have ΠR(yi)=yi\Pi_{R}(y_{i})=y_{i} for each i=1,2,,di=1,2,\dots,d, so we could decompose βk^\hat{\beta_{k}} by the following approach:

βk^\displaystyle\hat{\beta_{k}} =βku^+1n0𝚯^k𝑿(𝑿𝜷+𝑾𝑿𝜷u^)+Ek\displaystyle=\hat{\beta_{k}^{\text{u}}}+\frac{1}{n_{0}}\hat{\bm{\Theta}}_{k}^{\top}\bm{X}^{\top}(\bm{X}{\bm{\beta}}+\bm{W}-\bm{X}\hat{{\bm{\beta}}^{\text{u}}})+E_{k}
=βku^+1n0𝚯^k𝑿𝑿(𝜷𝜷u^)+1n0𝚯^k𝑿𝑾+Ek\displaystyle=\hat{\beta_{k}^{\text{u}}}+\frac{1}{n_{0}}\hat{\bm{\Theta}}_{k}^{\top}\bm{X}^{\top}\bm{X}({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})+\frac{1}{n_{0}}\hat{\bm{\Theta}}_{k}^{\top}\bm{X}^{\top}\bm{W}+E_{k}
=βk+(𝚯^k𝚺^𝒆k)(𝜷𝜷u^)+1n0𝚯^k𝑿𝑾+Ek\displaystyle={\beta_{k}}+(\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})+\frac{1}{n_{0}}\hat{\bm{\Theta}}_{k}^{\top}\bm{X}^{\top}\bm{W}+E_{k}

Thus, we have:

n0(βj^βj)=n0(𝚯^k𝚺^𝒆k)(𝜷𝜷u^)A.8.1+1n0𝚯^k𝑿𝑾A.8.2+n0EkA.8.3\sqrt{n_{0}}(\hat{\beta_{j}}-\beta_{j})=\underbrace{\sqrt{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})}_{\ref{eq:eq1}.1}+\underbrace{\frac{1}{\sqrt{n_{0}}}\hat{\bm{\Theta}}_{k}^{\top}\bm{X}^{\top}\bm{W}}_{\ref{eq:eq1}.2}+\underbrace{\sqrt{n_{0}}E_{k}}_{\ref{eq:eq1}.3} (A.8)

We will analyze the three terms in (A.8) one by one. For the first term, we could further decompose this term as:

n0(𝚯^k𝚺^𝒆k)(𝜷𝜷u^)\displaystyle\sqrt{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}) =n0(𝚯^k𝚺^𝚯k𝚺^+𝚯k𝚺^𝒆k)(𝜷𝜷u^)\displaystyle=\sqrt{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}+{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})
=n0(𝚯^k𝚺^𝚯k𝚺^)(𝜷𝜷u^)+n0(𝚯k𝚺^𝒆k)(𝜷𝜷u^)\displaystyle=\sqrt{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})+\sqrt{n_{0}}({\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}) (A.9)

For the first term in (A.4), we could further decompose this term from 𝚺^=1mni=1mj=1n𝒙ij𝒙ij\hat{\bm{\Sigma}}=\frac{1}{mn}\sum_{i=1}^{m}\sum_{j=1}^{n}\bm{x}_{ij}\bm{x}_{ij}^{\top}:

n0(𝚯^k𝚺^𝚯k𝚺^)(𝜷𝜷u^)\displaystyle\sqrt{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}) =n0(𝚯^k𝚯k)Σ^(𝜷𝜷u^)\displaystyle={\sqrt{n_{0}}}(\hat{\bm{\Theta}}_{k}^{\top}-{\bm{\Theta}}_{k}^{\top})\hat{\Sigma}({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})
n0λs(Σ^)𝚯^k𝚯k2|𝜷𝜷u^2\displaystyle\leq{\sqrt{n_{0}}}\lambda_{s}(\hat{\Sigma})\|\hat{\bm{\Theta}}_{k}-{\bm{\Theta}}_{k}\|_{2}|{\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}\|_{2} (A.10)

In the last inequality, we use λs\lambda_{s} to denote the largest s-restricted eigenvalue of the covariance matrix Σ^\hat{\Sigma}. From Theorem 3, we could obtain that there exists a constant cc such that:

𝜷𝜷u^22cσ2(slogdn0+(slogd)2log(1/δ)log3n0n02ε2)\|{\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}\|_{2}^{2}\leq c\cdot\sigma^{2}\left(\frac{s^{*}\log d}{n_{0}}+\frac{(s^{*}\log d)^{2}\log(1/\delta)\log^{3}n_{0}}{n_{0}^{2}\varepsilon^{2}}\right)

Also, for the output 𝚯^k\hat{\bm{\Theta}}_{k}, we could have the similar result:

𝚯^k𝚯k22cσ2(slogdn0+(slogd)2log(1/δ)log3n0n02ε2)\|\hat{\bm{\Theta}}_{k}-\bm{\Theta}_{k}\|_{2}^{2}\leq c\cdot\sigma^{2}\left(\frac{s^{*}\log d}{n_{0}}+\frac{(s^{*}\log d)^{2}\log(1/\delta)\log^{3}n_{0}}{n_{0}^{2}\varepsilon^{2}}\right)

Combining (A.4) and (A.4), we could obtain that

n0(𝚯^k𝚺^𝚯k𝚺^)(𝜷𝜷u^)=o(slogdn0)=o(1)\sqrt{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})=o\quantity(\frac{s^{*}\log d}{\sqrt{n_{0}}})=o(1)

Then, we could focus on the second term of (A.4). We first introduce the following lemma:

Lemma A.3

(Lemma 6.2 in [24]) For the vector 𝚯k𝚺^𝐞k{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k}. Denote κ=Σ1/2𝐗1ϕ2\kappa=\|\Sigma^{-1/2}\bm{X}_{1}\|_{\phi_{2}}, then with probability 12d1a2/24e2κ4L21-2d^{1-a^{2}/24e^{2}\kappa^{4}L^{2}}, we have:

𝚯k𝚺^𝒆kalogdn0\|{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k}\|_{\infty}\leq a\sqrt{\frac{\log d}{n_{0}}}

Thus, for the second term of (A.4), we have:

n0(𝚯k𝚺^𝒆j)(𝜷𝜷u^)\displaystyle\sqrt{n_{0}}({\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{j}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}) n0𝚯k𝚺^𝒆k𝜷𝜷u^1\displaystyle\leq\sqrt{n_{0}}\|{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k}^{\top}\|_{\infty}\|{\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}\|_{1}
kn0logdn0s𝜷𝜷u^2\displaystyle\leq k\sqrt{n_{0}}\sqrt{\frac{\log d}{n_{0}}}\sqrt{s^{*}}\|{\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}\|_{2}
kslogdslogdn0=o(1)\displaystyle\leq k\cdot\sqrt{{s^{*}\log d}}\cdot\sqrt{\frac{s^{*}\log d}{n_{0}}}=o(1) (A.11)

Combine the result from (A.4), (A.4) to (A.4), we could obtain that the first term of (A.8) is o(1)o(1). We could also analyze the third term of (A.8),n0𝑬kN(0,8Δ12log(1.25/δ)/n0ϵ2)\sqrt{n_{0}}\bm{E}_{k}\sim N(0,8\Delta_{1}^{2}\log(1.25/\delta)/n_{0}\epsilon^{2}). Then, by the definition of Δ1\Delta_{1}, we have 8Δ12log(1.25/δ)/n0ϵ2s2log2dlog(1.25/δ)n0ϵ2=o(1)8\Delta_{1}^{2}\log(1.25/\delta)/n_{0}\epsilon^{2}\sim\frac{{s^{*}}^{2}\log^{2}d\log(1.25/\delta)}{n_{0}\epsilon^{2}}=o(1) from the assumption. Also, we notice that 𝑬=N(0,cs2log2dlog(1.25/δ)n02ϵ2)\bm{E}^{\prime}=N(0,c\cdot\frac{{s^{*}}^{2}\log^{2}d\log(1.25/\delta)}{n_{0}^{2}\epsilon^{2}}). By the concentration of Gaussian distribution, we also have that 𝑬=o(1)\bm{E}^{\prime}=o(1).

Finally, we analyze the term 1n0𝚯^j𝑿𝑾\frac{1}{\sqrt{n_{0}}}\hat{\bm{\Theta}}_{j}^{\top}\bm{X}^{\top}\bm{W}. From our definition, WW is sub-Gaussian random noise. Then, from the central limit theorem, we could conclude that:

1n0𝚯^j𝑿𝑾N(0,σ2𝚯^j^Σ𝚯^j)\frac{1}{\sqrt{n_{0}}}\hat{\bm{\Theta}}_{j}^{\top}\bm{X}^{\top}\bm{W}\to N(0,\sigma^{2}\hat{\bm{\Theta}}_{j}^{\top}\bm{\hat{}}{\Sigma}\hat{\bm{\Theta}}_{j})

Thus, n0(βj^βj)=1n0𝚯^j𝑿𝑾+n0𝑬kN(0,σ2𝚯^j^Σ𝚯^j+o(1))\sqrt{n_{0}}(\hat{\beta_{j}}-\beta_{j})=\frac{1}{\sqrt{n_{0}}}\hat{\bm{\Theta}}_{j}^{\top}\bm{X}^{\top}\bm{W}+\sqrt{n_{0}}\bm{E}_{k}\sim N(0,\sigma^{2}\hat{\bm{\Theta}}_{j}^{\top}\bm{\hat{}}{\Sigma}\hat{\bm{\Theta}}_{j}+o(1)). Also, from lemma 4.1, we could claim that under our assumptions, σ^2=σ2+o(1)\hat{\sigma}^{2}=\sigma^{2}+o(1). We could get the result where with high probability, n0(βj^βj)σ^𝒘j^𝚺^𝒘j^N(0,1)\frac{\sqrt{n_{0}}(\hat{\beta_{j}}-\beta_{j})}{\hat{\sigma}\sqrt{\hat{\bm{w}_{j}}^{\top}\hat{\bm{\Sigma}}\hat{\bm{w}_{j}}}}\to N(0,1).

Therefore, we could claim that [βj^Φ1(1α/2)σ^n0𝚯^j𝚺^𝚯^j,βj^+Φ1(1α/2)σ^n0𝚯^j𝚺^𝚯^j][\hat{\beta_{j}}-\Phi^{-1}(1-\alpha/2)\frac{\hat{\sigma}}{\sqrt{n_{0}}}\sqrt{\hat{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{j}},\hat{\beta_{j}}+\Phi^{-1}(1-\alpha/2)\frac{\hat{\sigma}}{\sqrt{n_{0}}}\sqrt{\hat{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{j}}] is asymptotically 1α1-\alpha confidence interval for βj\beta_{j}. Therefore, we have finished the proof of theorem. \square

A.5 Proof of Theorem 5

The proof is similar to the proof of Theorem 4, the difference is that we need to consider the case where the privacy cost is not dominated by the statistical error. Then, for the proof of Theorem 5, we follow the proof of Theorem 4 until (A.8). The analysis for the second term and the third term for (A.8) stays the same. On the other hand, for the first term of (A.8), we have: We will analyze the three terms in (A.8) one by one. For the first term, in the same manner, we could decompose this term as:

n0(𝚯^k𝚺^𝒆k)(𝜷𝜷u^)\displaystyle\sqrt{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}) =n0(𝚯^k𝚺^𝚯k𝚺^+𝚯k𝚺^𝒆k)(𝜷𝜷u^)\displaystyle=\sqrt{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}+{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})
=n0(𝚯^k𝚺^𝚯k𝚺^)(𝜷𝜷u^)+n(𝚯k𝚺^𝒆k)(𝜷𝜷u^)\displaystyle=\sqrt{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})+\sqrt{n}({\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}) (A.12)

For the first term in (A.5), we could further decompose this term from 𝚺^=1ni=1n𝒙i𝒙i\hat{\bm{\Sigma}}=\frac{1}{n}\sum_{i=1}^{n}\bm{x}_{i}\bm{x}_{i}^{\top}:

n0(𝚯^k𝚺^𝚯k𝚺^)(𝜷𝜷u^)\displaystyle\sqrt{n_{0}}(\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}) =n0(𝚯^k𝚯k)Σ^(𝜷𝜷u^)\displaystyle={\sqrt{n_{0}}}(\hat{\bm{\Theta}}_{k}^{\top}-{\bm{\Theta}}_{k}^{\top})\hat{\Sigma}({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})
n0λs(Σ^)𝚯^k𝚯k2|𝜷𝜷u^2\displaystyle\leq{\sqrt{n_{0}}}\lambda_{s}(\hat{\Sigma})\|\hat{\bm{\Theta}}_{k}-{\bm{\Theta}}_{k}\|_{2}|{\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}\|_{2}
kμs2νs2s2log2dlog(1/δ)log3n0/n03/2ϵ2\displaystyle\leq\frac{k\mu_{s}^{2}}{\nu_{s}^{2}}s^{2}\log^{2}d\log(1/\delta)\log^{3}n_{0}/n_{0}^{3/2}\epsilon^{2} (A.13)

Thus, for the second term of (A.5), by Lemma A.3, we have:

n0(𝚯k𝚺^𝒆k)(𝜷𝜷u^)\displaystyle\sqrt{n_{0}}({\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}) n0𝚯k𝚺^𝒆k𝜷𝜷u^1\displaystyle\leq\sqrt{n_{0}}\|{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k}^{\top}\|_{\infty}\|{\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}\|_{1}
kn0logdn0s𝜷𝜷u^2\displaystyle\leq k\sqrt{n_{0}}\sqrt{\frac{\log d}{n_{0}}}\sqrt{s^{*}}\|{\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}\|_{2}
kslogd𝜷𝜷u^2\displaystyle\leq k\cdot\sqrt{{s^{*}\log d}}\cdot\|{\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}\|_{2} (A.14)

When the privacy cost is not dominated by the statistical error and also slogd/n=o(1)s^{*}\log d/\sqrt{n}=o(1), we can observe that the equation (A.5) has smaller convergence rate that (A.5). Then, combining (A.5) and (A.5), there exists a constant k1k_{1}, such that:

n(𝚯^k𝚺^𝒆j)(𝜷𝜷u^)γμs2νs2s2log2dlog(1/δ)log3n0n03/2ϵ2\displaystyle\sqrt{n}(\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{j}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})\leq\frac{\gamma\mu_{s}^{2}}{\nu_{s}^{2}}\frac{s^{2}\log^{2}d\log(1/\delta)\log^{3}n_{0}}{n_{0}^{3/2}\epsilon^{2}}

Then, insert the result into (A.8), we have:

n0(βj^βj)=O(μs2νs2s2log2dlog(1/δ)log3n0n03/2ϵ2)+1n0𝚯^k𝑿𝑾+n0Ek\displaystyle\sqrt{n_{0}}(\hat{\beta_{j}}-\beta_{j})=O\quantity(-\frac{\mu_{s}^{2}}{\nu_{s}^{2}}\frac{s^{2}\log^{2}d\log(1/\delta)\log^{3}n_{0}}{n_{0}^{3/2}\epsilon^{2}})+\frac{1}{\sqrt{n_{0}}}\hat{\bm{\Theta}}_{k}^{\top}\bm{X}^{\top}\bm{W}+\sqrt{n_{0}}E_{k} (A.15)

Notice that for the first term on the right hand, the constant could be set to 1 because it comes from the tail bound of Laplace random variable. From the result in (A.15), we could also apply the central limit theorem to show that the second term is asymptotically Gaussian, notice that in the right hand side, the second term and the third term asymptotically follows a distribution of N(0,σ2𝚯^k𝚺^𝚯^k+8Δ12log(1/δ)n0ϵ2)N(0,\sigma^{2}\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{n_{0}\epsilon^{2}}). Also, by the concentration of Gaussian distribution, we have with high probability, E8Δ12log(1/δ)n0ϵ2E^{\prime}\leq\frac{8\Delta_{1}^{2}\log(1/\delta)}{n_{0}\epsilon^{2}}.Thus, the privacy conditions are satisfied. Therefore, we have:

n0[βj^βjn0(𝚯k𝚺^𝒆k)(𝜷𝜷u^)]/σ2𝚯^k𝚺^𝚯^k+8Δ12log(1/δ)n0ϵ2N(0,1)\sqrt{n_{0}}[\hat{\beta_{j}}-\beta_{j}-\sqrt{n_{0}}({\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{k}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})]/\sqrt{\sigma^{2}\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{n_{0}\epsilon^{2}}}\sim N(0,1)

Then, also by our assumptions and the result in Lemma 4.1, we could claim that σ2=sigma^2+o(1)\sigma^{2}=\hat{sigma}^{2}+o(1). Thus, finally, the confidence interval is given by:

Jj(α)=[βj^γμs^2νs^2s2log2dlog(1/δ)log3n0n02ϵ2)Φ1(1α/2)σn0𝚯^k𝚺^𝚯^k+8Δ12log(1/δ)n0ϵ2,\displaystyle J_{j}(\alpha)=\biggl{[}\hat{\beta_{j}}-\frac{\gamma\hat{\mu_{s}}^{2}}{\hat{\nu_{s}}^{2}}\frac{s^{2}\log^{2}d\log(1/\delta)\log^{3}n_{0}}{n_{0}^{2}\epsilon^{2}})-\Phi^{-1}(1-\alpha/2)\frac{\sigma}{\sqrt{n_{0}}}\sqrt{\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{n_{0}\epsilon^{2}}},
βj^+γμs^2νs^2s2log2dlog(1/δ)log3n0n02ϵ2)+Φ1(1α/2)σn0𝚯^k𝚺^𝚯^k+8Δ12log(1/δ)n0ϵ2]\displaystyle\hat{\beta_{j}}+\frac{\gamma\hat{\mu_{s}}^{2}}{\hat{\nu_{s}}^{2}}\frac{s^{2}\log^{2}d\log(1/\delta)\log^{3}n_{0}}{n_{0}^{2}\epsilon^{2}})+\Phi^{-1}(1-\alpha/2)\frac{\sigma}{\sqrt{n_{0}}}\sqrt{\hat{\bm{\Theta}}_{k}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{k}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{n_{0}\epsilon^{2}}}\biggr{]}

which finishes our proof.

A.6 Proof of Theorem 6

Let us first show that our algorithm is ϵ,δ\epsilon,\delta private. The major proof lies in the choice of noise level B3B_{3}. To decompose, we can find:

𝚯^1mi=1mein(𝒈i𝒈¯)=𝚯^1mi=1mein𝒈i𝚯^nm(i=1mei)𝒈¯\displaystyle\hat{\bm{\Theta}}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}e_{i}\sqrt{n}(\bm{g}_{i}-\bar{\bm{g}})=\hat{\bm{\Theta}}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}e_{i}\sqrt{n}\bm{g}_{i}-\hat{\bm{\Theta}}\frac{\sqrt{n}}{\sqrt{m}}\quantity(\sum_{i=1}^{m}e_{i})\bar{\bm{g}}

Then, suppose in an adjacent data set, the different data in denoted as (𝒙ij,yij)(\bm{x}_{ij},y_{ij}) and (𝒙ij,yij)(\bm{x}^{\prime}_{ij},y^{\prime}_{ij}). Then, we calculate:

(𝚯^1mi=1mein𝒈i𝚯^nm(i=1mei)𝒈¯)(𝚯^1mi=1mein𝒈i𝚯^nm(i=1mei)𝒈¯)\displaystyle\norm{(\hat{\bm{\Theta}}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}e_{i}\sqrt{n}\bm{g}_{i}-\hat{\bm{\Theta}}\frac{\sqrt{n}}{\sqrt{m}}\quantity(\sum_{i=1}^{m}e_{i})\bar{\bm{g}})-(\hat{\bm{\Theta}}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}e_{i}\sqrt{n}\bm{g}^{\prime}_{i}-\hat{\bm{\Theta}}\frac{\sqrt{n}}{\sqrt{m}}\quantity(\sum_{i=1}^{m}e_{i})\bar{\bm{g}^{\prime}})}_{\infty}
𝚯^nmei(𝒈i𝒈i)𝚯^nm(i=1mei)(𝒈¯𝒈¯)\displaystyle\leq\norm{\hat{\bm{\Theta}}\frac{\sqrt{n}}{\sqrt{m}}e_{i}(\bm{g}_{i}-\bm{g}^{\prime}_{i})-\hat{\bm{\Theta}}\frac{\sqrt{n}}{\sqrt{m}}\quantity(\sum_{i=1}^{m}e_{i})(\bar{\bm{g}}-\bar{\bm{g}^{\prime}})}_{\infty}
𝚯^maxnmei(𝒈i𝒈i)nm(i=1mei)(𝒈¯𝒈¯)\displaystyle\leq\norm{\hat{\bm{\Theta}}}_{\max}\norm{\frac{\sqrt{n}}{\sqrt{m}}e_{i}(\bm{g}_{i}-\bm{g}^{\prime}_{i})-\frac{\sqrt{n}}{\sqrt{m}}\quantity(\sum_{i=1}^{m}e_{i})(\bar{\bm{g}}-\bar{\bm{g}^{\prime}})}_{\infty}
(𝚯^𝚯max+𝚯max)nmei(𝒈i𝒈i)nm(i=1mei)(𝒈¯𝒈¯)\displaystyle\leq(\norm{\hat{\bm{\Theta}}-\bm{\Theta}}_{\max}+\norm{\bm{\Theta}}_{\max})\norm{\frac{\sqrt{n}}{\sqrt{m}}e_{i}(\bm{g}_{i}-\bm{g}^{\prime}_{i})-\frac{\sqrt{n}}{\sqrt{m}}\quantity(\sum_{i=1}^{m}e_{i})(\bar{\bm{g}}-\bar{\bm{g}^{\prime}})}_{\infty}
(𝚯^𝚯1+𝚯2)nmei(𝒈i𝒈i)+nm(i=1mei)(𝒈¯𝒈¯))\displaystyle\leq(\norm{\hat{\bm{\Theta}}-\bm{\Theta}}_{1}+\norm{\bm{\Theta}}_{2})\norm{\frac{\sqrt{n}}{\sqrt{m}}e_{i}(\bm{g}_{i}-\bm{g}^{\prime}_{i})}_{\infty}+\norm{\frac{\sqrt{n}}{\sqrt{m}}\quantity(\sum_{i=1}^{m}e_{i})(\bar{\bm{g}}-\bar{\bm{g}^{\prime}})}_{\infty})
(o(1)+L)nmlogm(𝒈i𝒈i)+nmmlogm𝒈¯𝒈¯)\displaystyle\leq(o(1)+L)\frac{\sqrt{n}}{\sqrt{m}}\sqrt{\log m}\norm{(\bm{g}_{i}-\bm{g}^{\prime}_{i})}_{\infty}+\frac{\sqrt{n}}{\sqrt{m}}m\sqrt{\log m}\norm{\bar{\bm{g}}-\bar{\bm{g}^{\prime}}}_{\infty})
L4logmmn𝒙ij(πR(yij)𝒙ij𝜷^)\displaystyle\leq L\frac{4\sqrt{\log m}}{\sqrt{mn}}\norm{\bm{x}_{ij}(\pi_{R}(y_{ij})-\bm{x}_{ij}\hat{{\bm{\beta}}})}_{\infty}
L4logmmncx(R+c0cxs)\displaystyle\leq L\frac{4\sqrt{\log m}}{\sqrt{mn}}c_{x}(R+c_{0}c_{x}\sqrt{s^{*}})

Thus, the privacy could be guaranteed. Then, let us start the proof of consistency. Throughout the proof, we define n0=mnn_{0}=m\cdot n, (𝑿,𝒀)(\bm{X},\bm{Y}) be the whole data set where 𝑿n0d\bm{X}\in\mathbb{R}^{n_{0}*d} and 𝒀d\bm{Y}\in\mathbb{R}^{d}. U=maxkG𝚯^1mi=1mein(𝒈i𝒈¯)U^{\prime}=\max_{k\in G}\hat{\bm{\Theta}}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}e_{i}\sqrt{n}(\bm{g}_{i}-\bar{\bm{g}}). Let us define another multiplier bootstrap statistic:

U=maxkG1mni=1mj=1nΣ1𝒙ij(yij𝒙ij𝜷)eij,U^{*}=\max_{k\in G}\frac{1}{mn}\sum_{i=1}^{m}\sum_{j=1}^{n}\Sigma^{-1}\bm{x}_{ij}(y_{ij}-\bm{x}_{ij}{\bm{\beta}})e_{ij},

where eije_{ij} are all standard Gaussian variables. At the same time, we also define:

M0=maxkG1mni=1mj=1nΣ1𝒙ij(yij𝒙ij𝜷)M_{0}=\max_{k\in G}\frac{1}{mn}\sum_{i=1}^{m}\sum_{j=1}^{n}\Sigma^{-1}\bm{x}_{ij}(y_{ij}-\bm{x}_{ij}{\bm{\beta}})

The proof consists three major steps, we start from the first step and measure supα(0,1)|(M0CU(α))α|\sup_{\alpha\in(0,1)}|{\mathbb{P}}(M_{0}\leq C_{U^{*}}(\alpha))-\alpha|. This measurement is quite straightforward, we could apply Theorem 3.1 from [10]. However, we need to verify Corollary 2.1 from [10]. Notice that for any kk, 𝔼[ΘkT𝒙ij(yij𝒙ij𝜷)]2=σ2ΘkTΣΘkσ2/L\mathbb{E}[\Theta_{k}^{T}\bm{x}_{ij}(y_{ij}-\bm{x}_{ij}{\bm{\beta}})]^{2}=\sigma^{2}\Theta_{k}^{T}\Sigma\Theta_{k}\geq\sigma^{2}/L. Also, it is not difficult to verify that ΘkT𝒙ij(yij𝒙ij𝜷)\Theta_{k}^{T}\bm{x}_{ij}(y_{ij}-\bm{x}_{ij}{\bm{\beta}}) is sub-exponential. Since from assumption D1, we have 𝒙ij\bm{x}_{ij} is sub-Gaussian and from the linear model, we know that (yij𝒙ij𝜷)(y_{ij}-\bm{x}_{ij}{\bm{\beta}}) is also, sub-Gaussian. Then, the condition could be verified. Thus, by applying Theorem 3.1 and also under the condition where there exists a constant k,k0,k1k,k_{0},k_{1} such that log7(dmn)/mn1(mn)k\log^{7}(dmn)/mn\leq\frac{1}{(mn)^{k}}we could have:

supα(0,1)|(T0CU(α))α|\displaystyle\sup_{\alpha\in(0,1)}|{\mathbb{P}}(T_{0}\leq C_{U^{*}}(\alpha))-\alpha| k01(mn)k1+k2v1/3(max(1,log(d/v)))2/3+P(Δ>v)\displaystyle\leq k_{0}\cdot\frac{1}{(mn)^{k_{1}}}+k_{2}v^{1/3}(\max(1,\log(d/v)))^{2/3}+P(\Delta>v)
k2v1/3(max(1,log(d/v)))2/3+P(>v)+o(1),\displaystyle\leq k_{2}v^{1/3}(\max(1,\log(d/v)))^{2/3}+P(\square>v)+o(1), (A.16)

where \square represents the maximum element between the two matrix Ω1\Omega_{1} and Ω2\Omega_{2}, denote as Ω1Ω2max\|\Omega_{1}-\Omega_{2}\|_{\max}, where Ω1\Omega_{1} and Ω2\Omega_{2} are defined as:

[Ω1]k,l=1mni=1mj=1nΘk𝒙ij(yij𝒙ijT𝜷)Θl[\Omega_{1}]_{k,l}=\frac{1}{mn}\sum_{i=1}^{m}\sum_{j=1}^{n}\Theta_{k}^{\top}\bm{x}_{ij}(y_{ij}-\bm{x}_{ij}^{T}{\bm{\beta}})\Theta_{l}

and

Ω2=σ2Θ\Omega_{2}=\sigma^{2}\Theta

Then, from Corollary 3.1 in [10] and Lemma E.2 in [41], we could verify that Ω1Ω2max=O(logdn0+log2(dn0)logdn0)\|\Omega_{1}-\Omega_{2}\|_{\max}=O(\sqrt{\frac{\log d}{n_{0}}}+\frac{\log^{2}(dn_{0})\log d}{n_{0}}). With a proper choice of vv, e.g, there exists a constant κ\kappa and let v=(logdn0+log2(dn0)logdn0)1κv=(\sqrt{\frac{\log d}{n_{0}}}+\frac{\log^{2}(dn_{0})\log d}{n_{0}})^{1-\kappa}, we have k2v1/3(max(1,log(d/v)))2/3+P(>v)=o(1)k_{2}v^{1/3}(\max(1,\log(d/v)))^{2/3}+P(\square>v)=o(1). Next, we would like to associate MM with M0M_{0}. Similarly, from Theorem 3.2 in [10] and (A.16), we could have:

supα(0,1)|(MCU(α))α|o(1)+v1max(1,log(d/v1))+(MM0>v1),\sup_{\alpha\in(0,1)}|{\mathbb{P}}(M\leq C_{U^{*}}(\alpha))-\alpha|\leq o(1)+v_{1}\sqrt{max(1,\log(d/v_{1}))}+{\mathbb{P}}(\|M-M_{0}\|>v_{1}),

From the definition of MM and M0M_{0}, we have:

n0(MM0)=max1kd1n0|(𝚯k^𝚯k)𝑿𝑾|\sqrt{n_{0}}(M-M_{0})=\max_{1\leq k\leq d}\frac{1}{\sqrt{n_{0}}}|(\hat{\bm{\Theta}_{k}}^{\top}-\bm{\Theta}_{k}^{\top})\bm{X}^{\top}\bm{W}|

Then, for any kk in 1,,d1,\dots,d, by Holder inequality and Cauchy–Schwarz inequality, we have:

1n0|(𝒘k^𝒘k)𝑿𝑾|𝒘k^𝒘k11n0𝑿𝑾s𝒘k^𝒘k21n0𝑿𝑾\displaystyle\frac{1}{\sqrt{n_{0}}}|(\hat{\bm{w}_{k}}^{\top}-\bm{w}_{k}^{\top})\bm{X}^{\top}\bm{W}|\leq\|\hat{\bm{w}_{k}}^{\top}-\bm{w}_{k}^{\top}\|_{1}\|\frac{1}{\sqrt{n_{0}}}\bm{X}^{\top}\bm{W}\|_{\infty}\leq\sqrt{s^{*}}\|\hat{\bm{w}_{k}}^{\top}-\bm{w}_{k}^{\top}\|_{2}\|\frac{1}{\sqrt{n_{0}}}\bm{X}^{\top}\bm{W}\|_{\infty}

On one hand, from previous proof, we obtain that 𝒘k^𝒘k2=cslogdmn\|\hat{\bm{w}_{k}}^{\top}-\bm{w}_{k}^{\top}\|_{2}=c\cdot\sqrt{\frac{s^{*}\log d}{mn}} when the privacy cost is dominated by statistical error uniformly for kk. On the other hand, by the fact that 𝚺\bm{\Sigma} have bounded maximum eigenvalue and traditional linear regression model, we could apply Bernstein inequality and also obtain that 1n0𝑿𝑾\|\frac{1}{\sqrt{n_{0}}}\bm{X}^{\top}\bm{W}\|_{\infty} is O(logdn0)O(\sqrt{\frac{\log d}{n_{0}}}). Combine these two results, we could claim that there exist constants k0k_{0} such that:

1n0|(𝚯k^𝚯k)𝑿𝑾|=k0(slogd/n0)\frac{1}{\sqrt{n_{0}}}|(\hat{\bm{\Theta}_{k}}^{\top}-\bm{\Theta}_{k}^{\top})\bm{X}^{\top}\bm{W}|=k_{0}\cdot(s^{*}\log d/\sqrt{n_{0}})

uniformly for all kk, then we can choose v1v_{1} properly such that supα(0,1)|(MCU(α))α|=o(1)\sup_{\alpha\in(0,1)}|{\mathbb{P}}(M\leq C_{U^{*}}(\alpha))-\alpha|=o(1). At last, we need to relate UU^{*} with UU. Our major goal is to prove that CU(α)C_{U}(\alpha) and CUC_{U^{*}} are close to each other for any α(0,1)\alpha\in(0,1). We first associate UU with UU^{\prime}. From the design of private max algorithm, from Lemma 3.4 in [8], suppose l1l_{1} is the element chosen from UU^{\prime} and l2l_{2} is from UU without noise injection, we use ww to represent the noise injected when we pick the largest value privately, we find that, for any c>0c>0:

l22l12(1+c)l22+4(1+1/c)w2l_{2}^{2}\leq l_{1}^{2}\leq(1+c)l_{2}^{2}+4(1+1/c)\|w\|_{\infty}^{2}

From Lemma A.1 in [8], we can verify that there exists constant k0,k1k_{0},k_{1} such that w2k0slog4dlogmn0\|w\|_{\infty}^{2}\leq k_{0}\cdot\frac{s^{*}\log^{4}d\log m}{n_{0}}. When we choose c=o(1)c=o(1), e.g, c=k1slogdn0c=k_{1}\frac{s^{*}\log d}{n_{0}}, then from the conditions, we could claim that l1=l2+o(1)l_{1}=l_{2}+o(1), also notice that the scale of noise we injected is small, it is easy to verify that U=U+o(1)U=U^{\prime}+o(1). The following discussions will be between UU^{\prime} and UU^{*}. Denote \ominus as the symmetric difference, then we have:

(TCU(α)TCU(α))\displaystyle{\mathbb{P}}({T\leq C_{U}(\alpha)}\ominus{T\leq C_{U^{*}}(\alpha)})
2(CU(απ(u))<TCU(α+π(u)))+(CU(απ(u))>CU(α))+(CU(α+π(u))<CU(α))\displaystyle\leq 2{\mathbb{P}}(C_{U^{*}}(\alpha-\pi(u))<T\leq C_{U^{*}}(\alpha+\pi(u)))+{\mathbb{P}}(C_{U^{*}}(\alpha-\pi(u))>C_{U}(\alpha))+{\mathbb{P}}(C_{U^{*}}(\alpha+\pi(u))<C_{U}(\alpha)) (A.17)

For the first term in (A.17), define π(u)=u1/3max(1,log(d/u))2/3\pi(u)=u^{1/3}\max(1,\log(d/u))^{2/3}, then exist a constant k0k_{0}, such that:

(CU(απ(u))<MCU(α+π(u)))(MCU(α+π(u)))(MCU(απ(u)))kπ(u)+o(1){\mathbb{P}}(C_{U^{*}}(\alpha-\pi(u))<M\leq C_{U^{*}}(\alpha+\pi(u)))\leq{\mathbb{P}}(M\leq C_{U^{*}}(\alpha+\pi(u)))-{\mathbb{P}}(M\leq C_{U^{*}}(\alpha-\pi(u)))\leq k\cdot\pi(u)+o(1)

Then, for the second term and third term in (A.17), from Lemma 3.2 in [10], we have:

(CU(απ(u))>CU(α))+(CU(α+π(u))<CU(α))2(Ω1Ω3max>u),{\mathbb{P}}(C_{U^{*}}(\alpha-\pi(u))>C_{U}(\alpha))+{\mathbb{P}}(C_{U^{*}}(\alpha+\pi(u))<C_{U}(\alpha))\leq 2{\mathbb{P}}(\|\Omega_{1}-\Omega_{3}\|_{\max}>u),

where Ω3\Omega_{3} is defined as:

[Ω3]k,l=1mi=1mn𝚯k^(𝒈i𝒈¯)(𝒈i𝒈¯)𝚯l^,[\Omega_{3}]_{k,l}=\frac{1}{m}\sum_{i=1}^{m}n\hat{\bm{\Theta}_{k}}(\bm{g}_{i}-\bar{\bm{g}})(\bm{g}_{i}-\bar{\bm{g}})^{\top}\hat{\bm{\Theta}_{l}},

and Ω1\Omega_{1} is defined the same as we defined before. Then, our major focus is to analyze Ω1Ω3max\|\Omega_{1}-\Omega_{3}\|_{\max}, by triangle inequality, we have Ω1Ω3maxΩ1Ω2max+Ω3Ω2max\|\Omega_{1}-\Omega_{3}\|_{\max}\leq\|\Omega_{1}-\Omega_{2}\|_{\max}+\|\Omega_{3}-\Omega_{2}\|_{\max}. Since we have analyzed Ω1Ω2max\|\Omega_{1}-\Omega_{2}\|_{\max} before, we will focus on Ω3Ω2max\|\Omega_{3}-\Omega_{2}\|_{\max}.

Ω3Ω2max1mi=1mn𝚯^(𝒈i𝒈¯)(𝒈i𝒈¯)𝚯^σ2𝚯^Σ𝚯^max+σ2𝚯^Σ𝚯^σ2𝚯max\displaystyle\|\Omega_{3}-\Omega_{2}\|_{\max}\leq\norm{\frac{1}{m}\sum_{i=1}^{m}n\hat{\bm{\Theta}}(\bm{g}_{i}-\bar{\bm{g}})(\bm{g}_{i}-\bar{\bm{g}})^{\top}\hat{\bm{\Theta}}-\sigma^{2}\hat{\bm{\Theta}}\Sigma\hat{\bm{\Theta}}}_{\max}+\|\sigma^{2}\hat{\bm{\Theta}}\Sigma\hat{\bm{\Theta}}-\sigma^{2}\bm{\Theta}\|_{\max} (A.18)

We will analyze the two terms separately. We start from the second term in (A.18), we have:

𝚯^Σ𝚯^𝚯max\displaystyle\|\hat{\bm{\Theta}}\Sigma\hat{\bm{\Theta}}-\bm{\Theta}\|_{\max}
(𝚯^𝚯+𝚯)Σ(𝚯^𝚯+𝚯)𝚯max\displaystyle\leq\|(\hat{\bm{\Theta}}-\bm{\Theta}+\bm{\Theta})\Sigma(\hat{\bm{\Theta}}-\bm{\Theta}+\bm{\Theta})-\bm{\Theta}\|_{\max}
𝚯^𝚯12𝚺max+2𝚯^𝚯1\displaystyle\leq\|\hat{\bm{\Theta}}-\bm{\Theta}\|_{1}^{2}\|\bm{\Sigma}\|_{\max}+2\|\hat{\bm{\Theta}}-\bm{\Theta}\|_{1}
k0s2logdn0+k1slogdn0\displaystyle\leq k_{0}\frac{{s^{*}}^{2}\log d}{n_{0}}+k_{1}s^{*}\sqrt{\frac{\log d}{n_{0}}}

On the other hand, for the first term in (A.18),notice that:

1mi=1mn𝚯^(𝒈i𝒈¯)(𝒈i𝒈¯)𝚯^=1mi=1mn𝚯^𝒈i𝒈i𝚯^n𝚯^𝒈¯𝒈¯𝚯^\frac{1}{m}\sum_{i=1}^{m}n\hat{\bm{\Theta}}(\bm{g}_{i}-\bar{\bm{g}})(\bm{g}_{i}-\bar{\bm{g}})^{\top}\hat{\bm{\Theta}}=\frac{1}{m}\sum_{i=1}^{m}n\hat{\bm{\Theta}}\bm{g}_{i}\bm{g}_{i}^{\top}\hat{\bm{\Theta}}-n\hat{\bm{\Theta}}\bar{\bm{g}}\bar{\bm{g}}^{\top}\hat{\bm{\Theta}}^{\top} (A.19)

Denote the data set on the ii-th local machine as (𝑿i,𝒀i)(\bm{X}_{i},\bm{Y}_{i}) and in the linear model, the random noise as 𝑾i\bm{W}_{i}. Also, we can further decompose the first term by:

1mi=1mn𝒈i𝒈i\displaystyle\frac{1}{m}\sum_{i=1}^{m}n\bm{g}_{i}\bm{g}_{i}^{\top}
=1mi=1mn[𝑿iWi+𝑿i(𝜷𝜷^)n][𝑿iWi+𝑿i(𝜷𝜷^)n]\displaystyle=\frac{1}{m}\sum_{i=1}^{m}n[\frac{\bm{X}_{i}^{\top}W_{i}+\bm{X}_{i}^{\top}({\bm{\beta}}-\hat{\bm{\beta}})}{n}][\frac{\bm{X}_{i}^{\top}W_{i}+\bm{X}_{i}^{\top}({\bm{\beta}}-\hat{\bm{\beta}})}{n}]^{\top}
=1mi=1mn[𝑿iWin][𝑿iWin]+1mi=1mn[𝑿i(𝜷𝜷^)n][𝑿i(𝜷𝜷^)n]+2mi=1mn[𝑿i(𝜷𝜷^)n][𝑿iWin]\displaystyle=\frac{1}{m}\sum_{i=1}^{m}n[\frac{\bm{X}_{i}^{\top}W_{i}}{n}][\frac{\bm{X}_{i}^{\top}W_{i}}{n}]^{\top}+\frac{1}{m}\sum_{i=1}^{m}n[\frac{\bm{X}_{i}^{\top}({\bm{\beta}}-\hat{\bm{\beta}})}{n}][\frac{\bm{X}_{i}^{\top}({\bm{\beta}}-\hat{\bm{\beta}})}{n}]^{\top}+\frac{2}{m}\sum_{i=1}^{m}n[\frac{\bm{X}_{i}^{\top}({\bm{\beta}}-\hat{\bm{\beta}})}{n}][\frac{\bm{X}_{i}^{\top}W_{i}}{n}]^{\top} (A.20)

Then, for the equation (A.19), we have:

1mi=1mn𝚯^(𝒈i𝒈¯)(𝒈i𝒈¯)𝚯^σ2𝚯^Σ𝚯^max\displaystyle\|\frac{1}{m}\sum_{i=1}^{m}n\hat{\bm{\Theta}}(\bm{g}_{i}-\bar{\bm{g}})(\bm{g}_{i}-\bar{\bm{g}})^{\top}\hat{\bm{\Theta}}-\sigma^{2}\hat{\bm{\Theta}}\Sigma\hat{\bm{\Theta}}\|_{\max}
𝚯^max1mi=1mn(𝒈i𝒈¯)(𝒈i𝒈¯)𝚯^σ2Σ𝚯^max\displaystyle\leq\|\hat{\bm{\Theta}}\|_{\max}\|\frac{1}{m}\sum_{i=1}^{m}n(\bm{g}_{i}-\bar{\bm{g}})(\bm{g}_{i}-\bar{\bm{g}})^{\top}\hat{\bm{\Theta}}-\sigma^{2}\Sigma\hat{\bm{\Theta}}\|_{\max}
𝚯^max21mi=1mn(𝒈i𝒈¯)(𝒈i𝒈¯)σ2Σmax\displaystyle\leq\|\hat{\bm{\Theta}}\|_{\max}^{2}\|\frac{1}{m}\sum_{i=1}^{m}n(\bm{g}_{i}-\bar{\bm{g}})(\bm{g}_{i}-\bar{\bm{g}})^{\top}-\sigma^{2}\Sigma\|_{\max} (A.21)

And, we could insert (A.20) into (A.21),

1mi=1mn(𝒈i𝒈¯)(𝒈i𝒈¯)σ2Σmax\displaystyle\|\frac{1}{m}\sum_{i=1}^{m}n(\bm{g}_{i}-\bar{\bm{g}})(\bm{g}_{i}-\bar{\bm{g}})^{\top}-\sigma^{2}\Sigma\|_{\max}
1mi=1mn[𝑿iWin][𝑿iWin]σ2Σmax+n𝒈¯𝒈¯max+1mi=1mn[𝑿i(𝜷𝜷^)n][𝑿i(𝜷𝜷^)n]max\displaystyle\leq\|\frac{1}{m}\sum_{i=1}^{m}n[\frac{\bm{X}_{i}^{\top}W_{i}}{n}][\frac{\bm{X}_{i}^{\top}W_{i}}{n}]^{\top}-\sigma^{2}\Sigma\|_{\max}+n\|\bar{\bm{g}}\bar{\bm{g}}^{\top}\|_{\max}+\|\frac{1}{m}\sum_{i=1}^{m}n[\frac{\bm{X}_{i}^{\top}({\bm{\beta}}-\hat{\bm{\beta}})}{n}][\frac{\bm{X}_{i}^{\top}({\bm{\beta}}-\hat{\bm{\beta}})}{n}]^{\top}\|_{\max}
+2mi=1mn[𝑿i(𝜷𝜷^)n][𝑿iWin]max\displaystyle+\|\frac{2}{m}\sum_{i=1}^{m}n[\frac{\bm{X}_{i}^{\top}({\bm{\beta}}-\hat{\bm{\beta}})}{n}][\frac{\bm{X}_{i}^{\top}W_{i}}{n}]^{\top}\|_{\max} (A.22)

We will analyze the four terms in (A.22) one by one. For the first term, it is quite simple, from the proof of Lemma F.2 in [41], we have the first term is Op(logdm+log2(dm)logdm)O_{p}(\sqrt{\frac{\log d}{m}}+\frac{\log^{2}(dm)\log d}{m}). For the second term, we have:

n𝒈¯𝒈¯maxn𝒈¯2=n1n0𝑿(𝒀𝑿𝜷^)2\displaystyle n\|\bar{\bm{g}}\bar{\bm{g}}^{\top}\|_{\max}\leq n\|\bar{\bm{g}}\|_{\infty}^{2}=n\|\frac{1}{n_{0}}\bm{X}^{\top}(\bm{Y}-\bm{X}\hat{{\bm{\beta}}})\|_{\infty}^{2}

Also, we have:

1n0𝑿(𝒀𝑿𝜷^)\displaystyle\|\frac{1}{n_{0}}\bm{X}^{\top}(\bm{Y}-\bm{X}\hat{{\bm{\beta}}})\|_{\infty}
1n0𝑿(𝒀𝑿𝜷)+1n0𝑿𝑿(𝜷^𝜷)\displaystyle\leq\|\frac{1}{n_{0}}\bm{X}^{\top}(\bm{Y}-\bm{X}{{\bm{\beta}}})\|_{\infty}+\|\frac{1}{n_{0}}\bm{X}^{\top}\bm{X}(\hat{{\bm{\beta}}}-{{\bm{\beta}}})\|_{\infty}
1n0𝑿𝑾+(Σ^Σ)(𝜷^𝜷)+Σmax𝜷^𝜷1\displaystyle\leq\|\frac{1}{n_{0}}\bm{X}^{\top}\bm{W}\|_{\infty}+\|(\hat{\Sigma}-\Sigma)(\hat{{\bm{\beta}}}-{{\bm{\beta}}})\|_{\infty}+\|\Sigma\|_{\max}\|\hat{{\bm{\beta}}}-{{\bm{\beta}}}\|_{1}
k0(logdn0)+k1(logdn0)𝜷^𝜷1+𝜷^𝜷1\displaystyle\leq k_{0}(\sqrt{\frac{\log d}{n_{0}}})+k_{1}(\sqrt{\frac{\log d}{n_{0}}})\|\hat{{\bm{\beta}}}-{{\bm{\beta}}}\|_{1}+\|\hat{{\bm{\beta}}}-{{\bm{\beta}}}\|_{1}
k0logdn0+k1slogdn0+k2slogdn0\displaystyle\leq k_{0}\sqrt{\frac{\log d}{n_{0}}}+k_{1}\frac{s^{*}\log d}{n_{0}}+k_{2}s^{*}\sqrt{\frac{\log d}{n_{0}}} (A.23)

Thus, for the second term, we can obtain that n𝒈¯𝒈¯maxk0s2logd/m+k1s2log2d/m2nn\|\bar{\bm{g}}\bar{\bm{g}}^{\top}\|_{\max}\leq k_{0}{s^{*}}^{2}\log d/m+k_{1}{s^{*}}^{2}\log^{2}d/m^{2}n. For the third term, we have:

1mi=1mn[𝑿i𝑿i(𝜷𝜷^)n][𝑿i𝑿i(𝜷𝜷^)n]max\displaystyle\|\frac{1}{m}\sum_{i=1}^{m}n[\frac{\bm{X}_{i}^{\top}\bm{X}_{i}({\bm{\beta}}-\hat{\bm{\beta}})}{n}][\frac{\bm{X}_{i}^{\top}\bm{X}_{i}({\bm{\beta}}-\hat{\bm{\beta}})}{n}]^{\top}\|_{\max}
1mi=1mn[𝑿i𝑿i(𝜷𝜷^)n][𝑿i𝑿i(𝜷𝜷^)n]max\displaystyle\leq\frac{1}{m}\sum_{i=1}^{m}n\|[\frac{\bm{X}_{i}^{\top}\bm{X}_{i}({\bm{\beta}}-\hat{\bm{\beta}})}{n}][\frac{\bm{X}_{i}^{\top}\bm{X}_{i}({\bm{\beta}}-\hat{\bm{\beta}})}{n}]^{\top}\|_{\max}
1mi=1mn[𝑿i𝑿i(𝜷𝜷^)n]2\displaystyle\leq\frac{1}{m}\sum_{i=1}^{m}n\|[\frac{\bm{X}_{i}^{\top}\bm{X}_{i}({\bm{\beta}}-\hat{\bm{\beta}})}{n}]\|_{\infty}^{2}
1mi=1mn(Σi^Σmax+Σmax)2𝜷𝜷^12\displaystyle\leq\frac{1}{m}\sum_{i=1}^{m}n(\|\hat{\Sigma_{i}}-\Sigma\|_{\max}+\|\Sigma\|_{\max})^{2}\|{\bm{\beta}}-\hat{\bm{\beta}}\|_{1}^{2}
1mi=1m2n(Σi^Σmax2+Σmax2)𝜷𝜷^12\displaystyle\leq\frac{1}{m}\sum_{i=1}^{m}2n(\|\hat{\Sigma_{i}}-\Sigma\|_{\max}^{2}+\|\Sigma\|_{\max}^{2})\|{\bm{\beta}}-\hat{\bm{\beta}}\|_{1}^{2}
1mi=1m2n(O(logdn)+O(1))𝜷𝜷^12\displaystyle\leq\frac{1}{m}\sum_{i=1}^{m}2n(O(\sqrt{\frac{\log d}{n}})+O(1))\|{\bm{\beta}}-\hat{\bm{\beta}}\|_{1}^{2}
k0s2logdm\displaystyle\leq k_{0}{s^{*}}^{2}\frac{\log d}{m} (A.24)

For the fourth term, we could apply Cauchy-Schwarz inequality, which give us the result:

2mi=1mn[𝑿i𝑿i(𝜷𝜷^)n][𝑿iWin]max\displaystyle\|\frac{2}{m}\sum_{i=1}^{m}n[\frac{\bm{X}_{i}^{\top}\bm{X}_{i}({\bm{\beta}}-\hat{\bm{\beta}})}{n}][\frac{\bm{X}_{i}^{\top}W_{i}}{n}]^{\top}\|_{\max}
2mi=1mn[𝑿i𝑿i(𝜷𝜷^)n][𝑿iWin]max\displaystyle\leq\frac{2}{m}\sum_{i=1}^{m}n\|[\frac{\bm{X}_{i}^{\top}\bm{X}_{i}({\bm{\beta}}-\hat{\bm{\beta}})}{n}][\frac{\bm{X}_{i}^{\top}W_{i}}{n}]^{\top}\|_{\max}
2mi=1mn𝑿i𝑿i(𝜷𝜷^)n𝑿iWinmax\displaystyle\leq\frac{2}{m}\sum_{i=1}^{m}n\|\frac{\bm{X}_{i}^{\top}\bm{X}_{i}({\bm{\beta}}-\hat{\bm{\beta}})}{n}\|_{\infty}\|\frac{\bm{X}_{i}^{\top}W_{i}}{n}\|_{\max}
2mi=1mn𝑿i𝑿inmax𝜷𝜷^1𝑿iWin\displaystyle\leq\frac{2}{m}\sum_{i=1}^{m}n\|\frac{\bm{X}_{i}^{\top}\bm{X}_{i}}{n}\|_{\max}\|{\bm{\beta}}-\hat{{\bm{\beta}}}\|_{1}\|\frac{\bm{X}_{i}^{\top}W_{i}}{n}\|_{\infty}
k0nslogdn0\displaystyle\leq k_{0}n\cdot\frac{s^{*}\log d}{n_{0}}
k0slogdm\displaystyle\leq k_{0}\frac{s^{*}\log d}{m} (A.25)

We could combine the result in (A.23), (A.24), (A.25) and insert into (A.22) and into (A.21). We could finally get the first term of (A.18) has an order of O(logdn0+slogdn0+s2logdm)O(\sqrt{\frac{\log d}{n_{0}}}+\frac{s^{*}\log d}{n_{0}}+\frac{{s^{*}}^{2}\log d}{m}). Insert this result into (A.17), when uu is chosen properly, we could verify that supα(0,1)|(TCU(α))α|=o(1)\sup_{\alpha\in(0,1)}|{\mathbb{P}}(T\leq C_{U}(\alpha))-\alpha|=o(1), which finishes the proof.

A.7 Proof of Theorem 7

The proof of theorem 7 is quite straight forward. We could decompose true 𝜷i=𝒖+𝒗i{\bm{\beta}}_{i}=\bm{u}+\bm{v}_{i}. Then, 𝜷^=𝒖^+𝒗i^\hat{{\bm{\beta}}}=\hat{\bm{u}}+\hat{\bm{v}_{i}}. Thus, from the result of estimation, we could get the result that:

𝒖𝒖^22c0s0logdmn+c2s02logd2log(1/δ)log3mnm2n2ϵ2,\displaystyle\|\bm{u}-\hat{\bm{u}}\|_{2}^{2}\leq c_{0}\frac{s_{0}\log d}{mn}+c_{2}\frac{{s_{0}}^{2}\log d^{2}\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}},

and

𝒗i^𝒗i22c1s1logdn+c3s12logd2log(1/δ)log3nn2ϵ2\displaystyle\|\hat{\bm{v}_{i}}-\bm{v}_{i}\|_{2}^{2}\leq c_{1}\frac{s_{1}\log d}{n}+c_{3}\frac{{s_{1}}^{2}\log d^{2}\log(1/\delta)\log^{3}n}{n^{2}\epsilon^{2}}

Also, combing the above two results with the inequality that 𝜷i^𝜷i2𝒗i^𝒗i2+𝒖𝒖^2\|\hat{{\bm{\beta}}_{i}}-{\bm{\beta}}_{i}\|_{2}\leq\|\hat{\bm{v}_{i}}-\bm{v}_{i}\|_{2}+\|\bm{u}-\hat{\bm{u}}\|_{2} gives the proof of theorem 7. \square

A.8 Proof of Theorem 8

The proof of Theorem 8 follows the proof of and Theorem 4 and Theorem 5. We follow the proof of Theorem 4 until (A.8). The analysis for the second term and the third term for (A.8) stays the same. We will analyze the three terms in (A.8) one by one. For the first term, in the same manner, we could decompose this term as:

n(𝚯^j𝚺^𝒆j)(𝜷𝜷u^)\displaystyle\sqrt{n}(\hat{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{j}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}) =n(𝚯^j𝚺^𝚯j𝚺^+𝚯j𝚺^𝒆j)(𝜷𝜷u^)\displaystyle=\sqrt{n}(\hat{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}-{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}+{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{j}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})
=n(𝚯^j𝚺^𝚯j𝚺^)(𝜷𝜷u^)+n(𝚯j𝚺^𝒆j)(𝜷𝜷u^)\displaystyle=\sqrt{n}(\hat{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}-{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})+\sqrt{n}({\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{j}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}) (A.26)

For the first term in (A.8), we could further decompose this term from 𝚺^=1ni=1n𝒙i𝒙i\hat{\bm{\Sigma}}=\frac{1}{n}\sum_{i=1}^{n}\bm{x}_{i}\bm{x}_{i}^{\top}:

n(𝚯^j𝚺^𝚯j𝚺^)(𝜷𝜷u^)\displaystyle\sqrt{n}(\hat{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}-{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}) =n(𝚯^j𝚯j)Σ^(𝜷𝜷u^)\displaystyle={\sqrt{n}}(\hat{\bm{\Theta}}_{j}^{\top}-{\bm{\Theta}}_{j}^{\top})\hat{\Sigma}({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})
nλs(Σ^)𝚯^j𝚯j2|𝜷𝜷u^2\displaystyle\leq{\sqrt{n}}\lambda_{s}(\hat{\Sigma})\|\hat{\bm{\Theta}}_{j}-{\bm{\Theta}}_{j}\|_{2}|{\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}\|_{2}
o(1)+γμs2νs2s12log2dlog(1/δ)log3mnm2n3/2ϵ2+γμs2νs2s02log2dlog(1/δ)log3nn3/2ϵ2\displaystyle\leq o(1)+\frac{\gamma\mu_{s}^{2}}{\nu_{s}^{2}}\frac{s_{1}^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{3/2}\epsilon^{2}}+\frac{\gamma\mu_{s}^{2}}{\nu_{s}^{2}}\frac{s_{0}^{2}\log^{2}d\log(1/\delta)\log^{3}n}{n^{3/2}\epsilon^{2}} (A.27)

Thus, for the second term of (A.8), by Lemma A.3, we have:

n(𝚯j𝚺^𝒆j)(𝜷𝜷u^)\displaystyle\sqrt{n}({\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{j}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}) n𝚯j𝚺^𝒆j𝜷𝜷u^1\displaystyle\leq\sqrt{n}\|{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{j}^{\top}\|_{\infty}\|{\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}\|_{1}
knlogdmns𝜷𝜷u^2\displaystyle\leq k\sqrt{n}\sqrt{\frac{\log d}{mn}}\sqrt{s}\|{\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}\|_{2}
knslogdmn𝜷𝜷u^2\displaystyle\leq k\cdot\sqrt{n}\sqrt{\frac{s\log d}{mn}}\cdot\|{\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}}\|_{2}
o(1)+γμs2νs2s12log2dlog(1/δ)log3mnm2n3/2ϵ2+γμs2νs2s02log2dlog(1/δ)log3nn3/2ϵ2\displaystyle\leq o(1)+\frac{\gamma\mu_{s}^{2}}{\nu_{s}^{2}}\frac{s_{1}^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{3/2}\epsilon^{2}}+\frac{\gamma\mu_{s}^{2}}{\nu_{s}^{2}}\frac{s_{0}^{2}\log^{2}d\log(1/\delta)\log^{3}n}{n^{3/2}\epsilon^{2}} (A.28)

Then, combining (A.8) and (A.8), we have that:

n(𝚯^j𝚺^𝒆j)(𝜷𝜷u^)2γμs2νs2s12log2dlog(1/δ)log3mnm2n3/2ϵ2+2γμs2νs2s02log2dlog(1/δ)log3nn3/2ϵ2\displaystyle\sqrt{n}(\hat{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}-\bm{e}_{j}^{\top})({\bm{\beta}}-\hat{{\bm{\beta}}^{\text{u}}})\leq\frac{2\gamma\mu_{s}^{2}}{\nu_{s}^{2}}\frac{s_{1}^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{3/2}\epsilon^{2}}+\frac{2\gamma\mu_{s}^{2}}{\nu_{s}^{2}}\frac{s_{0}^{2}\log^{2}d\log(1/\delta)\log^{3}n}{n^{3/2}\epsilon^{2}}

Then, insert the result into (A.8), we have:

n(βj^βj2γμs2νs2s12log2dlog(1/δ)log3mnm2n2ϵ22γμs2νs2s02log2dlog(1/δ)log3nn2ϵ2)=1n𝚯^j𝑿𝑾+nE3\displaystyle\sqrt{n}\quantity(\hat{\beta_{j}}-\beta_{j}-\frac{2\gamma\mu_{s}^{2}}{\nu_{s}^{2}}\frac{s_{1}^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}-\frac{2\gamma\mu_{s}^{2}}{\nu_{s}^{2}}\frac{s_{0}^{2}\log^{2}d\log(1/\delta)\log^{3}n}{n^{2}\epsilon^{2}})=\frac{1}{\sqrt{n}}\hat{\bm{\Theta}}_{j}^{\top}\bm{X}^{\top}\bm{W}+\sqrt{n}E_{3} (A.29)

From the result in (A.29), notice that the right hand side asymptotically follows a distribution of N(0,σ2𝚯^j𝚺^𝚯^j+8Δ12log(1/δ)nϵ2)N(0,\sigma^{2}\hat{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{j}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{n\epsilon^{2}}). Also, by the concentration of Gaussian distribution, we have with high probability, E28Δ12log(1/δ)nϵ2)E_{2}\leq\frac{8\Delta_{1}^{2}\log(1/\delta)}{n\epsilon^{2}}). Thus, we have:

n(βj^βj2γμs2νs2s12log2dlog(1/δ)m2n2ϵ22γμs2νs2s02log2dlog(1/δ)n2ϵ2)/σ2𝚯^j𝚺^𝚯^j+8Δ12log(1/δ)nϵ2N(0,1)\sqrt{n}\quantity(\hat{\beta_{j}}-\beta_{j}-\frac{2\gamma\mu_{s}^{2}}{\nu_{s}^{2}}\frac{s_{1}^{2}\log^{2}d\log(1/\delta)}{m^{2}n^{2}\epsilon^{2}}-\frac{2\gamma\mu_{s}^{2}}{\nu_{s}^{2}}\frac{s_{0}^{2}\log^{2}d\log(1/\delta)}{n^{2}\epsilon^{2}})/\sqrt{\sigma^{2}\hat{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{j}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{n\epsilon^{2}}}\sim N(0,1)

Then, we could replace μs,νs\mu_{s},\nu_{s} with the estimation μ^s,ν^s\hat{\mu}_{s},\hat{\nu}_{s} introduced in Algorithm 5, the constant could be scaled to one given the tail bound of Laplace random variable. Also, for the estimation of σ\sigma, according to the assumption, we have σ^=σ+o(1)\hat{\sigma}=\sigma+o(1). For simplicity, we denote a=2kμ^s2ν^s2s12log2dlog(1/δ)log3mnm2n2ϵ2+2kμ^s2ν^s2s02log2dlog(1/δ)log3nn2ϵ2a=\frac{2k\hat{\mu}_{s}^{2}}{\hat{\nu}_{s}^{2}}\frac{s_{1}^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}}+\frac{2k\hat{\mu}_{s}^{2}}{\hat{\nu}_{s}^{2}}\frac{s_{0}^{2}\log^{2}d\log(1/\delta)\log^{3}n}{n^{2}\epsilon^{2}}, the confidence is given by:

Jj(α)=\displaystyle J_{j}(\alpha)=
[βj^aσΦ1(1α/2)n𝚯^j𝚺^𝚯^j+8Δ12log(1/δ)nϵ2,βj^+a+σΦ1(1α/2)n𝚯^j𝚺^𝚯^j+8Δ12log(1/δ)nϵ2]\displaystyle[\hat{\beta_{j}}-a-\frac{\sigma\Phi^{-1}(1-\alpha/2)}{\sqrt{n}}\sqrt{\hat{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{j}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{n\epsilon^{2}}},\hat{\beta_{j}}+a+\frac{\sigma\Phi^{-1}(1-\alpha/2)}{\sqrt{n}}\sqrt{\hat{\bm{\Theta}}_{j}^{\top}\hat{\bm{\Sigma}}\hat{\bm{\Theta}}_{j}+\frac{8\Delta_{1}^{2}\log(1/\delta)}{n\epsilon^{2}}}]

A.9 Proof of Theorem 9

In this proof, we first need to show that our algorithm is (ϵ,δ)(\epsilon,\delta) private. We assume in two data sets, the adjacent data set is different in (𝒙ij,yij)(\bm{x}_{ij},y_{ij}) and (𝒙ij,yij)(\bm{x}^{\prime}_{ij},y^{\prime}_{ij}). Then, we have:

1n𝚯^𝒙ijej1n𝚯^𝒙ijej\displaystyle\|\frac{1}{\sqrt{n}}\hat{\bm{\Theta}}\bm{x}_{ij}e_{j}-\frac{1}{\sqrt{n}}\hat{\bm{\Theta}}\bm{x}_{ij}^{\prime}e_{j}\|_{\infty} 2n𝚯^𝒙ijej\displaystyle\leq\frac{2}{\sqrt{n}}\|\hat{\bm{\Theta}}\bm{x}_{ij}e_{j}\|_{\infty}
2n𝚯^𝒙ijej\displaystyle\leq\frac{2}{\sqrt{n}}\|\hat{\bm{\Theta}}\bm{x}_{ij}\|_{\infty}\|e_{j}\|_{\infty}
2lognn𝚯^1𝒙ij\displaystyle\leq\frac{2\sqrt{\log n}}{\sqrt{n}}\|\hat{\bm{\Theta}}\|_{1}\|\bm{x}_{ij}\|_{\infty}
2lognncxsc1\displaystyle\leq 2\sqrt{\frac{\log n}{n}}c_{x}\sqrt{s}c_{1}

According to the choice of B5B_{5}, the privacy could be guaranteed. Then, let us start the proof of consistency. In this proof specifically, we define U=maxkG𝚯^kT1ni=1n𝒙ijejU^{\prime}=\max_{k\in G}\hat{\bm{\Theta}}_{k}^{T}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{x}_{ij}e_{j} and also, we define:

M0=maxkG1nj=1nξjk,M_{0}=\max_{k\in G}\frac{1}{\sqrt{n}}\sum_{j=1}^{n}\xi_{jk},

where ξj\xi_{j} follows a Gaussian Distribution N(0,ΘkΣΘkσ2)N(0,\Theta_{k}^{\top}\Sigma\Theta_{k}\sigma^{2}). Also, we define the α\alpha-quantile of M0M_{0} as UM0(α)U_{M_{0}}(\alpha). Then, we could start the proof.

We are aiming at proving supα(0,1)|(MCU(α))α|=o(1)\sup_{\alpha\in(0,1)}|{\mathbb{P}}(M\leq C_{U}(\alpha))-\alpha|=o(1). First, we could prove that CU(α)C_{U}(\alpha) and CUC_{U^{\prime}} are close to each other for any α(0,1)\alpha\in(0,1). From the design of private max algorithm, from Lemma 3.4 in [8], suppose l1l_{1} is the element chosen from UU^{\prime} and l2l_{2} is from UU without noise injection, we use ww to represent the noise injected when we pick the largest value privately, we find that, for any c>0c>0:

l22l12(1+c)l22+4(1+1/c)w2l_{2}^{2}\leq l_{1}^{2}\leq(1+c)l_{2}^{2}+4(1+1/c)\|w\|_{\infty}^{2}

From Lemma A.1 in [8], we can verify that there exists constant k0,k1k_{0},k_{1} such that w2k0slog4dlognn\|w\|_{\infty}^{2}\leq k_{0}\cdot\frac{s^{*}\log^{4}d\log n}{n}. When we choose c=o(1)c=o(1), e.g, c=k1slogdnc=k_{1}\frac{s^{*}\log d}{n}, then from the conditions, we could claim that l1=l2+o(1)l_{1}=l_{2}+o(1), also notice that the scale of noise we injected is small, it is easy to verify that U=U+o(1)U=U^{\prime}+o(1). The following discussions will be between UU^{\prime} and UT0U_{T_{0}}.

Motivated by the proof of Theorem 3.1 and Theorem 3.2 in [10], our proof will be divided into two major parts, to measure the closeness between UU^{\prime} and UM0U_{M_{0}} and measure the closeness between TT and T0T_{0}. We start from the measurement between MM and M0M_{0}. From the definition that M(𝜷^(i))=maxkGn(𝜷^k(i)𝜷k(i))M(\hat{{\bm{\beta}}}^{(i)})=\max_{k\in G}\sqrt{n}(\hat{{\bm{\beta}}}_{k}^{(i)}-{\bm{\beta}}_{k}^{(i)}), we notice that for each kk in 1,2,,d1,2,\dots,d, we have:

n(𝜷^(i)𝜷(i))=1nΘ^Xi𝑾i+(Θ^ΣI)(𝜷^𝜷)+nE\displaystyle\sqrt{n}(\hat{{\bm{\beta}}}^{(i)}-{\bm{\beta}}^{(i)})=\frac{1}{\sqrt{n}}\hat{\Theta}X_{i}^{\top}\bm{W}_{i}+(\hat{\Theta}\Sigma-I)(\hat{{\bm{\beta}}}-{\bm{\beta}})+\sqrt{n}E

Then,

|MM0|(1nΘ^Xi𝑾i1nj=1nξj)+(Θ^ΣI)(𝜷^𝜷)+nE\displaystyle|M-M_{0}|\leq(\|\frac{1}{\sqrt{n}}\hat{\Theta}X_{i}^{\top}\bm{W}_{i}\|_{\infty}-\|\frac{1}{\sqrt{n}}\sum_{j=1}^{n}\xi_{j}\|_{\infty})+\|(\hat{\Theta}\Sigma-I)(\hat{{\bm{\beta}}}-{\bm{\beta}})+\sqrt{n}E\|_{\infty} (A.30)

We analyze the two parts in (A.30) separately. For the first term in (A.30). First, from Lemma 1.1 in [42], we could obtain the result that for any zz, supz|P(1n𝚯𝑿i𝑾iz)P(M0z)|c01nc1\sup_{z}|P(\|\frac{1}{\sqrt{n}}\bm{\Theta}\bm{X}_{i}^{\top}\bm{W}_{i}\|_{\infty}\leq z)-P(M_{0}\leq z)|\leq c_{0}\cdot\frac{1}{n^{c_{1}}}, where c0c_{0} and c1c_{1} are constants. Also,

1n𝚯^𝑿i𝑾i1n𝚯𝑿i𝑾i\displaystyle\|\frac{1}{\sqrt{n}}\hat{\bm{\Theta}}\bm{X}_{i}^{\top}\bm{W}_{i}\|_{\infty}-\|\frac{1}{\sqrt{n}}\bm{\Theta}\bm{X}_{i}^{\top}\bm{W}_{i}\|_{\infty} 1n𝚯^𝑿i𝑾i𝚯𝑿i𝑾i\displaystyle\leq\frac{1}{\sqrt{n}}\|\hat{\bm{\Theta}}\bm{X}_{i}^{\top}\bm{W}_{i}-\bm{\Theta}\bm{X}_{i}^{\top}\bm{W}_{i}\|_{\infty}
^Θ𝚯11n𝑿i𝑾i\displaystyle\leq\|\bm{\hat{}}{\Theta}-\bm{\Theta}\|_{1}\cdot\|\frac{1}{\sqrt{n}}\bm{X}_{i}^{\top}\bm{W}_{i}\|_{\infty}
cslogdnlogdcslogd1n=o(1)\displaystyle\leq c\cdot s^{*}\sqrt{\frac{\log d}{n}}\sqrt{\log d}\leq c\cdot s^{*}\log d\sqrt{\frac{1}{n}}=o(1)

On the other hand, we also notice that for the second term in (A.30), following the proof in Theorem 4, we also know that the second part is o(1)o(1), hence finishes the first part of the proof. In the second part of the proof. By the arguments in the proof of Theorem 3.2 in [10], we have for any vv:

supα(0,1)|(TCU(α))α|c01nc1+c2v1/3(1log(d/v))2/3+P(>v),\displaystyle\sup_{\alpha\in(0,1)}|{\mathbb{P}}(T\leq C_{U^{\prime}}(\alpha))-\alpha|\leq c_{0}\frac{1}{n^{c_{1}}}+c_{2}v^{1/3}(1\vee\log(d/v))^{2/3}+P(\square>v),

where =maxk,l^Θk^Σ^𝚯l𝚯k𝚺𝚯l\square=\max_{k,l}\bm{\hat{}}{\Theta}_{k}^{\top}\bm{\hat{}}{\Sigma}\bm{\hat{}}{\bm{\Theta}}_{l}-\bm{\Theta}_{k}^{\top}\bm{\Sigma}\bm{\bm{\Theta}}_{l}.Then, we have:

𝚯^𝚺^^𝚯𝚯𝚺𝚯max\displaystyle\|\hat{\bm{\Theta}}^{\top}\hat{\bm{\Sigma}}\bm{\hat{}}{\bm{\Theta}}-\bm{\Theta}^{\top}\bm{\Sigma}\bm{\bm{\Theta}}\|_{\max}
𝚯^𝚺^^𝚯𝚯^𝚺^𝚯max+𝚯^Σ𝚯^𝚯max\displaystyle\leq\|\hat{\bm{\Theta}}^{\top}\hat{\bm{\Sigma}}\bm{\hat{}}{\bm{\Theta}}-\hat{\bm{\Theta}}^{\top}{\bm{\Sigma}}\bm{\hat{}}{\bm{\Theta}}\|_{\max}+\|\hat{\bm{\Theta}}\Sigma\hat{\bm{\Theta}}-\bm{\Theta}\|_{\max}
𝚯^𝚺^𝚺max𝚯^1+(𝚯^𝚯+𝚯)Σ(𝚯^𝚯+𝚯)𝚯max\displaystyle\leq\|\hat{\bm{\Theta}}\|_{\infty}\|\hat{\bm{\Sigma}}-{\bm{\Sigma}}\|_{\max}\|\hat{\bm{\Theta}}\|_{1}+\|(\hat{\bm{\Theta}}-\bm{\Theta}+\bm{\Theta})\Sigma(\hat{\bm{\Theta}}-\bm{\Theta}+\bm{\Theta})-\bm{\Theta}\|_{\max}
L2logdn0+𝚯^𝚯12𝚺max+2𝚯^𝚯1\displaystyle\leq L^{2}\sqrt{\frac{\log d}{n_{0}}}+\|\hat{\bm{\Theta}}-\bm{\Theta}\|_{1}^{2}\|\bm{\Sigma}\|_{\max}+2\|\hat{\bm{\Theta}}-\bm{\Theta}\|_{1}
k0s2logdn0+k1slogdn0,\displaystyle\leq k_{0}\frac{{s^{*}}^{2}\log d}{n_{0}}+k_{1}s^{*}\sqrt{\frac{\log d}{n_{0}}},

where k0,k1k_{0},k_{1} are constants. Then, with a proper choice of vv, we could claim that supα(0,1)|(TCU(α))α|=o(1)\sup_{\alpha\in(0,1)}|{\mathbb{P}}(T\leq C_{U^{\prime}}(\alpha))-\alpha|=o(1), which finishes the proof.

Appendix B Appendix

In this section, we will give proofs of the corollary in the main proof. We will introduce them one by one:

B.1 Proof of corollary 1

Following the proof of Theorem 1 in [2], we can gain the upper bound of i=1kdTV(𝒑+iZm,𝒑iZm)\sum_{i=1}^{k}d_{TV}({\bm{p}_{+i}^{Z^{m}}},{\bm{p}_{-i}^{Z^{m}}}), when we consider the central DP:

1k(i=1kdTV(𝒑+iZm,𝒑iZm))2\displaystyle\frac{1}{k}\quantity(\sum_{i=1}^{k}d_{TV}({\bm{p}_{+i}^{Z^{m}}},{\bm{p}_{-i}^{Z^{m}}}))^{2} 7t=1m𝔼A[i=1k𝒵(𝔼𝒑an[𝒲(zX)]𝔼𝒑ain[𝒲(zX))]2𝔼𝒑a[𝒲(zX)]𝑑μ]\displaystyle\leq 7\sum_{t=1}^{m}\mathbb{E}_{A}\quantity[\sum_{i=1}^{k}\int_{\mathcal{Z}}\frac{(\mathbb{E}_{\bm{p}_{a}^{\otimes n}}[\mathcal{W}{(z\mid X)}]-\mathbb{E}_{\bm{p}_{a^{\oplus i}}^{\otimes n}}[{\mathcal{W}(z\mid X)})]^{2}}{\mathbb{E}_{\bm{p}_{a}}[{\mathcal{W}(z\mid X)}]}d{\mu}]

Also notice that:

𝔼𝒑ain[𝒲(zX))]=𝔼𝒑an[d𝒑aind𝒑an𝒲(zX))]=𝔼𝒑a(1+qa,iϕa,i(X))n𝒲(zX)\displaystyle\mathbb{E}_{\bm{p}_{a^{\oplus i}}^{\otimes n}}[{\mathcal{W}(z\mid X)})]=\mathbb{E}_{\bm{p}_{a}^{\otimes n}}\quantity[\frac{d\bm{p}_{a^{\oplus i}}^{\otimes n}}{d\bm{p}_{a}^{\otimes n}}{\mathcal{W}(z\mid X)})]=\mathbb{E}_{\bm{p}_{a}}(1+q_{a,i}\phi_{a,i}(X))^{n}\cdot\mathcal{W}(z\mid X)

The last equation is from the definition of condition 1. Also, by the inequality that we can find constants c0c_{0}, c1c_{1} that when x>0x>0 and x1/nx\asymp 1/n, (1+x)n1+c0nx(1+x)^{n}\leq 1+c_{0}\cdot nx and (1x)n1c0nx(1-x)^{n}\leq 1-c_{0}\cdot nx. If we have |qa,iϕa,i(X)|1/n|q_{a,i}\phi_{a,i}(X)|\asymp 1/n. we could find a constant c2c_{2}, such that:

1k(i=1kdTV(𝒑+iZm,𝒑iZm))2c2q2n2t=1m𝔼A[i=1k𝒵𝔼𝒑An[ϕa,i(X)𝒲(zX)]2𝔼𝒑An[𝒲(zX)]𝑑μ]\displaystyle\frac{1}{k}(\sum_{i=1}^{k}d_{TV}\quantity({\bm{p}_{+i}^{Z^{m}}},{\bm{p}_{-i}^{Z^{m}}}))^{2}\leq c_{2}q^{2}n^{2}\sum_{t=1}^{m}\mathbb{E}_{A}\quantity[\sum_{i=1}^{k}\int_{\mathcal{Z}}\frac{\mathbb{E}_{\bm{p}_{A}^{\otimes n}}{[\phi_{a,i}(X)\mathcal{W}(z\mid X)}]^{2}}{\mathbb{E}_{\bm{p}_{A}^{\otimes n}}[{\mathcal{W}(z\mid X)}]}d{\mu}]

which finishes the proof of corollary 1.

B.2 Proof of Corollary 2

We continue to show the proof of Corollary 2. First, from Theorem 2 in [2], we get a direct result when condition 2 is satisfied, given all the conditions in corollary 2 hold, we have:

(1ki=1kdTV(𝒑+iZm,𝒑iZm))27kq2n2t=1mmaxa𝒜𝒵Var𝒑a[𝒲(zX)]𝔼𝒑a[𝒲(zX)]𝑑μ\displaystyle\quantity(\frac{1}{k}\sum_{i=1}^{k}d_{TV}(\bm{p}_{+i}^{Z^{m}},\bm{p}_{-i}^{Z^{m}}))^{2}\leq\frac{7}{k}q^{2}n^{2}\sum_{t=1}^{m}\max_{a\in\mathcal{A}}\int_{\mathcal{Z}}\frac{{\rm Var}_{\bm{p}_{a}}[\mathcal{W}(z\mid X)]}{\mathbb{E}_{\bm{p}_{a}}[{\mathcal{W}(z\mid X)}]}d{\mu}

Then, the focus of the proof of this corollary is on the calculation of 𝒵Var𝒑a[𝒲(zX)]𝔼𝒑a[𝒲(zX)]𝑑μ\int_{\mathcal{Z}}\frac{{\rm Var}_{\bm{p}_{a}}[\mathcal{W}(z\mid X)]}{\mathbb{E}_{\bm{p}_{a}}[{\mathcal{W}(z\mid X)}]}d{\mu} when the channel 𝒲\mathcal{W} is a privacy constraint channel 𝒲priv\mathcal{W}^{priv}. For simplicity, we denote L(𝒛,𝑿)=log𝒲priv(𝒛|𝑿)L(\bm{z},\bm{X})=log\mathcal{W}^{priv}(\bm{z}|\bm{X}), where 𝒛d\bm{z}\in\mathbb{R}^{d} and 𝑿n×d\bm{X}\in\mathbb{R}^{n\times d}. Then, notice that 𝒲priv\mathcal{W}^{priv} is ϵ\epsilon-differentially private constraint, for two adjacent dataset 𝑿\bm{X} and 𝑿\bm{X}^{\prime}, we have:

|L(𝒛,𝑿)L(𝒛,𝑿)|ϵ|L(\bm{z},\bm{X})-L(\bm{z},\bm{X}^{\prime})|\leq\epsilon

By McDiarmid’s inequality, we could claim that LL is nϵ\sqrt{n}\epsilon- subGaussian. So we could find a constant cc, which satisfies that:

𝔼[e2L]ce2𝔼[L]e2nϵ2\mathbb{E}[e^{2L}]\leq c\cdot e^{2\mathbb{E}[L]}\cdot e^{2n\epsilon^{2}}

Then, by Jensen inequality, we have:

𝔼[e2L]c(eE[L])2e2nϵ2\mathbb{E}[e^{2L}]\leq c\cdot(e^{E[L]})^{2}\cdot e^{2n\epsilon^{2}}

Thus, we have:

Var[𝒲priv(𝒛|𝑿)]𝔼[𝒲priv(𝒛|𝑿)]2=𝔼[𝒲priv(𝒛|𝑿)2](𝔼[𝒲priv(𝒛|𝑿)])21=𝔼[e2L](𝔼[eL])21e2nϵ21\frac{{\rm Var}[\mathcal{W}^{priv}(\bm{z}|\bm{X})]}{\mathbb{E}[\mathcal{W}^{priv}(\bm{z}|\bm{X})]^{2}}=\frac{\mathbb{E}[\mathcal{W}^{priv}(\bm{z}|\bm{X})^{2}]}{(\mathbb{E}[\mathcal{W}^{priv}(\bm{z}|\bm{X})])^{2}}-1=\frac{\mathbb{E}[e^{2L}]}{(\mathbb{E}[e^{L}])^{2}}-1\leq e^{2n\epsilon^{2}}-1

Thus, we have:

(1ki=1kdTV(𝒑+iZm,𝒑iZm))2\displaystyle\quantity(\frac{1}{k}\sum_{i=1}^{k}d_{TV}(\bm{p}_{+i}^{Z^{m}},\bm{p}_{-i}^{Z^{m}}))^{2}\leq 7kα2n2t=1mmaxa𝒜𝒵Var𝒑a[𝒲(zX)]𝔼𝒑a[𝒲(zX)]2𝔼𝒑a[W(zX)]𝑑μ\displaystyle\frac{7}{k}\alpha^{2}n^{2}\sum_{t=1}^{m}\max_{a\in\mathcal{A}}\int_{\mathcal{Z}}\frac{{\rm Var}_{\bm{p}_{a}}[\mathcal{W}(z\mid X)]}{\mathbb{E}_{\bm{p}_{a}}[{\mathcal{W}(z\mid X)}]^{2}}\cdot\mathbb{E}_{\bm{p}_{a}}[{W(z\mid X)}]d{\mu}
\displaystyle\leq 7kα2n2(e2nϵ21)t=1mmaxa𝒜𝒵𝔼𝒑a[𝒲(zX)]𝑑μ\displaystyle\frac{7}{k}\alpha^{2}n^{2}(e^{2n\epsilon^{2}}-1)\sum_{t=1}^{m}\max_{a\in\mathcal{A}}\int_{\mathcal{Z}}\mathbb{E}_{\bm{p}_{a}}[{\mathcal{W}(z\mid X)}]d{\mu}
\displaystyle\leq 7kq2mn2(e2nϵ21)\displaystyle\frac{7}{k}q^{2}mn^{2}(e^{2n\epsilon^{2}}-1)

which finishes the proof of corollary 2.

B.3 Proof of corollary 3

For an (n,ρ)(n,\rho)-estimator θ^\hat{\theta} of the true parameter under p\ell_{p} loss, we define A^\hat{A} for AA as

A^=argmina𝒜θaθ^p.\hat{A}=\underset{a\in\mathcal{A}}{\arg\!\min}\|\theta_{a}-\hat{\theta}\|_{p}.

Then, by the triangle inequality, we have

θAθA^pθAθ^p+θA^θ^p2θ^θAp.\norm{\theta_{A}-\theta_{\hat{A}}}_{p}\leq\norm{\theta_{A}-\hat{\theta}}_{p}+\norm{\theta_{\hat{A}}-\hat{\theta}}_{p}\leq 2\norm{\hat{\theta}-\theta_{A}}_{p}.

Because θ^\hat{\theta} is an (n,ρ)(n,\rho)-estimator under p\ell_{p} loss, we have,

𝔼Z[𝔼𝒑Z[θZθZ^pp]]\displaystyle\mathbb{E}_{Z}[\mathbb{E}_{\bm{p}_{Z}}[\norm{\theta_{Z}-\theta_{\hat{Z}}}_{p}^{p}]] 2pρp[𝒑Z𝒫Θ]+maxzzθzθzpp[𝒑Z𝒫Θ]\displaystyle\leq 2^{p}{\rho^{p}}{\mathbb{P}}[\bm{p}_{Z}\in\mathcal{P}_{\Theta}]+\max_{z\neq z^{\prime}}\norm{\theta_{z}-\theta_{z^{\prime}}}_{p}^{p}{\mathbb{P}}[\bm{p}_{Z}\notin\mathcal{P}_{\Theta}] (B.1)
2pρp+4pρp1ττ4\displaystyle\leq 2^{p}{\rho^{p}}+4^{p}{\rho^{p}}\frac{1}{\tau}\cdot\frac{\tau}{4} (B.2)
344pϵp,\displaystyle\leq\frac{3}{4}{4^{p}\epsilon^{p}}, (B.3)

using the fact that [𝒑A𝒫Θ]1τ/4{\mathbb{P}}[{\bm{p}_{A}\in\mathcal{P}_{\Theta}}]\geq 1-\tau/4 and condition 4. Also, from condition 4, Next, combining condition 4 and B.3, we could have: 1τki=1k[AiA^i]34\frac{1}{\tau k}\sum_{i=1}^{k}{\mathbb{P}}[A_{i}\neq\hat{A}_{i}]\leq\frac{3}{4}. Also, since the Markov relation AiXmZmA^iA_{i}-X^{m}-Z^{m}-\hat{A}_{i} holds for all ii, by the standard relation between total variation distance and hypothesis testing, and also the definition of τ\tau to be less than 1/21/2, we have:

[AiA^i]\displaystyle{\mathbb{P}}[{A_{i}\neq\hat{A}_{i}}] τ[A^i=1|Ai=1]+(1τ)[A^i=1|Ai=1]\displaystyle\geq\tau{\mathbb{P}}[{\hat{A}_{i}=-1}|{A_{i}=1}]+(1-\tau){\mathbb{P}}[{\hat{A}_{i}=1}|{A_{i}=-1}]
τ([A^i=1|Ai=1]+[A^i=1|Ai=1])\displaystyle\geq\tau({\mathbb{P}}[{\hat{A}_{i}=-1}|{A_{i}=1}]+{\mathbb{P}}[{\hat{A}_{i}=1}|{A_{i}=-1}])
τ(1dTV(𝒑+iXm,𝒑iXm))\displaystyle\geq\tau(1-d_{TV}({\bm{p}_{+i}^{X^{m}}},{\bm{p}_{-i}^{X^{m}}}))
τ(11/ndTV(𝒑+iZm,𝒑iZm))\displaystyle\geq\tau(1-1/n\cdot d_{TV}({\bm{p}_{+i}^{Z^{m}}},{\bm{p}_{-i}^{Z^{m}}}))

The last inequality uses the definition of total variation, because ZmZ^{m} is generated by XmX^{m} from the privacy constraint channel WprivW^{priv}, so for each dataset XiX_{i}, i=1,2,,mi=1,2,\dots,m on the ii-th machine, let XijkX_{ijk} be the dataset which changes the order of XijX_{ij} and XikX_{ik}, then for any z𝒵z\in\mathcal{Z},Wpriv(z|Xi)=Wpriv(z|Xijk)W^{priv}(z|X_{i})=W^{priv}(z|X_{ijk}). Thus, by the definition of total variation, we could verify that dTV(𝒑+iXi,𝒑iXi)=1/ndTV(𝒑+iZi,𝒑iZi)d_{TV}({\bm{p}_{+i}^{X_{i}}},{\bm{p}_{-i}^{X_{i}}})=1/n\cdot d_{TV}({\bm{p}_{+i}^{Z_{i}}},{\bm{p}_{-i}^{Z_{i}}}). Summing over 1ik1\leq i\leq k and combining it with the previous bound, we obtain

341τki=1k[AiA^i]11nki=1kdTV(𝒑+iZn,𝒑iZn)\frac{3}{4}\geq\frac{1}{\tau k}\sum_{i=1}^{k}{\mathbb{P}}[{A_{i}\neq\hat{A}_{i}}]\geq 1-\frac{1}{nk}\sum_{i=1}^{k}d_{TV}(\bm{p}_{+i}^{Z^{n}},\bm{p}_{-i}^{Z^{n}})

which finishes the proof of corollary 3.

B.4 Proof of Lemma 4.1

Proof of Lemma 4.1: First, we would like to show that our algorithm is (ϵ,δ)(\epsilon,\delta)-differentially private. For two adjacent data sets, we have:

1mn|(πR(yi)xiT𝜷^)2(πR(yi)xiT𝜷^)2|\displaystyle\frac{1}{mn}|(\pi_{R}(y_{i})-x_{i}^{T}\hat{{\bm{\beta}}})^{2}-(\pi_{R}({y_{i}^{\prime}})-{x_{i}^{\prime}}^{T}\hat{{\bm{\beta}}})^{2}|
2mn(πR(yi)xiT𝜷^)2\displaystyle\leq\frac{2}{mn}(\pi_{R}(y_{i})-x_{i}^{T}\hat{{\bm{\beta}}})^{2}
4mn(πR(yi)2+(xiT𝜷^)2)4mn(R2+sc02cx2)\displaystyle\leq\frac{4}{mn}(\pi_{R}(y_{i})^{2}+(x_{i}^{T}\hat{{\bm{\beta}}})^{2})\leq\frac{4}{mn}(R^{2}+sc_{0}^{2}c_{x}^{2})

From the definition of Gaussian Mechanism, we could claim that our algorithm is (ϵ,δ)(\epsilon,\delta)-differential private. Then, for the convergence rate of our estimated σ\sigma, from our algorithm, first, we observe with the choice of RR, we claim that with high prob, we have πR(Y)=Y\pi_{R}(Y)=Y. Therefore, we have:

|σ2σ^2|\displaystyle|\sigma^{2}-\hat{\sigma}^{2}| |1mn𝑿𝜷+𝑾𝑿𝜷^22σ2|+|E|\displaystyle\leq|\frac{1}{mn}\|\bm{X}{\bm{\beta}}+\bm{W}-\bm{X}\hat{{\bm{\beta}}}\|_{2}^{2}-\sigma^{2}|+|E|
|1mn𝑾T𝑾σ2|+(𝜷𝜷^)Σ^(𝜷𝜷^)+1mn(𝜷𝜷^)𝑿T𝑾+|E|\displaystyle\leq|\frac{1}{mn}\bm{W}^{T}\bm{W}-\sigma^{2}|+({\bm{\beta}}-\hat{{\bm{\beta}}})\hat{\Sigma}({\bm{\beta}}-\hat{{\bm{\beta}}})+\frac{1}{mn}({\bm{\beta}}-\hat{{\bm{\beta}}})\bm{X}^{T}\bm{W}+|E|

For the first term, we could obtain that |1mn𝑾T𝑾σ2|=O(1mn)|\frac{1}{mn}\bm{W}^{T}\bm{W}-\sigma^{2}|=O(\frac{1}{\sqrt{mn}}) Also, we have:

(𝜷𝜷^)Σ^(𝜷𝜷^)\displaystyle({\bm{\beta}}-\hat{{\bm{\beta}}})\hat{\Sigma}({\bm{\beta}}-\hat{{\bm{\beta}}}) λs(Σ)𝜷𝜷^22\displaystyle\leq\lambda_{s}(\Sigma)\|{\bm{\beta}}-\hat{{\bm{\beta}}}\|_{2}^{2}
cL(slogdmn+s2log2dlog(1/δ)log3mnm2n2ϵ2)\displaystyle\leq cL\quantity(\frac{s\log d}{mn}+\frac{s^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}})

Then, from Bernstein inequality, we could obtain that:

1mn𝑿T𝑾=c1logdmn\displaystyle\|\frac{1}{mn}\bm{X}^{T}\bm{W}\|_{\infty}=c_{1}\cdot\sqrt{\frac{\log d}{mn}}

Therefore, we claim that:

1mn(𝜷𝜷^)𝑿T𝑾\displaystyle\frac{1}{mn}({\bm{\beta}}-\hat{{\bm{\beta}}})\bm{X}^{T}\bm{W}
1mn𝜷𝜷^1𝑿T𝑾\displaystyle\leq\frac{1}{mn}\|{\bm{\beta}}-\hat{{\bm{\beta}}}\|_{1}\|\bm{X}^{T}\bm{W}\|_{\infty}
c2smn𝜷𝜷^2𝑿T𝑾\displaystyle\leq c_{2}\frac{\sqrt{s}}{mn}\|{\bm{\beta}}-\hat{{\bm{\beta}}}\|_{2}\|\bm{X}^{T}\bm{W}\|_{\infty}
c3s(slogdmn+slogdlog(1/δ)log3/2mnmnϵ)logdmn\displaystyle\leq c_{3}\sqrt{s}\quantity(\sqrt{\frac{s\log d}{mn}}+\frac{s\log d\sqrt{\log(1/\delta)}\log^{3/2}mn}{mn\epsilon})\sqrt{\frac{\log d}{mn}}
=Op(slogdmn+slogdlog(1/δ)log3/2mnmnϵslogdmn)\displaystyle=O_{p}\quantity(\frac{s\log d}{mn}+\frac{s\log d\sqrt{\log(1/\delta)}\log^{3/2}mn}{mn\epsilon}\cdot\sqrt{\frac{s\log d}{mn}})
=Op(slogdmn+s2log2dlog(1/δ)log3mnm2n2ϵ2)\displaystyle=O_{p}\quantity(\frac{s\log d}{mn}+\frac{s^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}})

Also, from our algorithm, we have EN(0,2B22log(1.25/δ)/ϵ2)E\sim N(0,2B_{2}^{2}\log(1.25/\delta)/\epsilon^{2}). Then,

|E|=2B22log(1/δ)ϵ2|N(0,1)|=c4R4+s2c04cx4log(1/δ)m2n2ϵ2=c5s2log2dlog(1/δ)log3mnm2n2ϵ2,\displaystyle|E|=\frac{2B_{2}^{2}\log(1/\delta)}{\epsilon^{2}}|N(0,1)|=c_{4}\frac{R^{4}+s^{2}c_{0}^{4}c_{x}^{4}\log(1/\delta)}{m^{2}n^{2}\epsilon^{2}}=c_{5}\cdot\frac{s^{2}\log^{2}d\log(1/\delta)\log^{3}mn}{m^{2}n^{2}\epsilon^{2}},

by observing that cx=O(logd)c_{x}=O(\sqrt{\log d}). Combining above inequalities, we have reached our conclusion. Therefore, we finish our proof.

B.5 Proof of Lemma 4.2

Proof of Lemma 4.2: It is not difficult to verify the privacy conditions. Then, from the theory of covering number, we could find n1n_{1} vectors 𝒗1,𝒗2,,𝒗n1\bm{v}_{1},\bm{v}_{2},\dots,\bm{v}_{n_{1}}, such that for each s-sparse unit vector 𝒗\bm{v}, we have 𝒗𝒗i1/9\|\bm{v}-\bm{v}_{i}\|\leq 1/9. Thus, we have:

λsλ^s\displaystyle{\lambda}_{s}-\hat{\lambda}_{s} =𝒗𝚺^𝒗𝒗i𝚺^𝒗i\displaystyle=\bm{v}^{*}\hat{\bm{\Sigma}}\bm{v}^{*}-\bm{v}_{i}\hat{\bm{\Sigma}}\bm{v}_{i}
𝒗𝚺^(𝒗𝒗i)+(𝒗𝒗i)𝚺^𝒗i\displaystyle\leq\bm{v}^{*}\hat{\bm{\Sigma}}(\bm{v}^{*}-\bm{v}_{i})+(\bm{v}^{*}-\bm{v}_{i})\hat{\bm{\Sigma}}\bm{v}_{i}
29λ2s\displaystyle\leq\frac{2}{9}\lambda_{2s}\quad
8/9λs\displaystyle\leq 8/9\lambda_{s}

Note that the second last inequality is a direct result of Cauchy inequality and the last inequality holds because let 𝒗\bm{v}^{*^{\prime}} be the corresponding eigenvector of λ2s\lambda_{2s}, then we could break this eigenvector to two ss-sparse vectors 𝒗1\bm{v}_{1}^{\prime} and 𝒗2\bm{v}_{2}^{\prime} such that 𝒗=𝒗1+𝒗2\bm{v}^{*^{\prime}}=\bm{v}_{1}^{\prime}+\bm{v}_{2}^{\prime}, then λ2s=(𝒗1+𝒗2)T𝚺^(𝒗1+𝒗2)4λs\lambda_{2s}=(\bm{v}_{1}^{\prime}+\bm{v}_{2}^{\prime})^{T}\hat{\bm{\Sigma}}(\bm{v}_{1}^{\prime}+\bm{v}_{2}^{\prime})\leq 4\lambda_{s}.
Also, notice that for the noise ξ\xi, by the concentration of Laplace distribution, we could find a constant cc such that ξcslogd/n=o(1)\xi\leq cs\log d/\sqrt{n}=o(1) with high probability. By the definition of λs\lambda_{s}, we conclude the proof. \square