This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online differentially private inference in stochastic gradient descent

Jinhan Xie1, Enze Shi211footnotemark: 1, Bei Jiang2, Linglong Kong2, Xuming He3 The first two authors contributed equally to this work.Corresponding author
Abstract

We propose a general privacy-preserving optimization-based framework for real-time environments without requiring trusted data curators. In particular, we introduce a noisy stochastic gradient descent algorithm for online statistical inference with streaming data under local differential privacy constraints. Unlike existing methods that either disregard privacy protection or require full access to the entire dataset, our proposed algorithm provides rigorous local privacy guarantees for individual-level data. It operates as a one-pass algorithm without re-accessing the historical data, thereby significantly reducing both time and space complexity. We also introduce online private statistical inference by conducting two construction procedures of valid private confidence intervals. We formally establish the convergence rates for the proposed estimators and present a functional central limit theorem to show the averaged solution path of these estimators weakly converges to a rescaled Brownian motion, providing a theoretical foundation for our online inference tool. Numerical simulation experiments demonstrate the finite-sample performance of our proposed procedure, underscoring its efficacy and reliability. Furthermore, we illustrate our method with an analysis of two datasets: the ride-sharing data and the US insurance data, showcasing its practical utility.

1Yunnan Key Laboratory of Statistical Modeling and Data Analysis, Yunnan University

2Department of Mathematical and Statistical Sciences, University of Alberta

3Department of Statistics and Data Science, Washington University in St. Louis

Keywords: Confidence interval; Local differential privacy; Privacy budget; Stochastic gradient descent.

1 Introduction

Online learning involves methods and algorithms designed for making timely inferences and predictions in a real-time environment, where data are collected sequentially rather than being static as in traditional batch learning. By not requiring the simultaneous use of the entire sample pool, online learning alleviates both storage and computation pressures. To date, various statistical methods and algorithms for online estimation and inference have been proposed, including aggregated estimating equations (Lin and Xi, 2011), cumulative estimating equation (Schifano et al., 2016), renewable estimator (Luo and Song, 2020), and stochastic gradient descent (SGD) (Robbins and Monro, 1951; Polyak and Juditsky, 1992; Chen et al., 2020). Alongside online learning, another crucial consideration is the preservation of privacy, especially with the growing availability of streaming data. For example, financial institutions, such as banks and credit card companies, often leverage customer-level data for fraud detection and personalized service offerings, which need to continuously update and refine their models in real-time as new transaction data are collected. This aids not only in monitoring transactions to detect unusual activities but also in designing targeted financial products and promotions (e.g., offering personalized loan rates or credit card rewards based on individual spending patterns); see Maniar et al. (2021); Davidow et al. (2023). However, individual customer data are usually sensitive and irreplaceable, and it is of great importance to prevent the leak of personal information to maintain customer trust and confidence, as well as to help businesses avoid financial losses and reputational damage (Hu et al., 2022; Fainmesser et al., 2023).

In response to the increasing demand for data privacy protection, differential privacy (DP) has emerged as a widely adopted concept (Dwork et al., 2006). It has been successfully applied in numerous fields, including healthcare, financial services, retail, and e-commerce (Dankar and El Emam, 2013; Zhong, 2019; Wang and Tsai, 2022). DP procedures ensure that an adversary cannot determine whether a particular subject is included in the dataset with high probability, thereby providing robust protection for personal information. In the literature, two main models have emerged for DP: the central model, where a trusted third party collects and analyzes data before releasing privatized results, and the local model, where data is randomized before sharing. Several works have focused on designing private algorithms to guarantee DP under the trusted central server setting across various applications, including, but not limited to, machine learning, synthetic data generation, and transfer learning; see Karwa and Slavković (2016); Ponomareva et al. (2023); Li et al. (2024). A significant limitation of the standard central model is the assumption of a trusted data curator with access to the entire dataset. However, in scenarios such as smartphone usage, users may not fully trust the server and prefer not to have their personal information directly stored or updated on any remote central server. In contrast to the central DP, user-level local differential privacy (LDP) does not rely on a trusted data collector, placing the privacy barrier closer to the users, which has seen adoption in practice by several companies that analyze sensitive user data, including Apple (Tang et al., 2017), Google (Erlingsson et al., 2014), and Microsoft (Ding et al., 2017). Recently, there has been growing interest in studying LDP from a statistical perspective, including density estimation (Sart, 2023), mean and median estimation (Duchi et al., 2018), nonparametric estimation problems (Rohde and Steinberger, 2020), and change-point detection (Berrett and Yu, 2021).

The aforementioned methods mainly address the inherent trade-off between privacy protection and efficient statistical inference, focusing on identifying what optimal privacy-preserving mechanisms may look like. The literature on private inference, particularly in quantifying the uncertainty of private estimators, is relatively limited even in offline settings and has recently gained attention in computer science (Karwa and Vadhan, 2018; Wang et al., 2019; Chadha et al., 2021). This issue has also been studied in the statistical literature. Sheffet (2017) provided finite-sample analyses of conservative inference for confidence intervals regarding the coefficients from the least-squares regression. Barrientos et al. (2019) presented algorithms for comparing the sign and significance level of privately computed regression coefficients that satisfy DP. Avella-Medina (2021) introduced the M-estimator approach to conduct DP statistical inference using smooth sensitivity. Alabi and Vadhan (2023) proposed DP hypothesis tests for the multivariate linear regression model. Avella-Medina et al. (2023) investigated optimization-based approaches for Gaussian differentially private M-estimators. Zhao et al. (2024) extended this to objective functions without local strong convexity and smoothness assumptions. Zhang et al. (2024) considered estimation and inference problems in the federated learning setting under DP. Despite these efforts, a significant gap remains in developing a statistically sound approach for conducting statistical inference, such as constructing confidence intervals based on private estimators, particularly in the online setting, where theoretical results for statistical inference are largely lacking. Recently, Liu et al. (2023) proposed a self-normalizing approach to obtain confidence intervals for population quantiles with privacy concerns. However, this method is specifically tailored to population quantiles.

In this paper, we introduce a rigorous LDP framework for online estimation and inference in optimization-based regression problems with streaming data, addressing the challenge of constructing valid confidence intervals for online private inference. Specifically, we propose a fully online, locally private framework for real-time estimation and inference without requiring trusted data collectors. To the best of our knowledge, this is the first time in the literature on online inference problems where the issue of privacy is systematically investigated. The main contributions of this paper are summarized as follows:

  1. 1.

    To eliminate the need for trusted data collectors, we design a local differential privacy stochastic gradient descent algorithm (LDP-SGD) for streaming data that requires only one pass over the data. Unlike prior works, our algorithm provides rigorous individual-level data privacy protection in an online manner. Meanwhile, we propose two online private statistical inference procedures. One is to conduct a private plug-in estimator of the asymptotic covariance matrix. The other is the random scaling procedure, bypasses direct estimation of the asymptotic variance by constructing an asymptotically pivotal statistic for valid private inference in real time.

  2. 2.

    From a theoretical standpoint, we establish the convergence rates of the proposed online private estimators. In addition, we prove a functional central limit theorem to capture the asymptotic behavior of the whole LDP-SGD path, thereby providing an asymptotically pivotal statistic for valid private inference. These results lay a solid foundation for real-time private decision-making, supported by theoretical guarantees.

  3. 3.

    We conduct numerical experiments with simulated data to evaluate the performance of the proposed procedure in various settings, including varying cumulative sample size, different privacy budgets, and various models. Our findings indicate that the proposed procedure is computationally efficient, provides robust privacy guarantees, and significantly reduces data storage costs, all while maintaining high accuracy in estimation and inference. In addition, we apply the proposed method to the the ride-sharing data and the US insurance data to demonstrate how the method can be useful in practice.

The rest of this paper is organized as follows. In Section 2.1, we introduce the key properties of DP. Section 2.2 outlines the problem formulation. In Section 3, we propose our LDP-SGD algorithm and provide its convergence rates. In Section 4, we establish a functional central limit theorem and introduce two online private inference methods: the private plug-in estimator and the random scaling method to construct asymptotically valid private confidence intervals. We evaluate the performance of the proposed procedure through simulation studies in Section 5. In Section 6, we apply the proposed method on two real-world datasets: the Ride-Sharing Data and the US Insurance Data. Ride-sharing data provides fare-related information over time, enabling analysis of market pricing dynamics in the ride-sharing industry. The US Insurance Data includes detailed personal and family information for one million clients, along with their associated insurance fees, offering valuable insights into insurance fee prediction. Discussions are provided in Section 7. Some basic concepts of DP, the detailed description of private batch-means estimator, additional numerical results, and technical details are included in the Supplementary Material.

2 Preliminaries

In this section, we provide some preliminaries. To facilitate understanding, we first introduce the notations used throughout this paper. Let D\stackrel{{\scriptstyle D}}{{\rightarrow}} denote convergence in distribution. For a pp-vector 𝒂=(a1,,ap)p\bm{a}=(a_{1},\ldots,a_{p})^{\top}\in\mathbb{R}^{p}, 𝒂r=(j=1p|aj|r)1/r\|{\mbox{\boldmath${a}$}}\|_{r}=(\sum_{j=1}^{p}|a_{j}|^{r})^{1/r} for r1r\geq 1. Define Λ1(𝑨),,Λp(𝑨)\Lambda_{1}(\bm{A}),\ldots,\Lambda_{p}(\bm{A}) as the eigenvalues of a matrix 𝑨p×p\bm{A}\in\mathbb{R}^{p\times p}. Specifically, the largest and smallest eigenvalues are denoted by Λmax(𝑨)\Lambda_{\max}(\bm{A}) and Λmin(𝑨)\Lambda_{\min}(\bm{A}), respectively. We use 𝑨\|\bm{A}\| to denote the matrix operator norm of 𝑨\bm{A} and 𝑨\|\bm{A}\|_{\infty} to represent the element-wise \ell_{\infty}-norm of 𝑨\bm{A}. For a square positive semi-definite matrix 𝑫\bm{D}, we denote its trace by tr(𝑫){\rm tr}(\bm{D}). For any two symmetric matrices 𝑨\bm{A} and 𝑩\bm{B} of the same dimension, 𝑨𝑩\bm{A}\preceq\bm{B} means that 𝑩𝑨\bm{B}-\bm{A} is a positive semi-definite matrix, while 𝑨𝑩\bm{A}\succeq\bm{B} means that 𝑨𝑩\bm{A}-\bm{B} is a positive semi-definite matrix. Let {an}\{a_{n}\} and {bn}\{b_{n}\} be two sequences of positive numbers. Denote anbna_{n}\lesssim b_{n} if there exists a positive constant c0c_{0} such that aj/bjc0a_{j}/b_{j}\leq c_{0} for all j+j\in\mathbb{N}^{+}, and anbna_{n}\asymp b_{n} if there exist positive constants c1c_{1}, c2c_{2} such that c1aj/bjc2c_{1}\leq a_{j}/b_{j}\leq c_{2} for all j+j\in\mathbb{N}^{+}.

2.1 Some basic properties of differential privacy

In this subsection, we only focus on some useful properties of DP, LDP, and Gaussian differential privacy (GDP), while more details about GDP-related concepts are provided in the Supplementary Material. Notice that a variety of basic algorithms can be made private by simply adding a properly scaled noise in the output. The scale of noise is characterized by the sensitivity of the algorithm. The formal definition is presented as follows:

Definition 1 (Global sensitivity).

For any (non-private) statistics h(𝐗)ph(\bm{X})\in\mathbb{R}^{p} of the data set 𝐗\bm{X}, the global sensitivity of hh is the (possibly infinite) number

sens(h)=sup𝑿,𝑿h(𝑿)h(𝑿)2,\displaystyle{\rm{sens}}(h)=\sup\limits_{\bm{X},\bm{X}^{\prime}}\|h(\bm{X})-h(\bm{X}^{\prime})\|_{2},

where the supremum is taken over all pairs of data sets 𝐗\bm{X} and 𝐗\bm{X}^{\prime} that differ by one entry or datum.

We then introduce the following mechanism to construct GDP estimators. Although only the univariate case was stated in Theorem 1 of Dong et al. (2022), its extension to general case can be obtained in the following proposition.

Proposition 1 (Gaussian mechanism, Dong et al. (2022)).

Define the Gaussian mechanism that operates on a statistic hh as h~(𝐗)=h(𝐗)+sens(h)𝐙/μ\tilde{h}(\bm{X})=h(\bm{X})+{\rm{sens}}(h)\bm{Z}/\mu, where 𝐙\bm{Z} is a standard normal pp-dimensional random vector and μ>0\mu>0. Then, h~\tilde{h} is μ\mu-GDP.

The post-processing and parallel composition properties are key aspects of DP, enabling the design of complex DP algorithms by combining simpler ones. Such properties are crucial in the algorithmic designs presented in later sections.

Proposition 2 (Post-processing property, Dong et al. (2022)).

Let 𝒜\mathcal{A} be an μ\mu-GDP algorithm, and hh be an arbitrary randomized algorithm which takes 𝒜(𝐗)\mathcal{A}(\bm{X}) as input, then h(𝒜(𝐗))h(\mathcal{A}(\bm{X})) is also μ\mu-GDP.

Proposition 3 (Parallel composition, Smith et al. (2022)).

Let a sequence of kk mechanisms 𝒜l\mathcal{A}_{l} each be μl\mu_{l}-GDP, l=1,,kl=1,\ldots,k. Let {𝐗l}l=1k\{\bm{X}_{l}\}_{l=1}^{k} be kk disjoint subsets of data. Define 𝒪1=𝒜1(𝐗1)\mathcal{O}_{1}=\mathcal{A}_{1}(\bm{X}_{1}) as the output of the mechanism 𝒜1\mathcal{A}_{1}, and the sequential output of the mechanism 𝒜j\mathcal{A}_{j} is defined as 𝒪j=𝒜j(𝐗j,𝒪j1)\mathcal{O}_{j}=\mathcal{A}_{j}(\bm{X}_{j},\mathcal{O}_{j-1}) for j=2,,kj=2,\ldots,k. Then the output of the last mechanism 𝒪k\mathcal{O}_{k} is max{μ1,μ2,,μk}\max\{\mu_{1},\mu_{2},\ldots,\mu_{k}\}-GDP.

Since we will use a noisy version of the matrix to construct an asymptotic covariance matrix in a later section, we introduce the Matrix Gaussian mechanism. An intuitive approach to generating a differentially private matrix is to add independent and identically distributed (i.i.d.) noise to each individual component, as described below.

Proposition 4 (Matrix Gaussian mechanism, Avella-Medina et al. (2023)).

Let 𝐆n×m\bm{G}\in\mathbb{R}^{n\times m} be a data matrix such that each row vector 𝐠i\bm{g}_{i} satisfies 𝐠i21\|\bm{g}_{i}\|_{2}\leq 1. Define the matrix Gaussian mechanism that operates on a statistic hh as h~(𝐆)=h(𝐆)+𝐖\tilde{h}(\bm{G})=h(\bm{G})+\bm{W}, where h(𝐆)=𝐆𝐆/nh(\bm{G})=\bm{G}^{\top}\bm{G}/n and 𝐖\bm{W} is a symmetric random matrix whose upper-triangular elements, including the diagonal, are i.i.d. draws from (2/(μn))𝒩(0,1)(2/(\mu n))\mathcal{N}(0,1). Then, h~\tilde{h} is μ\mu-GDP.

2.2 Problem formulation

Consider n2n\geq 2, where nn observations, denoted by {𝒛i}i=1n\{\bm{z}_{i}\}_{i=1}^{n} with 𝒛i=(𝒙i,yi){\mbox{\boldmath${z}$}}_{i}=(\bm{x}_{i}^{\top},y_{i})^{\top} arrive sequentially. We assume these nn pairs of observations as i.i.d. copies of (𝑿,Y)(\bm{X},Y) with 𝑿p\bm{X}\in\mathbb{R}^{p} and YY\in\mathbb{R} from a common cumulative distribution FF. Our goal is to estimate a population parameter of interest 𝜽0Θp\bm{\theta}_{0}\in\Theta\subseteq\mathbb{R}^{p}, which can be formulated as the solution to the following optimization problem:

𝜽0=argmin(L(𝜽):=E𝒫𝒛[ρ(𝜽,𝒛)]=ρ(𝜽,𝒛)d𝒫𝒛),\displaystyle{}\bm{\theta}_{0}=\operatorname{argmin}\left(L(\bm{\theta}):={E}_{\mathcal{P}_{{\mbox{\boldmath${z}$}}}}[\rho(\bm{\theta},\bm{z})]=\int\rho(\bm{\theta},\bm{z})\mathrm{d}\mathcal{P}_{{\mbox{\boldmath${z}$}}}\right), (2.1)

where ρ(𝜽,𝒛)\rho(\bm{\theta},{\mbox{\boldmath${z}$}}) is some convex loss function with respect to 𝜽\bm{\theta}, 𝒛{z} is a random variable from the distribution P𝒛P_{{\mbox{\boldmath${z}$}}}, and L(𝜽)L(\bm{\theta}) is the population loss function.

In the offline setting, where the entire dataset is readily accessible, traditional deterministic optimization methods are commonly used to estimate the parameter 𝜽0\bm{\theta}_{0} under the empirical distribution induced by the observed data. However, for applications involving online data, where each sample arrives sequentially, storing all the data can be costly and inefficient. To address these challenges, the SGD algorithm (Robbins and Monro, 1951) is often used, especially for online learning, as it requires only one pass over the data. Starting from an initial point 𝜽~0\tilde{\bm{\theta}}_{0}, the SGD algorithm recursively updates the estimate upon the arrival of each data point 𝒛n\bm{z}_{n}, n1n\geq 1, with the following update rule:

𝜽~n=𝜽~n1γnΨ(𝜽~n1,𝒛n),\displaystyle{}\tilde{\bm{\theta}}_{n}=\tilde{\bm{\theta}}_{n-1}-\gamma_{n}\Psi(\tilde{\bm{\theta}}_{n-1},\bm{z}_{n}), (2.2)

where Ψ(𝜽,𝒛)\Psi({\bm{\theta}},\bm{z}) denotes the stochastic gradient of ρ(𝜽,𝒛)\rho({\bm{\theta}},\bm{z}) with respect to the first argument 𝜽{\bm{\theta}}, i.e., Ψ(𝜽,𝒛)=ρ(𝜽,𝒛)\Psi({\bm{\theta}},\bm{z})=\nabla\rho({\bm{\theta}},\bm{z}) and γn\gamma_{n} is the step size at the nnth step. As suggested by Ruppert (1988) and Polyak and Juditsky (1992), we note that under some conditions, the averaging estimate 𝜽~n=i=1n𝜽~i/n\tilde{{\bm{\theta}}}^{\ast}_{n}=\sum_{i=1}^{n}\tilde{{\bm{\theta}}}_{i}/n has the asymptotic normality

n(𝜽~n𝜽0)D𝒩(𝟎,𝚺),\displaystyle\sqrt{n}(\tilde{{\bm{\theta}}}^{\ast}_{n}-{\bm{\theta}}_{0})\stackrel{{\scriptstyle D}}{{\rightarrow}}\mathcal{N}(\bm{0},\bm{\Sigma}),

where 𝚺=𝑨1𝑺𝑨1\bm{\Sigma}={\mbox{\boldmath${A}$}}^{-1}{\mbox{\boldmath${S}$}}{\mbox{\boldmath${A}$}}^{-1}, 𝑨=2L(𝜽0){\mbox{\boldmath${A}$}}=\nabla^{2}L({\bm{\theta}}_{0}) is the Hessian matrix of L(𝜽)L({\bm{\theta}}) at 𝜽=𝜽0{\bm{\theta}}={\bm{\theta}}_{0}, and 𝑺{S} is the covariance matrix of ρ(𝜽0,𝒛)\nabla\rho({\bm{\theta}}_{0},{\mbox{\boldmath${z}$}}), given by 𝑺=E{ρ(𝜽0,𝒛)ρ(𝜽0,𝒛)}{\mbox{\boldmath${S}$}}={E}\{\nabla\rho({\bm{\theta}}_{0},{\mbox{\boldmath${z}$}})\nabla\rho({\bm{\theta}}_{0},{\mbox{\boldmath${z}$}})^{\top}\}.

A significant limitation of traditional SGD is its vulnerability to privacy breaches, arising from the extensive data and gradient queries made at each iteration. In this paper, our work focuses primarily on settings where trusted data collectors are not required under the constraints of LDP. In such settings, each individual is required to locally apply a DP mechanism to their data before transmitting the perturbed data to the server. In particular, we introduce an LDP-SGD algorithm designed to fit the model in (2.1) and accommodate timely online private statistical inference for 𝜽0\bm{\theta}_{0}, thereby providing rigorous individual-level data privacy protection while reducing both time and space complexity.

3 Methodology

In this subsection, we introduce an LDP estimator using noisy SGD algorithms within the framework of streaming data. The heightened risk of privacy leakage arises from the numerous data and gradient queries inherent in the SGD algorithm. To facilitate our work, we impose a restriction on the class of estimators and enforce the following uniform boundedness condition.

Condition 1.

The gradient of the loss function ρ\rho satisfies

sup𝜽Θ,𝒛𝒵Ψ(𝜽,𝒛)2B0\displaystyle\sup\limits_{\bm{\theta}\in\Theta,{\mbox{\boldmath${z}$}}\in\mathcal{Z}}\|\Psi(\bm{\theta},{\mbox{\boldmath${z}$}})\|_{2}\leq B_{0}

for some positive constant B0B_{0}.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Boxplots comparing estimators from Mallow’s weights and gradient clipping for logistic regression with dimension p=3p=3. The red line indicates the true parameter value, θ=1\theta^{*}=1, while the boxplots represent the distribution of estimators across 50 replications. The four colors correspond to four different parameters. Clipping results in a positive bias for logistic regression, which does not arise using Mallow’s weights.

Condition 1 is satisfied by using either Mallow’s weights (as discussed in Examples 1-3) or gradient clipping. However, as demonstrated in Song et al. (2021); Avella-Medina et al. (2023), gradient clipping is not ideal from a statistical standpoint, as it may result in inconsistent estimators. In Figure 1, we compare the performance of gradient clipping and Mallow’s weights using simulated logistic regression data. The results show that the parameter estimates from gradient clipping exhibit bias that does not shrink toward zero with the sample size, further underscoring our preference for Mallow’s weights, alongside the assumption of a bounded gradient. Meanwhile, Condition 1 implies that the loss function ρ\rho has a known bound on the gradient Ψ(𝜽,𝒛)\Psi(\bm{\theta},{\mbox{\boldmath${z}$}}), which ensures that 2B02B_{0} is an upper bound for the global sensitivity of Ψ(𝜽,𝒛)\Psi(\bm{\theta},{\mbox{\boldmath${z}$}}). At the nnth iteration with observed data point 𝒛n{\mbox{\boldmath${z}$}}_{n}, n1n\geq 1, we consider the following locally private SGD estimator with the noisy version of the iterates (2.2)

𝜽^n=𝜽^n1γn{Ψ(𝜽^n1,𝒛n)+2B0μn𝝃n},\displaystyle{}\hat{\bm{\theta}}_{n}=\hat{\bm{\theta}}_{n-1}-\gamma_{n}\left\{\Psi(\hat{\bm{\theta}}_{n-1},{\mbox{\boldmath${z}$}}_{n})+\frac{2B_{0}}{\mu_{n}}\bm{\xi}_{n}\right\}, (2.3)

where the final estimate is the averaging estimate denoted by 𝜽¯n=i=1n𝜽^i/n\bar{{\bm{\theta}}}_{n}=\sum_{i=1}^{n}\hat{\bm{\theta}}_{i}/n and {𝝃n}n1\{\bm{\xi}_{n}\}_{n\geq 1} is a sequence of i.i.d. standard pp-dimensional Gaussian random vectors. Unlike the standard SGD algorithm, the locally private SGD algorithm introduces calibrated noise to the gradient computations at each step of the model parameter update process, ensuring rigorous privacy protection. In particular, taking B0=0B_{0}=0 in the iterates (2.3) recovers the standard SGD algorithm in (2.2). We call the proposed iterates in (2.3) as the LDP-SGD algorithm. The following proposition shows that our proposed LDP-SGD algorithm is max{μ1,,μn}\max\{\mu_{1},\ldots,\mu_{n}\}-GDP.

Proposition 5.

Given an initial estimate 𝛉^0p\hat{\bm{\theta}}_{0}\in\mathbb{R}^{p}, consider the iterates {𝛉^n}n1\{\hat{\bm{\theta}}_{n}\}_{n\geq 1} defined in (2.3). Then the final output 𝛉¯n\bar{\bm{\theta}}_{n} is max{μ1,,μn}\max\{\mu_{1},\ldots,\mu_{n}\}-GDP.

From (2.3), each individual has different privacy budgets μn\mu_{n}, n1.n\geq 1. A special case arises when the proposed estimator reduces to μ\mu-GDP with μ=μ1==μn\mu=\mu_{1}=\cdots=\mu_{n}, thereby not considering different privacy budgets for each individual. The techniques needed to handle the cases of varing μn\mu_{n} are the same as for the case of the common μ\mu. For simplicity, we assume that μ=μ1==μn\mu=\mu_{1}=\cdots=\mu_{n} for each individual.

Condition 2.

Assume that the objective function L(𝛉)L({\bm{\theta}}) is differentiable, ζ\zeta-smooth, and λ\lambda-strongly convex, in the sense

L(𝜽1)L(𝜽2)L(𝜽2),𝜽1𝜽2+ζ2𝜽1𝜽22,𝜽1,𝜽2Θp,\displaystyle L({\bm{\theta}}_{1})-L({\bm{\theta}}_{2})\leq\langle\nabla L({\bm{\theta}}_{2}),{\bm{\theta}}_{1}-{\bm{\theta}}_{2}\rangle+\frac{\zeta}{2}\|{\bm{\theta}}_{1}-{\bm{\theta}}_{2}\|^{2},\quad\forall{\bm{\theta}}_{1},{\bm{\theta}}_{2}\in{\Theta}\subseteq\mathbb{R}^{p},
L(𝜽1)L(𝜽2)L(𝜽2),𝜽1𝜽2+λ2𝜽1𝜽22,𝜽1,𝜽2Θp.\displaystyle L({\bm{\theta}}_{1})-L({\bm{\theta}}_{2})\geq\langle\nabla L({\bm{\theta}}_{2}),{\bm{\theta}}_{1}-{\bm{\theta}}_{2}\rangle+\frac{\lambda}{2}\|{\bm{\theta}}_{1}-{\bm{\theta}}_{2}\|^{2},\quad\forall{\bm{\theta}}_{1},{\bm{\theta}}_{2}\in{\Theta}\subseteq\mathbb{R}^{p}.

Strong convexity and smoothness are standard conditions for the convergence analysis of (stochastic) gradient optimization methods. Similar conditions can be found in Vaswani et al. (2022); Zhu et al. (2023). Due to space limitations, several useful properties of strongly convex and smooth functions are provided in Lemma S1 of the Supplementary Material. As stated in Lemma S1, the strong convexity and smoothness of L(𝜽)L({\bm{\theta}}) imply that Λmin{2L(𝜽)}λ\Lambda_{\min}\{\nabla^{2}L({\bm{\theta}})\}\geq\lambda and Λmax{2L(𝜽)}ζ\Lambda_{\max}\{\nabla^{2}L({\bm{\theta}})\}\leq\zeta, respectively.

Condition 3.

Let 𝛈i=L(𝛉^i1)Ψ(𝛉^i1,𝐳i){\mbox{\boldmath${\eta}$}}_{i}=\nabla L(\hat{\bm{\theta}}_{i-1})-\Psi(\hat{\bm{\theta}}_{i-1},{\mbox{\boldmath${z}$}}_{i}) denote the gradient noise. There exists some positive constant C1C_{1} such that the conditional covariance of 𝛈n{\mbox{\boldmath${\eta}$}}_{n} has an expansion around 𝐒{S}, satisfies

E(𝜼n𝜼n|n1)𝑺C1(𝜹^n12+𝜹^n122),\displaystyle\|E({\mbox{\boldmath${\eta}$}}_{n}{\mbox{\boldmath${\eta}$}}^{\top}_{n}|\mathcal{F}_{n-1})-{\mbox{\boldmath${S}$}}\|\leq C_{1}(\|\hat{\bm{\delta}}_{n-1}\|_{2}+\|\hat{\bm{\delta}}_{n-1}\|_{2}^{2}),
|tr{E(𝜼n𝜼n|n1)𝑺}|C1(𝜹^n12+𝜹^n122),\displaystyle|{\rm tr}\{E({\mbox{\boldmath${\eta}$}}_{n}{\mbox{\boldmath${\eta}$}}^{\top}_{n}|\mathcal{F}_{n-1})-{\mbox{\boldmath${S}$}}\}|\leq C_{1}(\|\hat{\bm{\delta}}_{n-1}\|_{2}+\|\hat{\bm{\delta}}_{n-1}\|_{2}^{2}),

where n1\mathcal{F}_{n-1} is the σ\sigma-algebra generated by {𝐳1,,𝐳n1}\{{\mbox{\boldmath${z}$}}_{1},\ldots,{\mbox{\boldmath${z}$}}_{n-1}\} and 𝛅^n=𝛉^n𝛉0\hat{{\mbox{\boldmath${\delta}$}}}_{n}=\hat{{\mbox{\boldmath${\theta}$}}}_{n}-{\bm{\theta}}_{0} is the error sequence. In addition, the fourth conditional moment of 𝛈n{\mbox{\boldmath${\eta}$}}_{n} is bounded by

E(𝜼n24|n1)C2(1+𝜹^n124)\displaystyle E(\|{\mbox{\boldmath${\eta}$}}_{n}\|_{2}^{4}|\mathcal{F}_{n-1})\leq C_{2}(1+\|\hat{\bm{\delta}}_{n-1}\|_{2}^{4})

for some positive constant C2.C_{2}.

Condition 3 is a mild requirement on the loss function. As shown in Lemma 3.1 of Chen et al. (2020), Condition 3 holds if the Hessian matrix of ρ(𝜽,𝒛)\rho({\bm{\theta}},{\mbox{\boldmath${z}$}}) is bounded by some random function H(𝒛)H({\mbox{\boldmath${z}$}}) with a bounded fourth moment. Examples that satisfy this condition include the robust estimation methods for commonly used linear and logistic regressions. With the aforementioned assumptions, we now present the following error bounds on the private SGD iterates.

Theorem 1.

Under Conditions 1-3, there exist some positive constants n0n_{0}\in\mathbb{N} and cpc_{p} that depends on the dimension pp, such that for nn0n\geq n_{0}, 𝛅^n=𝛉^n𝛉0\hat{\bm{\delta}}_{n}=\hat{\bm{\theta}}_{n}-{{\mbox{\boldmath${\theta}$}}}_{0} satisfies the following

E(𝜹^n2m)nmα/2{(γcpB02/μ2)m/2/λ+𝜹^02m},m=1,2,4,\displaystyle E(\|\hat{\bm{\delta}}_{n}\|_{2}^{m})\lesssim n^{-m\alpha/2}\{(\gamma c_{p}B_{0}^{2}/\mu^{2})^{m/2}/\lambda+\|\hat{\bm{\delta}}_{0}\|_{2}^{m}\},\quad m=1,2,4,

when the step-size is chosen to be γi=γiα\gamma_{i}=\gamma i^{-\alpha} with γ>0\gamma>0 and 1/2<α<11/2<\alpha<1.

Theorem 1 provides simplified bounds on the conditional moments of δ^n\hat{\delta}_{n}, including the first, second, and fourth moments. These results characterize the convergence rate of the last-step estimator 𝜽^n\hat{\bm{\theta}}_{n}. To illustrate the applicability of our proposed algorithm, we provide several examples below. Our framework, as it turns out, accommodates a wide range of important statistical models.

Example 1 (Linear Regression).

Let 𝐳n=(𝐱n,yn)\bm{z}_{n}=(\bm{x}^{\top}_{n},y_{n})^{\top}, n=1,2,,n=1,2,\ldots, be a sequence of i.i.d. copies from (𝐗,Y)({\mbox{\boldmath${X}$}}^{\top},Y)^{\top} with YY\in\mathbb{R} and 𝐗p{\mbox{\boldmath${X}$}}\in\mathbb{R}^{p}, satisfying yn=𝐱n𝛉0+εny_{n}=\bm{x}_{n}^{\top}\bm{\theta}_{0}+\varepsilon_{n}, where εn\varepsilon_{n}\in\mathbb{R} is the random noise. The loss function can be chosen as ρ(𝛉,𝐳)=hc(y𝐱𝛉)ω(𝐱)\rho(\bm{\theta},\bm{z})=h_{c}(y-{\mbox{\boldmath${x}$}}^{\top}{\bm{\theta}})\omega({\mbox{\boldmath${x}$}}), where hc()h_{c}(\cdot) is the Huber loss with truncated parameter cc and ω(𝐱)=min(1,2/𝐱22)\omega({\mbox{\boldmath${x}$}})=\min(1,2/\|{\mbox{\boldmath${x}$}}\|_{2}^{2}) downweights outlying covariates. With this construction, it is straightforward to verify that the global sensitivity of Ψ(𝛉,𝐳)\Psi(\bm{\theta},\bm{z}) is 22c2\sqrt{2}c. Letting B0=2cB_{0}=\sqrt{2}c, our proposed LDP-SGD updates for 𝛉0{\bm{\theta}}_{0}, as defined in (2.3) is given by:

𝜽^n=𝜽^n1+γn{I(|yn𝒙n𝜽^n1|<c)𝒙n(yn𝒙n𝜽^n1)ω(𝒙n)}+22cγnμn𝝃n,n1,\displaystyle\hat{\bm{\theta}}_{n}=\hat{\bm{\theta}}_{n-1}+\gamma_{n}\{I(|y_{n}-{\mbox{\boldmath${x}$}}_{n}^{\top}{\hat{\bm{\theta}}_{n-1}}|<c){\mbox{\boldmath${x}$}}_{n}(y_{n}-{\mbox{\boldmath${x}$}}_{n}^{\top}{\hat{\bm{\theta}}_{n-1}})\omega({\mbox{\boldmath${x}$}}_{n})\}+\frac{2\sqrt{2}c\gamma_{n}}{\mu_{n}}\bm{\xi}_{n},~{}~{}n\geq 1,

where I()I(\cdot) is the indicator function.

Example 2 (Logistic Regression).

An important data science problem is logistic regression for binary classification problems. Specifically, let 𝐳n=(𝐱n,yn)\bm{z}_{n}=({\mbox{\boldmath${x}$}}_{n}^{\top},y_{n})^{\top}, n=1,2,,n=1,2,\ldots, be a sequence of i.i.d. copies from (𝐗,Y)({\mbox{\boldmath${X}$}}^{\top},Y)^{\top} with Y{1,1}Y\in\{-1,1\} and 𝐗p{\mbox{\boldmath${X}$}}\in\mathbb{R}^{p}, satisfying ynBernoulli({1+exp(𝐱n𝛉0)}1)y_{n}\sim{\rm Bernoulli}(\{1+\exp(-\bm{x}_{n}^{\top}\bm{\theta}_{0})\}^{-1}). Taking Mallow’s weighted version of the usual cross-entropy loss as ρ(𝛉,𝐳)={ylog(1/{1+exp(𝐱𝛉)})+(1y)log(exp(𝐱𝛉)/{1+exp(𝐱𝛉)})}ω(𝐱)\rho(\bm{\theta},\bm{z})=\{-y\log(1/\{1+\exp(\bm{x}^{\top}\bm{\theta})\})+(1-y)\log(\exp(\bm{x}^{\top}\bm{\theta})/\{1+\exp(\bm{x}^{\top}\bm{\theta})\})\}\omega({\mbox{\boldmath${x}$}}), where ω(𝐱)\omega({\mbox{\boldmath${x}$}}) is defined in the previous example. By this construction, we can easily verify that the global sensitivity of Ψ(𝛉,𝐳)\Psi(\bm{\theta},\bm{z}) is 222\sqrt{2}. Setting B0=2B_{0}=\sqrt{2}, our proposed LDP-SGD updates for 𝛉0{\bm{\theta}}_{0}, as defined in (2.3) is given by:

𝜽^n=𝜽^n1+γn𝒙n(yn{1+exp(𝒙n𝜽^n1)}1)ω(𝒙n)+22γnμn𝝃n,n1.\displaystyle\hat{\bm{\theta}}_{n}=\hat{\bm{\theta}}_{n-1}+\gamma_{n}{\mbox{\boldmath${x}$}}_{n}(y_{n}-\{1+\exp(-{\mbox{\boldmath${x}$}}_{n}^{\top}{\hat{\bm{\theta}}_{n-1}})\}^{-1})\omega({\mbox{\boldmath${x}$}}_{n})+\frac{2\sqrt{2}\gamma_{n}}{\mu_{n}}\bm{\xi}_{n},~{}~{}n\geq 1.
Example 3 (Robust Expectile Regression).

As an important extension of linear regression, robust expectile regression allows the regression coefficients 𝛉0(τ)\bm{\theta}_{0}(\tau) to vary across different values of location parameter τ\tau, and thereby offers insights into the entire conditional distribution of YY given 𝐗{X}; see Man et al. (2024). Specifically, given a location parameter τ(0,1)\tau\in(0,1), let 𝐳n=(𝐱n,yn)\bm{z}_{n}=({\mbox{\boldmath${x}$}}_{n}^{\top},y_{n})^{\top}, n=1,2,,n=1,2,\ldots, be a sequence of i.i.d. copies from (𝐗,Y)({\mbox{\boldmath${X}$}}^{\top},Y)^{\top}, satisfying yn=𝐱n𝛉0(τ)+εn(τ)y_{n}=\bm{x}_{n}^{\top}\bm{\theta}_{0}(\tau)+\varepsilon_{n}(\tau). The loss function can be chosen as ρ(𝛉,𝐳)=|τI(y𝐱𝛉)|hc(y𝐱𝛉)ω(𝐱)/2\rho(\bm{\theta},\bm{z})=|\tau-I(y-\bm{x}^{\top}\bm{\theta})|\cdot h_{c}(y-\bm{x}^{\top}\bm{\theta})\omega({\mbox{\boldmath${x}$}})/2, where hc()h_{c}(\cdot) is the Huber loss with truncated parameter cc. With this construction, it is straightforward to verify that the global sensitivity of Ψ(𝛉,𝐳)\Psi(\bm{\theta},\bm{z}) is 22cmax(τ,1τ)2\sqrt{2}c\max(\tau,1-\tau). Taking B0=2cmax(τ,1τ)B_{0}=\sqrt{2}c\max(\tau,1-\tau), our proposed LDP-SGD updates for 𝛉0{\bm{\theta}}_{0}, as defined in (2.3) is given by:

𝜽^n\displaystyle\hat{\bm{\theta}}_{n} =𝜽^n1+γn|τI(yn𝒙n𝜽^n1)|I(|yn𝒙n𝜽^n1|<c)𝒙n(yn𝒙n𝜽^n1)ω(𝒙n)\displaystyle=\hat{\bm{\theta}}_{n-1}+\gamma_{n}|\tau-I(y_{n}-\bm{x}^{\top}_{n}\hat{\bm{\theta}}_{n-1})|\cdot I(|y_{n}-{\mbox{\boldmath${x}$}}_{n}^{\top}{\hat{\bm{\theta}}_{n-1}}|<c){\mbox{\boldmath${x}$}}_{n}(y_{n}-{\mbox{\boldmath${x}$}}_{n}^{\top}{\hat{\bm{\theta}}_{n-1}})\omega({\mbox{\boldmath${x}$}}_{n})
+22cmax(τ,1τ)γnμn𝝃n,n1.\displaystyle\quad+\frac{2\sqrt{2}c\max(\tau,1-\tau)\gamma_{n}}{\mu_{n}}\bm{\xi}_{n},~{}~{}n\geq 1.

4 Online private inference

In this subsection, we introduce two online inference methods, distinguished by the availability of second-order information from the loss function. One is the online private plug-in approach that directly uses the Hessian information and the other is the random scaling that utilizes only the information from the path of our proposed LDP-SGD iteration in (2.3). In addition to these two methods, we also consider a private batch-means estimator that utilizes the iterations generated by the LDP-SGD algorithm to estimate the asymptotic covariance. The detailed construction of this estimator, along with a comparative discussion of its theoretical properties relative to the above two methods, is provided in the Supplementary Material.

To elucidate these methods, we first examine the asymptotic behavior of the entire LDP-SGD path rather than its simple average 𝜽¯n\bar{\bm{\theta}}_{n}. Specifically, we present the functional central limit theorem, which shows that the standardized partial-sum process converges in distribution to a rescaled Brownian motion.

Theorem 2 (Functional CLT).

Consider LDP-SGD iterates in (2.3) with step-size γi=γiα\gamma_{i}=\gamma i^{-\alpha} for some constant γ>0\gamma>0 and 1/2<α<11/2<\alpha<1. Then, under Conditions 1-3, the following random function weakly converges to a scaled Brownian motion, i.e.,

ϕn(r):=1ni=1nr(𝜽^i𝜽0)𝑨1{𝑺+(4B02/μ2)𝑰}1/2𝒲p(r)forr(0,1]\displaystyle\phi_{n}(r):=\frac{1}{\sqrt{n}}\sum\limits_{i=1}^{\lfloor nr\rfloor}(\hat{\bm{\theta}}_{i}-\bm{\theta}_{0})\Rightarrow\bm{A}^{-1}\{\bm{S}+(4B^{2}_{0}/\mu^{2})\bm{I}\}^{1/2}\mathcal{W}_{p}(r)~{}~{}{\rm{for}}~{}~{}r\in(0,1]

as nn\rightarrow\infty, where 𝐀=2L(𝛉0){\mbox{\boldmath${A}$}}=\nabla^{2}L({\bm{\theta}}_{0}), 𝐒=E{ρ(𝛉0,𝐳)ρ(𝛉0,𝐳)}{\mbox{\boldmath${S}$}}={E}\{\nabla\rho({\bm{\theta}}_{0},{\mbox{\boldmath${z}$}})\nabla\rho({\bm{\theta}}_{0},{\mbox{\boldmath${z}$}})^{\top}\}, 𝒲p\mathcal{W}_{p} is the pp-dimensional vector of the independent standard Wiener processes on [0,1][0,1], and the symbol \Rightarrow stands for the weak convergence.

The result derived from this theorem is stronger than the asymptotic normality of the proposed private estimator 𝜽¯n\bar{\bm{\theta}}_{n} under the stated assumptions. In particular, by applying the continuous mapping theorem to Theorem 2, we are able to directly establish the following corollary.

Corollary 1.

Under the conditions of Theorem 2, the averaging estimator 𝛉¯n=i=1n𝛉^i/n\bar{\bm{\theta}}_{n}=\sum_{i=1}^{n}\hat{\bm{\theta}}_{i}/n satisfies n(𝛉¯n𝛉0)D𝒩(𝟎,𝚺~)\sqrt{n}(\bar{\bm{\theta}}_{n}-{\bm{\theta}}_{0})\stackrel{{\scriptstyle D}}{{\rightarrow}}\mathcal{N}(\bm{0},\widetilde{\bm{\Sigma}}),  where 𝚺~=𝐀1𝐒~𝐀1\widetilde{\bm{\Sigma}}={{\mbox{\boldmath${A}$}}}^{-1}\tilde{\bm{S}}{\mbox{\boldmath${A}$}}^{-1}, and 𝐒~=𝐒+(4B02/μ2)𝐈\tilde{\bm{S}}={\mbox{\boldmath${S}$}}+(4B_{0}^{2}/\mu^{2}){\mbox{\boldmath${I}$}}.

In contrast to the covariance matrix of the non-private estimator in (2.2), the additional term (4B02/μ2)𝑰(4B_{0}^{2}/\mu^{2}){{\mbox{\boldmath${I}$}}} in the covariance matrix of Corollary 1 represents the “cost of privacy” for our proposed LDP-SGD algorithm.

4.1 Private plug-in estimator

In the previous subsection, we present the asymptotic distribution of the proposed LDP-SGD estimator in Corollary 1. For the purpose of conducting statistical inference of 𝜽0\bm{\theta}_{0}, a consist estimator of the limiting covariance matrix 𝚺~\widetilde{\bm{\Sigma}} from Corollary 1 is required. An intuitive way to constructing confidence intervals for 𝜽0\bm{\theta}_{0} is to estimate each component of the sandwich formula 𝚺~\widetilde{\bm{\Sigma}} separately. Without privacy constraints, the estimators for 𝑨{A} and 𝑺~\tilde{\bm{S}} are given by:

𝑨n=1ni=1nΨ˙(𝜽^i1,𝒛i)and𝑺~n=1ni=1nΨ(𝜽^i1,𝒛i)Ψ(𝜽^i1,𝒛i)+4B02μ2𝑰\displaystyle{\mbox{\boldmath${A}$}}_{n}=\frac{1}{n}\sum\limits_{i=1}^{n}\dot{\Psi}(\hat{\bm{\theta}}_{i-1},{\mbox{\boldmath${z}$}}_{i})\quad{\rm{and}}\quad\tilde{\bm{S}}_{n}=\frac{1}{n}\sum\limits_{i=1}^{n}\Psi(\hat{\bm{\theta}}_{i-1},{\mbox{\boldmath${z}$}}_{i})\Psi(\hat{\bm{\theta}}_{i-1},{\mbox{\boldmath${z}$}}_{i})^{\top}+\frac{4B_{0}^{2}}{\mu^{2}}{\mbox{\boldmath${I}$}}

as long as the information of Ψ˙(𝜽^i1,𝒛i)\dot{\Psi}(\hat{\bm{\theta}}_{i-1},{\mbox{\boldmath${z}$}}_{i}) is available. Notice that each summand in both 𝑨n{\mbox{\boldmath${A}$}}_{n} and 𝑺~n\tilde{\bm{S}}_{n} involves different 𝜽^i\hat{\bm{\theta}}_{i}. This allows computations to be carried out in an online manner without the need to store the entire dataset. The consistency of the corresponding non-private plug-in estimator is also established and provided in the Lemma S8 of the Supplementary Material.

However, in the DP setting, direct application of this plug-in construction is not feasible, as neither 𝑨n{\bm{A}}_{n} nor 𝑺~n\tilde{\bm{S}}_{n} is differentially private. To overcome this limitation, we propose constructing DP versions of these matrices using the Matrix Gaussian mechanism, as outlined in Proposition 4. This requires the Hessian to have a specific factorization structure that facilitates the effective application of Proposition 4. Moreover, we assume that the spectral norm of the Hessian is uniformly bounded, as stated below.

Condition 4.

The Hessian 2ρ(𝛉,𝐳)\nabla^{2}{\rho}(\bm{\theta},\bm{z}) is positive definite for all 𝛉p\bm{\theta}\in\mathbb{R}^{p} with the form 2ρ(𝛉,𝐳)=m(𝛉,𝐳)m(𝛉,𝐳)\nabla^{2}{\rho}(\bm{\theta},\bm{z})=m(\bm{\theta},\bm{z})m(\bm{\theta},\bm{z})^{\top}, where m:Θ×𝒵pm:\Theta\times\mathcal{Z}\rightarrow\mathbb{R}^{p} and sup𝛉,𝐳m(𝛉,𝐳)22B1<\sup_{\bm{\theta},\bm{z}}\|m(\bm{\theta},\bm{z})\|_{2}^{2}\leq B_{1}<\infty.

We are now ready to present our DP counterparts of 𝑨n\bm{A}_{n} and 𝑺~n\tilde{\bm{S}}_{n}, which appear in the plug-in estimator. Consider the following private estimators:

𝑨^n=1ni=1nΨ˙(𝜽^i1,𝒛i)+2B1nμ𝑴1i=𝑨n+2B1nμ𝑴1,\displaystyle\hat{\bm{A}}_{n}=\frac{1}{n}\sum\limits_{i=1}^{n}\dot{\Psi}(\hat{\bm{\theta}}_{i-1},\bm{z}_{i})+\frac{2B_{1}}{n\mu}\bm{M}_{1i}=\bm{A}_{n}+\frac{2B_{1}}{n\mu}\bm{M}_{1},
𝑺^n=1ni=1nΨ(𝜽^i1,𝒛i)Ψ(𝜽^i1,𝒛i)+4B02μ2𝑰+2B02nμ𝑴2i=𝑺~n+2B02nμ𝑴2,\displaystyle\hat{\bm{S}}_{n}=\frac{1}{n}\sum\limits_{i=1}^{n}\Psi(\hat{\bm{\theta}}_{i-1},\bm{z}_{i})\Psi(\hat{\bm{\theta}}_{i-1},\bm{z}_{i})^{\top}+\frac{4B_{0}^{2}}{\mu^{2}}\bm{I}+\frac{2B_{0}^{2}}{n\mu}\bm{M}_{2i}=\tilde{\bm{S}}_{n}+\frac{2B_{0}^{2}}{n\mu}\bm{M}_{2},

where 𝑴1\bm{M}_{1} and 𝑴2\bm{M}_{2} are i.i.d. symmetric random matrices whose upper-triangular elements, including the diagonals, are i.i.d. standard normal. However, the DP matrices 𝑨^n\hat{\bm{A}}_{n} and 𝑺^n\hat{\bm{S}}_{n} can potentially be negative definite. To address this concern, we apply a truncation procedure to the eigenvalues of the matrices. Specifically, we first perform spectral decompositions: 𝑨^n=𝚪A𝑫A𝚪A\hat{\bm{A}}_{n}=\bm{\Gamma}_{A}\bm{D}_{A}\bm{\Gamma}_{A}^{\top} and 𝑺^n=𝚪S𝑫S𝚪S\hat{\bm{S}}_{n}=\bm{\Gamma}_{S}\bm{D}_{S}\bm{\Gamma}_{S}^{\top}, where 𝚪A\bm{\Gamma}_{A} and 𝚪S\bm{\Gamma}_{S} are orthogonal matrices. Given fixed threshold parameters κ1\kappa_{1} and κ2\kappa_{2}, we define the truncated estimators as 𝑨^n=𝚪A𝑫^A𝚪A\hat{\bm{A}}^{\ast}_{n}=\bm{\Gamma}_{A}\hat{\bm{D}}_{A}\bm{\Gamma}_{A}^{\top} and 𝑺^n=𝚪S𝑫^S𝚪S\hat{\bm{S}}^{\ast}_{n}=\bm{\Gamma}_{S}\hat{\bm{D}}_{S}\bm{\Gamma}_{S}^{\top}, where the diagonal elements are thresholded as (𝑫^A)i,i=max{κ1,(𝑫A)i,i}(\hat{\bm{D}}_{A})_{i,i}=\max\{\kappa_{1},(\bm{D}_{A})_{i,i}\} and (𝑫^S)i,i=max{κ2,(𝑫S)i,i}(\hat{\bm{D}}_{S})_{i,i}=\max\{\kappa_{2},(\bm{D}_{S})_{i,i}\} for i=1,,pi=1,\ldots,p. In other words, eigenvalues smaller than κ1\kappa_{1} or κ2\kappa_{2} are truncated to ensure numerical stability and positive semi-definiteness. The DP sandwich estimator can then be constructed as follows:

𝚺^n=𝑨^n1𝑺^n𝑨^n1.\displaystyle{}\widehat{\bm{\Sigma}}_{n}=\hat{\bm{A}}^{\ast-1}_{n}\hat{\bm{S}}^{\ast}_{n}\hat{\bm{A}}^{\ast-1}_{n}. (2.4)

To establish consistency of our proposed private plug-in estimator 𝚺^n\widehat{\bm{\Sigma}}_{n}, we need the following additional smoothness assumption regarding the Hessian, similar to those found in the literature (Chen et al., 2020, 2024).

Condition 5.

Assume that for all 𝛉p\bm{\theta}\in\mathbb{R}^{p}, the following holds:

E2ρ(𝜽,𝒛)2ρ(𝜽0,𝒛)K𝜽𝜽02,\displaystyle E\|\nabla^{2}{\rho}({\bm{\theta}},{\mbox{\boldmath${z}$}})-\nabla^{2}{\rho}({\bm{\theta}}_{0},{\mbox{\boldmath${z}$}})\|\leq K\|{\bm{\theta}}-{\bm{\theta}}_{0}\|_{2},
E{2ρ(𝜽0,z)}2𝑨2C3,\displaystyle\|E\{\nabla^{2}{\rho}({\bm{\theta}}_{0},z)\}^{2}-{\mbox{\boldmath${A}$}}^{2}\|\leq C_{3},

where KK and C3C_{3} are some postive constants. Furthermore, assume that Λmin(𝐀)>δ.\Lambda_{\min}({\mbox{\boldmath${A}$}})>\delta.

Theorem 3 (Error rate of the private plug-in estimator).

Under the conditions of Corollary 1, and Conditions 4-5, then 𝚺^n\widehat{\bm{\Sigma}}_{n} is 3μ\sqrt{3}\mu-GDP and converges to the asymptotic covariance matrix that satisfies the following

E𝚺^n𝑨1𝑺~𝑨1𝑺γcpB02Kμ(1+Λmin(𝑺)+4B02μ2)(nα/2+γcpKμnα),\displaystyle E\|\widehat{\bm{\Sigma}}_{n}-\bm{A}^{-1}\tilde{\bm{S}}\bm{A}^{-1}\|\lesssim\|\bm{S}\|\frac{\sqrt{\gamma c_{p}}B^{2}_{0}K}{\mu}\left(1+\Lambda_{\min}(\bm{S})+\frac{4B_{0}^{2}}{\mu^{2}}\right)\left(n^{-\alpha/2}+\frac{\sqrt{\gamma c_{p}}K}{\mu}n^{-\alpha}\right),

when the step-size is chosen to be γi=γiα\gamma_{i}=\gamma i^{-\alpha} with γ>0\gamma>0 and α(1/2,1).\alpha\in(1/2,1).

Theorem 3 establishes the consistency of the proposed private plug-in estimator 𝚺^n\widehat{\bm{\Sigma}}_{n}. Consequently, we can consider the following 3μ\sqrt{3}\mu-GDP (1α0)(1-\alpha_{0})-confidence interval for θ0,j\theta_{0,j}, j=1,,pj=1,\ldots,p:

CI1α0plug(θ0,j)=[θ¯n,jσ^n,jPz1α0/2/n,θ¯n,j+σ^n,jPz1α0/2/n],\displaystyle{\rm{CI}}_{1-\alpha_{0}}^{\rm{plug}}(\theta_{0,j})=\left[\bar{\theta}_{n,j}-\hat{\sigma}_{n,j}^{P}z_{1-\alpha_{0}/2}/\sqrt{n},\bar{\theta}_{n,j}+\hat{\sigma}_{n,j}^{P}z_{1-\alpha_{0}/2}/\sqrt{n}\right],

where σ^n,jP=(𝚺^n)j,j1/2\hat{\sigma}_{n,j}^{P}=(\widehat{\bm{\Sigma}}_{n})^{1/2}_{j,j} and z1α0/2z_{1-\alpha_{0}/2} is the (1α0/21-\alpha_{0}/2)-quantile of the standard normal distribution. In particular, we have the following corollary, which demonstrates that CI1α0plug(θ0,j){\rm{CI}}_{1-\alpha_{0}}^{\rm{plug}}(\theta_{0,j}) is an asymptotically exact confidence interval.

Corollary 2.

Under the conditions of Theorem 3, when pp is fixed and nn\rightarrow\infty, we have

Pr(θ¯n,jz1α0/2σ^n,jP/nθ0,jθ¯n,j+z1α0/2σ^n,jP/n)1α0.\displaystyle{\rm{Pr}}\left(\bar{\theta}_{n,j}-z_{1-\alpha_{0}/2}\hat{\sigma}_{n,j}^{P}/\sqrt{n}\leq\theta_{0,j}\leq\bar{\theta}_{n,j}+z_{1-\alpha_{0}/2}\hat{\sigma}_{n,j}^{P}/\sqrt{n}\right)\rightarrow 1-\alpha_{0}.

4.2 Random scaling

Despite the simplicity of the private plug-in approach, the proposed estimator 𝚺^n\widehat{\bm{\Sigma}}_{n} incurs additional computational and storage cost, primarily because it requires extra function-value queries to construct 𝑨^n\hat{\bm{A}}^{\ast}_{n}. To mitigate this issue, we propose the random scaling inference procedure, which avoids the direct estimation of the asymptotic variance. Instead, the procedure studentizes 𝜽¯n\bar{\bm{\theta}}_{n} using a matrix derived from iterations along the LDP-SGD path. Specifically, with Theorem 2, we observe that ϕn(1)=i=1n(𝜽^i𝜽0)/n=n(𝜽¯n𝜽0)\phi_{n}(1)=\sum_{i=1}^{n}(\hat{\bm{\theta}}_{i}-\bm{\theta}_{0})/\sqrt{n}=\sqrt{n}(\bar{\bm{\theta}}_{n}-\bm{\theta}_{0}). Consequently, ϕn(r)nrϕn(1)/n=i=1nr(𝜽^i𝜽¯n)/n\phi_{n}(r)-\lfloor nr\rfloor\phi_{n}(1)/{n}=\sum_{i=1}^{\lfloor nr\rfloor}(\hat{\bm{\theta}}_{i}-\bar{\bm{\theta}}_{n})/\sqrt{n} eliminates the dependence on 𝜽0\bm{\theta}_{0}. To remove the dependence on the unknown scale 𝑨1{𝑺+(4B02/μ2)𝑰}1/2\bm{A}^{-1}\{\bm{S}+(4B^{2}_{0}/\mu^{2})\bm{I}\}^{1/2}, we studentize ϕn(1)\phi_{n}(1) via

𝑽^n=\displaystyle\hat{\bm{V}}_{n}= 1nb=1n{ϕn(b/n)bnϕn(1)}{ϕn(b/n)bnϕn(1)}\displaystyle\frac{1}{n}\sum\limits_{b=1}^{n}\left\{\phi_{n}(b/n)-\frac{b}{n}\phi_{n}(1)\right\}\left\{\phi_{n}(b/n)-\frac{b}{n}\phi_{n}(1)\right\}^{\top}
=\displaystyle= 1nb=1n{1ni=1b(𝜽^i𝜽¯n)}{1ni=1b(𝜽^i𝜽¯n)}.\displaystyle\frac{1}{n}\sum\limits_{b=1}^{n}\left\{\frac{1}{\sqrt{n}}\sum\limits_{i=1}^{b}(\hat{\bm{\theta}}_{i}-\bar{\bm{\theta}}_{n})\right\}\left\{\frac{1}{\sqrt{n}}\sum\limits_{i=1}^{b}(\hat{\bm{\theta}}_{i}-\bar{\bm{\theta}}_{n})\right\}^{\top}.

This results in the statistic ϕn(1)𝑽^n1ϕn(1)\phi^{\top}_{n}(1)\hat{\bm{V}}_{n}^{-1}\phi_{n}(1) being asymptotically pivotal, with a distribution that is free of any unknown nuisance parameters. This conclusion is supported by the following corollary, which directly follows from Theorem 2 and the continuous mapping theorem. Thus, the proof of this corollary is omitted.

Corollary 3.

Under the conditions of Theorem 2, we have

ϕn(1)𝑽^n1ϕn(1)D𝒲p(1){01𝒲¯(r)𝒲¯(r)dr}1𝒲p(1),\displaystyle\phi^{\top}_{n}(1)\hat{\bm{V}}_{n}^{-1}\phi_{n}(1)\stackrel{{\scriptstyle D}}{{\rightarrow}}\mathcal{W}_{p}(1)^{\top}\left\{\int_{0}^{1}\bar{\mathcal{W}}(r)\bar{\mathcal{W}}(r)^{\top}{\rm d}r\right\}^{-1}\mathcal{W}_{p}(1),

where 𝒲¯(r)=𝒲p(r)r𝒲p(1)\bar{\mathcal{W}}(r)=\mathcal{W}_{p}(r)-r\mathcal{W}_{p}(1).

This corollary implies that ϕn(1)𝑽^n1ϕn(1)\phi^{\top}_{n}(1)\hat{\bm{V}}_{n}^{-1}\phi_{n}(1) can be used to construct valid asymptotic confidence intervals. From Corollary 3, we know that the tt-statistic n(θ¯n,jθ0,j)\sqrt{n}(\bar{\theta}_{n,j}-\theta_{0,j}), j=1,,pj=1,\ldots,p, is asymptotically pivotal and converges to the following pivotal limiting distribution

n(θ¯n,jθ0,j)σn,jRD𝒲1(1)[01{𝒲1(r)r𝒲1(1)}2dr]1/2,\displaystyle{}\frac{\sqrt{n}(\bar{\theta}_{n,j}-\theta_{0,j})}{\sigma^{R}_{n,j}}\stackrel{{\scriptstyle D}}{{\rightarrow}}\frac{\mathcal{W}_{1}(1)}{[\int_{0}^{1}\{\mathcal{W}_{1}(r)-r\mathcal{W}_{1}(1)\}^{2}{\rm d}r]^{1/2}}, (2.5)

where σn,jR=(𝑽^n)j,j1/2\sigma^{R}_{n,j}=({\hat{\bm{V}}_{n}})_{j,j}^{1/2}. Notice that the random matrix 𝑽^n\hat{\bm{V}}_{n} can be updated in an online fashion, thereby enabling the construction of confidence intervals. Specifically,

𝑽^n=\displaystyle\hat{\bm{V}}_{n}= 1nb=1n{1ni=1b(𝜽^i𝜽¯n)}{1ni=1b(𝜽^i𝜽¯n)}\displaystyle\frac{1}{n}\sum\limits_{b=1}^{n}\left\{\frac{1}{\sqrt{n}}\sum\limits_{i=1}^{b}(\hat{\bm{\theta}}_{i}-\bar{\bm{\theta}}_{n})\right\}\left\{\frac{1}{\sqrt{n}}\sum\limits_{i=1}^{b}(\hat{\bm{\theta}}_{i}-\bar{\bm{\theta}}_{n})\right\}^{\top}
=\displaystyle= 1n2(𝑼n𝜽¯n𝒗n𝒗n𝜽¯n+𝜽¯n𝜽¯nb=1nb2),\displaystyle\frac{1}{n^{2}}\left(\bm{U}_{n}-\bar{\bm{\theta}}_{n}\bm{v}_{n}^{\top}-\bm{v}_{n}\bar{\bm{\theta}}^{\top}_{n}+\bar{\bm{\theta}}_{n}\bar{\bm{\theta}}_{n}^{\top}\sum\limits_{b=1}^{n}b^{2}\right),

where 𝑼n=b=1ni=1b𝜽^ii=1b𝜽^i=𝑼n1+n2𝜽¯n𝜽¯n\bm{U}_{n}=\sum_{b=1}^{n}\sum_{i=1}^{b}\hat{\bm{\theta}}_{i}\sum_{i=1}^{b}\hat{\bm{\theta}}_{i}^{\top}=\bm{U}_{n-1}+n^{2}\bar{\bm{\theta}}_{n}\bar{\bm{\theta}}_{n}^{\top} and 𝒗n=b=1nbi=1b𝜽^i=𝒗n1+n2𝜽¯n\bm{v}_{n}=\sum_{b=1}^{n}b\sum_{i=1}^{b}\hat{\bm{\theta}}_{i}=\bm{v}_{n-1}+n^{2}\bar{\bm{\theta}}_{n} can both be updated recursively. Once 𝜽¯n\bar{\bm{\theta}}_{n} and 𝑽^n\hat{\bm{V}}_{n} are obtained, the 1α01-\alpha_{0} asymptotic confidence interval for the jjth element θ0,j\theta_{0,j} of 𝜽0\bm{\theta}_{0} can be constructed as follows:

Corollary 4.

Under the conditions of Corollary 3, when pp is fixed and nn\rightarrow\infty, we have

Pr(θ¯n,jz1α0/2Rσ^n,jR/nθ0,jθ¯n,j+z1α0/2Rσ^n,jR/n)1α0,\displaystyle{\rm{Pr}}\left(\bar{\theta}_{n,j}-z^{R}_{1-\alpha_{0}/2}\hat{\sigma}^{R}_{n,j}/\sqrt{n}\leq\theta_{0,j}\leq\bar{\theta}_{n,j}+z^{R}_{1-\alpha_{0}/2}\hat{\sigma}^{R}_{n,j}/\sqrt{n}\right)\rightarrow 1-\alpha_{0},

where z1α0/2Rz^{R}_{1-\alpha_{0}/2} is (1α0/21-\alpha_{0}/2)-quantile of 𝒲1(1)/[01{𝒲1(r)r𝒲1(1)}2dr]1/2{\mathcal{W}_{1}(1)}/{[\int_{0}^{1}\{\mathcal{W}_{1}(r)-r\mathcal{W}_{1}(1)\}^{2}{\rm d}r]^{1/2}}.

The limiting distribution in (2.5) is mixed normal and symmetric around zero. In practice, the critical value z1α0/2Rz^{R}_{1-\alpha_{0}/2} in Corollary 4 can be computed through Monte Carlo simulation, as detailed in Table I in Abadir and Paruolo (1997).

5 Simulation studies

In this section, we perform simulations to examine the finite-sample performance of the proposed algorithms, focusing on the coverage probabilities and lengths of confidence intervals. The methods under evaluation include the private plug-in (PPI), private batch-means (PBM), and private random scaling (PRS). As a benchmark, we also compute the non-private counterpart with plug-in (PI), batch-means (BM), and random scaling (RS). Due to space limitations, we only present the empirical performance of proposed algorithms under linear regression in our main manuscript. The results for logistic and robust expectile regressions are provided in the Supplementary Material.

In this simulation, we randomly generate a sequence of nn samples {(𝒙n,yn)}n=1,2,\{(\bm{x}_{n}^{\top},y_{n})^{\top}\}_{n=1,2,\ldots} from the linear regression: yn=𝒙n𝜽0+εny_{n}=\bm{x}_{n}^{\top}\bm{\theta}_{0}+\varepsilon_{n}, where the random noise εn\varepsilon_{n} is i.i.d. from 𝒩(0,0.52)\mathcal{N}\left(0,0.5^{2}\right) and 𝒙n=(1,𝒔n)\bm{x}_{n}=(1,\bm{s}^{\top}_{n})^{\top} denotes the covariates, with 𝒔n\bm{s}_{n} following a multivariate normal distribution 𝒩(𝟎,𝚺)\mathcal{N}\left(\bm{0},\bm{\Sigma}\right) with two types of 𝚺\bm{\Sigma} structure, identity covariance 𝚺=𝑰p\bm{\Sigma}=\bm{I}_{p} and correlated covariance 𝚺={0.5|jk|}j,k=1,,p\bm{\Sigma}=\{0.5^{|j-k|}\}_{j,k=1,\ldots,p}. We consider a total sample size of 200,000200,000 arriving sequentially. For this model, the true coefficient 𝜽0=(θ00,θ01,,θ0p)=𝟏p+1\bm{\theta}_{0}=({\theta}_{00},\theta_{01},\ldots,\theta_{0p})^{\top}=\bm{1}_{p+1} is a (p+1p+1)-dimensional vector that includes the intercept term θ00\theta_{00}. The corresponding loss function and its gradient sensitivity are outlined in Example 1 with a step size decay rate of α=0.51\alpha=0.51. For the identity covariance matrix, we vary the parameter dimensions p=3,5,10p=3,5,10 across different privacy budgets μ=1,2\mu=1,2. A smaller privacy budget corresponds to stronger privacy protection. In addition, we set the truncation parameter c=1.345c=1.345 as used in the Huber loss and choose Mn0.25M\asymp n^{0.25} for the batch-means estimators, as suggested by Chen et al. (2020). This choice aligns with the optimal order Mn(1α)/2M\asymp n^{(1-\alpha)/2} when α=1/2\alpha=1/2. Specifically, we set M=20M=20 in our implementation. Owing to space constraints, this section reports only the results for p=3p=3. The corresponding results for p=5p=5 and p=10p=10 are available in the Supplementary Material. For the correlated covariance matrix, we only present results for p=3p=3, as similar patterns are observed in higher dimensions, as with the identity covariance matrix case.

The performance of each method is assessed using two criteria: the average empirical coverage probability of the 95% confidence interval (CP), and the average length of the 95% confidence interval (AL). Specifically. let CIj denote a two-side 95% confidence interval for θ0j\theta_{0j}, j=1,,pj=1,\ldots,p. We then calculate these two criteria across pp coefficients as below:

CP=p1j=1pPr^(θ0jCIj),AL=p1j=1pLength(CIj),\displaystyle\text{CP}=p^{-1}\sum_{j=1}^{p}\widehat{\rm{Pr}}\left(\theta_{0j}\in\mathrm{CI}_{j}\right),\quad\text{AL}=p^{-1}\sum_{j=1}^{p}\text{Length}\left(\mathrm{CI}_{j}\right),

where Pr^()\widehat{\rm{Pr}}(\cdot) and Length()\text{Length}(\cdot) represent the empirical rate and the average length of the confidence intervals over all replications, respectively. All simulation results are based on 200 replications, with both the averages and standard errors recorded.

Figure 2 displays the trajectories of each component of the proposed LDP-SGD estimator 𝜽¯n\bar{\bm{\theta}}_{n} with μ=1\mu=1 and p=3p=3 along with three private confidence intervals for linear regression. As expected, the length of the confidence intervals decreases as nn increases. In the early iterations, both the estimators and confidence intervals are unstable, with noticeable fluctuations. The PRS method produces wider confidence intervals compared to PPI and PBM, indicating that PRS is more conservative than the other two methods. Additionally, the confidence interval widths for PPI and PBM are approximately equal. This observation aligns with findings from previous studies, such as those by Li et al. (2022) and Lee et al. (2022).

We also present the empirical performance of our proposed methods under varying total sample sizes, as detailed in Table 1. A key observation is that the performance of the proposed methods aligns more closely with their non-private counterparts as the privacy budget μ\mu increases. The coverage probabilities for all three methods approach the desired 95%95\% level as the cumulative sample size or privacy budget increases, while the lengths of the confidence intervals decrease. Among these, the PRS method exhibits more robust performance across various settings and privacy budgets, albeit with slightly wider confidence intervals than the other two. Such a wider average length could potentially account for the improvement in the average coverage rates. In contrast, the PBM method exhibits relatively lower coverage compared to the PPI and PRS methods, consistent with theoretical results indicating that PBM converges more slowly than PPI. Under privacy-preserving conditions, all three methods result in lower coverage probabilities and wider confidence intervals compared to non-private version. This effect becomes more pronounced with smaller privacy budgets, illustrating the trade-off between maintaining privacy and achieving statistical accuracy. The corresponding results for p=5p=5 and p=10p=10 are provided in the Supplementary Material, demonstrating similar patterns.

meik Refer to caption

Figure 2: Trajectories of the proposed LDP-SGD estimator 𝜽¯n\bar{\bm{\theta}}_{n} (black solid line) along with three corresponding private confidence intervals, for linear regression with p=3p=3 and μ=1\mu=1, where θ(i)\theta^{(i)} represents the ii-th coordinate, plotted against cumulative sample size.
Table 1: Linear regression: Summary of averages and standard errors (brackets) of the coverage probabilities (%\%) and length of the 95% confidence interval (×102\times 10^{-2}) across various scenarios with p=3p=3. CP: the coverage probability; AL: the average length of the 95% confidence interval.
n=40000n=40000 n=80000n=80000 n=120000n=120000 n=160000n=160000 n=200000n=200000
    Non-private
PI: CP 93.75 (3.66) 92.50 (4.83) 93.50 (3.54) 92.75 (3.97) 94.13 (3.09)
PI: AL 1.08 (0.20) 0.76 (0.14) 0.62 (0.11) 0.54 (0.10) 0.48 (0.09)
BM: CP 88.88 (2.01) 90.75 (1.55) 93.25 (2.50) 93.13 (1.55) 92.75 (1.26)
BM: AL 1.01 (0.02) 0.74 (0.01) 0.62 (0.01) 0.54 (0.01) 0.48 (0.01)
RS: CP 95.13 (1.38) 94.00 (1.47) 95.75 (0.50) 95.13 (0.75) 95.50 (1.22)
RS: AL 1.47 (0.02) 1.03 (0.02) 0.84 (0.02) 0.73 (0.03) 0.64 (0.02)
    μ=1\mu=1
PPI: CP 91.38 (3.82) 93.50 (4.30) 92.13 (5.34) 93.75 (3.18) 93.25 (4.50)
PPI: AL 11.49 (0.38) 7.68 (0.19) 6.10 (0.13) 5.19 (0.10) 4.60 (0.07)
PBM: CP 85.38 (1.60) 92.00 (1.41) 91.63 (2.93) 93.25 (2.02) 92.88 (2.93)
PBM: AL 11.09 (2.28) 7.83 (1.57) 6.49 (1.29) 5.49 (1.05) 4.77 (0.90)
PRS: CP 94.38 (1.93) 95.75 (0.29) 94.88 (1.44) 95.38 (2.32) 95.50 (1.08)
PRS: AL 16.01 (3.31) 10.96 (2.21) 8.64 (1.78) 7.21 (1.28) 6.50 (1.15)
    μ=2\mu=2
PPI: CP 90.75 (6.25) 90.50 (4.88) 91.25 (4.29) 92.63 (3.42) 92.13 (4.27)
PPI: AL 4.99 (0.07) 3.45 (0.08) 2.81 (0.10) 2.42 (0.10) 2.16 (0.10)
PBM: CP 84.63 (2.78) 87.88 (2.63) 89.75 (1.31) 90.25 (1.00) 90.88 (2.10)
PBM: AL 4.70 (0.08) 3.42 (0.06) 2.87 (0.05) 2.47 (0.04) 2.15 (0.04)
PRS: CP 92.75 (3.50) 92.50 (1.78) 93.50 (1.58) 94.00 (1.62) 94.88 (1.61)
PRS: AL 6.77 (0.11) 4.82 (0.09) 3.94 (0.08) 3.30 (0.06) 2.93 (0.05)

To further evaluate the coverage properties of PPI and PRS, we vary the significance level α0\alpha_{0} and compute the corresponding 1α01-\alpha_{0} asymptotic confidence intervals, alongside the empirical coverage probabilities. Specifically, α0\alpha_{0} is varied from 0.010.01 to 0.10.1 in increments of 0.010.01. The analysis is carried out for a linear regression model with dimension p=5p=5 and sample size n=200000n=200000, under two different privacy budgets μ=1,2\mu=1,2. The results are presented in Figure 3. For μ=1\mu=1, the PRS method exhibits a more conservative coverage probability, slightly surpassing the oracle coverage level, indicating minor over-coverage. In contrast, PPI tends to under-cover with its narrower confidence intervals. As the privacy budget increases to μ=2\mu=2, both PRS and PPI coverage probabilities shift closer to the oracle curve.

Refer to caption
Refer to caption
Figure 3: Coverage probability versus nominal level α0\alpha_{0} for PPI and PRS with dimension p=5p=5 and sample size n=200000n=200000 under different privacy budget.

6 Applications

In this section, we illustrate the effectiveness of the proposed method by analyzing two real-world datasets: the Ride-sharing data and the US insurance data.

6.1 Ride-sharing Data

In this subsection, we apply the proposed LDP-SGD algorithm to the Ride-sharing dataset, which contains synthetic data from a ride-sharing platform. The dataset spans from January 2024 to October 2024 and includes detailed records of rides, users, drivers, vehicles, and ratings. It offers valuable insights into driver performance and fare prediction over time. The dataset is publicly available at https://www.kaggle.com/datasets/adnananam/ride-sharing-platform-data, comprising 50,000 ride records and 3,000 profiles of drivers and vehicles. For preprocessing, we concatenate driver and vehicle information with each corresponding ride. We then select the following features as potential predictors for fare estimation: ride distance, driver rating, total number of rides completed by the driver, vehicle production year, vehicle capacity, and ride duration. After concatenation, the dataset is standardized by subtracting the mean and dividing by the standard deviation for each column. The resulting data is then used as input to the LDP-SGD algorithm with weighted Huber loss defined in Example 1, which is run with a privacy budget of μ=1\mu=1 and a step size decay rate of α=0.501\alpha=0.501.

Refer to caption
Figure 4: Trajectories of the LDP-SGD estimators (black solid line) in the Ride-sharing dataset predicting fare amount, accompanied by two private confidence intervals with μ=1\mu=1.

The results of this analysis are presented in Figure 4. As the sample size increases, the coefficients gradually stabilize. Specifically, the coefficients for ’distance’ and ’time cost’ converge to positive values, with both the PPI and PRS confidence intervals entirely above zero. This suggests a significant positive impact of these factors on fare amount, aligning with intuitive expectations, as longer distances and extended journey times are naturally associated with higher costs. In contrast, the coefficients for ’rating’ and ’total rides’ converge toward values near zero, with their confidence intervals including zero, indicating that driver-related attributes do not meaningfully affect fare pricing. Similarly, the coefficients for ’year’ and ’capacity’ also converge to values close to zero, with their confidence intervals encompassing zero, suggesting that vehicle-related factors have no significant influence on the fare amount. Overall, the analysis shows that only distance and time cost significantly impact fare pricing, while driver and vehicle characteristics have negligible effects. This is likely because ride fares are primarily driven by distance, duration, and demand fluctuations, which outweigh the influence of individual driver and vehicle attributes.

6.2 US insurance Data

In this subsection, we conduct a analysis on the Insurance dataset, which aims to predict health insurance premiums in the United States. The dataset, based on synthetic data, captures a variety of factors that influence medical costs and premiums. It includes the following variables on the insurance customers: age, gender, body mass index (BMI), number of children, smoking status, region, medical history, exercise frequency, occupation, and type of insurance plan. The full list of variables is presented in Table 2. This dataset was generated using a script that randomly sampled 1,000,000 records, ensuring a comprehensive representation of the insured population in the US. Publicly available at https://www.kaggle.com/datasets/sridharstreaks/insurance-data-for-machine-learning/data, it provides valuable insights into the factors affecting health insurance premiums.

Variable Name Type Range / Categories
age Numerical [18,65][18,65]
gender Categorical (Binary) {Female, Male}
bmi Numerical [18,50][18,50]
children Numerical (Integer) {0,1,2,3,4,5}\{0,1,2,3,4,5\}
smoker Categorical (Binary) {No, Yes}
medical_history Categorical (Ordinal) {None, blood pressure, Diabetes, Heart Disease}
family_medical_history Categorical (Ordinal) {None, blood pressure, Diabetes, Heart Disease}
exercise_frequency Categorical (Ordinal) {Never, Occasionally, Rarely, Frequently}
occupation Categorical (Ordinal) {Student, Unemployed, Blue collar, White collar}
coverage_level Categorical (Ordinal) {Basic, Standard, Premium}
charges Numerical [3445.01,32561.56][3445.01,32561.56]
Table 2: Summary of variables for US insurance data with their types and ranges

For the analysis, we first carried out extensive data preprocessing to ensure proper handling of both numerical and categorical variables. Numerical variables, such as age, BMI, and the number of children, were standardized to have a mean of zero and a standard deviation of one, ensuring consistency across features for the modeling process. Categorical variables were encoded with meaningful numeric values to reflect their underlying semantics. Specifically, health-related variables like medical_history and family_medical_history were assigned values based on the severity of conditions: “None” (0), “High blood pressure” (1), “Diabetes” (2), and “Heart disease” (3). The exercise_frequency variable was encoded as “Never” (0), “Occasionally” (1), “Rarely” (2), and “Frequently” (3), reflecting varying levels of activity. For occupation, a hierarchical encoding was used, ranging from “Student” (0) to “White collar” (3), capturing the occupational spectrum. Similarly, the coverage_level variable was encoded as “Basic” (0), “Standard” (1), and “Premium” (2) to reflect the varying levels of insurance coverage. Binary variables such as smoker and gender were encoded as 0 and 1, representing “no”/“yes” or “female”/“male,” respectively. The processed dataset was randomly split into training and testing sets, consisting of 800,000 and 200,000 observations, respectively. The training set is then input into the LDP-SGD algorithm with a weighted Huber loss, using a privacy budget of μ=1\mu=1 and a step size decay rate of α=0.501\alpha=0.501. The corresponding results observed during the training process are presented in Figure 5.

Refer to caption
Figure 5: Trajectories of the LDP-SGD estimators (black solid line) for each variable in the US insurance dataset, along with two corresponding private confidence intervals with μ=1\mu=1.

The results demonstrate that as the sample size increases, all regression coefficients for predicting health insurance premiums converge to positive values, indicating their significant contributions to premium determination. A positive coefficient implies that higher values of a predictor correspond to higher premiums. For instance, males are charged higher premiums than females, smokers incur substantially higher costs due to increased health risks, and individuals with serious medical conditions or a family history of such conditions (e.g., diabetes, heart disease) face elevated premiums. Likewise, a higher BMI and selecting more comprehensive insurance plans also lead to higher charges, reflecting the associated health risks or benefits. While the coefficients for age and exercise frequency are positive, their magnitudes are relatively small, suggesting a limited linear effect on premium pricing. Its influence may be non-linear and not fully captured by a simple linear term. Overall, the analysis highlights that risk-related factors such as smoking, BMI, and medical history, are the primary drivers of health insurance premiums, while demographic and lifestyle factors exert a lesser influence.

We apply the LDP-SGD estimator to perform linear regression for predicting insurance charges. The mean squared error (MSE) on the test set is 0.0722. For comparison, we also fit an offline ordinary least squares (OLS) estimator without privacy guarantees using the same training data, which achieves an MSE of 0.0607. Table 3 summarizes the comparison between the LDP-SGD and offline OLS estimators in terms of coefficient estimates. The results show that LDP-SGD estimators closely align with OLS estimates, indicating that the proposed algorithm effectively protects privacy while preserving utility and explanatory power.

Setting age gender bmi children smoker
OLS 0.0627 0.1131 0.1044 0.0774 0.5659
LDP-SGD 0.0679 0.1298 0.0664 0.0828 0.5483
Setting medical family medical exercise occupation coverage level
OLS 0.4052 0.4051 0.1389 0.1516 0.4625
LDP-SGD 0.4009 0.4264 0.1409 0.1767 0.4634
Table 3: Comparison of offline OLS and online LDP-SGD estimations.

7 Discussion

This paper investigates online private estimation and inference for optimization-based problems with streaming data under local differential privacy constraints. To eliminate the dependence on trusted data collectors, we propose an LDP-SGD algorithm that processes data in a single pass, making it well-suited for streaming applications and minimizing storage costs. Additionally, we introduce and discuss three asymptotically valid online private inference methods—private plug-in, private batch-means, and random scaling—for constructing private confidence intervals. We establish the consistency and asymptotic normality of the proposed estimators, providing a theoretical foundation for our online private inference approach.

Several directions for future research are worth exploring. First, while this work focuses on low-dimensional settings, extending the methodology to high-dimensional penalized estimators using noisy proximal methods would be valuable. Second, our models are based on parametric assumptions; adapting these to non-parametric models by developing a noisy functional SGD algorithm remains interesting. Finally, this work assumes strong convexity and smoothness of the objective loss function. However, extending the proposed framework to encompass general nonsmooth convex functions, such as ReLU and quantile losses, which are piecewise linear and lack strong convexity, call for future research.

References

  • Abadir and Paruolo (1997) Abadir, K. M. and P. Paruolo (1997). Two mixed normal densities from cointegration analysis. Econometrica: Journal of the Econometric Society, 671–680.
  • Alabi and Vadhan (2023) Alabi, D. G. and S. P. Vadhan (2023). Differentially private hypothesis testing for linear regression. Journal of Machine Learning Research 24(361), 1–50.
  • Avella-Medina (2021) Avella-Medina, M. (2021). Privacy-preserving parametric inference: a case for robust statistics. Journal of the American Statistical Association 116(534), 969–983.
  • Avella-Medina et al. (2023) Avella-Medina, M., C. Bradshaw, and P.-L. Loh (2023). Differentially private inference via noisy optimization. The Annals of Statistics 51(5), 2067–2092.
  • Barrientos et al. (2019) Barrientos, A. F., J. P. Reiter, A. Machanavajjhala, and Y. Chen (2019). Differentially private significance tests for regression coefficients. Journal of Computational and Graphical Statistics 28(2), 440–453.
  • Berrett and Yu (2021) Berrett, T. and Y. Yu (2021). Locally private online change point detection. Advances in Neural Information Processing Systems 34, 3425–3437.
  • Chadha et al. (2021) Chadha, K., J. Duchi, and R. Kuditipudi (2021). Private confidence sets. In NeurIPS 2021 Workshop Privacy in Machine Learning.
  • Chen et al. (2024) Chen, X., Z. Lai, H. Li, and Y. Zhang (2024). Online statistical inference for stochastic optimization via kiefer-wolfowitz methods. Journal of the American Statistical Association, 1–24.
  • Chen et al. (2020) Chen, X., J. D. Lee, X. T. Tong, and Y. Zhang (2020). Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics 48(1), 251–273.
  • Dankar and El Emam (2013) Dankar, F. K. and K. El Emam (2013). Practicing differential privacy in health care: A review. Trans. Data Priv. 6(1), 35–67.
  • Davidow et al. (2023) Davidow, D. M., Y. Manevich, and E. Toch (2023). Privacy-preserving transactions with verifiable local differential privacy. In 5th Conference on Advances in Financial Technologies.
  • Ding et al. (2017) Ding, B., J. Kulkarni, and S. Yekhanin (2017). Collecting telemetry data privately. Advances in Neural Information Processing Systems 30.
  • Dong et al. (2022) Dong, J., A. Roth, and W. J. Su (2022). Gaussian differential privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology 84(1), 3–37.
  • Duchi et al. (2018) Duchi, J. C., M. I. Jordan, and M. J. Wainwright (2018). Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association 113(521), 182–201.
  • Dwork et al. (2006) Dwork, C., F. McSherry, K. Nissim, and A. Smith (2006). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pp.  265–284. Springer.
  • Erlingsson et al. (2014) Erlingsson, Ú., V. Pihur, and A. Korolova (2014). Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pp.  1054–1067.
  • Fainmesser et al. (2023) Fainmesser, I. P., A. Galeotti, and R. Momot (2023). Digital privacy. Management Science 69(6), 3157–3173.
  • Hu et al. (2022) Hu, M., R. Momot, and J. Wang (2022). Privacy management in service systems. Manufacturing & Service Operations Management 24(5), 2761–2779.
  • Karwa and Slavković (2016) Karwa, V. and A. Slavković (2016). Inference using noisy degrees: Differentially private β\beta-model and synthetic graphs. The Annals of Statistics 44(1), 87–112.
  • Karwa and Vadhan (2018) Karwa, V. and S. Vadhan (2018). Finite sample differentially private confidence intervals. In 9th Innovations in Theoretical Computer Science Conferenc 94.
  • Lee et al. (2022) Lee, S., Y. Liao, M. H. Seo, and Y. Shin (2022). Fast and robust online inference with stochastic gradient descent via random scaling. In Proceedings of the AAAI Conference on Artificial Intelligence, Volume 36, pp.  7381–7389.
  • Li et al. (2024) Li, M., Y. Tian, Y. Feng, and Y. Yu (2024). Federated transfer learning with differential privacy. arXiv preprint arXiv:2403.11343.
  • Li et al. (2022) Li, X., J. Liang, X. Chang, and Z. Zhang (2022). Statistical estimation and online inference via local sgd. In Conference on Learning Theory, pp.  1613–1661. PMLR.
  • Lin and Xi (2011) Lin, N. and R. Xi (2011). Aggregated estimating equation estimation. Statistics and its Interface 4(1), 73–83.
  • Liu et al. (2023) Liu, Y., Q. Hu, L. Ding, and L. Kong (2023). Online local differential private quantile inference via self-normalization. In International Conference on Machine Learning, pp. 21698–21714. PMLR.
  • Luo and Song (2020) Luo, L. and P. X.-K. Song (2020). Renewable estimation and incremental inference in generalized linear models with streaming data sets. Journal of the Royal Statistical Society Series B: Statistical Methodology 82(1), 69–97.
  • Man et al. (2024) Man, R., K. M. Tan, Z. Wang, and W.-X. Zhou (2024). Retire: Robust expectile regression in high dimensions. Journal of Econometrics 239(2), 105459.
  • Maniar et al. (2021) Maniar, T., A. Akkinepally, and A. Sharma (2021). Differential privacy for credit risk model. arXiv preprint arXiv:2106.15343.
  • Polyak and Juditsky (1992) Polyak, B. T. and A. B. Juditsky (1992). Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization 30(4), 838–855.
  • Ponomareva et al. (2023) Ponomareva, N., H. Hazimeh, A. Kurakin, Z. Xu, C. Denison, H. B. McMahan, S. Vassilvitskii, S. Chien, and A. G. Thakurta (2023). How to dp-fy ml: A practical guide to machine learning with differential privacy. Journal of Artificial Intelligence Research 77, 1113–1201.
  • Robbins and Monro (1951) Robbins, H. and S. Monro (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 400–407.
  • Rohde and Steinberger (2020) Rohde, A. and L. Steinberger (2020). Geometrizing rates of convergence under local differential privacy constraints. The Annals of Statistics 48(5), 2646–2670.
  • Ruppert (1988) Ruppert, D. (1988). Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering.
  • Sart (2023) Sart, M. (2023). Density estimation under local differential privacy and hellinger loss. Bernoulli 29(3), 2318–2341.
  • Schifano et al. (2016) Schifano, E. D., J. Wu, C. Wang, J. Yan, and M.-H. Chen (2016). Online updating of statistical inference in the big data setting. Technometrics 58(3), 393–403.
  • Sheffet (2017) Sheffet, O. (2017). Differentially private ordinary least squares. In International Conference on Machine Learning, pp. 3105–3114. PMLR.
  • Smith et al. (2022) Smith, J., H. J. Asghar, G. Gioiosa, S. Mrabet, S. Gaspers, and P. Tyler (2022). Making the most of parallel composition in differential privacy. Proceedings on Privacy Enhancing Technologies, 253–273.
  • Song et al. (2021) Song, S., T. Steinke, O. Thakkar, and A. Thakurta (2021). Evading the curse of dimensionality in unconstrained private glms. In International Conference on Artificial Intelligence and Statistics, pp.  2638–2646. PMLR.
  • Tang et al. (2017) Tang, J., A. Korolova, X. Bai, X. Wang, and X. Wang (2017). Privacy loss in apple’s implementation of differential privacy on macos 10.12. arXiv preprint arXiv:1709.02753.
  • Vaswani et al. (2022) Vaswani, S., B. Dubois-Taine, and R. Babanezhad (2022). Towards noise-adaptive, problem-adaptive (accelerated) stochastic gradient descent. In International conference on machine learning, pp. 22015–22059. PMLR.
  • Wang et al. (2019) Wang, Y., D. Kifer, and J. Lee (2019). Differentially private confidence intervals for empirical risk minimization. Journal of Privacy and Confidentiality 9.
  • Wang and Tsai (2022) Wang, Y.-R. and Y.-C. Tsai (2022). The protection of data sharing for privacy in financial vision. Applied Sciences 12(15), 7408.
  • Zhang et al. (2024) Zhang, Z., R. Nakada, and L. Zhang (2024). Differentially private federated learning: Servers trustworthiness, estimation, and statistical inference. arXiv preprint arXiv:2404.16287.
  • Zhao et al. (2024) Zhao, T., W. Zhou, and L. Wang (2024). Private optimal inventory policy learning for feature-based newsvendor with unknown demand. Management Science, in press.
  • Zhong (2019) Zhong, G. (2019). E-commerce consumer privacy protection based on differential privacy. In Journal of Physics: Conference Series, Volume 1168, pp. 032084. IOP Publishing.
  • Zhu et al. (2023) Zhu, W., X. Chen, and W. B. Wu (2023). Online covariance matrix estimation in stochastic gradient descent. Journal of the American Statistical Association 118(541), 393–404.