This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Uncertainty quantification by block bootstrap for differentially private stochastic gradient descent

Holger Dette
Chair of Stochastics
Ruhr University Bochum
Germany, Bochum
holger.dette@ruhr-uni-bochum.de
&Carina Graw
Chair of Stochastics
Ruhr University Bochum
Germany, Bochum
carina.graw@ruhr-uni-bochum.de
Abstract

Stochastic Gradient Descent (SGD) is a widely used tool in machine learning. In the context of Differential Privacy (DP), SGD has been well studied in the last years in which the focus is mainly on convergence rates and privacy guarantees. While in the non private case, uncertainty quantification (UQ) for SGD by bootstrap has been addressed by several authors, these procedures cannot be transferred to differential privacy due to multiple queries to the private data. In this paper, we propose a novel block bootstrap for SGD under local differential privacy that is computationally tractable and does not require an adjustment of the privacy budget. The method can be easily implemented and is applicable to a broad class of estimation problems. We prove the validity of our approach and illustrate its finite sample properties by means of a simulation study. As a by-product, the new method also provides a simple alternative numerical tool for UQ for non-private SGD.

1 Introduction

In times where data is collected almost everywhere and anytime, privacy is an important issue that has to be addressed in any data analysis. In the past, successful de-anonymisation attacks have been performed (see, for example, Narayanan and Shmatikov,, 2006; Gambs et al.,, 2014; Eshun and Palmieri,, 2022) leaking private, possible sensitive data of individuals. Differential Privacy (DP) is a framework that has been introduced by Dwork et al., (2006) to protect an individuals data against such attacks while still learning something about the population. Nowadays, Differential Privacy has become a state-of-the-art framework which has been implemented by the US Census and large companies like Apple, Microsoft and Google (see Ding et al.,, 2017; Abowd,, 2018).
The key idea of Differential Privacy is to randomize the data or a data dependent statistic. This additional randomness guarantees that the exchange of one individual merely changes the distribution of the randomized output. In the last decade, numerous differentially private algorithms have been developed, see for example Wang et al., (2020); Xiong et al., (2020); Yang et al., (2023) and the references therein. In many cases differentially private algorithms are based on already known and well studied procedures such as empirical risk minimisation, where either the objective function (objective pertubation) or the minimizer of that function (output perturbation) is privatized, or statistical testing, where the statistic is privatized, see for example Chaudhuri et al., (2011); Vu and Slavkovic, (2009). Two different concepts of differential privacy are distinguished in the literature: the first one, called (global) Differential Privacy (DP) assumes the presence of a trusted curator, who maintains and performs computations on the data and ensures that a published statistic satisfies a certain privacy guarantee. The second one, which is considered in this paper, does not make this assumption and is called local Differential Privacy (LDP). Here it is required that the data is collected in a privacy preserving way.
Besides privacy there is a second issue which has to be addressed in the analysis of massive data, namely the problem of computational complexity. A common and widely used tool when dealing with large-scale and complex optimization problems in big data analysis is Stochastic Gradient Descent (SGD), which is already introduced in the early work of Robbins and Monro, (1951). This procedure computes data based, iteratively and in an online fashion an estimate of the minimizer

θ=argmin 𝜃L(θ)\displaystyle\theta_{\star}=\underset{\theta}{\textnormal{argmin }}L(\theta) (1)

of a function LL. The convergence properties of adaptive algorithms, such as SGD, towards θ\theta_{\star} are well studied, see for example the monograph Benveniste et al., (1990) and the references therein. Mertikopoulos et al., (2020) investigate SGD in case of non convex functions and show convergence to a local minimizer.
There also exists a large amount of work on statistical inference based on SGD estimation starting with the seminal papers of Ruppert, (1988) and Polyak and Juditsky, (1992), who establish asymptotic normality of the averaged SGD estimator for convex functions. The covariance matrix of the asymptotic distribution has a sandwich form, say G1SG1G^{-1}SG^{-1}, and Chen et al., (2020) propose two different approaches to estimate this matrix. An online version for one of these methods is proposed in Zhu et al., (2021), while Yu et al., (2021) generalize these results for SGD with constant stepsize and non convex loss functions. Recent literature on UQ for SDG includes Su and Zhu, (2018), who propose iterative sample splitting of the path of SGD, Li et al., (2018), who consider SGD with constant stepsize splitting the whole path into segments, and Chee et al., (2023), who construct a simple confidence interval for the last iterate of SGD.
Song et al., (2013) investigate SGD under DP and demonstrate that mini-batch SGD has a much better accuracy than estimators obtained with batch size 11. However, mini-batch SGD with a batch size larger than 11 only guarantees global DP and therefore requires the presence of a trusted curator. Among other approaches for empirical risk minimization, Bassily et al., (2014) derive a utility bound in terms of the excess risk for an (ϵ,δ)(\epsilon,\delta)-DP SGD and Avella-Medina et al., (2023) investigate statistical inference for noisy gradient descent under Gaussian DP.
On the other hand, statistcial inference for SGD under LDP is far less explored. Duchi et al., (2018) show the minimax optimality of the LDP SGD median but do not provide confidence intervals for inference. The method of Chen et al., (2020) is not always applicable (for example, for quantiles) and requires an additional private estimate of the matrix GG in the asymptotic covariance matrix of the SDG estimate. Recently, Liu et al., (2023) propose a self-normalizing approach to obtain distribution free locally differentially private confidence intervals for quantiles. However, this method is tailored to a specific model and it is well known that the nice property of distribution free inference by self-normalization comes with the price of a loss in statistical efficiency (see Dette and Gösmann,, 2020, for an example in the context of testing).
A natural alternative is the application of bootstrap, and bootstrap for SGD in the non-private case has been studied by Fang et al., (2018); Zhong et al., (2023). These authors propose a multiplier approach where SGD is executed BB times multiplying the updates by non-negative random variables. However, to ensure ϵ\epsilon-LDP, the privacy budget must be split over all BB bootstrap versions, leading to large noise levels and high inaccuracy of the resulting estimators. There is also ongoing research on parametric Bootstrap under DP, (see Ferrando et al.,, 2022; Fu et al.,, 2023; Wang and Awan,, 2023) who assume that the parameter of interest characterizes the distribution of the data, which is not necessarily the case for SGD. Finally we mention Wang et al., (2022), who introduce a bootstrap procedure under Gaussian DP, where each bootstrap estimator is privately computed on a random subset.

Our Contribution: In this paper we propose a computational tractable method for statistical inference using SGD under LDP. Our approach provides a universally applicable and statistically valid method for uncertainty quantification for SDG and LDP-SGD. It is based on an application of the block bootstrap to the mean of the iterates obtained by SGD. The (multiplier) block bootstrap is a common principle in time series analysis (see Lahiri,, 2003). However, in contrast to this field, where the dependencies in the sequence of observations are quickly decreasing with an increasing lag, the statistical analysis of such an approach for SGD is very challenging. More precisely, SGD produces a sequence of highly dependent iterates, which makes the application of classical concepts from time series analysis such as mixing (see Bradley,, 2005) or physical dependence (see Wu,, 2005) impossible. Instead, we use a different technique and show the consistency of the proposed multiplier block bootstrap for the LDP-SGD under appropriate conditions on the block length and the learning rate. As a by-product our results also provide a new method of UQ for SGD and for mini-batch SGD in the non-private case.

Notations: \lVert\cdot\rVert denotes a norm on d\mathbb{R}^{d} if not further specified. 𝑑\overset{d}{\to} denotes weak convergence and \overset{\mathbb{P}}{\to} denotes convergence in probability.

2 Preliminaries and Background

Stochastic Gradient Descent (SGD). Define L(θ)=𝔼[(θ,X)]L(\theta)=\mathbb{E}[\ell(\theta,X)], where XX is a 𝕏\mathbb{X}-valued random variable and :d×𝕏\ell:\mathbb{R}^{d}\times\mathbb{X}\to\mathbb{R} is a loss function. We consider the optimization problem (1). If LL is differentiable and convex, then this minimization problem is equivalent to solving

R(θ):=L(θ)=0,R(\theta):=\nabla L(\theta)=0,

where R=LR=\nabla L is the gradient of LL. SGD computes an estimator θ¯\bar{\theta} for θ\theta_{\star} using noisy observations g(Xi,)g(X_{i},\cdot) of the gradient R()R(\cdot) in an iterative way. Note that gg does not need to be differentiable with respect to θ\theta. The iterations of SGD are defined by

θi\displaystyle\theta_{i} =θi1ηig(Xi,θi1)=θi1ηi(R(θi1)+ξi)\displaystyle=\theta_{i-1}-\eta_{i}g(X_{i},\theta_{i-1})=\theta_{i-1}-\eta_{i}(R(\theta_{i-1})+\xi_{i}) (2)

where ηi=ciγ\eta_{i}=ci^{-\gamma} with parameters c>0c>0 and γ(0.5,1)\gamma\in(0.5,1) is the learning rate and the quantity ξi=g(Xi,θi1)R(θi1)\xi_{i}=g(X_{i},\theta_{i-1})-R(\theta_{i-1}) is called measurement error in the iith iteration (note that we do not reflect the dependence on θi1\theta_{i-1} and XiX_{i} in the definition of ξi\xi_{i}). The estimator of θ\theta_{\star} is finally obtained as the average of the iterates, that is

θ¯\displaystyle\bar{\theta} =1ni=1nθi.\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\theta_{i}\leavevmode\nobreak\ . (3)

Note that the first representation in (2) can be directly used for implementation, while the second is helpful for proving probabilistic properties of SGD. For example Polyak and Juditsky, (1992) show that, under appropriate conditions, n(θ¯θ)\sqrt{n}(\bar{\theta}-\theta_{\star}) is asymptotically normal distributed where the covariance matrix Σ\Sigma of the limit distribution depends on the optimization problem and the variance of the measurement errors ξt\xi_{t}.

The use of only one observation g(Xi,θi1)g(X_{i},\theta_{i-1}) in SGD yields a relatively large measurement error in each iteration which might result in inaccurate estimation of the gradient R(θi1)R(\theta_{i-1}). Therefore, several authors have proposed a variant of SGD, called mini-batch SGD, where s1s\geq 1 observations g(Xi1,θ~i1),,g(Xis,θ~i1)g(X_{i_{1}},\tilde{\theta}_{i-1}),\ldots,g(X_{i_{s}},\tilde{\theta}_{i-1}) are used for the update. An estimator for R(θ~i1)R(\tilde{\theta}_{i-1}) is then given by the mean of this observations, yielding the recursion

θ~i\displaystyle\tilde{\theta}_{i} =θ~i1ηi1sj=1sg(Xij,θ~i1)=θ~i1ηi(R(θ~i1)+ξ~i)\displaystyle=\tilde{\theta}_{i-1}-\eta_{i}\frac{1}{s}\sum_{j=1}^{s}g(X_{i_{j}},\tilde{\theta}_{i-1})=\tilde{\theta}_{i-1}-\eta_{i}(R(\tilde{\theta}_{i-1})+\tilde{\xi}_{i}) (4)

where the measurement error is given by ξ~i=1sj=1sg(Xij,θ~i1)R(θ~i1)\tilde{\xi}_{i}=\frac{1}{s}\sum_{j=1}^{s}g(X_{i_{j}},\tilde{\theta}_{i-1})-R(\tilde{\theta}_{i-1}), see for example Cotter et al., (2011); Khirirat et al., (2017).

Differential Privacy (DP). The idea of differential privacy is that one individual should not change the outcome of an algorithm largely, i.e. changing one data point should not alter the result too much. Let (𝕏,𝒳)(\mathbb{X},\mathcal{X}) be a measurable space. For nn\in\mathbb{N}, x𝕏nx\in\mathbb{X}^{n} will be called a data base containing nn data points. Two data bases x,x𝕏nx,x^{\prime}\in\mathbb{X}^{n} are called neighbouring if they only differ in one data point and are called disjoint if all data points are different. How much a single data point is allowed to change the outcome is captured in a privacy parameter ϵ\epsilon. Let (𝕐,𝒴)(\mathbb{Y},\mathcal{Y}) be a measurable space. A randomized algorithm 𝒜:𝕏n𝕐\mathcal{A}:\mathbb{X}^{n}\to\mathbb{Y} maps x𝕏nx\in\mathbb{X}^{n} onto a random variable 𝒜(x)\mathcal{A}(x) with values in 𝕐\mathbb{Y}. Here we will consider either subspaces of d\mathbb{R}^{d} equipped with the Borel sets or finite sets equipped with their power set.

Definition 2.1.

Let ϵ>0\epsilon>0. A randomized algorithm 𝒜:𝕏n𝕐\mathcal{A}:\mathbb{X}^{n}\to\mathbb{Y} is ϵ\epsilon-differentially private (dp) if for all neighbouring x,x𝕏nx,x^{\prime}\in\mathbb{X}^{n} and all sets S𝒴S\in\mathcal{Y}

P(𝒜(x)S)eϵP(𝒜(x)S).P(\mathcal{A}(x)\in S)\leq e^{\epsilon}P(\mathcal{A}(x^{\prime})\in S). (5)

𝒜\mathcal{A} is also called a ϵ\epsilon-dp mechanism and is said to preserve DP. If 𝒜\mathcal{A} is restricted to n=1n=1, i.e. xx and xx^{\prime} contain only one data point, and (5) holds, 𝒜\mathcal{A} is ϵ\epsilon-local differentially private (ldp).

In this paper we work under the constraints of local differential privacy (LDP).

The following result is well known (see, for example Dwork and Roth,, 2014).

Theorem 2.1.
  1. (1)

    Post processing: Let (,𝒵)(\mathbb{Z},\mathcal{Z}) be a measurable space, 𝒜:𝕏n𝕐\mathcal{A}:\mathbb{X}^{n}\to\mathbb{Y} be an ϵ\epsilon-dp mechanism and f:𝕐f:\mathbb{Y}\to\mathbb{Z} be a measurable function. Then f𝒜f\circ\mathcal{A} is ϵ\epsilon-dp.

  2. (2)

    Sequential composition: Let 𝒜i:𝕏n𝕐\mathcal{A}_{i}:\mathbb{X}^{n}\to\mathbb{Y}, i=1,,ki=1,\ldots,k be ϵi\epsilon_{i}-dp mechanisms. Then 𝒜:𝕏n𝕐k\mathcal{A}:\mathbb{X}^{n}\to\mathbb{Y}^{k} that maps x(𝒜1(x),,𝒜k(x))x\mapsto(\mathcal{A}_{1}(x),\ldots,\mathcal{A}_{k}(x)) is i=1kϵi\sum_{i=1}^{k}\epsilon_{i}-dp.

  3. (3)

    Parallel composition: Let 𝒜i:𝕏n𝕐\mathcal{A}_{i}:\mathbb{X}^{n}\to\mathbb{Y}, i=1,,ki=1,\ldots,k be ϵi\epsilon_{i}-dp mechanisms and x1,,xk𝕏nx_{1},\ldots,x_{k}\in\mathbb{X}^{n} be disjoint. Then 𝒜:𝕏n×k𝕐k\mathcal{A}:\mathbb{X}^{n\times k}\to\mathbb{Y}^{k} that maps (x1,,xk)(𝒜1(x1),,𝒜k(xk))(x_{1},\ldots,x_{k})\mapsto(\mathcal{A}_{1}(x_{1}),\ldots,\mathcal{A}_{k}(x_{k})) is maxϵi\max\epsilon_{i}-dp.

Two well known privacy mechanism are the following:

Example 2.1 (Laplace Mechanism).

Let f:𝕏𝕐f:\mathbb{X}\to\mathbb{Y}\subset\mathbb{R} be a function. Its sensitivity is defined as Δ(f)=supx,x𝕏f(x)f(x)\Delta(f)=\sup_{x,x^{\prime}\in\mathbb{X}}\lVert f(x)-f(x^{\prime})\rVert. Assume that Δ(f)<\Delta(f)<\infty and let LLap(0,Δ(f)/ϵ)L\sim\operatorname{Lap}(0,\Delta(f)/\epsilon), where Lap(0,b)\operatorname{Lap}(0,b) denotes a centered Laplace distribution with density

p(x)=12bexp(|x|b).p(x)=\frac{1}{2b}\exp\left(-\frac{|x|}{b}\right).

Then the randomized algorithm 𝒜Lap:xf(x)+L\mathcal{A}_{Lap}:x\mapsto f(x)+L is ϵ\epsilon-dp.

Example 2.2 (Randomized Response).

Let f:𝕏{0,1}f:\mathbb{X}\to\{0,1\} be a function. Denote p=eϵeϵ+1p=\frac{e^{\epsilon}}{e^{\epsilon}+1} and define for x𝕏x\in\mathbb{X} a random variable 𝒜RR(x)\mathcal{A}_{RR}(x) on {0,1}\{0,1\} with conditional distribution given xx as

𝒜RR(x){B(p)f(x)=1B(1p)f(x)=0,\mathcal{A}_{RR}(x)\sim\begin{cases}B(p)&\textnormal{, }f(x)=1\\ B(1-p)&\textnormal{, }f(x)=0\end{cases},

where B(p)B(p) denotes a Bernoulli distribution with success probability pp. Then 𝒜RR(x)\mathcal{A}_{RR}(x) is ϵ\epsilon-dp.

Local Differential Private Stochastic Gradient Descent (LDP-SGD). By Theorem˜2.1 it follows that SGD is ϵ\epsilon-ldp if the noisy observations g(Xi,θi1)g(X_{i},\theta_{i-1}) in (2) are computed in a way that preserves ϵ\epsilon-LDP. Let 𝒜\mathcal{A} be such an ϵ\epsilon-ldp mechanism for computing gg. The LDP-SGD updates are then given by

θiLDP=\displaystyle\theta_{i}^{LDP}= θi1LDPηi𝒜(g(Xi,θi1LDP))=θi1LDPηi(R(θi1LDP)+ξiSGD+ξiLDP)\displaystyle\theta_{i-1}^{LDP}-\eta_{i}\mathcal{A}(g(X_{i},\theta^{LDP}_{i-1}))=\theta_{i-1}^{LDP}-\eta_{i}\big{(}R(\theta_{i-1}^{LDP})+\xi_{i}^{SGD}+\xi_{i}^{LDP}\big{)} (6)

where

ξiSGD=g(Xi,θi1LDP)R(θi1LDP){\xi_{i}^{SGD}=g(X_{i},\theta_{i-1}^{LDP})-R(\theta_{i-1}^{LDP})} (7)

captures the error due to measurement and

ξiLDP=𝒜(g(Xi,θi1LDP))g(Xi,θi1LDP){\xi_{i}^{LDP}=\mathcal{A}(g(X_{i},\theta_{i-1}^{LDP}))-g(X_{i},\theta_{i-1}^{LDP})} (8)

captures the error due to privacy. Analog to (3), the final ldp SDG estimate is defined by

θ¯LDP\displaystyle\bar{\theta}^{LDP} =1ni=1nθiLDP.\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\theta_{i}^{LDP}. (9)
Remark 2.1.

Song et al., (2013) also investigate mini-batch SGD under DP, where each iteration is updated as in (4) and demonstrated that mini-batch SGD with batchsize 55 achieves higher accuracy than batchsize 11. Their procedure then reads

θ~iDP=θ~i1DPηi(1sj=1sg(Xij,θ~i1DP)+Lis)=θ~i1DPηi(R(θ~i1(DP))+ξiSGD+ξiDP),\displaystyle\tilde{\theta}_{i}^{DP}=\tilde{\theta}_{i-1}^{DP}-\eta_{i}\Big{(}\frac{1}{s}\sum_{j=1}^{s}g(X_{i_{j}},\tilde{\theta}_{i-1}^{DP})+\frac{L_{i}}{s}\Big{)}=\tilde{\theta}_{i-1}^{DP}-\eta_{i}\big{(}R(\tilde{\theta}_{i-1}^{(DP)})+\xi_{i}^{SGD}+\xi_{i}^{DP}\big{)},

where LiLap(0,Δ(g)/ϵ)L_{i}\sim\operatorname{Lap}(0,\Delta(g)/\epsilon), ξiSGD=1sj=1sg(Xij,θ~i1DP)R(θ~i1DP)\xi_{i}^{SGD}=\frac{1}{s}\sum_{j=1}^{s}g(X_{i_{j}},\tilde{\theta}_{i-1}^{DP})-R(\tilde{\theta}_{i-1}^{DP}) and ξiDP=Li/s\xi_{i}^{DP}=L_{i}/s. Therefore, all results presented in this paper for the LDP-SGD also hold for the DP mini-batch SGD proposed by Song et al., (2013). However, this mini-batch SGD guarantees DP, not LDP. To obtain an ldp mini-batch SGD, one would need to privatize each data, leading to the following iteration:

θ~iLDP=\displaystyle\tilde{\theta}_{i}^{LDP}= θ~i1LDPηi1sj=1s(g(Xij,θ~i1LDP)+Lij),\displaystyle\tilde{\theta}_{i-1}^{LDP}-\eta_{i}\frac{1}{s}\sum_{j=1}^{s}\big{(}g(X_{i_{j}},\tilde{\theta}_{i-1}^{LDP})+L_{i_{j}}\big{)},

where Li1,,LisLap(0,Δ(g)ϵ)L_{i_{1}},\ldots,L_{i_{s}}\sim\operatorname{Lap}(0,\Delta(g)\epsilon) are independent random variables. Our results are applicable for ldp mini batch SGD as well.

Remark 2.2.

Note that the deconvolution approach used in Wang et al., (2022) for the construction of a DP bootstrap procedure is not applicable for SGD, because it requires an additive relation of the form U=V+WU=V+W between a DP-private and a non-private estimator UU and VV, where the distribution of WW is known. For example, if the gradient is linear, that is R(θ)=GθR(\theta)=G\theta for a matrix GG, and SGD and LDP-SDG are started with the same initial value θ0\theta_{0}, we obtain by standard calculations the representation

θ¯LDP\displaystyle\bar{\theta}^{LDP} =θ¯+Zn\displaystyle=\bar{\theta}+Z_{n}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\

where Zn=1ni=1nj=1iηjξjLDPk=j+1i(1Gηk)Z_{n}=-\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{i}\eta_{j}\xi_{j}^{LDP}\prod_{k=j+1}^{i}(1-G\eta_{k}). However, although the distribution of the random variables ξjLDP\xi_{j}^{LDP} is known, the distribution of ZnZ_{n} is not easily accessible and additionally depends on the unknown matrix GG, which makes the application of deconvolution principles not possible.

Algorithm 1 Local Differentially Private Stochastic Gradient Descent (LDP-SGD)
Input: X1,,XnX_{1},\ldots,X_{n}, noisy gradient g(,)g(\cdot,\cdot), θ0\theta_{0}, learning rate ηi=ciγ\eta_{i}=ci^{-\gamma} with parameters c>0c>0 and γ(0.5,1)\gamma\in(0.5,1) , ϵ\epsilon-ldp mechanism 𝒜\mathcal{A}
Output: θ1LDP,,θnLDP\theta_{1}^{LDP},\ldots,\theta_{n}^{LDP}, θ¯LDP\bar{\theta}^{LDP}
θ¯LDP0\bar{\theta}^{LDP}\leftarrow 0
for i=1,,ni=1,\ldots,n do
  θiLDPθi1LDP+ciγ𝒜(g(Xi,θi1LDP))\theta_{i}^{LDP}\leftarrow\theta_{i-1}^{LDP}+ci^{-\gamma}\mathcal{A}(g(X_{i},\theta_{i-1}^{LDP}))
  θ¯LDPi1iθ¯LDP+1iθiLDP\bar{\theta}^{LDP}\leftarrow\frac{i-1}{i}\bar{\theta}^{LDP}+\frac{1}{i}\theta_{i}^{LDP}
end for

Block Bootstrap. Bootstrap is a widely used procedure to estimate a distribution of an estimator θ^=θ(X1,,Xn)\hat{\theta}=\theta(X_{1},\ldots,X_{n}) calculated from data X1,,XnX_{1},\ldots,X_{n} (see Efron and Tibshirani,, 1994). In the simplest case X1,,XnX_{1}^{\star},\ldots,X_{n}^{\star} are drawn with replacement from X1,,XnX_{1},\ldots,X_{n} and used to calculate θ^=θ(X1,,Xn)\hat{\theta}^{\star}=\theta(X_{1}^{\star},\ldots,X_{n}^{\star}). This procedure is repeated several times to obtain an approximation of the distribution θ^\hat{\theta}. While it provides a valid approximation in many (but not in all) cases if X1,,XnX_{1},\ldots,X_{n} are independent identically distributed, the bootstrap approximation is not correct in the case of dependent data. A common approach to address this problem is the multiplier bootstrap which is tailored to estimators with a linear structure (Lahiri,, 2003). For illustration of the principle, we consider the empirical mean θ^=X¯n=1ni=1nXi\hat{\theta}=\bar{X}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}. Under suitable assumptions the central limit theorem for stationary sequences shows that for large sample size nn the distribution of n(X¯nμ)\sqrt{n}(\bar{X}_{n}-\mu) can be approximated by a normal distribution, say 𝒩(0,σ2){\cal N}(0,\sigma^{2}), with expectation 0 and variance σ2\sigma^{2}, where μ=𝔼[X1]\mu=\mathbb{E}[X_{1}] and σ2\sigma^{2} depends in a complicated manner on the dependence structure of the data.
The multiplier block bootstrap mimics this distribution by partitioning the sample into mm blocks of length ll, say {X(i1)l+1,,Xil}\{X_{(i-1)l+1},\ldots,X_{il}\}. For each block one calculates the mean X¯(i)=1lj=(i1)l+1ilXj\bar{X}_{(i)}=\frac{1}{l}\sum_{j=(i-1)l+1}^{il}X_{j} which is then multiplied with a random weight ϵi\epsilon_{i} with mean 0 and variance 11 to obtain the estimate X¯n=1mi=1mϵiX¯(i)\bar{X}^{\star}_{n}=\frac{1}{m}\sum_{i=1}^{m}\epsilon_{i}\bar{X}_{(i)} where m=nlm=\lfloor\frac{n}{l}\rfloor. If the dependencies between XiX_{i} and XjX_{j} become sufficiently small for increasing |ij||i-j|, it can be shown that the distribution of n(X¯nX¯n)\sqrt{n}(\bar{X}_{n}^{\star}-\bar{X}_{n}) is a reasonable approximation of the distribution n(X¯nμ)\sqrt{n}(\bar{X}_{n}-\mu) (for large nn). Typical conditions on the dependence structure of the data guaranteeing these approximations are formulated in terms of mixing or physical dependence coefficients (see Bradley,, 2005; Wu,, 2005). We will not give details here, because none of these techniques will be applicable for proving the validity of the multiplier block bootstrap developed in the following section.

3 Multiplier Block Bootstrap for LDP-SGD

In this section, we will develop a multiplier block bootstrap approach to obtain the distribution of the LDP-SGD estimate θ¯LDP\bar{\theta}^{LDP} defined in (9) by resampling. We also prove that this method is statistically valid in the sense that the bootstrap distribution converges weakly (conditional on the data) to the limit distribution of the estimate θ¯LDP\bar{\theta}^{LDP}, which is derived first.

Our first result is a direct consequence of Theorem 2 in Polyak and Juditsky, (1992) which can be applied to LDP-SGD.

Theorem 3.1.

If Assumption˜A.1 in the supplement holds, then the LDP-SGD estimate θ¯LDP\bar{\theta}^{LDP} in (9) satisfies

n(θ¯LDPθ)𝑑N(0,Σ),\displaystyle\sqrt{n}(\bar{\theta}^{LDP}-\theta_{\star})\overset{d}{\rightarrow}N(0,\Sigma), (10)

where the matrix Σ\Sigma is given by Σ=G1SG1\Sigma=G^{-1}SG^{-1}. Here SS is the asymptotic variance of the errors ξi=ξiSGD+ξiLDP\xi_{i}=\xi_{i}^{SGD}+\xi_{i}^{LDP} and GG is a linear approximation of R(θ)R(\theta) (see Assumption˜A.1 in the supplement for more details).

Remark 3.1.

(a) Assumption˜A.1 requires the sequence of errors {ξi}i1\{\xi_{i}\}_{i\geq 1} to be a martingale difference process with respect to a filtration {i}i0\{\mathcal{F}_{i}\}_{i\geq 0}. This assumption is satisfied, if {ξiSGD}i1\{\xi_{i}^{SGD}\}_{i\geq 1} and {ξiLDP}i1\{\xi^{LDP}_{i}\}_{i\geq 1} are martingale difference processes with respect to {i}i0\{\mathcal{F}_{i}\}_{i\geq 0}. Note that the condition 𝔼[ξiLDP|i1]=0\mathbb{E}[\xi_{i}^{LDP}|\mathcal{F}_{i-1}]=0 is implied by 𝔼[𝒜(g(X,θ))|X]=g(X,θ)\mathbb{E}[\mathcal{A}(g(X,\theta))|X]=g(X,\theta), which is satisfied for the Laplace mechanism and Randomized Response can be adjusted to fulfill this requirement.
(b) If ξtLDP\xi_{t}^{LDP} and ξtSGD\xi_{t}^{SGD} are independent given t1\mathcal{F}_{t-1} and their respective covariance matrices converge in probability to SLDPS_{LDP} and SSGDS_{SGD}, then it holds that S=SSGD+SLDPS=S_{SGD}+S_{LDP}.

In principle ldp statistical inference based on the statistic θ¯LDP\bar{\theta}^{LDP} can be made using the asymptotic distribution in Theorem 3.1 with an ldp estimator of the covariance matrix Σ\Sigma in (10). For this purpose the matrices GG and SS have to be estimated in an ldp way. While SS can be estimated directly from the ldp observations 𝒜(g(Xi,θi1LDP))\mathcal{A}(g(X_{i},\theta_{i-1}^{LDP})) in (6) by

S¯LDP=1ni=1n𝒜(g(Xi,θi1LDP))(𝒜(g(Xi,θi1LDP)))T,\bar{S}^{LDP}=\frac{1}{n}\sum_{i=1}^{n}\mathcal{A}(g(X_{i},\theta_{i-1}^{LDP}))(\mathcal{A}(g(X_{i},\theta_{i-1}^{LDP})))^{T},

the matrix GG has to be estimated separately. Therefore, the privacy budget has to be split on the estimation of GG and on the iterations of SGD. As a consequence this approach leads to high inaccuracies since all components need to be estimated in a ldp way. Furthermore, common estimates of GG are based on the derivative of gg (with respect to θ\theta), and therefore gg needs to be differentiable. This excludes important examples such as quantile regression. Additionally, the computation of the inverse of GG can become computationally costly in high dimensions.

As an alternative we will develop a multiplier block bootstrap, which avoids these problems. Before we explain our approach in detail we note that the bootstrap method proposed in Fang et al., (2018) for non private SGD is not applicable in the present context. These authors replace the recursion in (2) by θi=θi1ηiwig(Xi,θi1)\theta_{i}^{\star}=\theta_{i-1}^{\star}-\eta_{i}w_{i}g(X_{i},\theta_{i-1}^{\star}) where w1,,wnw_{1},\ldots,w_{n} are independent identically distributed non-negative random variables with Var(wi)=1=𝔼[wi]Var(w_{i})=1=\mathbb{E}[w_{i}]. By applying SGD in this way BB times they calculate SGD estimates θ¯(1),,θ¯(B)\bar{\theta}^{\star(1)},\ldots,\bar{\theta}^{\star(B)}, which are used to estimate the distribution of θ¯\bar{\theta}. However, for the implementation of this approach under LDP the privacy budget ϵ\epsilon must be split onto these BB versions. In other words, for the calculation of each bootstrap replication θ¯(b)\bar{\theta}^{\star(b)} there is only a privacy budget of ϵ/B\epsilon/B left, resulting in highly inaccurate estimators.
To address these problems we propose to apply the multiplier block bootstrap principle to the iterations θ1LDP,,θnLDP\theta_{1}^{LDP},\ldots,\theta_{n}^{LDP} of LDP-SGD in order to mimic the strong dependencies in this data. To be precise let l=l(n)l=l(n) be the block length and m=nlm=\lfloor\frac{n}{l}\rfloor the number of blocks. Let ϵ1,,ϵm\epsilon_{1},\ldots,\epsilon_{m} be bounded, real-valued, independent and identically distributed random variables with 𝔼[ϵi]=0\mathbb{E}[\epsilon_{i}]=0 and equal to Var(ϵi)=1{\rm Var}(\epsilon_{i})=1. For the given iterates θ1LDP,,θnLDP\theta_{1}^{LDP},\ldots,\theta_{n}^{LDP} by LDP-SGD in (6) we calculate a bootstrap analog of θ¯LPDθ\bar{\theta}^{LPD}-\theta_{\star} as

θ¯=1mlj=1mϵjb=(j1)l+1jl(θbLDPθ¯LDP),\bar{\theta}^{\star}=\frac{1}{ml}\sum_{j=1}^{m}\epsilon_{j}\sum_{b=(j-1)l+1}^{jl}(\theta_{b}^{LDP}-\bar{\theta}^{LDP}), (11)

where θ¯LDP\bar{\theta}^{LDP} is the LDP-SGD estimate defined in (9). We will show below that under appropriate assumptions the distribution of θ¯\bar{\theta}^{\star} is a reasonable approximation of the distribution of θ¯LDPθ\bar{\theta}^{LDP}-\theta_{\star}. In practice this distribution can be obtained by generating BB samples of the form (11). Details are given in Algorithm˜2, where the multiplier block bootstrap is used to construct an α\alpha-quantile for the distribution of θ¯LDPθ\bar{\theta}^{LDP}-\theta_{\star}, say qαq_{\alpha}. With these quantiles we obtain an (12α)(1-2\alpha) confidence interval for θ\theta_{\star} as

C^=[θ¯LDP+qα,θ¯LDP+q1α].\hat{C}=[\bar{\theta}^{LDP}+q_{\alpha},\bar{\theta}^{LDP}+q_{1-\alpha}]. (12)
Algorithm 2 Block Bootstrap for Stochastic Gradient Descent
Input: θ1,,θn\theta_{1},\ldots,\theta_{n}, BB, ll, α(0,1)\alpha\in(0,1)
Output: an α\alpha-quantile of the distribution of θ¯LDPθ\bar{\theta}^{LDP}-\theta_{\star}
Set θ¯=1ni=1nθi\bar{\theta}=\frac{1}{n}\sum_{i=1}^{n}\theta_{i} and m=nlm=\lfloor\frac{n}{l}\rfloor
for i=1Bi=1...B do
  Draw ϵ1,,ϵm\epsilon_{1},\ldots,\epsilon_{m} independent and identically distributed with 𝔼[ϵi]=0\mathbb{E}[\epsilon_{i}]=0 and Var(ϵi)=1Var(\epsilon_{i})=1
  θ¯(i)1mlj=1mϵjb=(j1)l+1jl(θbθ¯)\bar{\theta}^{\star(i)}\leftarrow\frac{1}{ml}\sum_{j=1}^{m}\epsilon_{j}\sum_{b=(j-1)l+1}^{jl}(\theta_{b}-\bar{\theta})
end for
qαq_{\alpha}\leftarrow empirical α\alpha quantile of {θ¯(1),,θ¯(B)}\{\bar{\theta}^{\star(1)},\ldots,\bar{\theta}^{\star(B)}\}
Theorem 3.2.

Let Assumptions˜A.1 and A.2 in the supplement be satisfied. Let l=l(n)l=l(n) and m=m(n)m=m(n) such that l,ml,m\to\infty and mγlγ10m^{\gamma}l^{\gamma-1}\to 0 for nn\to\infty. Let ϵ1,,ϵm\epsilon_{1},\ldots,\epsilon_{m} in (11) be independent identical distributed random variables independent of θ1LDP,,θnLDP\theta_{1}^{LDP},\ldots,\theta_{n}^{LDP} with 𝔼[ϵ1]=0\mathbb{E}[\epsilon_{1}]=0 and Var(ϵ1)=1Var(\epsilon_{1})=1. Further assume that there exists a constant CC such that |ϵi|C|\epsilon_{i}|\leq C a.s.. Then, conditionally on θ1LDP,,θnLDP\theta_{1}^{LDP},\ldots,\theta_{n}^{LDP},

nθ¯𝑑N(0,Σ)\sqrt{n}\bar{\theta}^{\star}\overset{d}{\rightarrow}N(0,\Sigma)

in probability with Σ=G1SG1\Sigma=G^{-1}SG^{-1}. Here SS is the asymptotic variance of the errors ξi=ξiSGD+ξiLDP\xi_{i}=\xi_{i}^{SGD}+\xi_{i}^{LDP} and GG is a linear approximation of R(θ)R(\theta) (see Assumption˜A.1 in the supplement for more details).

Remark 3.2.

(a) If we chose l=nβl=\lfloor n^{\beta}\rfloor (which yields m=n1βm=\lfloor n^{1-\beta}\rfloor), ll and mm satisfy the assumptions of Theorem˜3.2 if γ<β\gamma<\beta. The parameter β\beta determines the number of blocks used in the multiplier bootstrap. On the one hand we would like to have as many blocks as possible since we expect better results if more samples are available. This would suggest to choose β\beta close to γ\gamma. On the other hand, β\beta also determines the block-length ll, which needs to be large enough to capture the underlying strong dependence structure of the iterates of LDP-SGD. This suggests to choose β\beta close to one.
(b) If the blocks have been calculated, the run time of the block bootstrap is of order O(Bmd)O(Bmd), which is dominated by the run time O(nd)O(nd) of (LDP-)SGD, as long as B=o(l)B=o(l).

4 Simulation

We will consider LDP-SGD for the estimation of a quantile and of the parameters in a quantile regression model. We investigate these models because here the gradient is not differentiable and the plug-in estimator from Chen et al., (2020) can not be used. In each scenario, we consider samples of size n=106,107,108n=10^{6},10^{7},10^{8} and privacy parameter ϵ=1\epsilon=1. The stepsize of the SGD is chosen as ηi=i0.51\eta_{i}=i^{-0.51}. The 90%90\% confidence intervals (12) for the quantities of interest are computed by B=500B=500 bootstrap replications. The block length is chosen as l=nβl=\lfloor n^{\beta}\rfloor, where β=0.75\beta=0.75. The distribution of multiplier is chosen as ϵiUnif(3,3)\epsilon_{i}\sim\operatorname{Unif}(-\sqrt{3},\sqrt{3}). The empirical coverage probability and average length of the interval C^\hat{C} are estimated by 500500 simulation runs. All simulations are run in R (R Core Team,, 2023) and executed on an internal cluster with an AMD EPYC 7763 64-Core processor under Ubuntu 22.04.4.

Quantiles: Let X1,,XnX_{1},\ldots,X_{n} be independent and identically distributed \mathbb{R} valued random variables with distribution function FXF_{X} and density pXp_{X}. Denote by xτx_{\tau} the τ\tau-th quantile of X1X_{1} i.e. FX(xτ)=τF_{X}(x_{\tau})=\tau. We assume that pX(x)>0p_{X}(x)>0 in a neighborhood of xτx_{\tau}, then xτx_{\tau} is the (unique) root of the equation R(θ)=τ+FX(θ).R(\theta)=-\tau+F_{X}(\theta). The noisy gradient is given by g(Xi,θ)=τ+𝕀{Xiθ},g(X_{i},\theta)=-\tau+\mathbb{I}\{X_{i}\leq\theta\}, which can be privatized using Randomized Response, that is

𝒜(g(Xi,θ))=τ+𝒜RR(𝕀{Xiθ})2p11p2p1,\mathcal{A}(g(X_{i},\theta))=-\tau+\frac{\mathcal{A}_{RR}(\mathbb{I}\{X_{i}\leq\theta\})}{2p-1}-\frac{1-p}{2p-1},

where 𝒜RR\mathcal{A}_{RR} is defined in Example 2.2 and p=eϵ1+eϵp=\frac{e^{\epsilon}}{1+e^{\epsilon}}. Note that the re-scaling ensures that 𝔼[𝒜(g(Xi,θ))|Xi]=g(Xi,θ)\mathbb{E}[\mathcal{A}(g(X_{i},\theta))|X_{i}]=g(X_{i},\theta) for all θ\theta\in\mathbb{R}. The measurement error and the error due to privacy in (7) and (8) are given by

ξiSGD\displaystyle\xi_{i}^{SGD} =𝕀{Xiθi1LDP}FX(θi1LDP) ,\displaystyle=\mathbb{I}\{X_{i}\leq\theta_{i-1}^{LDP}\}-F_{X}(\theta_{i-1}^{LDP})\text{ , }
ξiLDP\displaystyle\xi_{i}^{LDP} =𝒜RR(𝕀{Xiθi1LDP})2p11p2p1𝕀{Xiθi1LDP},\displaystyle=\frac{\mathcal{A}_{RR}(\mathbb{I}\{X_{i}\leq\theta_{i-1}^{LDP}\})}{2p-1}-\frac{1-p}{2p-1}-\mathbb{I}\{X_{i}\leq\theta_{i-1}^{LDP}\},

respectively, which define indeed martingale difference processes with respect to the filtration {i}i1\{\mathcal{F}_{i}\}_{i\geq 1}, where i=σ(θ1LDP,,θiLDP)\mathcal{F}_{i}=\sigma(\theta_{1}^{LDP},\ldots,\theta_{i}^{LDP}) and σ(Y)\sigma(Y) denotes the sigma algebra generated by the random variable YY. The assumptions of Theorems˜3.2 and 3.1 can be verified with Σ=(SSGD+SLDP)/(pX(xτ))2\Sigma=(S_{SGD}+S_{LDP})/(p_{X}(x_{\tau}))^{2} where

SSGD\displaystyle S_{SGD} =FX(xτ)(1FX(xτ))=τ(1τ),\displaystyle=F_{X}(x_{\tau})(1-F_{X}(x_{\tau}))=\tau(1-\tau)\text{, } SLDP=eϵeϵ1.\displaystyle S_{LDP}=\tfrac{e^{\epsilon}}{e^{\epsilon}-1}.

In Table˜1 we display the simulated coverage probabilities and length of the confidence intervals for the 50%50\% and 90%90\% quantile of a standard normal distribution calculated by block bootstrap (BB). We also compare our approach with the batch mean (BM) introduced in Chen et al., (2020) and the self normalisation (SN) approach suggested in Liu et al., (2023). We observe that BB and BM behave very similar with respect to the empirical coverage and the length of the computed confidence intervals while the confidence intervals obtained by SN are slightly larger.
In Figure˜1 we display the trajectory of the upper and lower boundaries of a 90%90\% confidence interval for the 50%50\% quantile of a standard normal distribution for the BB, BM and SN approach. Again, we observe that the confidence intervals obtained by BB and BM are quite similar and the confidence intervals obtained by SN are wider.

Refer to caption
Refer to caption
Figure 1: Trajectory of the LDP-SGD estimate θ¯nLDP\bar{\theta}_{n}^{LDP} (av. SGD) and the upper and lower boundaries of confidence intervals obtained by block bootstrap (BB), batch mean (BM) and self normalisation (SN) of the 50%50\% quantile of a standard normal distribution. The left and right panel correspond to different sample sizes, left: n=10200000n=10-200000; right: n=2000001000000n=200000-1000000.
Table 1: Empirical coverage and length of a 90%90\% confidence interval for the 50%50\% and 90%90\% quantile of a standard normal distribution. The confidence intervals are obtained by LDP-SGD with self normalization (SN), batch mean (BM) and the block bootstrap (BB) proposed in this paper. The numbers in brackets are the standard errors.
τ\tau 10610^{6} 10710^{7} 10810^{8}
0.5 BB cover: 0.880 (0.015) 0.886 (0.014) 0.886 (0.014)
length: 0.0085 (5.2×105)(5.2\times 10^{-5}) 0.0027 (1.2×105)(1.2\times 10^{-5}) 0.0009 (3.1×106)(3.1\times 10^{-6})
BM cover: 0.884 (0.014) 0.898 (0.014) 0.896 (0.014)
length: 0.0086 (5.2×105)(5.2\times 10^{-5}) 0.0027 (1.1×105)1.1\times 10^{-5}) 0.0009 (2.9×106)(2.9\times 10^{-6})
SN cover: 0.914 (0.013) 0.898 (0.014) 0.886 (0.014)
length: 0.0106 (1.9×104)(1.9\times 10^{-4}) 0.0034 (6×105)(6\times 10^{-5}) 0.0011 (2.1×105)(2.1\times 10^{-5})
0.9 BB cover: 0.828 (0.017) 0.856 (0.016) 0.848 (0.016)
length: 0.0175 (1.1×1041.1\times 10^{-4}) 0.0057 (2.7×1052.7\times 10^{-5}) 0.0018 (6.4×1066.4\times 10^{-6})
BM cover: 0.830 (0.017) 0.868 (0.015) 0.840 (0.016)
length: 0.0179 (1.1×1041.1\times 10^{-4}) 0.0058 (2.5×1052.5\times 10^{-5}) 0.0018 (5.8×1065.8\times 10^{-6})
SN cover: 0.862 (0.015) 0.880 (0.015) 0.878 (0.015)
length: 0.0235 (4.5×1044.5\times 10^{-4}) 0.0076 (1.4×1041.4\times 10^{-4}) 0.0024 (4.3×1054.3\times 10^{-5})

Quantile Regression: Let Z1,,ZnZ_{1},\ldots,Z_{n} be iid random vectors where Zi=(Xi,yi)Z_{i}=(X_{i},y_{i}), XidX_{i}\in\mathbb{R}^{d} with ΣX=𝔼[XiXiT]\Sigma_{X}=\mathbb{E}[X_{i}X_{i}^{T}] and yi=XiTβτ+εiy_{i}=X_{i}^{T}\beta_{\tau}+\varepsilon_{i} where εi\varepsilon_{i} is independent of XiX_{i} with distribution function FεF_{\varepsilon}. We further assume that FεF_{\varepsilon} has a density in a neighbourhood of 0, say pεp_{\varepsilon}, with pε(0)>0p_{\varepsilon}(0)>0, that Fε(0)=τF_{\varepsilon}(0)=\tau and that there exists a constant m>0m>0 such that |Xi|m|X_{i}|\leq m for all i=1,,di=1,\ldots,d. If Qy|X(τ)Q_{y|X}(\tau) is the conditional quantile of yy given XX, it follows that Qy|X(τ)=XTβτQ_{y|X}(\tau)=X^{T}\beta_{\tau} and R(βτ)=0R(\beta_{\tau})=0, where

R(β)=τ𝔼[X]+𝔼[𝕀{yXTβ<0}X]R(\beta)=-\tau\mathbb{E}[X]+\mathbb{E}[\mathbb{I}\{y-X^{T}\beta<0\}X]

which can be linearly approximated by G=ΣXpε(0)G=\Sigma_{X}p_{\varepsilon}(0), since a Taylor expansion at βτ\beta_{\tau} yields

𝔼[𝕀{yXTβ<0}|X]=τ+pε(0)XT(ββτ)+O(ββτ2).\mathbb{E}[\mathbb{I}\{y-X^{T}\beta<0\}|X]=\tau+p_{\varepsilon}(0)X^{T}(\beta-\beta_{\tau})+O(\lVert\beta-\beta_{\tau}\rVert^{2}).

The noisy observations are given by g(Zj,β)=(τ+𝕀{yjXjTβ0})Xjg(Z_{j},\beta)=(-\tau+\mathbb{I}\{y_{j}-X_{j}^{T}\beta\leq 0\})X_{j} and can be privatized by the Laplace mechanism (see Example 2.1)

𝒜Lap(g(Zj,β))=g(Zj,β)+𝐋j,\mathcal{A}_{Lap}(g(Z_{j},\beta))=g(Z_{j},\beta)+\mathbf{L}_{j},

where 𝐋j=(L1,,Ld)T\mathbf{L}_{j}=(L_{1},\ldots,L_{d})^{T} is a vector of independent and identical distributed Laplacian random variables, i.e. LiLap(0,2max(τ,1τ)md/ϵ)L_{i}\sim\operatorname{Lap}(0,2\max(\tau,1-\tau)md/\epsilon).
The measurement error and the error due to privacy in (7) and (8) are given by

ξiSGD\displaystyle\xi_{i}^{SGD} =τ(Xi𝔼[X])+𝕀{yiXiβi1LDP<0}Xi𝔼[𝕀{yXβi1LDP<0}X]\displaystyle=-\tau(X_{i}-\mathbb{E}[X])+\mathbb{I}\{y_{i}-X_{i}\beta_{i-1}^{LDP}<0\}X_{i}-\mathbb{E}[\mathbb{I}\{y-X\beta_{i-1}^{LDP}<0\}X]
ξiLDP\displaystyle\xi_{i}^{LDP} =(L1,,Ld)T,\displaystyle=(L_{1},\ldots,L_{d})^{T},

which are indeed martingale difference processes with respect to the filtration {i}i1\{\mathcal{F}_{i}\}_{i\geq 1}, where i=σ(β1LDP,,βiLDP)\mathcal{F}_{i}=\sigma(\beta_{1}^{LDP},\ldots,\beta_{i}^{LDP}). The assumptions from Theorems˜3.2 and 3.1 can be verified with Σ=ΣX1(SSGD+SLDP)ΣX1/pε(0)2\Sigma=\Sigma_{X}^{-1}(S_{SGD}+S_{LDP})\Sigma_{X}^{-1}/p_{\varepsilon}(0)^{2} and

SSGD\displaystyle S_{SGD} =τ(1τ)ΣX,\displaystyle=\tau(1-\tau)\Sigma_{X}\text{, } SLDP=24max{τ2,(1τ)2}m2d2ϵ2𝕀d×d.\displaystyle S_{LDP}=2\tfrac{4\max\{\tau^{2},(1-\tau)^{2}\}m^{2}d^{2}}{\epsilon^{2}}\mathbb{I}_{d\times d}.

We compare the confidence intervals obtained by BB and BM where X=(1,X1,X2,X3)X=(1,X_{1},X_{2},X_{3}), (X1,X2,X3)(X_{1},X_{2},X_{3}) follows a truncated standard normal distribution on the cube [1,1]3[-1,1]^{3} and εi\varepsilon_{i} is normal distributed with variance 11. For the generation of the truncated normal and the Laplace distribution we used the package of Botev and Belzile, (2021) and Wu et al., (2023), respectively. The simulation results are displayed in Table˜2. We observe that for n=106n=10^{6} BB yields too small and BM yields too large coverage probabilities, while the lengths of the confidence intervals from BB are smaller. If n=107n=10^{7} and n=108n=10^{8} the coverage probabilities from BB and BM are increasing and decreasing respectively.

Table 2: Empirical coverage and length of a 90%90\% confidence interval for the parameters of quantile regression with βτ=(0,0,1,1)\beta_{\tau}=(0,0,1,-1) where τ=0.5\tau=0.5. The confidence intervals are obtained by batch mean (BM) and the block bootstrap (BB) proposed in this paper. The numbers in brackets are the standard errors.
10610^{6} 10710^{7} 10810^{8}
BB β0\beta_{0} cover: 0.860 (0.016) 0.882 (0.014) 0.904 (0.013)
length: 0.07 (1.2×103)(1.2\times 10^{-3}) 0.016 (1.2×104)(1.2\times 10^{-4}) 0.005 (1.7×105)(1.7\times 10^{-5})
β1\beta_{1} cover: 0.862 (0.015) 0.868 (0.015) 0.880 (0.015)
length: 0.228 (6.3×103)(6.3\times 10^{-3}) 0.055 (4.8×104)(4.8\times 10^{-4}) 0.016 (6.2×105)(6.2\times 10^{-5})
β2\beta_{2} cover: 0.850 (0.016) 0.870 (0.015) 0.882 (0.014)
length: 0.241 (5.6×103)(5.6\times 10^{-3}) 0.054 (5.3×104)(5.3\times 10^{-4}) 0.016 (6.9×105)(6.9\times 10^{-5})
β3\beta_{3} cover: 0.844 (0.016) 0.866 (0.015) 0.890 (0.014)
length: 0.243 (5.6×103)(5.6\times 10^{-3}) 0.054 (8.0×104)(8.0\times 10^{-4}) 0.016(6.7×105)(6.7\times 10^{-5})
BM β0\beta_{0} cover: 0.952 (0.01) 0.932 (0.011) 0.918 (0.012)
length: 0.103 (2.4×103)(2.4\times 10^{-3}) 0.02 (3.6×104)(3.6\times 10^{-4}) 0.005(2.1×105)(2.1\times 10^{-5})
β1\beta_{1} cover: 0.938 (0.012) 0.928 (0.012) 0.896 (0.014)
length: 0.315 (9.4×103)(9.4\times 10^{-3}) 0.076 (1.5×103)(1.5\times 10^{-3}) 0.017 (1.3×104)(1.3\times 10^{-4})
β2\beta_{2} cover: 0.948 (0.01) 0.926 (0.012) 0.902 (0.013)
length: 0.335 (8.4×103)(8.4\times 10^{-3}) 0.073 (1.7×103)(1.7\times 10^{-3}) 0.017 (2.0×104)(2.0\times 10^{-4})
β3\beta_{3} cover: 0.928 (0.012) 0.928 (0.012) 0.898 (0.014)
length: 0.339 (8.5×103)(8.5\times 10^{-3}) 0.073 (2.3×103)(2.3\times 10^{-3}) 0.017 (1.5×104)(1.5\times 10^{-4})

Acknowledgments and Disclosure of Funding

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC 2092 CASA - 390781972.

References

  • Abowd, (2018) Abowd, J. M. (2018). The us census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2867–2867.
  • Avella-Medina et al., (2023) Avella-Medina, M., Bradshaw, C., and Loh, P.-L. (2023). Differentially private inference via noisy optimization. The Annals of Statistics, 51(5):2067–2092.
  • Bassily et al., (2014) Bassily, R., Smith, A., and Thakurta, A. (2014). Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th annual symposium on foundations of computer science, pages 464–473. IEEE.
  • Benveniste et al., (1990) Benveniste, A., Métivier, M., and Priouret, P. (1990). Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media.
  • Botev and Belzile, (2021) Botev, Z. and Belzile, L. (2021). TruncatedNormal: Truncated Multivariate Normal and Student Distributions. R package version 2.2.2.
  • Bradley, (2005) Bradley, R. C. (2005). Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions. Probability Surveys, 2(none):107 – 144.
  • Chaudhuri et al., (2011) Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. (2011). Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3).
  • Chee et al., (2023) Chee, J., Kim, H., and Toulis, P. (2023). “plus/minus the learning rate”: Easy and scalable statistical inference with sgd. In International Conference on Artificial Intelligence and Statistics, pages 2285–2309. PMLR.
  • Chen et al., (2020) Chen, X., Lee, J. D., Tong, X. T., and Zhang, Y. (2020). Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics, 48(1):251–273.
  • Chung, (1954) Chung, K. L. (1954). On a stochastic approximation method. The Annals of Mathematical Statistics, pages 463–483.
  • Cotter et al., (2011) Cotter, A., Shamir, O., Srebro, N., and Sridharan, K. (2011). Better mini-batch algorithms via accelerated gradient methods. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc.
  • Dette and Gösmann, (2020) Dette, H. and Gösmann, J. (2020). A likelihood ratio approach to sequential change point detection for a general class of parameters. Journal of the American Statistical Association, 115(531):1361–1377.
  • Ding et al., (2017) Ding, B., Kulkarni, J., and Yekhanin, S. (2017). Collecting telemetry data privately. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  • Duchi et al., (2018) Duchi, J. C., Jordan, M. I., and Wainwright, M. J. (2018). Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 113(521):182–201.
  • Dwork et al., (2006) Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer.
  • Dwork and Roth, (2014) Dwork, C. and Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407.
  • Efron and Tibshirani, (1994) Efron, B. and Tibshirani, R. (1994). An Introduction to the Bootstrap. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis.
  • Eshun and Palmieri, (2022) Eshun, S. N. and Palmieri, P. (2022). Two de-anonymization attacks on real-world location data based on a hidden markov model. In 2022 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), pages 01–09. IEEE.
  • Fang et al., (2018) Fang, Y., Xu, J., and Yang, L. (2018). Online bootstrap confidence intervals for the stochastic gradient descent estimator. The Journal of Machine Learning Research, 19(1):3053–3073.
  • Ferrando et al., (2022) Ferrando, C., Wang, S., and Sheldon, D. (2022). Parametric bootstrap for differentially private confidence intervals. In International Conference on Artificial Intelligence and Statistics, pages 1598–1618. PMLR.
  • Fu et al., (2023) Fu, X., Xiang, Y., and Guo, X. (2023). Differentially private confidence interval for extrema of parameters. arXiv preprint arXiv:2303.02892.
  • Gambs et al., (2014) Gambs, S., Killijian, M.-O., and del Prado Cortez, M. N. (2014). De-anonymization attack on geolocated data. Journal of Computer and System Sciences, 80(8):1597–1614.
  • Khirirat et al., (2017) Khirirat, S., Feyzmahdavian, H. R., and Johansson, M. (2017). Mini-batch gradient descent: Faster convergence under data sparsity. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 2880–2887.
  • Lahiri, (2003) Lahiri, S. (2003). Resampling methods for dependent data. Springer Science & Business Media.
  • Li et al., (2018) Li, T., Liu, L., Kyrillidis, A., and Caramanis, C. (2018). Statistical inference using sgd. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
  • Liu et al., (2023) Liu, Y., Hu, Q., Ding, L., and Kong, L. (2023). Online local differential private quantile inference via self-normalization. In International Conference on Machine Learning, pages 21698–21714. PMLR.
  • Mertikopoulos et al., (2020) Mertikopoulos, P., Hallak, N., Kavis, A., and Cevher, V. (2020). On the almost sure convergence of stochastic gradient descent in non-convex problems. Advances in Neural Information Processing Systems, 33:1117–1128.
  • Narayanan and Shmatikov, (2006) Narayanan, A. and Shmatikov, V. (2006). How to break anonymity of the netflix prize dataset. arXiv preprint cs/0610105.
  • Polyak and Juditsky, (1992) Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855.
  • R Core Team, (2023) R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Robbins and Monro, (1951) Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400–407.
  • Ruppert, (1988) Ruppert, D. (1988). Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering.
  • Song et al., (2013) Song, S., Chaudhuri, K., and Sarwate, A. D. (2013). Stochastic gradient descent with differentially private updates. In 2013 IEEE global conference on signal and information processing, pages 245–248. IEEE.
  • Su and Zhu, (2018) Su, W. J. and Zhu, Y. (2018). Uncertainty quantification for online learning and stochastic approximation via hierarchical incremental gradient descent. arXiv preprint arXiv:1802.04876.
  • Vu and Slavkovic, (2009) Vu, D. and Slavkovic, A. (2009). Differential privacy for clinical trial data: Preliminary evaluations. In 2009 IEEE International Conference on Data Mining Workshops, pages 138–143. IEEE.
  • Wang et al., (2020) Wang, T., Zhang, X., Feng, J., and Yang, X. (2020). A comprehensive survey on local differential privacy toward data statistics and analysis. Sensors, 20(24):7030.
  • Wang and Awan, (2023) Wang, Z. and Awan, J. (2023). Debiased parametric bootstrap inference on privatized data. TPDP 2023 - Theory and Practice of Differential Privacy.
  • Wang et al., (2022) Wang, Z., Cheng, G., and Awan, J. (2022). Differentially private bootstrap: New privacy analysis and inference strategies. arXiv preprint arXiv:2210.06140.
  • Wu et al., (2023) Wu, H., Godfrey, A. J. R., Govindaraju, K., and Pirikahu, S. (2023). ExtDist: Extending the Range of Functions for Probability Distributions. R package version 0.7-2.
  • Wu, (2005) Wu, W. B. (2005). Nonlinear system theory: Another look at dependence. Proceedings of the National Academy of Sciences, 102(40):14150–14154.
  • Xiong et al., (2020) Xiong, X., Liu, S., Li, D., Cai, Z., and Niu, X. (2020). A comprehensive survey on local differential privacy. Security and Communication Networks, 2020:1–29.
  • Yang et al., (2023) Yang, M., Guo, T., Zhu, T., Tjuawinata, I., Zhao, J., and Lam, K.-Y. (2023). Local differential privacy and its applications: A comprehensive survey. Computer Standards & Interfaces, page 103827.
  • Yu et al., (2021) Yu, L., Balasubramanian, K., Volgushev, S., and Erdogdu, M. A. (2021). An analysis of constant step size sgd in the non-convex regime: Asymptotic normality and bias. Advances in Neural Information Processing Systems, 34:4234–4248.
  • Zhong et al., (2023) Zhong, Y., Kuffner, T., and Lahiri, S. (2023). Online bootstrap inference with nonconvex stochastic gradient descent estimator. arXiv preprint arXiv:2306.02205.
  • Zhu et al., (2021) Zhu, W., Chen, X., and Wu, W. B. (2021). Online covariance matrix estimation in stochastic gradient descent. Journal of the American Statistical Association, pages 1–30.

Appendix A Technical assumptions

The theoretical results of this paper, in particular the consistency of the multiplier block bootstrap, are proved under the following assumptions.

Assumption A.1.

Denote ξi=ξiLDP+ξiSGD\xi_{i}=\xi_{i}^{LDP}+\xi_{i}^{SGD}. Assume that the following conditions hold:

  1. (1)

    There exists a differentiable function V:dV:\mathbb{R}^{d}\to\mathbb{R} with gradient V\nabla V such that for some constants λ>0\lambda>0, c>0c^{\prime}>0, L>0L^{\prime}>0, ε>0\varepsilon>0

    V(θ)\displaystyle V(\theta) cθ2 for all θd\displaystyle\geq c^{\prime}\lVert\theta\rVert^{2}\leavevmode\nobreak\ \leavevmode\nobreak\ \text{ for all }\theta\in\mathbb{R}^{d}
    V(θ)V(θ)\displaystyle\lVert\nabla V(\theta)-\nabla V(\theta^{\prime})\rVert Lθθ for all θ,θd\displaystyle\leq L^{\prime}\lVert\theta-\theta^{\prime}\rVert\leavevmode\nobreak\ \leavevmode\nobreak\ \text{ for all }\theta,\theta^{\prime}\in\mathbb{R}^{d}
    V(θθ)TR(θ)\displaystyle\nabla V(\theta-\theta_{\star})^{T}R(\theta) >0 for all θθ\displaystyle>0\leavevmode\nobreak\ \leavevmode\nobreak\ \text{ for all }\theta\neq\theta_{\star}
    V(θθ)TR(θ)\displaystyle\nabla V(\theta-\theta_{\star})^{T}R(\theta) λV(θθ) for all θd with θθε\displaystyle\geq\lambda V(\theta-\theta_{\star})\leavevmode\nobreak\ \leavevmode\nobreak\ \text{ for all }\theta\in\mathbb{R}^{d}\text{ with }\lVert\theta-\theta_{\star}\rVert\leq\varepsilon
  2. (2)

    There exist positive definite matrix Gd×dG\in\mathbb{R}^{d\times d} and constants K1<K_{1}<\infty, ε>0\varepsilon>0, 0<λ10<\lambda\leq 1 such that for all θθε\lVert\theta-\theta_{\star}\rVert\leq\varepsilon

    R(θ)G(θθ)K1θθ1+λ.\lVert R(\theta)-G(\theta-\theta_{\star})\rVert\leq K_{1}\lVert\theta-\theta_{\star}\rVert^{1+\lambda}. (13)
  3. (3)

    {ξi}i1\{\xi_{i}\}_{i\geq 1} is a martingale-difference process with respect to a filtration {i}i0\{\mathcal{F}_{i}\}_{i\geq 0}, and for a constant K2>0K_{2}>0 it holds that for all i1i\geq 1

    𝔼[ξi2|i1]+R(θi1LDP)2K2(1+θi1LDP2).\displaystyle\mathbb{E}\left[\lVert\xi_{i}\rVert^{2}|\mathcal{F}_{i-1}\right]+\lVert R(\theta_{i-1}^{LDP})\rVert^{2}\leq K_{2}(1+\lVert\theta_{i-1}^{LDP}\rVert^{2}).
  4. (4)

    For the errors {ξi}i1\{\xi_{i}\}_{i\geq 1}, there exists a decomposition of the form ξi=ξi(0)+ζi(θi1LDP),\xi_{i}=\xi_{i}(0)+\zeta_{i}(\theta_{i-1}^{LDP}), a positive definite matrix SS and a function δ:𝕕,\delta:\mathbb{R^{d}}\to\mathbb{R}, which is continuous at the origin, such that

    𝔼[ξi(0)|i1]\displaystyle\mathbb{E}[\xi_{i}(0)|\mathcal{F}_{i-1}] =0 a.s.\displaystyle=0\leavevmode\nobreak\ \leavevmode\nobreak\ \text{ a.s. }
    limi𝔼[ξi(0)ξi(0)T|i1]\displaystyle\lim_{i\to\infty}\mathbb{E}[\xi_{i}(0)\xi_{i}(0)^{T}|\mathcal{F}_{i-1}] =S in probability\displaystyle=S\leavevmode\nobreak\ \leavevmode\nobreak\ \text{ in probability }
    limCsupi𝔼[ξi(0)2𝕀(ξi(0)>C)i1]\displaystyle\lim_{C\to\infty}\sup_{i}\mathbb{E}[\lVert\xi_{i}(0)\rVert^{2}\mathbb{I}(\lVert\xi_{i}(0)\rVert>C)\rVert\mathcal{F}_{i-1}] =0 in probability\displaystyle=0\leavevmode\nobreak\ \leavevmode\nobreak\ \text{ in probability }
    𝔼[ζi(θi1LDP)2|i1]\displaystyle\mathbb{E}[\lVert\zeta_{i}(\theta_{i-1}^{LDP})\rVert^{2}|\mathcal{F}_{i-1}] δ(θi1LDP) a.s.\displaystyle\leq\delta(\theta_{i-1}^{LDP})\leavevmode\nobreak\ \leavevmode\nobreak\ \text{ a.s. }
  5. (5)

    The learning rates are given by ηi=ciγ\eta_{i}=ci^{-\gamma} with c>0c>0 and 0.5<γ<10.5<\gamma<1.

Assumption A.2.

Denote ξi=ξiSGD+ξiLDP\xi_{i}=\xi_{i}^{SGD}+\xi_{i}^{LDP} and assume that {ξi}i1\{\xi_{i}\}_{i\geq 1} is a martingale difference process with respect to a filtration {i}i0\{\mathcal{F}_{i}\}_{i\geq 0}. Assume that the following holds:

  1. (1)

    There exists a function s:dd×ds:\mathbb{R}^{d}\to\mathbb{R}^{d\times d} such that

    𝔼[ξiξiT|i1]=S+s(θi1LDPθ)a.s.,\mathbb{E}[\xi_{i}\xi_{i}^{T}|\mathcal{F}_{i-1}]=S+s(\theta_{i-1}^{LDP}-\theta_{\star})\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \text{a.s.},

    where Sd×dS\in\mathbb{R}^{d\times d} is the matrix in Assumption˜A.1, and tr(𝔼[s(θiLDPθ)]2)s1ηi\sqrt{tr(\mathbb{E}[s(\theta_{i}^{LDP}-\theta_{\star})]^{2})}\leq s_{1}\sqrt{\eta_{i}} for a constant s1>0s_{1}>0.

  2. (2)

    There exists a constant K<K<\infty such that tr(𝔼[(ξiξiT)2|i1])Ktr(\mathbb{E}[(\xi_{i}\xi_{i}^{T})^{2}|\mathcal{F}_{i-1}])\leq K a.s..

  3. (3)

    There exists a constant ρ\rho with 1mi=1m|1lj=(i1)l+1ilξj|3ρ\frac{1}{m}\sum_{i=1}^{m}|\frac{1}{\sqrt{l}}\sum_{j=(i-1)l+1}^{il}\xi_{j}|^{3}\overset{\mathbb{P}}{\to}\rho.

  4. (4)

    The sequence {ξiξiT}i1\{\xi_{i}\xi_{i}^{T}\}_{i\geq 1} is uniformly integrable.

  5. (5)

    The learning rates are given by ηj=cjγ\eta_{j}=cj^{-\gamma} and there exists a constant b0b_{0}\in\mathbb{N} such that b0γ2λ(G)cb_{0}^{\gamma}\geq 2\lambda(G)c and

    j=1b0λ(G)2ηj2k=j+1b0(1λ(G)ηk)2λ(G)ηb0/2\sum_{j=1}^{b_{0}}\lambda(G)^{2}\eta_{j}^{2}\prod_{k=j+1}^{b_{0}}(1-\lambda(G)\eta_{k})^{2}\geq\lambda(G)\eta_{b_{0}}/2 (14)

    for every eigenvalue λ(G)\lambda(G) of the matrix GG in (13).

Remark A.1.

In many cases, tr(𝔼[s(θiLDPθ)]2)tr(\mathbb{E}[s(\theta_{i}^{LDP}-\theta_{\star})]^{2}) is approximately proportional to 𝔼[θiθ2]2\mathbb{E}[\lVert\theta_{i}-\theta_{\star}\rVert_{2}]^{2} (which follows by a Taylor expansion) and the convergence rate of 𝔼[θiθ2]\mathbb{E}[\lVert\theta_{i}-\theta_{\star}\rVert_{2}] is well studied, see for example Chung, (1954) and Chen et al., (2020). If the sequence {ξi}i1\{\xi_{i}\}_{i\geq 1} is bounded, Assumption˜A.2 (2)-(4) are satisfied.
Condition (14) is the base case of a proof by induction and can be numerically verified for different choices of the parameters cc and γ\gamma in the learning rate.

Appendix B Proofs of the results in Section 3

We will denote ξi:=ξiSGD+ξiLDP\xi_{i}:=\xi_{i}^{SGD}+\xi_{i}^{LDP} in this section. Furthermore, we omit the superscript "LDP" of θLDP\theta^{LDP} for improved readability. In other words, we write θi\theta_{i} for θiLDP\theta_{i}^{LDP} and θ¯\bar{\theta} for θ¯LDP\bar{\theta}^{LDP}. Moreover, for the sake of simplicity, we assume that n=mln=ml.

B.1 Proof of Theorem 3.1

This follows from Theorem 2 in Polyak and Juditsky, (1992) by observing that ξi=ξiSGD+ξiLDP\xi_{i}=\xi_{i}^{SGD}+\xi_{i}^{LDP} satisfies the assumptions stated in this reference.

B.2 Proof of Theorem 3.2

We first prove the statement in the case where the gradient R(θ)R(\theta) is assumed to be linear. After that, Theorem˜3.2 is derived by showing that bootstrap LDP-SGD in the non linear case is asymptotically approximated by a linearized version of bootstrap LDP-SGD. The following result is shown in Section B.3.

Theorem B.1.

Assume that the conditions of Theorem˜3.2 hold and that R(θ)=G(θθ)R(\theta)=G(\theta-\theta_{\star}) for a positive definite matrix Gd×dG\in\mathbb{R}^{d\times d}. Then, conditionally on θ1LDP,,θnLDP\theta_{1}^{LDP},\ldots,\theta_{n}^{LDP}

nθ¯𝑑N(0,Σ)\sqrt{n}\bar{\theta}^{\star}\overset{d}{\rightarrow}N(0,\Sigma)

in probability, where Σ=G1SG1\Sigma=G^{-1}SG^{-1}.

From Theorem˜B.1 we conclude that Theorem˜3.2 holds. To be precise, let Gd×dG\in\mathbb{R}^{d\times d} be the matrix in the linear approximation of the gradient R(θ)R(\theta) in (13), let {ξi}i1\{\xi_{i}\}_{i\geq 1} be the martingale difference process capturing the error due to measurement and privacy as in (6) and let Sd×dS\in\mathbb{R}^{d\times d} be the matrix in Assumption˜A.1 (4). Next we define a sequence of iterates θi1\theta_{i}^{1} of LDP-SGD for a loss function with R(θ)=G(θθ)R(\theta)=G(\theta-\theta_{\star}), that is, θ01=θ0\theta_{0}^{1}=\theta_{0} and for i1i\geq 1

θi1\displaystyle\theta_{i}^{1} =θi11ηi(G(θi11θ)+ξi)\displaystyle=\theta_{i-1}^{1}-\eta_{i}(G(\theta_{i-1}^{1}-\theta_{\star})+\xi_{i})
θ¯n1\displaystyle\bar{\theta}_{n}^{1} =1ni=1nθi1.\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\theta_{i}^{1}.

We denote the bootstrap analogue defined as in (11) for this sequence by θ¯1,\bar{\theta}^{1,\star}, that is

θ¯1,=1mlj=1mϵjb=(j1)l+1jl(θb1θ¯n1).\displaystyle\bar{\theta}^{1,\star}=\frac{1}{ml}\sum_{j=1}^{m}\epsilon_{j}\sum_{b=(j-1)l+1}^{jl}(\theta_{b}^{1}-\bar{\theta}^{1}_{n}).

By Theorem˜B.1 it follows conditionally on θ11,,θn1\theta_{1}^{1},\ldots,\theta_{n}^{1},

nθ¯1,𝑑N(0,G1SG1)\sqrt{n}\bar{\theta}^{1,\star}\overset{d}{\rightarrow}N(0,G^{-1}SG^{-1})

in probability. Since there is a bijection from θ11,,θn1\theta_{1}^{1},\ldots,\theta_{n}^{1} to ξ1,,ξn\xi_{1},\ldots,\xi_{n} and from θ1,,θn\theta_{1},\ldots,\theta_{n} to ξ1,,ξn\xi_{1},\ldots,\xi_{n}, the corresponding sigma algebras coincide and therefore, conditionally on θ1,,θn\theta_{1},\ldots,\theta_{n},

nθ¯1,𝑑N(0,G1SG1)\sqrt{n}\bar{\theta}^{1,\star}\overset{d}{\rightarrow}N(0,G^{-1}SG^{-1})

in probability. The assertion of Theorem˜3.2 is now a consequence of

n(θ¯θ¯1,)0.\sqrt{n}(\bar{\theta}^{\star}-\bar{\theta}^{1,\star})\overset{\mathbb{P}}{\rightarrow}0.

For a proof of this statement, we define

kj=i𝕀{(i1)l+1jil}k_{j}=i\mathbb{I}\{(i-1)l+1\leq j\leq il\} (15)

and note that

n(θ¯θ¯1,)\displaystyle\sqrt{n}(\bar{\theta}^{\star}-\bar{\theta}^{1,\star}) =ni=1nϵki(θiθ¯θi1+θ¯1)\displaystyle=\sqrt{n}\sum_{i=1}^{n}\epsilon_{k_{i}}(\theta_{i}-\bar{\theta}-\theta_{i}^{1}+\bar{\theta}^{1})
=ni=1nϵki(θiθi1)+n(θ¯1θ)1mi=1mϵin(θ¯θ)1mi=1mϵi.\displaystyle=\sqrt{n}\sum_{i=1}^{n}\epsilon_{k_{i}}(\theta_{i}-\theta_{i}^{1})+\sqrt{n}(\bar{\theta}^{1}-\theta_{\star})\frac{1}{m}\sum_{i=1}^{m}\epsilon_{i}-\sqrt{n}(\bar{\theta}-\theta_{\star})\frac{1}{m}\sum_{i=1}^{m}\epsilon_{i}.

Here the second and third term converge to zero in probability, since 1mi=1mϵi\frac{1}{m}\sum_{i=1}^{m}\epsilon_{i} does and both n(θ¯1θ)\sqrt{n}(\bar{\theta}^{1}-\theta_{\star}), n(θ¯θ)\sqrt{n}(\bar{\theta}-\theta_{\star}) converge weakly by Theorem˜3.1. For a corresponding statement regarding the first term let II denote the dd-dimensional identity matrix and note that

θiθi1=\displaystyle\theta_{i}-\theta_{i}^{1}= θi1ηiR(θi1)ηiξi(θi11ηiG(θi11θ)ηiξi)\displaystyle\theta_{i-1}-\eta_{i}R(\theta_{i-1})-\eta_{i}\xi_{i}-(\theta_{i-1}^{1}-\eta_{i}G(\theta_{i-1}^{1}-\theta_{\star})-\eta_{i}\xi_{i})
=\displaystyle= (IηiG)(θi1θi11)ηi(R(θi1)G(θi1θ))\displaystyle(I-\eta_{i}G)(\theta_{i-1}-\theta_{i-1}^{1})-\eta_{i}(R(\theta_{i-1})-G(\theta_{i-1}-\theta_{\star}))
=\displaystyle= (IηiG)((Iηi1G)(θi2θi21)ηi1(R(θi2)G(θi2θ)))\displaystyle(I-\eta_{i}G)\big{(}(I-\eta_{i-1}G)(\theta_{i-2}-\theta_{i-2}^{1})-\eta_{i-1}(R(\theta_{i-2})-G(\theta_{i-2}-\theta_{\star}))\big{)}
ηi(R(θi1)G(θi1θ))\displaystyle-\eta_{i}(R(\theta_{i-1})-G(\theta_{i-1}-\theta_{\star}))
=\displaystyle= (IηiG)(Iηi1G)(θi2θi21)\displaystyle(I-\eta_{i}G)(I-\eta_{i-1}G)(\theta_{i-2}-\theta_{i-2}^{1})
(IηiG)ηi1(R(θi2)G(θi2θ))ηi(R(θi1)G(θi1θ))\displaystyle-(I-\eta_{i}G)\eta_{i-1}(R(\theta_{i-2})-G(\theta_{i-2}-\theta_{\star}))-\eta_{i}(R(\theta_{i-1})-G(\theta_{i-1}-\theta_{\star}))
\displaystyle\vdots
=\displaystyle= (k=1i(IηkG))(θ0θ01)j=1iηj(k=j+1i(IηkG))(R(θj1)G(θj1θ))\displaystyle\Big{(}\prod_{k=1}^{i}(I-\eta_{k}G)\Big{)}(\theta_{0}-\theta_{0}^{1})-\sum_{j=1}^{i}\eta_{j}\Big{(}\prod_{k=j+1}^{i}(I-\eta_{k}G)\Big{)}(R(\theta_{j-1})-G(\theta_{j-1}-\theta_{\star}))
=\displaystyle= j=1iηj(k=j+1i(IηkG))(R(θj1)G(θj1θ)),\displaystyle-\sum_{j=1}^{i}\eta_{j}\Big{(}\prod_{k=j+1}^{i}(I-\eta_{k}G)\Big{)}(R(\theta_{j-1})-G(\theta_{j-1}-\theta_{\star})),

since θ0=θ01\theta_{0}=\theta_{0}^{1}. Therefore, since max|ϵj|C\max|\epsilon_{j}|\leq C a.s. by assumption, we obtain

1ni=1nϵki(θiθi1)=\displaystyle\left\lVert\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\epsilon_{k_{i}}(\theta_{i}-\theta_{i}^{1})\right\rVert= 1ni=1nϵkij=1iηj(k=j+1i(IηkG))(R(θj1)G(θj1θ))\displaystyle\left\lVert\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\epsilon_{k_{i}}\sum_{j=1}^{i}\eta_{j}\Big{(}\prod_{k=j+1}^{i}(I-\eta_{k}G)\Big{)}(R(\theta_{j-1})-G(\theta_{j-1}-\theta_{\star}))\right\rVert
\displaystyle\leq Cni=1nj=1iηj(k=j+1i(IηkG))(R(θj1)G(θj1θ))\displaystyle\frac{C}{\sqrt{n}}\sum_{i=1}^{n}\left\lVert\sum_{j=1}^{i}\eta_{j}\Big{(}\prod_{k=j+1}^{i}(I-\eta_{k}G)\Big{)}(R(\theta_{j-1})-G(\theta_{j-1}-\theta_{\star}))\right\rVert
=\displaystyle= Cni=1nηi(j=ink=i+1j(IηkG))(R(θi1)G(θi1θ))\displaystyle\frac{C}{\sqrt{n}}\sum_{i=1}^{n}\eta_{i}\left\lVert\Big{(}\sum_{j=i}^{n}\prod_{k=i+1}^{j}(I-\eta_{k}G)\Big{)}(R(\theta_{i-1})-G(\theta_{i-1}-\theta_{\star}))\right\rVert
\displaystyle\leq Cni=1nηiR(θi1)G(θi1θ)j=ink=i+1j(IηkG)M,\displaystyle\frac{C}{\sqrt{n}}\sum_{i=1}^{n}\eta_{i}\lVert R(\theta_{i-1})-G(\theta_{i-1}-\theta_{\star})\rVert\left\lVert\sum_{j=i}^{n}\prod_{k=i+1}^{j}(I-\eta_{k}G)\right\rVert_{M},

where M\lVert\cdot\rVert_{M} denotes a matrix norm. Following the same arguments as in Polyak and Juditsky, (1992), part 4 of the proof of Theorem 2, it follows that the last term converges in probability to zero.

B.3 Proof of Theorem B.1

The proof is a consequence of the following three Lemmas:

Lemma B.1.

The bootstrap estimate θ¯\bar{\theta}^{\star} in (11) can be represented as

θ¯=\displaystyle\bar{\theta}^{\star}= 1nη1β1n(θ0θ)1nj=1nϵkjG1ξj1nj=1n(βjnϵkjG1)ξj(θ¯θ)1mj=1mϵj,\displaystyle\frac{1}{n\eta_{1}}\beta_{1}^{n}(\theta_{0}-\theta_{\star})-\frac{1}{n}\sum_{j=1}^{n}\epsilon_{k_{j}}G^{-1}\xi_{j}-\frac{1}{n}\sum_{j=1}^{n}\left(\beta_{j}^{n}-\epsilon_{k_{j}}G^{-1}\right)\xi_{j}-(\bar{\theta}-\theta_{\star})\frac{1}{m}\sum_{j=1}^{m}\epsilon_{j}, (16)

where the indices kjk_{j} are defined in (15) and the d×dd\times d matrices βjn\beta_{j}^{n} are given by

βjn\displaystyle\beta_{j}^{n} =ηji=jnϵkik=j+1i(IηkG).\displaystyle=\eta_{j}\sum_{i=j}^{n}\epsilon_{k_{i}}\prod_{k=j+1}^{i}(I-\eta_{k}G). (17)
Lemma B.2.

Conditionally on ξ1,,ξn\xi_{1},\ldots,\xi_{n} we have

1nj=1nϵkjG1ξj𝑑N(0,Σ)\frac{1}{\sqrt{n}}\sum_{j=1}^{n}\epsilon_{k_{j}}G^{-1}\xi_{j}\overset{d}{\rightarrow}N(0,\Sigma)

in probability, where Σ=G1SG1\Sigma=G^{-1}SG^{-1}.

Lemma B.3.

Let the assumptions from Theorem˜B.1 hold, βin\beta_{i}^{n} as in (17) and kjk_{j} as in (15). Then

1nj=1n(βjnϵkjG1)ξj0.\frac{1}{\sqrt{n}}\sum_{j=1}^{n}\left(\beta_{j}^{n}-\epsilon_{k_{j}}G^{-1}\right)\xi_{j}\overset{\mathbb{P}}{\rightarrow}0.

By the same arguments of Part 1 in the proof of Theorem 1 in Polyak and Juditsky, (1992) we obtain for the first term in (16) 1nη1(θ0θ)0\frac{1}{\sqrt{n}\eta_{1}}(\theta_{0}-\theta_{\star})\overset{\mathbb{P}}{\rightarrow}0 (note that by assumption ϵi\epsilon_{i} are bounded a.s.). The fourth term is of order o(1/n)o_{\mathbb{P}}(1/\sqrt{n}) since 1mj=1mϵj\frac{1}{m}\sum_{j=1}^{m}\epsilon_{j} converges to zero in probability and n(θ¯θ)\sqrt{n}(\bar{\theta}-\theta_{\star}) converges in distribution by Theorem˜3.1. The third term in (16) is of order o(1/n)o_{\mathbb{P}}(1/\sqrt{n}) as well by Lemma˜B.3. Therefore

nθ¯=1nj=1nϵkjG1ξj+oP(1)𝑑N(0,Σ)\sqrt{n}\bar{\theta}^{\star}=-\frac{1}{\sqrt{n}}\sum_{j=1}^{n}\epsilon_{k_{j}}G^{-1}\xi_{j}+o_{P}(1)\overset{d}{\to}N(0,\Sigma)

in probability, given θ1,,θn\theta_{1},\ldots,\theta_{n} by Lemma˜B.2(note that the sigma algebras generated by ξ1,,ξn\xi_{1},\ldots,\xi_{n} and θ1,,θn\theta_{1},\ldots,\theta_{n} coincide).

Appendix C Proofs of Lemma B.1 - Lemma B.3

C.1 Proof of Lemma B.1

Obviously,

θ¯=\displaystyle\bar{\theta}^{\star}= 1mj=1mϵj1lb=(j1)l+1jl(θbθ+θθ¯)=1mlj=1mϵjb=(j1)l+1jl(θbθ)(θ¯θ)1mj=1mϵj.\displaystyle\frac{1}{m}\sum_{j=1}^{m}\epsilon_{j}\frac{1}{l}{}\sum_{b=(j-1)l+1}^{jl}(\theta_{b}-\theta_{\star}+\theta_{\star}-\bar{\theta})=\frac{1}{ml}\sum_{j=1}^{m}\epsilon_{j}\sum_{b=(j-1)l+1}^{jl}(\theta_{b}-\theta_{\star})-(\bar{\theta}-\theta_{\star})\frac{1}{m}\sum_{j=1}^{m}\epsilon_{j}.

Now we use the following representation for ii-th iterate

θiθ=j=1i(IηjG)(θ0θ)j=1iηjξjk=j+1i(IηkG)\theta_{i}-\theta_{\star}=\prod_{j=1}^{i}(I-\eta_{j}G)(\theta_{0}-\theta_{\star})-\sum_{j=1}^{i}\eta_{j}\xi_{j}\prod_{k=j+1}^{i}(I-\eta_{k}G)

to obtain

θ¯\displaystyle\bar{\theta}^{\star} =1ni=1nϵki(j=1i(IηjG)(θ0θ)j=1iηjξjk=j+1i(IηkG))(θ¯θ)1mj=1mϵj\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\epsilon_{k_{i}}\left(\prod_{j=1}^{i}(I-\eta_{j}G)(\theta_{0}-\theta_{\star})-\sum_{j=1}^{i}\eta_{j}\xi_{j}\prod_{k=j+1}^{i}(I-\eta_{k}G)\right)-(\bar{\theta}-\theta_{\star})\frac{1}{m}\sum_{j=1}^{m}\epsilon_{j}
=1ni=1nϵkij=1i(IηjG)(θ0θ)1nj=1nηjξj(i=jnϵkik=j+1i(IηkG))(θ¯θ)1mj=1mϵj\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\epsilon_{k_{i}}\prod_{j=1}^{i}(I-\eta_{j}G)(\theta_{0}-\theta_{\star})-\frac{1}{n}\sum_{j=1}^{n}\eta_{j}\xi_{j}\left(\sum_{i=j}^{n}\epsilon_{k_{i}}\prod_{k=j+1}^{i}(I-\eta_{k}G)\right)-(\bar{\theta}-\theta_{\star})\frac{1}{m}\sum_{j=1}^{m}\epsilon_{j}
=(θ0θ)1nη1β1n1nj=1nϵkjG1ξj1nj=1n(βjnϵkjG1)ξj(θ¯θ)1mj=1mϵj,\displaystyle=(\theta_{0}-\theta_{\star})\frac{1}{n\eta_{1}}\beta_{1}^{n}-\frac{1}{n}\sum_{j=1}^{n}\epsilon_{k_{j}}G^{-1}\xi_{j}-\frac{1}{n}\sum_{j=1}^{n}(\beta_{j}^{n}-\epsilon_{k_{j}}G^{-1})\xi_{j}-(\bar{\theta}-\theta_{\star})\frac{1}{m}\sum_{j=1}^{m}\epsilon_{j},

which proves the assertion of Lemma˜B.1.

C.2 Proof of Lemma B.2

We will give the proof for d=1d=1, the multivariate case follows by analog arguments. Define

Yn,i=1lj=(i1)l+1ilξj\displaystyle Y_{n,i}=\frac{1}{\sqrt{l}}\sum_{j=(i-1)l+1}^{il}\xi_{j}

and note that

1nj=1nϵkjG1ξj=1mi=1mϵiG1Yn,i.\frac{1}{\sqrt{n}}\sum_{j=1}^{n}\epsilon_{k_{j}}G^{-1}\xi_{j}=\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\epsilon_{i}G^{-1}Y_{n,i}.

As, conditional on ξ1,,ξn\xi_{1},\ldots,\xi_{n}, the random variables ϵ1Yn,1,,ϵmYn,m\epsilon_{1}Y_{n,1},\ldots,\epsilon_{m}Y_{n,m} are independent we obtain by the Berry Esseen theorem that

supx|(1mi=1mϵiG1Yn,ix|ξ1,,ξn)Φ(xσn)|1m1mi=1m|G1Yn,i|3𝔼[|ϵi|3]σn3\displaystyle\sup_{x\in\mathbb{R}}\Big{|}\mathbb{P}\Big{(}\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\epsilon_{i}G^{-1}Y_{n,i}\leq x|\xi_{1},\ldots,\xi_{n}\Big{)}-\Phi\Big{(}\frac{x}{\sigma_{n}}\Big{)}\Big{|}\leq\frac{1}{\sqrt{m}}\frac{\frac{1}{m}\sum_{i=1}^{m}|G^{-1}Y_{n,i}|^{3}\mathbb{E}[|\epsilon_{i}|^{3}]}{\sigma_{n}^{3}} (18)

where

σn2=1mi=1m(G1Yn,i)2=G11mi=1mYn,i2G1\sigma_{n}^{2}=\frac{1}{m}\sum_{i=1}^{m}(G^{-1}Y_{n,i})^{2}=G^{-1}\frac{1}{m}\sum_{i=1}^{m}Y_{n,i}^{2}G^{-1}

is the variance of 1mi=1mϵiG1Yn,i\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\epsilon_{i}G^{-1}Y_{n,i} conditional on ξ1,,,ξn\xi_{1},,\ldots,\xi_{n}. Note that, by Assumption˜A.2 (1)

𝔼[1mi=1mYn,i2]=1ni=1n𝔼[ξi2]=S+1ni=1n𝔼[s(θi1LDPθ)]=S+o(1).\mathbb{E}\Big{[}\frac{1}{m}\sum_{i=1}^{m}Y_{n,i}^{2}\Big{]}=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[\xi_{i}^{2}]=S+\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[s(\theta_{i-1}^{LDP}-\theta_{\star})]=S+o(1).

Moreover, as

E[Yn,i4]=1l2j=(i1)l+1il𝔼[ξj4]+1l2kj=(i1)l+1il𝔼[ξjξk3]+1l2kj=(i1)l+1il𝔼[ξj2ξk2]E[Y_{n,i}^{4}]=\frac{1}{l^{2}}\sum_{j=(i-1)l+1}^{il}\mathbb{E}[\xi_{j}^{4}]+\frac{1}{l^{2}}\sum_{k\neq j=(i-1)l+1}^{il}\mathbb{E}[\xi_{j}\xi_{k}^{3}]+\frac{1}{l^{2}}\sum_{k\neq j=(i-1)l+1}^{il}\mathbb{E}[\xi_{j}^{2}\xi_{k}^{2}]

it follows that the E[Yn,i4]E[Y_{n,i}^{4}] are uniformly bounded (note that that the fourth moment of ξi\xi_{i} is bounded by assumption). Therefore, by Chebyshev’s inequality, 1mi=1mYn,i2S\frac{1}{m}\sum_{i=1}^{m}Y_{n,i}^{2}\overset{\mathbb{P}}{\rightarrow}S, which yields

σn2σ2=G1SG1.\sigma_{n}^{2}\overset{\mathbb{P}}{\rightarrow}\sigma^{2}=G^{-1}SG^{-1}.

By assumption, 𝔼[|ϵi|3]\mathbb{E}[|\epsilon_{i}|^{3}] is bounded (since |ϵi||\epsilon_{i}| is bounded almost surely) and 1mi=1m|Yn,i|3\frac{1}{m}\sum_{i=1}^{m}|Y_{n,i}|^{3} converges in probability. Therefore, the right hand side of (18) is of order O(1/m)O_{\mathbb{P}}(1/\sqrt{m}) and the assertion of the Lemma follows.

C.3 Proof of Lemma B.3

We first derive an alternative representation for the expression in Lemma˜B.3. By interchanging the order of summation we obtain

1ni=1n(βinϵkiG1)ξi=\displaystyle\frac{1}{\sqrt{n}}\sum_{i=1}^{n}(\beta_{i}^{n}-\epsilon_{k_{i}}G^{-1})\xi_{i}= 1ni=1n(ηij=inϵkjk=i+1j(IηkG)ϵkiG1)ξi=1mi=1mϵiG1Vi\displaystyle\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\Big{(}\eta_{i}\sum_{j=i}^{n}\epsilon_{k_{j}}\prod_{k=i+1}^{j}(I-\eta_{k}G)-\epsilon_{k_{i}}G^{-1}\Big{)}\xi_{i}=\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\epsilon_{i}G^{-1}V_{i}

where ViV_{i} are defined by

Vi=\displaystyle V_{i}= 1lb=(i1)l+1il(j=1bGηjk=j+1b(IηkG)ξjξb).\displaystyle\frac{1}{\sqrt{l}}\sum_{b=(i-1)l+1}^{il}\left(\sum_{j=1}^{b}G\eta_{j}\prod_{k=j+1}^{b}(I-\eta_{k}G)\xi_{j}-\xi_{b}\right).

With these notations the assertion of Lemma˜B.3 follows from the L2L_{2} convergence

𝔼[1ni=1n(βinϵkiG1)ξi2]G12𝔼[1mi=1mϵiVi2]0.\mathbb{E}\left[\left\lVert\frac{1}{\sqrt{n}}\sum_{i=1}^{n}(\beta_{i}^{n}-\epsilon_{k_{i}}G^{-1})\xi_{i}\right\rVert^{2}\right]\leq\lVert G^{-1}\rVert^{2}\mathbb{E}\left[\left\lVert\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\epsilon_{i}V_{i}\right\rVert^{2}\right]\rightarrow 0.

The expression on the right hand side can be decomposed into different parts according to Lemma C.1.

Lemma C.1.

Define Δi=θiθ\Delta_{i}=\theta_{i}-\theta_{\star}, Σi=E[s(Δi1)]\Sigma_{i}=E[s(\Delta_{i-1})], where s:dd×ds:\mathbb{R}^{d}\to\mathbb{R}^{d\times d} is the function in Assumption˜A.2, and 𝔼[ξiξiT]=S+Σi\mathbb{E}[\xi_{i}\xi_{i}^{T}]=S+\Sigma_{i}. It holds that

𝔼[1mi=1mϵiVi2]=\displaystyle\mathbb{E}\left[\left\lVert\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\epsilon_{i}V_{i}\right\rVert^{2}\right]= tr(S{IRl,mNl,m+NCl,m})+tr(IrestRl,mrestNl,mrest+NCl,mrest)\displaystyle tr\left(S\{I-R_{l,m}-N_{l,m}+NC_{l,m}\}\right)+tr(I^{rest}-R_{l,m}^{rest}-N_{l,m}^{rest}+NC_{l,m}^{rest})

where

Rl,m=\displaystyle R_{l,m}= 1lmi=1m{b=(i1)l+1ilj=1bKb,jTKb,j}\displaystyle\frac{1}{lm}\sum_{i=1}^{m}\left\{\sum_{b=(i-1)l+1}^{il}\sum_{j=1}^{b}K_{b,j}^{T}K_{b,j}\right\}
Nl,m=\displaystyle N_{l,m}= 1lmi=1m{b=(i1)l+1ilb=(i1)l+1bKb,bT+b=(i1)l+1ilb=(i1)l+1bKb,b}\displaystyle\frac{1}{lm}\sum_{i=1}^{m}\left\{\sum_{b=(i-1)l+1}^{il}\sum_{b^{\prime}=(i-1)l+1}^{b}K_{b,b^{\prime}}^{T}+\sum_{b^{\prime}=(i-1)l+1}^{il}\sum_{b=(i-1)l+1}^{b}K_{b^{\prime},b}\right\} (19)
NCl,m=\displaystyle NC_{l,m}= 1lmi=1m{b=(i1)l+1ilb=(i1)l+1bj=1bKb,jTKb,j+b=(i1)l+1ilb=(i1)l+1bj=1bKb,jTKb,j}\displaystyle\frac{1}{lm}\sum_{i=1}^{m}\left\{\sum_{b=(i-1)l+1}^{il}\sum_{b^{\prime}=(i-1)l+1}^{b}\sum_{j=1}^{b^{\prime}}K_{b,j}^{T}K_{b^{\prime},j}+\sum_{b^{\prime}=(i-1)l+1}^{il}\sum_{b=(i-1)l+1}^{b^{\prime}}\sum_{j=1}^{b}K_{b,j}^{T}K_{b^{\prime},j}\right\}
Irest=\displaystyle I^{rest}= 1ni=1nΣi\displaystyle\frac{1}{n}\sum_{i=1}^{n}\Sigma_{i}
Rl,mrest=\displaystyle R_{l,m}^{rest}= 1lmi=1mb=(i1)l+1ilj=1bΣjKb,jTKb,j\displaystyle\frac{1}{lm}\sum_{i=1}^{m}\sum_{b=(i-1)l+1}^{il}\sum_{j=1}^{b}\Sigma_{j}K_{b,j}^{T}K_{b,j}
Nl,mrest=\displaystyle N_{l,m}^{rest}= 1lmi=1mb=(i1)l+1ilb=(i1)l+1bΣbKb,bT+b=(i1)l+1ilb=(i1)l+1bΣbKb,b\displaystyle\frac{1}{lm}\sum_{i=1}^{m}\sum_{b=(i-1)l+1}^{il}\sum_{b^{\prime}=(i-1)l+1}^{b}\Sigma_{b^{\prime}}K_{b,b^{\prime}}^{T}+\sum_{b^{\prime}=(i-1)l+1}^{il}\sum_{b=(i-1)l+1}^{b}\Sigma_{b}K_{b^{\prime},b} (20)
NCl,mrest=\displaystyle NC_{l,m}^{rest}= 1lmi=1mb=(i1)l+1ilb=(i1)l+1b{j=1bΣjKb,jTKb,j}+b=(i1)l+1ilb=(i1)l+1b{j=1bΣjKb,jTKb,j}\displaystyle\frac{1}{lm}\sum_{i=1}^{m}\sum_{b=(i-1)l+1}^{il}\sum_{b^{\prime}=(i-1)l+1}^{b}\left\{\sum_{j=1}^{b^{\prime}}\Sigma_{j}K_{b,j}^{T}K_{b^{\prime},j}\right\}+\sum_{b^{\prime}=(i-1)l+1}^{il}\sum_{b=(i-1)l+1}^{b^{\prime}}\left\{\sum_{j=1}^{b}\Sigma_{j}K_{b,j}^{T}K_{b^{\prime},j}\right\}

and

Kb,j\displaystyle K_{b,j} :=ηjGk=j+1b(IηkG).\displaystyle:=\eta_{j}G\prod_{k=j+1}^{b}(I-\eta_{k}G). (21)

In order to estimate the expressions in Lemma˜C.1 we now argue as follows. For the first term we use the Cauchy-Schwarz inequality and obtain

{tr(S{IRl,mNl,m+NCl,m})}2tr(S2)tr({IRl,mNl,m+NCl,m}2).\{tr\left(S\{I-R_{l,m}-N_{l,m}+NC_{l,m}\}\right)\}^{2}\leq tr(S^{2})tr(\{I-R_{l,m}-N_{l,m}+NC_{l,m}\}^{2}).

Note that the matrix IRl,mNl,m+NCl,mI-R_{l,m}-N_{l,m}+NC_{l,m} is symmetric as it is a polynomial of the symmetric matrix GG. Consequently, the second term converges to 0 if we can show that all eigenvalues of the matrix IRl,mNl,m+NCl,mI-R_{l,m}-N_{l,m}+NC_{l,m} converge to zero, that is

λ(IRl,mNl,m+NCl,m)0,\lambda(I-R_{l,m}-N_{l,m}+NC_{l,m})\to 0,

where λ(A)\lambda(A) denotes the eigenvalue of the matrix AA. For this purpose we note again that the matrix is a polynomial of the symmetric matrix GG which implies that

λ(IRl,mNl,m+NCl,m)=1λ(Rl,m)λ(Nl,m)+λ(NCl,m)\lambda(I-R_{l,m}-N_{l,m}+NC_{l,m})=1-\lambda(R_{l,m})-\lambda(N_{l,m})+\lambda(NC_{l,m})

and investigate the eigenvalues of the different matrices separately. In particular, we show in Section D the following results.

Lemma C.2.

If n=lmn=lm\to\infty, it holds that

λ(Rl,m)0.\lambda(R_{l,m})\to 0.
Lemma C.3.

If mγl1γ0\frac{m^{\gamma}}{l^{1-\gamma}}\to 0 and ηi=ciγ\eta_{i}=ci^{-\gamma} for c>0c>0 and γ(0.5,1)\gamma\in(0.5,1), it holds that

λ(Nl,m)2.\lambda(N_{l,m})\rightarrow 2.
Lemma C.4.

If mγl1γ0\frac{m^{\gamma}}{l^{1-\gamma}}\to 0 and log(l)/m0\log(l)/m\to 0, it holds that

λ(NCl,m)1.\lambda(NC_{l,m})\rightarrow 1.

From these results we can conclude that

tr(S{IRl,mNl,m+NCl,m})0.tr(S\{I-R_{l,m}-N_{l,m}+NC_{l,m}\})\to 0. (22)

In order to derive a similar result for the second term in Lemma˜C.1 we consider all four terms in the trace separately. Starting with IRestI^{Rest} we obtain by Assumption˜A.2 that

|tr(Irest)|1ni=1ntr(Σi2)1ni=1nηi0.|tr(I^{rest})|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{tr(\Sigma_{i}^{2})}\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{\eta_{i}}\to 0.

Next we note, observing the definition of Rl,mrestR_{l,m}^{rest} that

|tr(Rl,mrest)|1lmi=1mb=(i1)l+1ilj=1btr(Σj2)tr(Kb,jTKb,j)Ctr(Rl,m),|tr(R_{l,m}^{rest})|\leq\frac{1}{lm}\sum_{i=1}^{m}\sum_{b=(i-1)l+1}^{il}\sum_{j=1}^{b}\sqrt{tr(\Sigma_{j}^{2})}tr(K_{b,j}^{T}K_{b,j})\leq Ctr(R_{l,m}),

since tr(Σj2)\sqrt{tr(\Sigma_{j}^{2})} can be bounded by a constant by Assumption˜A.2. By Lemma˜C.2 tr(Rl,m)tr(R_{l,m}) converges to zero which implies that

tr(Rl,mrest)0.tr(R_{l,m}^{rest})\to 0.

The two remaining terms tr(Nl,mrest)tr(N_{l,m}^{rest}) and tr(NCl,mrest)tr(NC_{l,m}^{rest}) are considered in the following Lemma, which will be proved in Section D.

Lemma C.5.

Under Assumption˜A.2 it holds that

tr(Nl,mrest)0tr(N_{l,m}^{rest})\rightarrow 0 (23)

and that

tr(NCl,mrest)0.tr(NC_{l,m}^{rest})\rightarrow 0. (24)

Combining these arguments yield

tr(IrestRl,mrestNl,mrest+NCl,mrest)0tr(I^{rest}-R_{l,m}^{rest}-N_{l,m}^{rest}+NC_{l,m}^{rest})\to 0

and the assertion of Lemma˜B.3 follows from Lemma˜C.1 and (22).

Appendix D Proofs of auxiliary results

D.1 Proof of Lemma˜C.1

Observing that the bootstrap multipliers ϵi\epsilon_{i} are independent of ξj\xi_{j} it follows that

𝔼[1mi=1mϵiVi2]\displaystyle\mathbb{E}\left[\left\lVert\frac{1}{\sqrt{m}}\sum_{i=1}^{m}\epsilon_{i}V_{i}\right\rVert^{2}\right] =1mi=1m𝔼[ϵi2ViTVi]+1mij=1m𝔼[ϵjϵiViTVj]=1mi=1m𝔼[ViTVi]\displaystyle=\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}[\epsilon_{i}^{2}V_{i}^{T}V_{i}]+\frac{1}{m}\sum_{i\neq j=1}^{m}\mathbb{E}[\epsilon_{j}\epsilon_{i}V_{i}^{T}V_{j}]=\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}[V_{i}^{T}V_{i}]

since 𝔼[ϵi]=0\mathbb{E}[\epsilon_{i}]=0 and Var(ϵi)=1Var(\epsilon_{i})=1. With the notation

Bb\displaystyle B_{b} :=j=1bηjGk=j+1b(IηkG)ξjξb=j=1bKb,jξjξb,\displaystyle:=\sum_{j=1}^{b}\eta_{j}G\prod_{k=j+1}^{b}(I-\eta_{k}G)\xi_{j}-\xi_{b}=\sum_{j=1}^{b}K_{b,j}\xi_{j}-\xi_{b},

we obtain the representation Vi=1lb=(i1)l+1ilBbV_{i}=\frac{1}{\sqrt{l}}\sum_{b=(i-1)l+1}^{il}B_{b} which yields

𝔼[ViTVi]=1lb=(i1)l+1il𝔼[BbTBb]+1lbb=(i1)l+1il𝔼[BbTBb]\displaystyle\mathbb{E}[V_{i}^{T}V_{i}]=\frac{1}{l}\sum_{b=(i-1)l+1}^{il}\mathbb{E}[B_{b}^{T}B_{b}]+\frac{1}{l}\sum_{b\neq b^{\prime}=(i-1)l+1}^{il}\mathbb{E}[B_{b}^{T}B_{b^{\prime}}]

where

𝔼[BbTBb]\displaystyle\mathbb{E}[B_{b}^{T}B_{b}] =𝔼[(j=1bKb,jξj)T(j=1bKb,jξj)]2j=1b𝔼[ξbTKb,jξj]+𝔼[ξbTξb]\displaystyle=\mathbb{E}\left[\left(\sum_{j=1}^{b}K_{b,j}\xi_{j}\right)^{T}\left(\sum_{j=1}^{b}K_{b,j}\xi_{j}\right)\right]-2\sum_{j=1}^{b}\mathbb{E}[\xi_{b}^{T}K_{b,j}\xi_{j}]+\mathbb{E}[\xi_{b}^{T}\xi_{b}]
=j=1b𝔼[ξjTKb,jTKb,jξj]2𝔼[ξbTKb,bξb]+𝔼[ξbTξb]\displaystyle=\sum_{j=1}^{b}\mathbb{E}[\xi_{j}^{T}K_{b,j}^{T}K_{b,j}\xi_{j}]-2\mathbb{E}[\xi_{b}^{T}K_{b,b}\xi_{b}]+\mathbb{E}[\xi_{b}^{T}\xi_{b}]
=j=1btr(𝔼[ξjξjT]Kb,jTKb,j)2tr(𝔼[ξbξbT]Kb,b)+tr(𝔼[ξbξbT]),\displaystyle=\sum_{j=1}^{b}tr(\mathbb{E}[\xi_{j}\xi_{j}^{T}]K_{b,j}^{T}K_{b,j})-2tr(\mathbb{E}[\xi_{b}\xi_{b}^{T}]K_{b,b})+tr(\mathbb{E}[\xi_{b}\xi_{b}^{T}]),

since 𝔼[ξiTAξj]=𝔼[𝔼[ξiTAξj|max{i,j}1]]=0\mathbb{E}[\xi_{i}^{T}A\xi_{j}]=\mathbb{E}[\mathbb{E}[\xi_{i}^{T}A\xi_{j}|\mathcal{F}_{max\{i,j\}-1}]]=0 for iji\neq j. For bbb\neq b^{\prime} a similar calculation shows that

𝔼[BbTBb]\displaystyle\mathbb{E}[B_{b}^{T}B_{b^{\prime}}] =𝔼[(j=1bξjKb,jξb)T(i=1bξiKb,iξb)]\displaystyle=\mathbb{E}\left[\left(\sum_{j=1}^{b}\xi_{j}K_{b,j}-\xi_{b}\right)^{T}\left(\sum_{i=1}^{b^{\prime}}\xi_{i}K_{b^{\prime},i}-\xi_{b^{\prime}}\right)\right]
=j=1min(b,b)tr(𝔼[ξjξjT]Kb,jTKb,j)𝕀{b<b}tr(𝔼[ξbξbT]Kb,b)𝕀{b<b}tr(𝔼[ξbξbT]Kb,bT).\displaystyle=\sum_{j=1}^{min(b,b^{\prime})}tr(\mathbb{E}[\xi_{j}\xi_{j}^{T}]K_{b,j}^{T}K_{b^{\prime},j})-\mathbb{I}\{b<b^{\prime}\}tr(\mathbb{E}[\xi_{b}\xi_{b}^{T}]K_{b^{\prime},b})-\mathbb{I}\{b^{\prime}<b\}tr(\mathbb{E}[\xi_{b^{\prime}}\xi_{b^{\prime}}^{T}]K_{b,b^{\prime}}^{T}).

Inserting 𝔼[ξjξjT]=S+Σj\mathbb{E}[\xi_{j}\xi_{j}^{T}]=S+\Sigma_{j} and noting that Kb,b=ηbGK_{b,b}=\eta_{b}G it follows

𝔼[ViTVi]=\displaystyle\mathbb{E}[V_{i}^{T}V_{i}]= 1lb=(i1)l+1il{j=1btr((S+Σj)Kb,jTKb,j)2tr((S+Σb)Kb,b)+tr(S+Σb)}\displaystyle\frac{1}{l}\sum_{b=(i-1)l+1}^{il}\left\{\sum_{j=1}^{b}tr((S+\Sigma_{j})K_{b,j}^{T}K_{b,j})-2tr((S+\Sigma_{b})K_{b,b})+tr(S+\Sigma_{b})\right\}
+1lb=(i1)l+1ilbb=(i1)l+1il{j=1min(b,b)tr((S+Σj)Kb,jTKb,j)}\displaystyle+\frac{1}{l}\sum_{b=(i-1)l+1}^{il}\sum_{b\neq b^{\prime}=(i-1)l+1}^{il}\left\{\sum_{j=1}^{min(b,b^{\prime})}tr((S+\Sigma_{j})K_{b,j}^{T}K_{b^{\prime},j})\right\}
1lb=(i1)l+1ilbb=(i1)l+1il{𝕀{b<b}tr((S+Σb)Kb,b)+𝕀{b<b}tr((S+Σb)Kb,bT)}\displaystyle-\frac{1}{l}\sum_{b=(i-1)l+1}^{il}\sum_{b\neq b^{\prime}=(i-1)l+1}^{il}\left\{\mathbb{I}\{b<b^{\prime}\}tr((S+\Sigma_{b})K_{b^{\prime},b})+\mathbb{I}\{b^{\prime}<b\}tr((S+\Sigma_{b^{\prime}})K_{b,b^{\prime}}^{T})\right\}

The assertion of Lemma˜C.1 now follows by a straight forward calculation adding and subtracting the terms with b=bb=b^{\prime} in the second and third sum.

D.2 Proof of Lemma C.2

As Rl,mR_{l,m} is a polynomial of the symmetric matrix GG we may assume without loss of generality that λ(G)=1\lambda(G)=1 (change the constant cc in the definition of ηk\eta_{k}) and obtain with the inequalities 1+xex1+x\leq e^{x} and i=1kηic1k+1xγ𝑑x\sum_{i=1}^{k}\eta_{i}\geq c\int_{1}^{k+1}x^{-\gamma}dx that

λ(Rl,m)=\displaystyle\lambda(R_{l,m})= 1ni=1n[j=1iηj2k=j+1i(1ηk)2]1ni=1n[j=1iηj2exp(2k=j+1iηk)]\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{j=1}^{i}\eta_{j}^{2}\prod_{k=j+1}^{i}(1-\eta_{k})^{2}\right]\leq\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{j=1}^{i}\eta_{j}^{2}\exp\left(-2\sum_{k=j+1}^{i}\eta_{k}\right)\right]
\displaystyle\leq 1ni=1n[j=1iηj2exp(2cj+1i+1xγ𝑑x)]\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{j=1}^{i}\eta_{j}^{2}\exp\left(-2c\int_{j+1}^{i+1}x^{-\gamma}dx\right)\right]
=\displaystyle= 1ni=1nexp(2c1γ(i+1)1γ)[j=1iηj2exp(2c1γ(j+1)1γ)]\displaystyle\frac{1}{n}\sum_{i=1}^{n}\exp\left(-\tfrac{2c}{1-\gamma}(i+1)^{1-\gamma}\right)\left[\sum_{j=1}^{i}\eta_{j}^{2}\exp\left(\tfrac{2c}{1-\gamma}(j+1)^{1-\gamma}\right)\right]

Fixing a constant v(0.5,1)v\in(0.5,1), the inner sum can be split at iv\lfloor iv\rfloor, i.e.

λ(Rl,m)1ni=1nexp(2c1γ(i+1)1γ)[Fi+Li],\lambda(R_{l,m})\leq\frac{1}{n}\sum_{i=1}^{n}\exp\left(-\tfrac{2c}{1-\gamma}(i+1)^{1-\gamma}\right)\left[F_{i}+L_{i}\right],

where Fi=j=1viηj2exp(2c1γ(j+1)1γ)F_{i}=\sum_{j=1}^{\lfloor vi\rfloor}\eta_{j}^{2}\exp\left(\frac{2c}{1-\gamma}(j+1)^{1-\gamma}\right) and Li=j=vi+1iηj2exp(2c1γ(j+1)1γ)L_{i}=\sum_{j=\lfloor vi\rfloor+1}^{i}\eta_{j}^{2}\exp\left(\frac{2c}{1-\gamma}(j+1)^{1-\gamma}\right). It holds that

Fi\displaystyle F_{i} exp(2c1γ(vi+1)1γ)j=1viηj2c2exp(2c1γ(vi+1)1γ)(1+1vix2γ𝑑x)\displaystyle\leq\exp\left(\tfrac{2c}{1-\gamma}(\lfloor vi\rfloor+1)^{1-\gamma}\right)\sum_{j=1}^{\lfloor vi\rfloor}\eta_{j}^{2}\leq c^{2}\exp\left(\tfrac{2c}{1-\gamma}(\lfloor vi\rfloor+1)^{1-\gamma}\right)\left(1+\int_{1}^{\lfloor vi\rfloor}x^{-2\gamma}dx\right)
Li\displaystyle L_{i} exp(2c1γ((i+1)1γ)j=vi+1iηj2c2exp(2c1γ((i+1)1γ)viix2γdx\displaystyle\leq\exp\left(\tfrac{2c}{1-\gamma}((i+1)^{1-\gamma}\right)\sum_{j=\lfloor vi\rfloor+1}^{i}\eta_{j}^{2}\leq c^{2}\exp\left(\tfrac{2c}{1-\gamma}((i+1)^{1-\gamma}\right)\int_{\lfloor vi\rfloor}^{i}x^{-2\gamma}dx

which yields

λ(Rl,m)\displaystyle\lambda(R_{l,m})\leq c2ni=1nexp(2c1γ(i+1)1γ(1(vi+1i+1)1γ))(1+1vi12γ2γ1)\displaystyle\frac{c^{2}}{n}\sum_{i=1}^{n}\exp\left(-\tfrac{2c}{1-\gamma}(i+1)^{1-\gamma}(1-(\tfrac{\lfloor vi\rfloor+1}{i+1})^{1-\gamma})\right)(1+\tfrac{1-\lfloor vi\rfloor^{1-2\gamma}}{2\gamma-1})
+c2ni=1ni12γ12γ1((ivi)2γ11)\displaystyle+\frac{c^{2}}{n}\sum_{i=1}^{n}i^{1-2\gamma}\frac{1}{2\gamma-1}((\tfrac{i}{\lfloor vi\rfloor})^{2\gamma-1}-1)

It is easy to see that (vi+1)/(i+1)(v+1)/2(\lfloor vi\rfloor+1)/(i+1)\leq(v+1)/2 and ivi1v1/2\tfrac{i}{\lfloor vi\rfloor}\leq\tfrac{1}{v-1/2} for i2i\geq 2. Therefore and since vi12γ0\lfloor vi\rfloor^{1-2\gamma}\geq 0,

λ(Rl,m)\displaystyle\lambda(R_{l,m})\leq 2γc2(2γ1)ni=1nexp(2c1γ(i+1)1γ(1(v+12)1γ))+κc2n(2γ1)i=1ni12γ,\displaystyle\tfrac{2\gamma c^{2}}{(2\gamma-1)n}\sum_{i=1}^{n}\exp\left(-\tfrac{2c}{1-\gamma}(i+1)^{1-\gamma}\left(1-\left(\tfrac{v+1}{2}\right)^{1-\gamma}\right)\right)+\tfrac{\kappa c^{2}}{n(2\gamma-1)}\sum_{i=1}^{n}i^{1-2\gamma},

where κ=(1v1/2)2γ11>0\kappa=(\tfrac{1}{v-1/2})^{2\gamma-1}-1>0. Since 1ni=1nexp(K(i+1)1γ)0\frac{1}{n}\sum_{i=1}^{n}\exp(-K(i+1)^{1-\gamma})\rightarrow 0 for K>0K>0 and 1ni=1ni12γ0\frac{1}{n}\sum_{i=1}^{n}i^{1-2\gamma}\to 0, the assertion of the lemma follows.

D.3 Proof of Lemma C.3

Without loss of generality we assume that λ(G)=1\lambda(G)=1 (changing the constant cc in the learning rate). Because Nl,mN_{l,m} is a polynomial of the symmetric matrix G it follows that

λ(Nl,m)=2lmi=1mb=(i1)l+1ilb=(i1)l+1b{ηbk=b+1b(1ηk)}.\lambda(N_{l,m})=\frac{2}{lm}\sum_{i=1}^{m}\sum_{b=(i-1)l+1}^{il}\sum_{b^{\prime}=(i-1)l+1}^{b}\left\{\eta_{b^{\prime}}\prod_{k=b^{\prime}+1}^{b}(1-\eta_{k})\right\}.

Note that

j=1bηjk=j+1b(1ηk)=1k=1b(1ηk)\sum_{j=1}^{b}\eta_{j}\prod_{k=j+1}^{b}(1-\eta_{k})=1-\prod_{k=1}^{b}(1-\eta_{k}) (25)

which follows by a direct calculation using an induction argument. With this representation we obtain

b=(i1)l+1b{ηbk=b+1b(1ηk)}=\displaystyle\sum_{b^{\prime}=(i-1)l+1}^{b}\left\{\eta_{b^{\prime}}\prod_{k=b^{\prime}+1}^{b}(1-\eta_{k})\right\}= b=1b{ηbk=b+1b(1ηk)}b=1(i1)l{ηbk=b+1b(1ηk)}\displaystyle\sum_{b^{\prime}=1}^{b}\left\{\eta_{b^{\prime}}\prod_{k=b^{\prime}+1}^{b}(1-\eta_{k})\right\}-\sum_{b^{\prime}=1}^{(i-1)l}\left\{\eta_{b^{\prime}}\prod_{k=b^{\prime}+1}^{b}(1-\eta_{k})\right\}
=\displaystyle= 1k=1b(1ηk)(1k=1(i1)l(1ηk))k=(i1)l+1b(1ηk)\displaystyle 1-\prod_{k=1}^{b}(1-\eta_{k})-\left(1-\prod_{k=1}^{(i-1)l}(1-\eta_{k})\right)\prod_{k=(i-1)l+1}^{b}(1-\eta_{k})
=\displaystyle= 1k=(i1)l+1b(1ηk).\displaystyle 1-\prod_{k=(i-1)l+1}^{b}(1-\eta_{k}).

Therefore, it is left to show that

An:=2lmi=1mb=(i1)l+1ilk=(i1)l+1b(1ηk)0.A_{n}:=\frac{2}{lm}\sum_{i=1}^{m}\sum_{b=(i-1)l+1}^{il}\prod_{k=(i-1)l+1}^{b}(1-\eta_{k})\rightarrow 0.

For kbilk\leq b\leq il use (1ηk)(1ηil)(1-\eta_{k})\leq(1-\eta_{il}) which gives (using the definition ηk=ckγ\eta_{k}=ck^{-\gamma})

b=(i1)l+1ilk=(i1)l+1b(1ηk)\displaystyle\sum_{b=(i-1)l+1}^{il}\prod_{k=(i-1)l+1}^{b}(1-\eta_{k})\leq b=(i1)l+1il(1ηil)b(i1)l=(1ηil)1(1ηil)lηil\displaystyle\sum_{b=(i-1)l+1}^{il}(1-\eta_{il})^{b-(i-1)l}=(1-\eta_{il})\frac{1-(1-\eta_{il})^{l}}{\eta_{il}}
=\displaystyle= (iγlγc)1(1ciγlγ)lc1ciγlγ\displaystyle(i^{\gamma}l^{\gamma}-c)\frac{1-(1-ci^{-\gamma}l^{-\gamma})^{l}}{c}\leq\frac{1}{c}i^{\gamma}l^{\gamma}

if ll is sufficiently large. This gives

An\displaystyle A_{n}\leq 2lmci=1miγlγ2mγcl1γ,\displaystyle\frac{2}{lmc}\sum_{i=1}^{m}i^{\gamma}l^{\gamma}\leq\frac{2m^{\gamma}}{cl^{1-\gamma}},

which converges to zero since mγl1γ0\frac{m^{\gamma}}{l^{1-\gamma}}\to 0, by assumption.

D.4 Proof of Lemma C.4

As in the proof of Lemma˜C.4 we assume without loss of generality that λ(G)=1\lambda(G)=1. Observing again that NCl,mNC_{l,m} is a polynomial of the symmetric matrix GG it follows that

λ(NCl,m)=2lmi=1mb=(i1)l+1ilb=(i1)l+1b1{j=1bηj2k=j+1b(1ηk)2}k=b+1b(1ηk).\lambda(NC_{l,m})=\frac{2}{lm}\sum_{i=1}^{m}\sum_{b=(i-1)l+1}^{il}\sum_{b^{\prime}=(i-1)l+1}^{b-1}\left\{\sum_{j=1}^{b^{\prime}}\eta_{j}^{2}\prod_{k=j+1}^{b^{\prime}}(1-\eta_{k})^{2}\right\}\prod_{k=b^{\prime}+1}^{b}(1-\eta_{k}).

We will show below that there exists a constant cc^{\prime} and a constant b0b_{0} (which depends on the parameters of the learning rate) such that for all bb0b\geq b_{0}

j=1bηj2k=j+1b(1ηk)2\displaystyle\sum_{j=1}^{b}\eta_{j}^{2}\prod_{k=j+1}^{b}(1-\eta_{k})^{2} ηb/2+cb1\displaystyle\leq\eta_{b}/2+c^{\prime}b^{-1} (26)
j=1bηj2k=j+1b(1ηk)2\displaystyle\sum_{j=1}^{b}\eta_{j}^{2}\prod_{k=j+1}^{b}(1-\eta_{k})^{2} ηb/2.\displaystyle\geq\eta_{b}/2. (27)

Recalling the definition of Nl,mN_{l,m} in (19) and using (26), (27) and Lemma˜C.3 it follows that

λ(NCl,m)\displaystyle\lambda(NC_{l,m}) λ(Nl,m)/2+c2lmi=1mb=(i1)l+1ilb=(i1)l+1b1b11,\displaystyle\leq\lambda(N_{l,m})/2+c^{\prime}\frac{2}{lm}\sum_{i=1}^{m}\sum_{b=(i-1)l+1}^{il}\sum_{b^{\prime}=(i-1)l+1}^{b-1}b^{\prime-1}\to 1, (28)
λ(NCl,m)\displaystyle\lambda(NC_{l,m}) λ(Nl,m)/21\displaystyle\geq\lambda(N_{l,m})/2\rightarrow 1

since the first term in (28) converges to 11 by Lemma˜C.4 while the second term vanishes asymptotically, which is a consequence of the following

2lmi=1mb=(i1)l+1ilb=(i1)l+1bb1\displaystyle\frac{2}{lm}\sum_{i=1}^{m}\sum_{b=(i-1)l+1}^{il}\sum_{b^{\prime}=(i-1)l+1}^{b}b^{\prime-1} =2lmi=1mb=(i1)l+1ilb=b+1ilb1\displaystyle=\frac{2}{lm}\sum_{i=1}^{m}\sum_{b^{\prime}=(i-1)l+1}^{il}\sum_{b=b^{\prime}+1}^{il}b^{\prime-1}
=2lmi=1mb=(i1)l+1ilb1(ilb)\displaystyle=\frac{2}{lm}\sum_{i=1}^{m}\sum_{b^{\prime}=(i-1)l+1}^{il}b^{\prime-1}(il-b^{\prime})
=2mi=1mib=(i1)l+1ilb12\displaystyle=\frac{2}{m}\sum_{i=1}^{m}i\sum_{b^{\prime}=(i-1)l+1}^{il}b^{\prime-1}-2
2mb=1lb1+2mi=2milog(ii1)2+o(1)\displaystyle\leq\frac{2}{m}\sum_{b^{\prime}=1}^{l}b^{\prime-1}+\frac{2}{m}\sum_{i=2}^{m}i\log\left(\frac{i}{i-1}\right)-2+o(1)
2log(l)m+2m2m+1xlog(xx1)𝑑x2+o(1).\displaystyle\leq\frac{2\log(l)}{m}+\frac{2}{m}\int_{2}^{m+1}x\log\left(\frac{x}{x-1}\right)dx-2+o(1).

The right hand side converges to 0 since log(l)m0\frac{\log(l)}{m}\to 0 and

2m2m+1xlog(xx1)𝑑x2.\frac{2}{m}\int_{2}^{m+1}x\log\left(\frac{x}{x-1}\right)dx\to 2.

Therefore it remains to prove the two inequalities (26) and (27).

Proof of (26).

We will show this result by induction over bb. Denote

B=B(c)=max{(3c24c1)12γ1,(32cc(4c1))11γ,(32cc4c1)1γ}.B=B(c^{\prime})=\max\left\{\left(3\frac{c^{2}}{4c^{\prime}-1}\right)^{\frac{1}{2\gamma-1}},\left(3\frac{2c^{\prime}}{c(4c^{\prime}-1)}\right)^{\frac{1}{1-\gamma}},\left(3\frac{2cc^{\prime}}{4c^{\prime}-1}\right)^{\frac{1}{\gamma}}\right\}.

At first we argue that there exist constants c1c^{\prime}\geq 1 and b0Bb_{0}\geq B such that (26) holds for b=b0b=b_{0} (this particular choice of BB is used in the induction step).
To see this note that, for c1c^{\prime}\geq 1, B(c)B(c^{\prime}) is bounded by a constant (depending on cc and γ\gamma). Let b0b_{0} be fixed and larger than this constant. In particular, the choice of b0b_{0} does not depend on cc^{\prime} and b0B(c)b_{0}\geq B(c^{\prime}) for all c1c^{\prime}\geq 1. Then, for cc^{\prime} sufficiently large, (26) holds for b=b0b=b_{0} since the left-hand side is finite for b0b_{0} fixed.
For the induction step we assume that (26) holds for some b1b0b-1\geq b_{0}, then we have to show:

j=1bηj2k=j+1b(1ηk)2=\displaystyle\sum_{j=1}^{b}\eta_{j}^{2}\prod_{k=j+1}^{b}(1-\eta_{k})^{2}= ηb2+(1ηb)2j=1b1ηj2k=j+1b1(1ηk)2ηb/2+c1b.\displaystyle\eta_{b}^{2}+(1-\eta_{b})^{2}\sum_{j=1}^{b-1}\eta_{j}^{2}\prod_{k=j+1}^{b-1}(1-\eta_{k})^{2}\leq\eta_{b}/2+c^{\prime}\frac{1}{b}. (29)

Since we assumed that (26) holds for b1b-1, (29) follows from

ηb2+(1ηb)2(ηb1/2+c1b1)ηb/2+c1b\eta_{b}^{2}+(1-\eta_{b})^{2}\left(\eta_{b-1}/2+c^{\prime}\frac{1}{b-1}\right)\leq\eta_{b}/2+c^{\prime}\frac{1}{b}

which is equivalent to (inserting the definition of ηb\eta_{b} and multiplying by bγb^{\gamma} )

((bb1)γ1)(c2bγ+c2)+12c3b2γ(bb1)a+cc2b1bγ+cbγb(b1)2ccb1.\displaystyle\left(\left(\tfrac{b}{b-1}\right)^{\gamma}-1\right)\left(-c^{2}b^{-\gamma}+\tfrac{c}{2}\right)+\tfrac{1}{2}c^{3}b^{-2\gamma}\left(\tfrac{b}{b-1}\right)^{a}+\tfrac{c^{\prime}c^{2}}{b-1}b^{-\gamma}+\tfrac{c^{\prime}b^{\gamma}}{b(b-1)}\leq\tfrac{2cc^{\prime}}{b-1}. (30)

Since (bb1)γbb1(\frac{b}{b-1})^{\gamma}\leq\frac{b}{b-1} it holds that

((bb1)γ1)(c2bγ+12c)+12c3b2γ(bb1)a+cc2b1bγ+cbγb(b1)\displaystyle\left(\left(\tfrac{b}{b-1}\right)^{\gamma}-1\right)\left(-c^{2}b^{-\gamma}+\tfrac{1}{2}c\right)+\tfrac{1}{2}c^{3}b^{-2\gamma}\left(\tfrac{b}{b-1}\right)^{a}+\tfrac{c^{\prime}c^{2}}{b-1}b^{-\gamma}+\tfrac{c^{\prime}b^{\gamma}}{b(b-1)}
\displaystyle\leq c2(bb11)+c32(b1)b12γ+cc2b1bγ+cbγb(b1)\displaystyle\tfrac{c}{2}\left(\tfrac{b}{b-1}-1\right)+\tfrac{c^{3}}{2(b-1)}b^{1-2\gamma}+\tfrac{c^{\prime}c^{2}}{b-1}b^{-\gamma}+\tfrac{c^{\prime}b^{\gamma}}{b(b-1)}
=\displaystyle= c+c3b12γ+2cc2bγ+2cbγ12(b1).\displaystyle\frac{c+c^{3}b^{1-2\gamma}+2c^{\prime}c^{2}b^{-\gamma}+2c^{\prime}b^{\gamma-1}}{2(b-1)}.

So (30)\eqref{ineq} and therefore (29) follows from

c3b12γ+2cc2bγ+2cbγ14ccc,\displaystyle c^{3}b^{1-2\gamma}+2c^{\prime}c^{2}b^{-\gamma}+2c^{\prime}b^{\gamma-1}\leq 4cc^{\prime}-c,

which is implied if the following three inequalities hold

c3b12γ\displaystyle c^{3}b^{1-2\gamma} 13c(4c1)\displaystyle\leq\frac{1}{3}c(4c^{\prime}-1)
2cc2bγ\displaystyle 2c^{\prime}c^{2}b^{-\gamma} 13c(4c1)\displaystyle\leq\frac{1}{3}c(4c^{\prime}-1)
2cbγ1\displaystyle 2c^{\prime}b^{\gamma-1} 13c(4c1)\displaystyle\leq\frac{1}{3}c(4c^{\prime}-1)

which they do for bBb\geq B. ∎

Proof of (27).

We will show this result by induction over bb. By Assumption˜A.2 there exists a constant b0(2c)1γb_{0}\geq(2c)^{\frac{1}{\gamma}} such that (27) holds for a b1b0b-1\geq b_{0}. Therefore,

j=1bηj2k=j+1b(1ηk)2=\displaystyle\sum_{j=1}^{b}\eta_{j}^{2}\prod_{k=j+1}^{b}(1-\eta_{k})^{2}= ηb2+(1ηb)2j=1b1ηj2k=j+1b1(1ηk)2ηb2+(1ηb)2ηb12\displaystyle\eta_{b}^{2}+(1-\eta_{b})^{2}\sum_{j=1}^{b-1}\eta_{j}^{2}\prod_{k=j+1}^{b-1}(1-\eta_{k})^{2}\geq\eta_{b}^{2}+(1-\eta_{b})^{2}\frac{\eta_{b-1}}{2}

by the induction hypothesis. Showing that this is larger than ηb/2\eta_{b}/2 is equivalent to (inserting the definition of ηb\eta_{b}, multiplying by 2b2γ(b1)γ/c2b^{2\gamma}(b-1)^{\gamma}/c)

2c((b1)γbγ)+c2+b2γ\displaystyle 2c\left((b-1)^{\gamma}-b^{\gamma}\right)+c^{2}+b^{2\gamma} (b1)γbγ.\displaystyle\geq(b-1)^{\gamma}b^{\gamma}.

Since c2>0c^{2}>0, this is implied by

2c\displaystyle 2c bγ\displaystyle\leq b^{\gamma}

(note that (b1)γbγ0(b-1)^{\gamma}-b^{\gamma}\leq 0). By Assumption˜A.2 (5) we have bγb0γ2cb^{\gamma}\geq b_{0}^{\gamma}\geq 2c, which completes the proof of Lemma˜C.4. ∎

D.5 Proof of Lemma C.5

We start by showing (23). Note that by definition of Nl,mrestN_{l,m}^{rest} (20)

tr(Nl,mrest)=2lmi=1mb=(i1)l+1ilb=(i1)l+1btr(ΣbKb,b)tr(N_{l,m}^{rest})=\frac{2}{lm}\sum_{i=1}^{m}\sum_{b=(i-1)l+1}^{il}\sum_{b^{\prime}=(i-1)l+1}^{b}tr(\Sigma_{b^{\prime}}K_{b,b^{\prime}})

where the matrix Kb,bK_{b,b^{\prime}} is defined in (21) and Σb\Sigma_{b} as in Lemma˜C.1. From the Cauchy-Schwarz inequality it follows that |tr(ΣbKb,bT)|tr(Σb2)tr(Kb,b2)|tr(\Sigma_{b^{\prime}}K_{b,b^{\prime}}^{T})|\leq\sqrt{tr(\Sigma_{b^{\prime}}^{2})tr(K_{b,b^{\prime}}^{2})}. By Assumption˜A.2, tr(Σb2)\sqrt{tr(\Sigma_{b^{\prime}}^{2})} is (up to a constant) bounded by ηb\sqrt{\eta_{b^{\prime}}}. We will prove

Bn\displaystyle B_{n} :=2lmi=2mb=(i1)l+1ilb=(i1)l+1bηbλ(Kb,b2)0,\displaystyle:=\frac{2}{lm}\sum_{i=2}^{m}\sum_{b=(i-1)l+1}^{il}\sum_{b^{\prime}=(i-1)l+1}^{b}\sqrt{\eta_{b^{\prime}}\lambda(K_{b,b^{\prime}}^{2})}\to 0, (31)
Cn\displaystyle C_{n} :=2lmb=1lb=1bηbλ(Kb,b2)0\displaystyle:=\frac{2}{lm}\sum_{b=1}^{l}\sum_{b^{\prime}=1}^{b}\sqrt{\eta_{b^{\prime}}\lambda(K_{b,b^{\prime}}^{2})}\to 0 (32)

for every eigenvalue of Kb,bK_{b,b^{\prime}} from which (23) follows.
To see (31), note again that Kb,bK_{b,b^{\prime}} is a polynomial of the symmetric matrix GG and as in the proof of Lemma˜C.3 we assume without loss of generality that λ(G)=1\lambda(G)=1, which gives

λ(Kb,b2)=ηb2k=b+1b(1ηk)2.\lambda(K_{b,b^{\prime}}^{2})=\eta_{b^{\prime}}^{2}\prod_{k=b^{\prime}+1}^{b}(1-\eta_{k})^{2}.

If ll is sufficiently large, 1ηk01-\eta_{k}\geq 0 for kblk\geq b^{\prime}\geq l, and therefore

λ(Kb,b2)=ηbk=b+1b(1ηk).\displaystyle\sqrt{\lambda(K_{b,b^{\prime}}^{2})}=\eta_{b^{\prime}}\prod_{k=b^{\prime}+1}^{b}(1-\eta_{k}).

By ηbη(i1)l\sqrt{\eta_{b^{\prime}}}\leq\sqrt{\eta_{(i-1)l}} for b(i1)lb^{\prime}\geq(i-1)l and by applying (25), it follows that

Bn\displaystyle B_{n}\leq 2lmi=2mη(i1)lb=(i1)l+1ilb=(i1)l+1b{ηbk=b+1b(1ηk)}\displaystyle\frac{2}{lm}\sum_{i=2}^{m}\sqrt{\eta_{(i-1)l}}\sum_{b=(i-1)l+1}^{il}\sum_{b^{\prime}=(i-1)l+1}^{b}\left\{\eta_{b^{\prime}}\prod_{k=b^{\prime}+1}^{b}(1-\eta_{k})\right\}
=\displaystyle= 2lmi=2mη(i1)lb=(i1)l+1il(1k=(i1)l+1b(1ηk))\displaystyle\frac{2}{lm}\sum_{i=2}^{m}\sqrt{\eta_{(i-1)l}}\sum_{b=(i-1)l+1}^{il}(1-\prod_{k=(i-1)l+1}^{b}(1-\eta_{k}))
\displaystyle\leq 2mi=2mη(i1)l\displaystyle\frac{2}{m}\sum_{i=2}^{m}\sqrt{\eta_{(i-1)l}}
\displaystyle\leq 2lγ/2\displaystyle 2l^{-\gamma/2}

which converges to zero.
(32) follows by similar arguments and noting that (1ηk)<0(1-\eta_{k})<0 for only finitely many kk.

To show the second assertion (24) in Lemma˜C.5, note that by the definition (20) of NCl,mrestNC_{l,m}^{rest} and same arguments as in the proof of (23), it follows that

tr(NCl,m)2lmi=1mb=(i1)l+1ilb=(i1)l+1b1{j=1bηbtr((Kb,jTKb,j)2)}tr(NC_{l,m})\leq\frac{2}{lm}\sum_{i=1}^{m}\sum_{b=(i-1)l+1}^{il}\sum_{b^{\prime}=(i-1)l+1}^{b-1}\left\{\sum_{j=1}^{b^{\prime}}\sqrt{\eta_{b^{\prime}}tr((K_{b,j}^{T}K_{b^{\prime},j})^{2})}\right\}

where tr((Kb,jTKb,j)2)dλmax((Kb,jTKb,j)2)tr((K_{b,j}^{T}K_{b^{\prime},j})^{2})\leq d\lambda_{max}((K_{b,j}^{T}K_{b^{\prime},j})^{2}). Noting Kb,jK_{b,j} is a polynomial of the symmetric matrix GG and assuming without loss of generality that for the corresponding eigenvalue of GG it holds λ(G)=1\lambda(G)=1 it follows

λ((Kb,jTKb,j)2))=ηj2k=j+1b(1ηk)2k=b+1b|1ηk|.\sqrt{\lambda((K_{b,j}^{T}K_{b^{\prime},j})^{2}))}=\eta_{j}^{2}\prod_{k=j+1}^{b^{\prime}}(1-\eta_{k})^{2}\prod_{k=b^{\prime}+1}^{b}|1-\eta_{k}|.

Similar to (26) one can show by induction that (cc being the constant in the learning rate)

j=1bηj2ηjk=j+1b(1ηk)2cηbηb+cb1.\sum_{j=1}^{b^{\prime}}\eta_{j}^{2}\sqrt{\eta_{j}}\prod_{k=j+1}^{b^{\prime}}(1-\eta_{k})^{2}\leq c\eta_{b^{\prime}}\sqrt{\eta_{b^{\prime}}}+c^{\prime}b^{\prime-1}.

This implies

tr(NCl,m)2dlmi=1mb=(i1)l+1ilb=(i1)l+1b1ηbηbk=b+1b|1ηk|+cb1,tr(NC_{l,m})\leq\frac{2\sqrt{d}}{lm}\sum_{i=1}^{m}\sum_{b=(i-1)l+1}^{il}\sum_{b^{\prime}=(i-1)l+1}^{b-1}\eta_{b^{\prime}}\sqrt{\eta_{b^{\prime}}}\prod_{k=b^{\prime}+1}^{b}|1-\eta_{k}|+c^{\prime}b^{\prime-1},

where the sum over the first term converges to zero by the same arguments used for the convergence of BnB_{n} and CnC_{n} and the sum over the second term converges to zero by the same arguments used in the proof of (26). Therefore (24) follows.