The Role of Adaptive Optimizers
for Honest Private Hyperparameter Selection^†^†thanks: Authors SM and SS have equal contribution and are listed in alphabetical order. Authors XH, GK, OT have equal contribution and are listed in alphabetical order.

Shubhankar Mohapatra Cheriton School of Computer Science, University of Waterloo. s3mohapa@uwaterloo.ca. Sajin Sasy Cheriton School of Computer Science, University of Waterloo. ssasy@uwaterloo.ca. Supported by an Ontario Graduate Scholarship, and also grateful to the Royal Bank of Canada and NSERC grant CRDPJ-534381. Xi He Cheriton School of Computer Science, University of Waterloo. xi.he@uwaterloo.ca. Supported by an NSERC Discovery Grant, a University of Waterloo startup grant, and a Compute Canada RRG grant. Gautam Kamath Cheriton School of Computer Science, University of Waterloo. g@csail.mit.edu. Supported by an NSERC Discovery Grant, a University of Waterloo startup grant, and a Compute Canada RRG grant. Om Thakkar Google. omthkkr@google.com.

Abstract

Hyperparameter optimization is a ubiquitous challenge in machine learning, and the performance of a trained model depends crucially upon their effective selection. While a rich set of tools exist for this purpose, there are currently no practical hyperparameter selection methods under the constraint of differential privacy (DP). We study honest hyperparameter selection for differentially private machine learning, in which the process of hyperparameter tuning is accounted for in the overall privacy budget. To this end, we i) show that standard composition tools outperform more advanced techniques in many settings, ii) empirically and theoretically demonstrate an intrinsic connection between the learning rate and clipping norm hyperparameters, iii) show that adaptive optimizers like DPAdam enjoy a significant advantage in the process of honest hyperparameter tuning, and iv) draw upon novel limiting behaviour of Adam in the DP setting to design a new and more efficient optimizer.

1 Introduction

Over the last several decades, the field of machine learning has flourished. However, training machine learning models frequently involves personal data, which leaves data contributors susceptible to privacy attacks. This is not purely hypothetical: recent results have shown that models are vulnerable to membership inference [SSSS17, CLE⁺19, NSH19] and model inversion attacks [FJR15, SRS17]. The leading approaches for privacy-preserving machine learning are based on differential privacy (DP) [DMNS06]. Informally, DP rigorously limits and masks the contribution that an individual datapoint can have on an algorithm’s output. To address the aforementioned issues, DP training procedures have been developed [WM10, BST14, SCS13, ACG⁺16], which generally resemble non-private gradient-based methods, but with the incorporation of gradient clipping and noise injection.

In both settings, hyperparameter selection is instrumental to achieving high accuracy. The most common methods are grid search or random search, both of which incur a computational overhead scaling with the number of hyperparameters under consideration. In the private setting, this issue is often magnified as most private training procedures introduce new hyperparameters. Regardless, and more importantly, hyperparameter tuning on a sensitive dataset also costs in terms of privacy, naively incurring a multiplicative cost which scales as the square root of the number of candidates (based on composition properties of differential privacy [KOV15]).

Most prior works on private learning choose not to account for this cost [ACG⁺16, YLP⁺19, TB21], focusing instead on demonstrating the accuracy achievable by private learning under idealized conditions; if the best hyperparameters were somehow known ahead of time. Some works assume the presence of supplementary public data resembling the sensitive dataset [AGD⁺20, RTM⁺20], which may be freely used for hyperparameter tuning. Naturally, such public data may be scarce or nonexistent in settings where privacy is a concern, leaving practitioners with little guidance on how to choose hyperparameters in practice. As explored in our paper, poor hyperparameter selection with standard private optimizers can have catastrophic effects on model accuracy.

Hope is afforded by the success of adaptive optimizers in the non-private setting. The canonical example is Adam [KB14], which exploits moments of the gradients to adaptively and dynamically determine the learning rate. It works out of the box in many cases, providing accuracy comparable with tuned SGD. However, Adam has been overlooked in the context of private learning since previous works have show than fine-tuned DPSGD tends to perform better than DPAdam [PCS⁺20], which has lead to several subsequent works [YZC⁺21, TB21] to limit themselves to highlighting accuracy under ideal DPSGD hyperparameters. We navigate the different options available to a practitioner to solve the honest private hyperparameter tuning problem and ask, are there optimizers which provide strong privacy, require minimal hyperparameter tuning, and perform competitively with tuned counterparts?

1.1 Our Contributions

•

We investigate techniques for private tuning of hyperparameters. We perform the first empirical evaluation of the proposed theoretical method of [LT19] and demonstrate that it can be expensive; in certain cases, one can tune over sufficiently many hyperparameters using standard composition tools such as moments accountant [ACG⁺16].
•

We empirically and theoretically demonstrate that two hyperparameters, the learning rate and clipping threshold, are intrinsically coupled for non-adaptive optimizers. While other hyperparameters and the model architecture are restricted by the scope of the task, privacy and utility targets, and computational resources, the learning rate and clipping norm have no a priori bounds. Since the resulting hyperparameter grid adds up to the privacy cost while tuning to achieve the model with the best utility, we explore leveraging adaptive optimizers to reduce the hyperparameter space.
•

We empirically demonstrate the DPAdam optimizer (with default values for most hyperparameters), can match the performance of tuned non-adaptive optimizers on a variety of datasets, thus enabling private learning with honest hyperparameter selection. This finding complements a prior claim of [PCS⁺20], which suggests that a well-tuned DPSGD can outperform DPAdam. However, our findings show that this difference in performance is relatively insignificant. Furthermore, in the realistic setting where hyperparameter tuning must be accounted for in the privacy loss, we show that DPAdam is much more likely to produce non-catastrophic results.
•

We show that the adaptive learning rate of DPAdam converges to a static value. To leverage this, we introduce a new private optimizer, $\mathsf{DPAdamWOSM}$ that matches the performance of DPAdam without computing the second moments.

2 Preliminaries

Definition 2.1 (Differential Privacy).

[DKM⁺06, DMNS06] A randomized algorithm $\mathcal{M}$ achieves ( $\varepsilon,\delta$ )-DP if for all $\mathcal{S}\subseteq$ Range( $\mathcal{M}$ ) and for any two database instances $D,D^{\prime}\in\mathcal{D}$ that differ only in one tuple:

\Pr[\mathcal{M}(D)\in\mathcal{S}]\leq e^{\varepsilon}\Pr[\mathcal{M}(D^{\prime})\in\mathcal{S}]+\delta.

The privacy cost is measured by the parameters ( $\varepsilon,\delta$ ) also referred to as the privacy budget. Smaller values of $\varepsilon$ correspond to stricter privacy guarantees, and it is standard in literature to set $\delta\ll\frac{1}{n}$ , where $n$ is the size of the database. We set the $\delta_{f}$ in our work to $\frac{1}{n}$ scaled down to the nearest power of $10$ . Complex DP algorithms can be built from the basic algorithms following two important properties of differential privacy: 1) Post-processing states that for any function $g$ defined over the output of the mechanism $\mathcal{M}$ , if $\mathcal{M}$ satisfies ( $\varepsilon,\delta$ )-DP, so does $g(\mathcal{M})$ ; 2) Basic composition states that if for each $i\in[k]$ , mechanism $\mathcal{M}_{i}$ satisfies ( $\varepsilon_{i},\delta_{i}$ )-DP, then a mechanism sequentially applying $\mathcal{M}_{1}$ , $\mathcal{M}_{2},\ldots,\mathcal{M}_{k}$ satisfies ( $\sum_{i=1}^{k}\varepsilon_{i},\sum_{i=1}^{k}\delta_{i}$ )-DP.

Given a function $f:\mathcal{D}\rightarrow\mathbb{R}^{d}$ , the Gaussian mechanism adds noise drawn from a normal distribution $\mathcal{N}(0,S_{f}^{2}\sigma^{2})$ to each dimension of the output, where $S_{f}$ is the $\ell_{2}$ -sensitivity of $f$ , defined as $S_{f}=\max_{D,D^{\prime}\text{differ in a row}}\|f(D)-f(D^{\prime})\|_{2}$ . For $\varepsilon\in(0,1)$ , if $\sigma\geq\sqrt{2\ln(1.25/\delta)}/\varepsilon$ , then the Gaussian mechanism satisfies $(\varepsilon,\delta)$ -DP.

The Gaussian mechanism is used to privatize optimization algorithms. In contrast to non-private optimizers where batches are sliced from the training dataset, DP optimizers at each iteration work by sampling “lots” from the training with probability $L/n$ , where $L$ is the (expected) lot size and $n$ is the total data size. A set of queries are computed over those samples. These queries include gradient computation, updates to batch normalization or accuracy metric calculations. As there is not any a priori bound on these query outputs, the sensitivity $S_{f}$ is set by clipping the maximum $\ell_{2}$ norm of the gradient to a user-defined parameter $C$ . The gradient of each point is then noised and published. All DP optimizers follow the same framework in which they take steps on the computed noisy gradient as in its non-private counterpart [MAE⁺18]. The privacy cost of the whole training procedure is calculated by advanced composition techniques such as the Moments accountant [ACG⁺16].

2.1 DP Optimizers

DP-SGD: The most popular private optimizer is the differentially private stochastic gradient descent (DPSGD) [WM10, BST14, SCS13, ACG⁺16]. DPSGD takes individual steps for each point in the sampled lot just like in SGD. Due to these individual steps, SGD is more locally unstable and empirically generalizes better than other optimizers [ZFM⁺20]. However, SGD requires the learning rate to be properly tuned when changing architectures or datasets, without which SGD may show subpar performance. There are five main hyperparameters involved in DPSGD. We start with those also present in the non-private setting, highlighting any differences that arise due to privacy.

•

Training iterations ( $T$ ) - In the private setting, more iterations results in a larger privacy cost.
•

Lot size ( $L$ ) - Lot size factors into the privacy calculation, due to amplification by subsampling [BBG18].
•

Learning rate ( $\alpha$ ) - Learning rate has an important interplay with the clipping threshold $C$ , discussed in Section 5.

The following hyperparameters are new in the private setting.

•

Clipping threshold ( $C$ ) - To limit sensitivity, per-example gradients are clipped to have $\ell_{2}$ -norm bounded by $C$ .
•

Noise scale ( $\sigma$ ) - Scale of the noise added, as a multiple of $C$ . A larger value gives higher privacy but (typically) lower accuracy.

DPMomentum: The private counterpart of SGD-Momentum [RHW86, Qia99], which adds the momentum parameter to the update rule of DPSGD [GAYB17]. This optimizer adds an extra hyperparameter to tune as no default value for momentum is known.

DP-Adam: Adam [KB14] is an adaptive optimizer that combines the advantages from AdaGrad [DHS11] and RMSProp [HSS12]. At the core of Adam, exponentially averaged first and second moment estimates of the gradients are used to take a step. Converting Adam to its differentially private counterpart DPAdam can be done trivially by replacing the standard gradients with their clipped and noised counterparts. Adam adds two extra hyperparameters ( $\beta_{1},\beta_{2}$ ) to tune in the DP setting. However, default values of these parameters are known in the non-private setting. We will tune these parameters to the private setting in Section 5. The adaptivity of these optimizers imply they need not be tuned across learning rates, hence reducing a hyperparameter to tune.

ADADP: This DP adaptive optimizer finds the best learning rate at every alternate iteration [KH20]. It does so by leveraging the $\ell_{2}$ error of taking a full step and taking two half steps. If the error computed is greater than a threshold $\tau$ , the learning rate is updated using a closed form expression. As suggested by the authors, for all our experiments using ADADP, we use the threshold $\tau=\sqrt{\frac{d}{2T}}$ , where $d$ is the model dimension and $T$ is the total number of iterations.

2.2 Related Work

Hyperparameter tuning plays a vital role in machine learning practice. In the non-private setting, ML practitioners use grid search, random search, Bayesian optimization techniques [SSA13] or AutoML [HZC21] techniques to tune their models. However, there hasn’t been much research on private hyperparameter tuning procedures due to the significant associated privacy costs. Each set of hyperparameter configuration results in a privacy-utility tradeoff. This tradeoff for multiple configurations can be captured by Pareto frontiers using multivariate Bayesian optimization over parameter dimensions [AGD⁺20]. However, this method asks the model curator to query the dataset multiple times which requires non-private access to the dataset. There have been some end-to-end private tuning procedures [CMS11, CV13, KGGW15] which work for a selected number of hyperparameter sets. These results work either in restricted settings for few combinations of candidates under relaxations of approximate differential privacy. The most relevant work to ours is an approach for private selection from private candidates [LT19]. Their work provides two methods, one which outputs a candidate with accuracy greater than a given threshold, and another which randomly stops and outputs the best candidate seen so far. The first approach is of limited utility in practice as it requires a prior accuracy bound for the dataset. The second variant incurs a considerable overhead in privacy cost. We study the second approach and compare it with a naive approach based on Moments Accountant [ACG⁺16] which would scale as the square root of the number of candidates. Recent work shows generalized version of the approach for private selection on diverse with better bounds using Rényi DP [PS21].

3 Problem Setup and Overview

Consider a sensitive dataset $D$ which lies beyond a privacy firewall and has $n$ points of the form ${(x_{1},y_{1}),(x_{2},y_{2}),\dots,(x_{n},y_{n})}$ where $x_{i}\in\mathcal{X}$ is the feature vector of the $i$ th point and $y_{i}\in\mathcal{Y}$ is its desired output. Though our experiments are carried out in the supervised setting, all results can be translated to unsupervised setting as well. The dataset has been divided into two parts, the training set and the validation set. A trusted curator wants to train a machine learning model by making queries on the dataset with a total end-to-end training privacy budget of ( $\varepsilon_{f},\delta_{f}$ ) such that the model can perform with high accuracy on the validation set. The curator wants to try multiple hyperparameter candidates for the model to figure out which candidate gives the maximum accuracy. However, as the model is private, each candidate requires multiple queries made on the dataset and all of them need to be accounted in the total privacy budget of ( $\varepsilon_{f},\delta_{f}$ ).

Note that any validation accuracy must also be measured privately. Since this accuracy is a low-sensitivity query with a scalar output, and must only be computed once per choice of hyperparameters, the cost of this procedure is generally a lower order term versus the main training procedure. Thus for simplicity, we do not noise these validation accuracy queries. As we will see later, some optimizers require more candidates to tune and hence would also require more privacy budget than others.

To tackle private hyperparameter selection, we first compare the available private tuning procedures in Section 4. We show that the privacy cost for training a model depends on the hyperparameter grid size and standard composition theorems provide the best guarantees when the grid is small. In Section 5, we investigate different optimizers to see how many candidates are required to output a good solution. In Section 5, we provide theoretical and empirical evidence to demonstrate an intrinsic coupling between two hyperparameters – the learning rate and clipping norm in DPSGD. We show that this coupling makes DPSGD sensitive to these parameter choices, which can drastically affect the validation accuracy. In Section 5, we demonstrate that an adaptive optimizer, DPAdam, translates well from the non-private setting and obviates tuning of the learning rate. In Section 6, we empirically compare DPAdam with DPSGD and DPMomentum to show that DPAdam performs at par with less hyperparameter tuning. Finally, in Section 7, we establish that DPAdam converges to a static learning rate in restricted settings, and unveil a new optimizer $\mathsf{DPAdamWOSM}$ which can leverage this converged value without computing the second moments.

4 Privately Tuning DP Optimizers

Effective hyperparameter tuning is crucial in extracting good utility from an optimizer. Unlike the non-private setting, DP optimizers typically i) have more hyperparameters to tune; ii) require additional privacy budget for tuning. Existing work on DP optimizers acknowledge this problem (e.g., [ACG⁺16]), but do not address the privacy cost incurred during hyperparameter tuning [ACG⁺16, YLP⁺19, TB21]. There are two main prior general-purpose approaches for private hyperparameter selection. The first performs composition via Moments Accountant [ACG⁺16], and the second is the algorithm of [LT19] (LT). The latter is a theoretical result, and to the best of our knowledge, has not been previously evaluated in practice. We investigate the privacy cost of these two techniques in practice and discuss situations in which each method is preferred.

Tuning cost via LT

[LT19] propose a random stopping algorithm (LT) to output a ‘good’ hyperparameter candidate from a pool of $K$ candidates, { $x_{1},\ldots,x_{K}$ }. They assume sampling access to a randomized mechanism $Q(D)$ which samples $i\sim[K]$ , and returns the $i$ -th candidate $x_{i}$ , and a score $q_{i}$ for this candidate. It is a random stopping algorithm, in which at every iteration, a candidate is picked from $Q$ i.i.d. with replacement and a $\gamma$ -biased coin is flipped to randomly stop the algorithm. When the algorithm stops, the candidate with the maximum score seen so far is outputted. In the approximate DP version of this algorithm, an extra parameter $\Upsilon$ is set to limit the total of number of iterations. The pseudocode of this algorithm is deferred to Appendix A.

Theorem 4.1 ([LT19], Theorem 3.4).

Fix any $\gamma\in[0,1]$ , $\delta_{2}>0$ and let $\Upsilon=\frac{1}{\gamma}\log{\frac{1}{\delta_{2}}}$ . If $Q$ is ( $\varepsilon_{1},\delta_{1}$ )-DP, then the LT algorithm is ( $\varepsilon_{f},\delta_{f})$ -DP for $\varepsilon_{f}=3\varepsilon_{1}+3\sqrt{2\delta_{1}}$ and $\delta_{f}=\sqrt{2\delta_{1}}\Upsilon+\delta_{2}$ .

Theorem 4.1 expresses the privacy cost of the algorithm in terms of the privacy cost of individual learners, and parameters of the algorithm itself. The $\delta_{2}$ parameter does not significantly affect the final epsilon $\varepsilon_{f}$ of the algorithm and in practice, one can set it to a very small value ( $10^{-20}$ ). Though a small value of $\delta_{2}$ has little effect on $\delta_{f}$ , it increases the hard stopping time of the algorithm, $\Upsilon$ .

To understand the LT algorithm, we will compare the privacy costs of training a single hyperparameter candidate with a final $\varepsilon_{f},\delta_{f}$ budget via LT and compare it with the privacy cost $\varepsilon_{1},\delta_{1}$ of the underlying individual learner. This setting might seem unnatural for LT as it was designed to select from a pool of candidates but we choose this setting to show the minimum privacy cost overhead associated with LT and later show how the privacy cost changes for multiple candidates (varying $\gamma$ ). To use LT, one needs to figure out the $\varepsilon_{1},\delta_{1}$ via Theorem 4.1 using the final $\varepsilon_{f},\delta_{f}$ values and in this case, $\gamma=1$ (as we have just one candidate). The individual learner is then trained using $\varepsilon_{1},\delta_{1}$ budget.

Refer to caption — Figure 1: Comparing the privacy cost of LT versus Moments Accountant. The minimal privacy overhead incurred by LT is at least ~5x, and increases with the dataset size (left). However, as we allow LT to sample and test more candidate hyperparameters, the privacy cost barely increases (middle). Moments Accountant is able to test a significant number of candidates at the same cost as the minimal privacy overhead of LT (right).

Due to the delicate balance of $\delta_{f}$ in Theorem 4.1, one can see the $\delta_{1}$ comes out to be much smaller than $\delta_{f}$ . This change in $\delta_{1}$ results in a blowup of $\varepsilon_{1}$ and hence, the final privacy cost of the LT algorithm ( $3\varepsilon_{1}+3\sqrt{2\delta_{1}}$ ), is much larger than what it would have been for learning one candidate without LT. We call this increase the blowup of privacy. We measure this blowup in Figure 1(left), for the setting of $\sigma=4$ , $L=250$ , $T=10,000$ with varying dataset sizes ( $n$ ). It can be seen that for $n=5,000$ , the blowup is 4.8x whereas for for $n=950,000$ , the blowup is almost 7.3x (note the log scale on y-axis). Qualitatively similar trends persist for other choices of noise multiplier, lot size and iterations. We add more experiments to compare LT vs MA with varying candidate sizes in Appendix B.

Furthermore, we show that although LT entails a privacy blowup, decreasing $\gamma$ (corresponding to training more individual learners with $\varepsilon_{1},\delta_{1}$ ) doesn’t result in a significant difference in the final epsilon guaranteed by LT. In Figure 1(middle), we show the final epsilon cost for different dataset size and varying values of $\gamma\in[0.001,0.01,0.1,1]$ . It is interesting to note here that with smaller $\gamma$ values, one can train many candidates (in expectation, $\frac{1}{\gamma}$ ) for negligible additional privacy cost. The blowup to train $1$ candidate ( $\gamma=1$ ) versus $1,000$ candidates ( $\gamma=0.001$ ) increases from $33$ to $39$ for $n=5,000$ and increases from $0.49$ to $0.69$ , for $n=950,000$ . This increase is minimal in comparison to advanced composition, which grows proportional to $\mathcal{O}(\sqrt{k})$ . However, another resource at play is the total training time, proportional to $1/\gamma$ (i.e., the total number of candidates). In summary, the LT algorithm is effective if an analyst has the privacy budget to afford the initial blowup, as the privacy cost of testing additional hyperparameters is insignificant.

Tuning cost via MA

We learnt from the previous section that, LT permits selection from a large pool of hyperparameters (depending on the $\gamma$ value) but incurs a constant privacy blowup. We compare LT with tuning using Moments Accountant (MA); for tuning via MA each hyperparameter candidate is trained by adding necessary Gaussian noise at each iteration, and the best hyperparameter candidate is selected at the end, MA is used as the composition mechanism for arriving at the final privacy cost of this process. We notice that with the same initial privacy blowup of the LT algorithm, MA is able to compose a considerable number of hyperparameter candidates. In Figure 1(right), we show the number of candidates that can be composed using MA for the minimum privacy cost for running the LT algorithm ( $\gamma=1$ ), for the setting of $\sigma=4$ , $L=250$ , $T$ = $10,000$ and varying dataset size ( $n$ ) on the x-axis. As the $T$ and $L$ is set constant, bigger $n$ values in this graph correspond to fewer epochs of training and hence, worse utility. Depending on dataset size, MA can compose $14$ candidates for $n=5000$ and up to $175$ candidates when $n=100000$ . It is perhaps surprising how well a standard composition technique performs versus LT. This information can be highly valuable to a practitioner who has limited privacy budget. Qualitatively similar trends persist for other choices of batch size, noise multiplier, and iterations.

From our experiments for both these tuning procedures, we conclude that while tuning with LT entails an initial privacy blowup, the additional privacy cost for trying more candidates (smaller $\gamma$ ) is minimal. Even though this has an additional computation cost, it can be appealing when an analyst wants to try numerous hyperparameters. On the other hand, for the same overall privacy cost, MA can be used to compose a significant number of hyperparameter candidates. Additionally, MA allows access to all intermediate learners, whereas LT allows access to only the final output parameters. In the sequel, this conclusion will be useful in making the naive MA approach a more appealing tool for some settings (e.g., tighter privacy budgets).

5 Tuning DP Optimizers

We detail aspects of tuning both non-adaptive and adaptive optimizers. We start with tuning non-adaptive optimizers. We theoretically and empirically demonstrate a connection between the learning rate and clipping threshold. We also establish that non-adaptive optimizers inevitably require searching over a large LR-clip grid to extract performant models. Adaptive optimizers forego this problem as they do not need to tune the hyperparameter dimension of learning rate. However, they introduce other hyperparameters that have known good choices in the non-private setting, and later we empirically show that they are good candidates in the private setting.

Tuning DP non-adaptive optimizers

While many hyperparameters are restricted due to computational and privacy/utility targets, the learning rate $\alpha$ and the clipping threshold $C$ have no a priori bounds. In what follows, we show an interplay between these parameters by first theoretically analyzing the convergence of DPSGD. We then explore an illustrative experiment which demonstrates their entanglement. In the following theorem, we derive a bound on the expected excess risk of DPSGD and while doing so, show that the optimal learning rate, $\alpha_{opt}$ , is proportional to the inverse of $C$ . The proof appears in Appendix F.

Theorem 5.1.

Let $f$ be a convex and $\beta$ -smooth function, and let $x^{*}=\underset{x\in\mathcal{S}}{\operatorname*{arg\,min}}f(x)$ . Let $x_{0}$ be an arbitrary point in $\mathcal{S}$ , and $x_{t+1}=\Pi_{\mathcal{S}}(x_{t}-\alpha(g_{t}+z_{t}))$ , where $g_{t}=\min(1,\frac{C}{{\|\nabla f(x)\|}^{2}})\nabla f(x)$ and $z_{t}\sim\mathcal{N}(0,\sigma^{2}C^{2})$ is the noise due to privacy. After $T$ iterations, the optimal learning rate is $\alpha_{opt}=\frac{R}{CT\sqrt{1+\sigma^{2}}}$ , where $\mbox{\bf E}[f(\frac{1}{T}\sum_{i}^{T}x_{t})-f(x^{*})]\leq\frac{RC\sqrt{1+\sigma^{2}}}{\sqrt{T}}$ and $R=\mbox{\bf E}[\|x_{0}-x^{*}\|]$ .

Though Theorem 5.1 gives a closed-form expression for the optimum learning rate for the convergence bound to hold, it is a function of the parameter $R$ , which is unknown a priori to the analyst. Given constant $T$ and $\sigma$ , the optimal learning rate $\alpha_{opt}$ is inversely proportional to the clipping norm $C$ . This is crucial information in practice because these parameters vary among datasets and are unbounded. This unboundedness property thus requires us to search over very large ranges of $C$ and $\alpha$ when we have no prior knowledge of the dataset. It is natural to ask whether one can fix the clipping norm $C$ and search only over a wide range for the learning rate $\alpha$ (or vice versa). We explore this relationship experimentally via simulations on a synthetic dataset as well as on the ENRON dataset, showing that fixing one of these two hyperparameters may often but not always result in an optimal model.

We train a linear regression model on a $10$ -dimensional synthetic dataset of input-label pairs $(x,y)$ sampled from a distribution $\mathcal{D}$ as follows: $x_{1},\dots,x_{d}\sim\mathcal{U}(0,1),y=x\cdot w^{*},w^{*}=10\cdot\bm{1}^{d}$ . We use the initialization $w_{0}=\bm{0}^{d}$ and train for 5000 iterations. We set the initial weights of the model $w_{0}=\bm{0}^{d}$ , $d=10$ , $k=10$ and train this linear regression model for $100$ iterations. In the non-private setting, this model converges quickly with any reasonable learning rate, but in the private setting, we notice that the training loss depends heavily on the choice of $\alpha$ and $C$ . Figure 3 shows a heat map for the log training loss when trained on ( $\alpha$ , $C$ ) pairs taken from a large grid consisting of $[1,2,4,5,8]$ at scales of $[10^{-4},10^{-3},10^{-2},10^{-1},10^{0},10^{1}]$ . The best candidates (lowest 1 percentile of loss values) we demarcate with white pixels.

We observe two fundamental phenomena from this figure. First, to achieve the best accuracy, $\alpha$ and $C$ need to be tuned on a large grid spanning several orders of magnitude for each of these parameters. Second, multiple ( $\alpha$ , $C$ ) pairs achieve the best accuracy and all lie on the same diagonal, validating our theory for an inverse relation between learning rate and clipping norm. As mentioned earlier, one might hypothesize that by setting the clipping norm $C$ constant and tuning $\alpha$ (corresponding to a vertical line in Figure 5) or vice versa, one could eliminate tuning a hyperparameter. However, note that not all $C$ and $\alpha$ values correspond to the lowest loss. This phenomenon is evident by noticing that not all vertical or horizontal lines on this figure have white pixels. This happens, for example, at the extremes (e.g., at the top-right corner), but also for several intermediate and standard choices (e.g., $C=0.1$ or $0.2$ ). Again, the analyst has no way of knowing this a priori.

In Figure 3, we detail the results for the same simulation experiment with DPAdam (with default Adam hyperparameters) as the underlying optimizer. Here we notice that the inverse relation between $\alpha$ and $C$ no longer hold as we intuited earlier. In order to be able to compare these two figures, we mark all candidates with lower losses than that of the lowest 1 percentile of Figure 3 with white in Figure 3. Figure 3 suggests that there is a small range of ideal choices for the initial learning rate $\alpha$ , and within these good choices of the $\alpha$ , DPAdam is significantly robust to the choice of clip value, as we see white pixels for a large range of clip values that encompass clip values used in practice. We repeat the same experiment over the ENRON dataset instead of our synthetic dataset, and observe similar trends as seen in Figure 5 and Figure 5. We conclude that to privately tune non-adaptive optimizers, we require a larger grid of hyperparameter options than their adaptive counterparts.

Tuning DP adaptive optimizers

Adaptive optimizers automatically adapt over the learning rate $\alpha$ , requiring us to tune only over the clipping norm $C$ . But recall our key question: can we train models that perform competitively with the fine-tuned counterparts from DPSGD?

Adam [KB14], the canonical adaptive optimizer introduces two new hyperparameters, which are the first and second moment exponential decay parameters ( $\beta_{1}$ and $\beta_{2}$ ). In the non-private setting, these parameters are relatively insensitive, and default values of $\alpha=0.001,\beta_{1}=0.9$ , and $\beta_{2}=0.999$ are recommended based on empirical findings, requiring no additional tuning for this hyperparameter triple. Hence before we compare DPAdam and DPSGD, we first find and establish such recommended values for this hyperparameter triple in the DP setting next, and then show that DPAdam with a small hyperparameter space performs competitively with DPSGD in Section 6.

Table 1: Datasets used in experiment

Dataset	Type	#Samples	#Dims	#Classes
MNIST	Image	70000	784	10
Gisette	Image	6000	5000	2
Adult	Structured	45222	202	2
ENRON	Structured	5172	5512	2

To establish default choices of $\alpha$ , $\beta_{1}$ , and $\beta_{2}$ for DPAdam, we evaluate this private optimizer over four diverse datasets Table 1, and two learning models including logistic regression and a neural network with one 100 neurons hidden layer (TLNN). These selected datasets include both low-dimensional data (where the number of samples greatly outnumbers the dimensionality) and high-dimensional data (where the number of samples and dimensionality are at same scale). Since we still have a large hyperparameter space to tune over, for the rest of this work, we fix a constant lot size ( $L=250$ ), and consider tuning over three different noise levels, $\sigma\in[2,4,8]$ , so that we can study the effects of tuning the other hyperparameters more thoroughly. All experiments are repeated three times and averaged before reporting. Additionally, in this particular experiment since we focus on $\alpha,\beta_{1},$ and $\beta_{2}$ , we also fix the clipping threshold $C=0.5$ , and $T=2500$ iterations of training. For each dataset and model, we run DPAdam three times with hyperparameters $(\alpha,\beta_{1},\beta_{2})$ from the grids, $\alpha\in[0.001,0.05,0.01,0.2,0.5]$ , $\beta_{1},\beta_{2}\in[0.8,0.85,0.9,0.95,0.99.0.999]$ .

We show that the default hyperparameter choice ( $\alpha,\beta_{1},\beta_{2}$ ) of Adam in the non-private setting also works well for DPAdam. Figure 6 shows the boxplots of testing accuracies of DPAdam over the different hyperparameter choices. When $\alpha$ is $0.001$ (same as in the non-private setting), all the datasets and models have final testing accuracies (marked in black) close to the best possible (and in most cases it is in fact the best) accuracy. Furthermore, we also highlight the accuracy of the suggested default choice ( $\alpha=0.001,\beta_{1}=0.9,\beta_{2}=0.999$ ) using gold dots. Hence, for the ease of using DPAdam, we suggest the non-private default values for these parameters in the private setting as well and hence in all our subsequent experiments.

6 Advantages of tuning using DPAdam

In the non-private setting, adaptive optimizers like Adam enjoy a smaller hyperparameter tuning space than SGD. We ask two questions in this section. First, can DPAdam (with little tuning) achieve accuracy comparable to a well-tuned DPSGD? Second, what is the privacy-accuracy tradeoff one incurs when using either of the two methods we detail in Section 4 for hyperparameter selection.

Table 2: Parameter grid for comparing DPSGD and DPAdam

Optimizer	Parameter	Values
DPSGD	$\alpha$	0.001, 0.002, 0.005, 0.01,
	$\alpha$	0.02, 0.05, 0.1, 0.2, 0.5, 1
	$C$	0.1, 0.2, 0.5, 1
DPMomentum	$\alpha$	0.001, 0.002, 0.005, 0.01,
	$\alpha$	0.02, 0.05, 0.1, 0.2, 0.5, 1
	$C$	0.1, 0.2, 0.5, 1
	$m$	0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99
DPAdam	$C$	0.1, 0.2, 0.5, 1

To answer both questions, we compare DPAdam and DPSGD over the same set of datasets and models from the previous section. We report the accuracy of DPSGD with a range of learning rates and clipping values shown in Table 2, and the testing accuracy of DPAdam with default parameter choice from Section 5 ( $\alpha=0.001,\beta_{1}=0.9$ , $\beta_{2}=0.999$ ) and a range of clipping values $C$ in Table 2. In total, DPSGD has 40 candidates to tune over, and DPAdam has 4. This is because we have shown in Section 5 that DPSGD needs a wide grid to obtain the best accuracy when data distributions are unknown. Additionally, we also consider the DPMomentum optimizer. Similar to how we searched for default tuning choices for DPAdam in Section 5, we investigate if there exists a qualitatively good choice for the momentum hyperparameter, and unfortunately our results show that there is no such choice.

In order to show the comparison from both sides of the privacy-accuracy tradeoff, we compare the three optimizers through i) the privacy cost when extracting the best accuracy from these optimizers, and ii) the accuracy one would obtain from them under the tight privacy constraints.

Prioritizing Accuracy

For brevity, we show experiments for $\sigma=4$ in Figure 7, results for other values of $\sigma$ are displayed in Appendix G. For each dataset and model, we train three times for each hyperparameter candidate and report the max every 100 iterations, corresponding to the dark lines for each optimizer. We note that their maxima are extremely similar. However, Table 3 shows the final privacy costs incurred by each of these max accuracy lines, and reflects our claims from Section 4 that using fewer hyperparameter candidates and composing privacy via MA gives a much tighter privacy guarantee.

Dataset	DPSGD	DPMomentum	DPAdam
Adult	5.01	5.23	1.91
ENRON	30.86	32.31	12.80
Gisette	26.40	27.64	10.76
MNIST	3.01	3.14	1.14

Table 3: Final

\varepsilon

(at

\delta=10^{-6}

) for optimizers for the LR Models (Figure 7). DPSGD and DPMomentum use LT for privacy accounting; DPAdam uses MA.

Prioritizing Privacy

Additionally in Figure 7, DPSGD and DPMomentum have pastel dotted lines corresponding to their mean accuracy attained using the MA composition that provides the tightest privacy guarantees for DPAdam. These pastel lines are the mean accuracies (with 95% CI) from 100 repetitions of this experiment. Since DPAdam has only 4 hyperparameter candidates, for this experiment, we sample 4 of the candidates at random for DPSGD and DPMomentum so that they all incur the same privacy cost. Since the candidate pool is significantly larger for DPSGD and DPMomentum, we additionally scrutinize the parameter grid for them and prune the learning rates that perform poorly. Our pruning process (detailed in the supplement) is quite generous, and favours minimizing the hyperparameter space of DPSGD and DPMomentum as much possible.¹¹1Note, pruning itself is of course unfair; the intent was to design a DP optimizer that can be used on any data distributions that we have no prior knowledge of. To do so with DPSGD one would have to consider a significantly wide range of $(\alpha,C)$ pairs to cover ‘good’ candidates as we illustrated in Section 5 Despite the pruning advantage we see that these optimizers perform subpar than DPAdam when constrained with privacy.

7 $\mathsf{DPAdamWOSM}$

In addition to a decaying average of past gradient updates, DPAdam also maintains a decaying average of their second moments. In this section, we design $\mathsf{DPAdamWOSM}$ (DPAdam Without Second Moments), a new DP optimizer that operates only using a decaying average of past gradients, as well as eliminates the need to tune the learning rate parameter. We achieve this by analyzing the convergence behavior of the second-moment decaying average in DPAdam in regimes where the scale of noise added is much higher than the scale of the clipped gradients. Setting the effective step size (ESS) of DPAdam to the converged constant, and removing all computations related to the second-moment updates, results in $\mathsf{DPAdamWOSM}$ . We empirically demonstrate that $\mathsf{DPAdamWOSM}$ matches the utility of DPAdam, while requiring less computation than DPAdam.

Observe that removing the second-moment updates from DPAdam reduces it to DPMomentum with one additional feature: bias-correction to the first-moment decaying average, which DPAdam does to account for its initialization at the origin. While the resultant optimizer still requires tuning the learning rate (in addition to other hyperparameters like the clipping threshold), $\mathsf{DPAdamWOSM}$ can be viewed as self-tuning the learning rate by fixing it to the converged effective step size in DPAdam.

Effective step size (ESS) in DPAdam

DPAdam results have less variance than DPSGD due to its adaptive learning rate. To understand this phenomenon better, we inspect DPAdam’s update step. DPAdam being an adaptive optimizer picks per-parameter ESS as $\frac{\alpha}{\sqrt{\hat{v}_{t}}+\xi}$ , which is the base learning rate $\alpha$ scaled by the second moment of the individual parameter gradients. We notice that when $g\rightarrow 0$ , the ESS for DPAdam converges for the first moment gradient, which innately accounts for the clip bound one is training with. This may happen at later iterations, when the model is close to its minima and the gradients get close to zero.

Theorem 7.1.

The effective step size (ESS) for DPAdam with $g\rightarrow 0$ converges to $ESS^{*}=\frac{\alpha}{(\sigma C/L)+\xi}$ .

Proof.

Recall that the average noisy gradient over a lot is $\tilde{g}=g+\mathcal{N}(0,\sigma^{2}C^{2})/L$ . We now look at the effect of this noisy gradient on the effective step size (ESS) of DPAdam. As $g\rightarrow 0$ , the second moment of DPAdam converges to $\frac{\sigma^{2}C^{2}}{L^{2}}$ . This gives us the converging value for ESS:

ESS^{*}=\frac{\alpha}{\sqrt{\hat{v}_{t}}+\xi}=\frac{\alpha}{\sqrt{\frac{\sigma^{2}C^{2}}{L^{2}}}+\xi}=\frac{\alpha}{(\sigma C/L)+\xi}

Theorem 7.1 gives a closed form expression that ESS converges to. We can use this value in place of $\frac{\alpha}{\sqrt{\hat{v}_{t}}+\xi}$ in the update step from the inception of the learning process. Since the second-moment updates (e.g., $\hat{v}_{t}$ ) are not used anymore, removing them results in our new optimizer $\mathsf{DPAdamWOSM}$ . We provide a pseudo-code for $\mathsf{DPAdamWOSM}$ in the appendix.

Comparing Adaptive Optimizers

We evaluate $\mathsf{DPAdamWOSM}$ by running it alongside DPAdam and ADADP with the same hyperparameter grid in the appendix. For brevity, we show experiments on $\sigma=4$ and others appear in the supplement. In Figure 8, we show the maximum and median accuracy curves for all the optimizers. We display the median accuracy curves (shown in dotted), as an indicator of the quality of the entire pool of hyperparameter candidates for a given optimizer; which in this case is strictly over the choices of clip. The max lines for ADADP lies beneath DPAdam and $\mathsf{DPAdamWOSM}$ for all dataset except Adult. Also, the max accuracy line for $\mathsf{DPAdamWOSM}$ runs alongside DPAdam which means that it can perform as good as DPAdam throughout training. The median line for $\mathsf{DPAdamWOSM}$ also performs alongside DPAdam and in some cases is able to beat it (e.g, the median for $\mathsf{DPAdamWOSM}$ for MNIST-LR and MNIST-TLNN lies above the median line of DPAdam). This occurrence is seen because $\mathsf{DPAdamWOSM}$ uses the converged ESS from the first iteration of training.

8 Conclusion

We thoroughly investigated honest hyperparameter selection for DP optimizers. We compared two existing private methods (LT and MA) to search for hyperparameter candidates, and showed that the former incurs a significant privacy cost but can compose over many candidates, while the latter is effective when the number of candidates is small. Next, we explored connections between the clipping norm and the step size hyperparameter to show an inverse relationship between them. Additionally, we compared non-adaptive and adaptive optimizers, demonstrating that the latter typically achieves more consistent performance over a variety of hyperparameter settings. This can be vital for applications where public data is scarce, resulting in difficulties when tuning hyperparameters. Finally, we brought to light that DPAdam converges to a static learning rate when the noise dominates the gradients. This insight allowed us to derive a novel optimizer $\mathsf{DPAdamWOSM}$ , a variant of DPAdam which avoids the second-moment computation and enjoys better accuracy especially at earlier iterations. Future work remains to investigate further implications of these results to provide tuning-free end-to-end private ML optimizers.

Acknowledgements

Experiments for this work were run on Compute Canada hardware, supported by a Resources for Research Groups grant.

References

[ACG⁺16] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS’16), 2016.
[AGD⁺20] Brendan Avent, Javier González, Tom Diethe, Andrei Paleyes, and Borja Balle. Automatic discovery of privacy–utility pareto fronts. Proceedings on Privacy Enhancing Technologies, 2020(4):5–23, 2020.
[BBG18] Borja Balle, Gilles Barthe, and Marco Gaboardi. Privacy amplification by subsampling: Tight analyses via couplings and divergences. In Advances in Neural Information Processing Systems 31, NeurIPS ’18, pages 6277–6287. Curran Associates, Inc., 2018.
[BST14] Raef Bassily, Adam D. Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pages 464–473, 2014.
[CLE⁺19] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pages 267–284, Santa Clara, CA, August 2019. USENIX Association.
[CMS11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3), 2011.
[CV13] Kamalika Chaudhuri and Staal A Vinterbo. A stability-based validation procedure for differentially private machine learning. Advances in Neural Information Processing Systems, 26:2652–2660, 2013.
[DHS11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
[DKM⁺06] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, volume 4004, pages 486–503. Springer, 2006.
[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, TCC ’06, pages 265–284, 2006.
[FJR15] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS’15), 2015.
[GAYB17] R. Gylberth, R. Adnan, S. Yazid, and T. Basaruddin. Differentially private optimization algorithms for deep neural networks. In 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), pages 387–394, 2017.
[HSS12] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8), 2012.
[HZC21] Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the state-of-the-art. Knowledge-Based Systems, 212:106622, 2021.
[KB14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[KGGW15] Matt Kusner, Jacob Gardner, Roman Garnett, and Kilian Weinberger. Differentially private bayesian optimization. In International conference on machine learning, pages 918–927. PMLR, 2015.
[KH20] Antti Koskela and Antti Honkela. Learning rate adaptation for differentially private learning. In International Conference on Artificial Intelligence and Statistics, pages 2465–2475. PMLR, 2020.
[KOV15] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In International conference on machine learning, pages 1376–1385. PMLR, 2015.
[LT19] Jingcheng Liu and Kunal Talwar. Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 298–309, 2019.
[MAE⁺18] H Brendan McMahan, Galen Andrew, Ulfar Erlingsson, Steve Chien, Ilya Mironov, Nicolas Papernot, and Peter Kairouz. A general approach to adding differential privacy to iterative training procedures. arXiv preprint arXiv:1812.06210, 2018.
[NSH19] M. Nasr, R. Shokri, and A. Houmansadr. Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE Symposium on Security and Privacy (SP’19), 2019.
[PCS⁺20] Nicolas Papernot, Steve Chien, Shuang Song, Abhradeep Thakurta, and Ulfar Erlingsson. Making the shoe fit: Architectures, initializations, and tuning for learning with privacy, 2020.
[PS21] Nicolas Papernot and Thomas Steinke. Hyperparameter tuning with renyi differential privacy. arXiv preprint arXiv:2110.03620, 2021.
[Qia99] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 1999.
[RHW86] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
[RTM⁺20] Swaroop Ramaswamy, Om Thakkar, Rajiv Mathews, Galen Andrew, H Brendan McMahan, and Françoise Beaufays. Training production language models without memorizing user data. arXiv preprint arXiv:2009.10031, 2020.
[SCS13] Shuang Song, Kamalika Chaudhuri, and Anand D. Sarwate. Stochastic gradient descent with differentially private updates. In GlobalSIP, pages 245–248, 2013.
[SRS17] Congzheng Song, Thomas Ristenpart, and Vitaly Shmatikov. Machine learning models that remember too much. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS’17), 2017.
[SSA13] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization. Advances in Neural Information Processing Systems, 26:2004–2012, 2013.
[SSSS17] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), 2017.
[TB21] Florian Tramer and Dan Boneh. Differentially private learning needs better features (or much more data). In International Conference on Learning Representations, 2021.
[WM10] Oliver Williams and Frank McSherry. Probabilistic inference and differential privacy. In NIPS, pages 2451–2459, 2010.
[YLP⁺19] L. Yu, L. Liu, C. Pu, M. E. Gursoy, and S. Truex. Differentially Private Model Publishing for Deep Learning. In 2019 IEEE Symposium on Security and Privacy (SP’19), 2019.
[YZC⁺21] Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, and Tie-Yan Liu. Large scale private learning via low-rank reparametrization. International Conference of Machine Learning (ICML) 2021, 2021.
[ZFM⁺20] Pan Zhou, Jiashi Feng, Chao Ma, Caiming Xiong, Steven HOI, et al. Towards theoretically understanding why sgd generalizes better than adam in deep learning. arXiv preprint arXiv:2010.05627, 2020.

Appendix A LT Algorithm

\gamma\leq 1

\delta_{2}>0

, and sampling access to

Q(D)

1: Initialize the list

S=\emptyset

2: Initialize

\Upsilon=\frac{1}{\gamma}\log{\frac{1}{\delta_{2}}}

3: for

j\in[1,\Upsilon]

4: Draw

(x,q)\sim Q(D)

S\leftarrow S\cup(x,q)

6: Flip a

\gamma

-biased coin, output highest scored candidate from

S

and halt;

7: end for

8: Output highest scored candidate from

S

Algorithm 1 Hard stopping private selection algorithm for

(\varepsilon,\delta)

-DP input algorithms

Appendix B LT vs MA with varying candidate size

Continuing from Section 4, in this section we show an additional experiment in which we compare the LT (Liu and Talwar) and MA (Moments Accountant) algorithms with varying number of hyperparameter candidates.

In Figure 9, we run the LT and MA algorithms for $T=10000$ with $\sigma=4$ and $L=250$ with varying candidate size and compare the final privacy costs. The $\gamma$ value for the LT algorithm is set to $1/k$ , where the $k$ is the number of candidates. It can be seen that the privacy cost of LT (blue) remains almost constant for with increasing number of candidates. Figure 9 also demonstrates the exact number of candidates when the cost of MA (orange) remains below the LT cost. This insight is valuable in practice to a practitioner to decide the which algorithm to choose for hyperparameter tuning with respect to the number of candidates.

Appendix C Pruning hyperparameter grid for SGD

Figure 10 demonstrates a heat map plot of the candidate hyperparameter pairs for DPSGD. Each point on this heatmap is assigned a score (totalling 2400) that reflects how many times that $(\alpha,C)$ pair has performed the best among all the candidates, and we score across all iterations (at a granularity of every 100 iterations) of training.

We justify this as a fair metric of ‘goodness’, for candidates as one could in practice stop training at any iteration. Furthermore this metric is quite critical of quality, in that it only awards a hyperparameter set a point, if it appeared as the best candidate at one of the intervals. Hence we deem this to be a generous pruning of the search space, which will imbue the best possible advantage to DPSGD with regards to a pruned hyperparameter search space.

Appendix D Implementation Details

The code for our paper is written in Python3.6 using the PyTorch library. The implementation of all private optimizers are done using the Pyvacy library²²2https://github.com/ChrisWaites/pyvacy. We run our code on ComputeCanada servers. Each allocation of the server includes 2 CPU cores, 8 GB RAM and 1 GPU from the list – P100, V100, K80. We report results from all our experiments after averaging over 3 runs. The code is attached with our supplementary material submission.

All datasets used in our experiments are publicly available. We split all datasets into 80% training and 20% validation sets. For our experiments, we assume that all our datasets start in its preprocessed state, i.e. the numerical features are scaled to the range [0,1], as is standard practice in machine learning. However when considering an end-to-end private algorithm, this preprocessing itself may need to be performed in a privacy-preserving fashion. In this work, we do not account for privacy in this step. Note that for our work this only effects the ENRON and Adult datasets, where scaling the values does require computing the maximum possible values of features in a differentially-private fashion, whereas the max values for image datasets (Gisette and MNIST) are known a priori due to max pixel value and does not involve any privacy cost.

Appendix E Omitted Pseudocode for $\mathsf{DPAdamWOSM}$

0: Training set

A:\{x_{1},...,x_{n}\}

, Loss function

\mathcal{L}(\theta)

, Parameters: Lot size

L

, Learning rate

\alpha

, Gradient norm bound

C

, Noise scale

\sigma

, Total number of iterations

T

, Exponential decay rate

\beta_{1}

1: Initialize model with

\theta_{0}

randomly

2: Initialize first moment vector

m_{0}=0

3: Set learning rate to ESS

\alpha=\frac{10^{-3}}{(\sigma C/L)+10^{-8}}

;

4: for

t\in[1,T]

5: Sample a random subset

L_{t}\subseteq A

, by independently including each element of

A

with probability

L/n

6: Compute gradient

\forall x_{i}\in L_{t}

g_{t}(x_{i})=\nabla_{\theta}\mathcal{L}(\theta_{t},x_{i})

7: Clip each gradient in

\ell_{2}

norm to

C

\bar{g}_{t}(x_{i})=g_{t}(x_{i})/\max(1,\frac{\|g_{t}(x_{i})\|_{2}}{C})

8: Add noise

\tilde{g}_{t}=\frac{1}{|L|}(\sum_{i}\bar{g}_{t}(x_{i})+\mathcal{N}(0,\sigma^{2}C^{2}I))

9: Exponentially average the first moment

m_{t}=\beta_{1}\cdot m_{t-1}+(1-\beta_{1})\cdot\tilde{g}_{t}

10: Perform bias correction

\hat{m}_{t}=\frac{m_{t}}{1-\beta_{1}^{t}}

11: Update model

\theta_{t}=\theta_{t-1}-\alpha\cdot\hat{m}_{t}

12: end for

13: Compute privacy cost using Moments Accountant.

Algorithm 2 Optimization using

\mathsf{DPAdamWOSM}

Appendix F Proof of Theorem 2

Theorem F.1.

Proof.

\begin{split}\mbox{\bf E}[\|x_{t+1}-x^{*}\|^{2}\Bigm{|}x_{t}]&=\mbox{\bf E}[\|x_{t}-\alpha(g_{t}+z_{t})-x^{*}\|^{2}\Bigm{|}x_{t}]\\ &=\mbox{\bf E}[\|x_{t}-x^{*}\|^{2}-2\alpha(g_{t}+z_{t})(x_{t}-x^{*})+\alpha^{2}\|(g_{t}+z_{t})\|^{2}\Bigm{|}x_{t}]\\ &=\|x_{t}-x^{*}\|^{2}-2\alpha\mbox{\bf E}[(g_{t}+z_{t})\Bigm{|}x_{t}]^{T}(x_{t}-x^{*})+\alpha^{2}\mbox{\bf E}[\|(g_{t}+z_{t})\|^{2}\Bigm{|}x_{t}]\\ &\leq\|x_{t}-x^{*}\|^{2}-2\alpha[f(x_{t})-f(x^{*})]+\alpha^{2}\mbox{\bf E}[\|(g_{t}+z_{t})\|^{2}\Bigm{|}x_{t}]\\ \end{split}

The inequality is due to convexity of the loss function and $E[(g_{t}+z_{t})]=g_{t}$ due to 0-mean noise. Taking expectation on both sides and reordering,

\begin{split}2\alpha[f(x_{t})-f(x^{*})]&\leq\mbox{\bf E}[\|x_{t+1}-x^{*}\|^{2}]-\mbox{\bf E}[\|x_{t}-x^{*}\|^{2}]+\alpha^{2}\mbox{\bf E}[\|(g_{t}+z_{t})\|^{2}\\ &\leq\mbox{\bf E}[\|x_{t+1}-x^{*}\|^{2}]-\mbox{\bf E}[\|x_{t}-x^{*}\|^{2}]+\alpha^{2}(C^{2}+C^{2}\sigma^{2})\\ \end{split}

Summing for T steps and dividing both sides by $2\alpha T$ ,

\mbox{\bf E}[f(\frac{1}{T}\sum_{i}^{T}x_{t})-f(x*)]\leq\frac{R^{2}}{2\alpha T}+\frac{\alpha C^{2}(1+\sigma^{2})}{2}

(1)

Taking derivative and finding best value of $\alpha$ ,

\alpha_{opt}=\frac{R}{C\sqrt{1+\sigma^{2}}T}

Plugging $\alpha_{opt}$ to Eq. 1,

\mbox{\bf E}[f(\frac{1}{T}\sum_{i}^{T}x_{t})-f(x*)]\leq\frac{RC\sqrt{1+\sigma^{2}}}{\sqrt{T}}

∎

Appendix G Additional experiment results for Section 6 and Section 7

In Figures 11 and 12, we display our results for the same experiments described in Section 6, with $\sigma=2$ , and $\sigma=8$ respectively. Similarly Figure 13 and 14 displays our results of the experiments detailed in Section 7 with $\sigma=2$ , and $\sigma=8$ .