This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adaptive Second Order Coresets for Data-efficient Machine Learning

Omead Pooladzandi    David Davini    Baharan Mirzasoleiman
Abstract

Training machine learning models on massive datasets incurs substantial computational costs. To alleviate such costs, there has been a sustained effort to develop data-efficient training methods that can carefully select subsets of the training examples that generalize on par with the full training data. However, existing methods are limited in providing theoretical guarantees for the quality of the models trained on the extracted subsets, and may perform poorly in practice. We propose AdaCore, a method that leverages the geometry of the data to extract subsets of the training examples for efficient machine learning. The key idea behind our method is to dynamically approximate the curvature of the loss function via an exponentially-averaged estimate of the Hessian to select weighted subsets (coresets) that provide a close approximation of the full gradient preconditioned with the Hessian. We prove rigorous guarantees for the convergence of various first and second-order methods applied to the subsets chosen by AdaCore. Our extensive experiments show that AdaCore extracts coresets with higher quality compared to baselines and speeds up training of convex and non-convex machine learning models, such as logistic regression and neural networks, by over 2.9x over the full data and 4.5x over random subsets111 Code is available at https://github.com/opooladz/AdaCore.

Machine Learning, ICML

1 Introduction

Large datasets have been crucial for the success of modern machine learning models. Learning from massive datasets, however, incurs substantial computational costs and becomes very challenging (Asi & Duchi, 2019; Strubell et al., 2019; Schwartz et al., 2019). Crucially, not all data points are equally important for learning (Birodkar et al., 2019; Katharopoulos & Fleuret, 2018; Toneva et al., 2018). While several examples can be excluded from training without harming the accuracy of the final model (Birodkar et al., 2019; Toneva et al., 2018), other points need to be trained on many times to be learned (Birodkar et al., 2019). To improve scalability of machine learning, it is essential to theoretically understand and quantify the value of different data points on training and optimization. This allows identifying examples that contribute the most to learning and safely excluding those that are redundant or non-informative.

To find essential data points, recent empirical studies used heuristics such as the fully trained or a smaller proxy model’s uncertainty (entropy of predicted class probabilities) (Coleman et al., 2020), or forgetting events (Toneva et al., 2018) to identify examples that frequently transition from being classified correctly to incorrectly. Others employ either the gradient norm (Alain et al., 2015; Katharopoulos & Fleuret, 2018) or the loss (Loshchilov & Hutter, 2015; Schaul et al., 2015) to sample important points that reduce variance of stochastic optimization methods. Such methods, however, do not provide any theoretical guarantee for the quality of the trained model on the extracted examples.

Quantifying the importance of different data points without training a model to convergence is very challenging. First, the value of each example cannot be measured without updating the model parameters and measuring the loss or accuracy. Second, as the effect of different data points changes throughout training, their value cannot be precisely measured before training converges. Third, to eliminate redundancies, one needs to look at the importance of individual data points as well as the higher-order interactions between data points. Finally, one needs to provide theoretical guarantees for the performance and convergence of the model trained on the extracted data points.

Here, we focus on finding data points that contribute the most to learning and automatically excluding redundancies while training a model. A practical and effective approach is to carefully select a small subset of training examples that closely approximate the full gradient, i.e., the sum of the gradients over all the training data points. This idea has been recently employed to find a subset of data points that guarantee convergence of first-order methods to near-optimal solution for training convex models (Mirzasoleiman et al., 2020). However, modern machine learning models are high dimensional and non-convex in nature. In such scenarios, subsets selected based on gradient information only capture gradient along the sharp dimensions, and lack diversity within groups of examples with similar training dynamics. Hence, they representative large groups of examples with a few data points with substantial weights. This introduces a large error in the gradient estimation and result in first-order coresets to perform poorly.

We propose ADAptive second-order COREsets (AdaCore) that incorporates the geometry of the data to iteratively select weighted subsets (coresets) of training examples that captures the gradient of the loss preconditioned with the Hessian, by maximizing a submodular function. Such subsets capture the curvature of the loss landscape along different dimensions, and provide convergence guarantees for first and second-order methods. As a naive use of Hessian at every iteration is prohibitively expensive for overparameterized models, AdaCore relies on Hessian-free methods to extract coresets that capture the full gradient preconditioned by the Hessian diagonal. Furthermore, AdaCore exponentially averages first and second-order information in order to smooth the noise in the local gradient and curvature information.

We first provide a theoretical analysis of our method and prove its convergence for convex and non-convex functions. For a β\beta-smooth and α\alpha-strongly convex loss function and a subset SS selected by AdaCore that estimates the full preconditioned gradient by an error of at most ϵ\epsilon, we prove that Newton’s method and AdaHessian applied to SS with constant stepsize η=α/β\eta=\alpha/\beta converges to a βϵ/α\beta\epsilon/\alpha neighborhood of the optimal solution, in exponential rate. For non-convex overparameterized functions such as deep networks, we prove that for a β\beta-smooth and μ\mu-PL loss function satisfying (w)2/2μ(w)\|\nabla\mathcal{L}(w)\|^{2}/2\geq\mu\mathcal{L}(w), (stochastic) gradient descent applied to subsets found by AdaCore has similar training dynamics to that of training on full data, and converges at a exponential rate. In both cases, AdaCore leads to a speedup by training on smaller subsets.

Next, we empirically study the examples selected by AdaCore during training. We show that as training continues, AdaCore selects more uncertain or forgettable samples. Hence, AdaCore effectively determines the value of every learning example, i.e., when and how many times a sample needs to be trained on, and automatically excludes redundant and non-informative instances. Importantly, incorporating curvature in selecting coresets allows AdaCore to quantify the value of training examples more accurately, and find fewer but more diverse samples than existing methods.

We demonstrate the effectiveness of various first and second-order methods, namely SGD with momentum, Newton’s method and AdaHessian, applied to AdaCore for training models with a convex loss function (logistic regression) as well as models with a non-convex loss functions, namely ResNet-20, ResNet-18, and ResNet-50, on MNIST, CIFAR10, (Imbalanced) CIFAR100, and BDD100k (Deng, 2012; Krizhevsky et al., 2009; Yu et al., 2020). Our experiments show that AdaCore can effectively extract crucial samples for machine learning, resulting in higher accuracy while achieving over 2.9x speedup over the full data and 4.5x over random subsets, for training models with convex and non-convex loss functions.

2 Related Work

Data-efficient methods have recently gained a lot of interest. However, existing methods often require training the original (Birodkar et al., 2019; Ghorbani & Zou, 2019; Toneva et al., 2018) or a proxy model (Coleman et al., 2020) to convergence, and use features or predictions of the trained model to find subsets of examples that contribute the most to learning. While these results empirically confirm the existence of notable semantic redundancies in large datasets (Birodkar et al., 2019), such methods cannot identify the crucial subsets before fully training the original or the proxy model on the entire dataset. Most importantly, such methods do not provide any theoretical guarantees for the model’s performance trained on the extracted subsets.

There have been recent efforts to take advantage of the difference in importance among various samples to reduce the variance and improve the convergence rate of stochastic optimization methods. Those that are applicable to overparameterized models employ either the gradient norm (Alain et al., 2015; Katharopoulos & Fleuret, 2018) or the loss (Loshchilov & Hutter, 2015; Schaul et al., 2015) to compute each sample’s importance. However, these methods do not provide rigorous convergence guarantees and cannot provide a notable speedup. A recent study proposed a method, Craig, to find subsets of samples that closely approximate the full gradient, i.e., sum of the gradients over all the training samples (Mirzasoleiman et al., 2020). Craig finds the subsets by maximizing a submodular function, and provides convergence guarantees to a neighborhood of the optimal solution for strongly-convex models. GradMatch (Killamsetty et al., 2021) proposes a variation to address the same objective using orthogonal matching pursuit (OMP) (Killamsetty et al., 2021), and Glister Killamsetty et al. (2020) aims at finding subsets that closely approximate the gradient of a held-out validation set. However, Glister requires a validation set, and GradMatch uses OMP which may return subsets as little as 0.1% of the intended size. Such subsets are then augmented with random samples. In contrast, our method successfully finds subsets of higher quality by preconditioning the gradient by the Hessian information.

3 Background and Problem Setting

Training machine learning models often reduces to minimizing an empirical risk function. Given a not-necessarily convex loss {\cal{L}}, one aims to find model parameter vector ww_{*} in the parameter space 𝒲\mathcal{W} that minimizes the loss {\cal{L}} over the training data:

wargminw𝒲(w),\displaystyle w_{*}\in{\arg\min}_{w\in\mathcal{W}}{\cal{L}}(w),\quad\quad\quad (1)
(w):=iVli(w),li(w)=l(f(xi,w),yi).\displaystyle{\cal{L}}(w):=\sum_{i\in V}l_{i}(w),\quad l_{i}(w)=l(f(x_{i},w),y_{i}).

Here, V={1,,n}V=\{1,\dots,n\} is an index set of the training data, wdw\in\mathbb{R}^{d} is the parameters of the model ff being trained, and lil_{i} is the loss function associated with training example iVi\in V with feature vector xidx_{i}\in\mathbb{R}^{d} and label yiy_{i}. We denote the gradient of the loss w.r.t. model parameters by 𝐠=(w)=1|V|iVliw\mathbf{g}=\nabla{\cal{L}}(w)=\frac{1}{|V|}\sum_{i\in V}\frac{\partial l_{i}}{\partial w}, and the corresponding second derivative (i.e., Hessian) by 𝐇=2(w)=1|V|iV2liwjwk\mathbf{H}=\nabla^{2}{\cal{L}}(w)=\frac{1}{|V|}\sum_{i\in V}\frac{\partial^{2}l_{i}}{\partial w_{j}\partial w_{k}}.

First order gradient methods are popular for solving Problem (1). They start from an initial point w0w_{0} and at every iteration tt, step in the negative direction of the gradient 𝐠t\mathbf{g}_{t} multiplied by learning rate ηt\eta_{t}. The most popular first-order method is Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951):

wt+1=wtηt𝐯t,𝐯t=𝐠t,\displaystyle w_{t+1}=w_{t}-\eta_{t}\mathbf{v}_{t},\quad\quad\mathbf{v}_{t}=\mathbf{g}_{t}, (2)

SGD is often used with momentum, i.e., 𝐯t=β𝐯t1+(1β)𝐠t\mathbf{v}_{t}\!=\!\beta\mathbf{v}_{t-1}\!+\!(1\!-\!\beta)\mathbf{g}_{t} where β[0,1]\beta\!\in\![0,1], accelerating it in dimensions whose gradients point in the same directions and dampening oscillations in dimensions whose gradients change directions (Qian, 1999). For larger datasets, mini-batched SGD is used, where 𝐯t=1mj=1mlit(j)(wt)\mathbf{v}_{t}\!=\!\frac{1}{m}\sum_{j=1}^{m}l_{i_{t}^{(j)}}(w_{t}), where mm is the size of the mini-batch of datapoints whose indices {it(1),,it(m)}\{i_{t}^{(1)},\ldots,i_{t}^{(m)}\} are uniformly drawn with replacement from VV, at each iteration tt.

Second-order gradient methods rely on the geometry of the problem to automatically rotate and scale the gradient vectors, using the curvature of the loss landscape. In doing so, second-order methods can choose a better descent direction and automatically adjust the learning rate for each parameter. Hence, second-order methods have superior convergence properties compared to first-order methods. Newton’s method (Bertsekas, 1982) is a classical second order method that preconditions the gradient vector with inverse of the local Hessian at every iteration, 𝐇t1\mathbf{H}_{t}^{-1}:

wt+1=wtηt𝐇t1𝐠t.w_{t+1}=w_{t}-\eta_{t}\mathbf{H}_{t}^{-1}\mathbf{g}_{t}. (3)

As inverting the Hessian matrix requires quadratic memory and cubic computational complexity, several methods approximate Hessian information to significantly reduce time and memory complexity (Nocedal, 1980; Schaul et al., 2013; Martens & Grosse, 2015; Xu et al., 2020). In particular, AdaHessian (Yao et al., 2020) directly approximates the diagonal of the Hessian and relies on exponential moving averaging and block diagonal averaging to smooth out and reduce the variation of the Hessian diagonal.

4 AdaCore: Adaptive Second order Coresets

The key idea behind our proposed method is to leverage the geometry of the data, precisely the curvature of the loss landscape, to select subsets of the training examples that enable fast convergence. Here, we first discuss why coresets that only capture the full gradient perform poorly in various scenarios. Then, we show how to incorporate curvature information in subset selection for training convex and non-convex models with provable convergence guarantees— ameliorating problems of first-order coresets.

4.1 When First-order Coresets Fail

First-order coreset methods iteratively select weighted subsets of training data that closely approximate the full gradient at particular values of wtw_{t}, e.g. beginning of every epoch ​(Killamsetty et al., 2021, 2020; Mirzasoleiman et al., 2020):

St=argminSV,γt,j0j|S|s.t.gtjSγt,jgt,jϵ,\displaystyle S^{*}_{t}=\!\!\underset{S\subseteq V,\gamma_{t,j}\geq 0\leavevmode\nobreak\ \forall j}{\arg\min}|S|\quad\textrm{s.t.}\quad\|\textbf{g}_{t}-\sum_{j\in S}\gamma_{t,j}\textbf{g}_{t,j}\|\leq\epsilon,\vspace{-4mm} (4)

where gt,j\textbf{g}_{t,j} and γt,j>0\gamma_{t,j}>0 are the gradient and the weight of element jj in the coreset SS. Such subsets often perform poorly for high-dimensional and non-convex functions, due to the following reasons: (1) the scale of gradient 𝐠d\mathbf{g}\in\mathbb{R}^{d} is often different along different dimensions. Hence, the selected subsets estimate the full gradient closely only along dimensions with a larger gradient scale. This can introduce a significant error in the optimization trajectory for both convex and non-convex loss functions; (2) the loss functions associated with different data points lil_{i} may have similar gradients but very different curvature properties at a particular wtw_{t}. Thus, for a small δ>0\delta>0, the gradients li(wt+δ)\nabla l_{i}(w_{t}+\delta) at wt+δw_{t}+\delta may be totally different than the gradients li(wt)\nabla l_{i}(w_{t}) at wtw_{t}. Consequently, subsets that capture the gradient well at at a particular point during training may not provide a close approximation of the full gradient after a few gradient updates, e.g., mini-batches. This often results in inferior performance, particularly when selecting larger subsets for non-convex loss functions; (3) subsets that only capture the gradient, select one representative example with a large weight from data points with similar gradients at wtw_{t}. Such subsets lack diversity and cannot distinguish different subgroups of the data. Importantly, the large weights introduce a substantial error in estimating the full gradient and result in a poor performance, as we show in Fig. 7 in the Appendix.

4.2 Adaptive Second-order Coresets

To address the above issues, our main idea is to select subsets of training examples that capture the full gradient preconditioned with the curvature of the loss landscape. In doing so, we normalize the gradient by multiplying it by the Hessian inverse, 𝐇1𝐠\mathbf{H}^{-1}\mathbf{g}, before selecting the subsets. This allows selecting subsets that (1) can capture the full gradient in all dimensions equally well; (2) contain a more diverse set of data points with similar gradients, but different curvature properties; and (3) allow adaptive first and second-order methods trained on the coresets to obtain similar training dynamics to that of training on the full data.

Formally, our goal in AdaCore is to adaptively find the smallest subset SVS\subseteq V and corresponding per-element weights γj>0\gamma_{j}>0 that approximates the full gradient preconditioned with the Hessian matrix, with an error of at most ϵ>0\epsilon>0 at every iteration tt, I.e.,:

St=argminSV,γt,j0j\displaystyle S^{*}_{t}=\!\!\underset{S\subseteq V,\gamma_{t,j}\geq 0\leavevmode\nobreak\ \forall j}{\arg\min} |S|,s.t.\displaystyle|S|,\quad\textrm{s.t.}\quad (5)
𝐇t1gtjSγt,j𝐇t,j1gt,jϵ,\displaystyle\|\mathbf{H}^{-1}_{t}\textbf{g}_{t}-\sum_{j\in S}\gamma_{t,j}\mathbf{H}_{t,j}^{-1}\textbf{g}_{t,j}\|\leq\epsilon,

where 𝐇t1gt\mathbf{H}^{-1}_{t}\textbf{g}_{t} and jSγt,j𝐇t,j1gt,j\sum_{j\in S}\gamma_{t,j}\mathbf{H}_{t,j}^{-1}\textbf{g}_{t,j} are preconditioned gradients of the full data and the subset SS.

4.3 Scaling up to Over-parameterized Models

Directly solving the optimization problem (5) requires explicit calculation and storage of the Hessian matrix and its inverse. This is infeasible for large models such as neural networks. In the following, we first address the issue of calculating the inverse Hessian at every iteration. ​Then, we discuss how to efficiently find a near-optimal subset to estimates the full preconditioned gradient by solving Eq. ​(5).

Approximating the Gradients

For neural networks, derivative of the loss {\cal{L}} w.r.t. the input to the last layer (Katharopoulos & Fleuret, 2018; Mirzasoleiman et al., 2020) or the penultimate layer (Killamsetty et al., 2021) can capture the variation of gradient norm well. We extend these results (Appendix B.2) to show that the normed difference preconditioned gradient difference between data points can be approximately efficiently bound by:

𝐇i1gi𝐇j1gj\displaystyle\|\mathbf{H}_{i}^{-1}\textbf{g}_{i}-\mathbf{H}_{j}^{-1}\textbf{g}_{j}\|\leq (6)
c1ΣL(zi(L))(𝐇i1gi)(L)ΣL(zj(L))(𝐇j1gj)(L)+c2,\displaystyle c_{1}\|\Sigma^{\prime}_{L}(z_{i}^{(L)})(\mathbf{H}_{i}^{-1}\textbf{g}_{i})^{(L)}-\Sigma^{\prime}_{L}(z_{j}^{(L)})(\mathbf{H}_{j}^{-1}\textbf{g}_{j})^{(L)}\|+c_{2},

where ΣL(zi(L))(𝐇i1gi)(L)\Sigma^{\prime}_{L}(z_{i}^{(L)})(\mathbf{H}_{i}^{-1}\textbf{g}_{i})^{(L)} is gradient preconditioned by the inverse of the Hessian of the loss w.r.t. the input to the last layer for data point ii, and c1,c2c_{1},c_{2} are constants. Since the upper bound depends on the weight parameters, we need to update our subset SS using AdaCore during the training.

Calculating the last layer gradient often requires only a forward pass, which is as expensive as calculating the loss, and does not require any extra storage. For example, having a softmax as the last layer, the gradients of the loss w.r.t. the ithi^{th} input to the softmax is piyip_{i}-y_{i}, where pip_{i} is the ithi^{th} output the softmax and yy is the one-hot encoded label with the same dimensionality as the number of classes. Using this low-dimensional approximation gi^\hat{\textbf{g}_{i}} for the gradient gi\textbf{g}_{i} we can efficiently calculate the preconditioned gradient for every data point. For non-convex functions, the local gradient information can be very noisy. To smooth out the local gradient information and get a better approximation of the global gradient, we apply exponential moving average with a parameter 0<β10<\beta_{1} to the low-dimensional gradient approximations:

𝐠¯t=(1β1)i=1tβ2ti𝐠𝐢^1β2t.\overline{\mathbf{g}}_{t}=\frac{(1-\beta_{1})\sum_{i=1}^{t}\beta_{2}^{t-i}\hat{\mathbf{{g}_{i}}}}{1-\beta_{2}^{t}}. (7)

Approximating the Hessian Preconditioner

Since it is infeasible to calculate, store, and invert the full Hessian matrix every iteration, we use an inexact Newton method, where an approximate Hessian operator is used instead of the full Hessian. To efficiently calculate the Hessian diagonal, we first use the Hessian-Free method (Yao et al., 2018) to compute the multiplication between Hessian 𝐇t\mathbf{H}_{t} and a random vector zz with Rademacher distribution. To do so, we backpropagate on the low-dimensional gradient estimates multiplied by zz to get 𝐇tz=𝐠^tTz/wt\mathbf{H}_{t}z=\partial\hat{\mathbf{g}}_{t}^{T}z/\partial w_{t}. Now, we can use the Hutchinson’s method of obtains a stochastic estimate of the diagonal of the Hessian matrix as follows:

diag(𝐇t)=𝔼[z(𝐇tz)],\displaystyle\text{diag}(\mathbf{H}_{t})=\mathbb{E}[z\odot(\mathbf{H}_{t}z)], (8)

without having to form the Hessian matrix explicitly (Bekas et al., 2007). The diagonal approximation has the same convergence rate as using Hessian for strongly convex, and strictly smooth functions (Proof in Appendix A.1). Nevertheless, our method can be applied to general machine learning problems, such as deep networks and regularized classical methods (e.g., SVM, LASSO), which are strongly-convex. To smooth out the noisy local curvature and get a better approximation of the global Hessian information, we apply an exponential moving average with parameter 0<β2<10<\beta_{2}<1 to the Hessian diagonal estimate in Eq. (8):

𝐇¯t=(1β2)i=1tβ2tidiag(𝐇i)diag(𝐇i)1β2t.\mkern 1.5mu\overline{\mkern-1.5mu\mathbf{H}\mkern-1.5mu}\mkern 1.5mu_{t}=\sqrt{\frac{(1-\beta_{2})\sum_{i=1}^{t}\beta_{2}^{t-i}\text{diag}(\mathbf{H}_{i})\text{diag}(\mathbf{H}_{i})}{1-\beta_{2}^{t}}}. (9)

Using exponentially averaged gradient and Hessian approximations in Eq. (7), and (9), the preconditioned gradients in Eq. (5) can be approximated as follows:

St=argminSV,γt,j0j\displaystyle S^{*}_{t}=\!\!\underset{S\subseteq V,\gamma_{t,j}\geq 0\leavevmode\nobreak\ \forall j}{\arg\min} |S|,s.t.\displaystyle|S|,\quad\textrm{s.t.}\quad (10)
𝐇¯t1g¯tjSγt,j𝐇¯t,j1g¯t,jϵ.\displaystyle\|\mkern 1.5mu\overline{\mkern-1.5mu\mathbf{H}\mkern-1.5mu}\mkern 1.5mu_{t}^{-1}\overline{\textbf{g}}_{t}-\sum_{j\in S}\gamma_{t,j}\mkern 1.5mu\overline{\mkern-1.5mu\mathbf{H}\mkern-1.5mu}\mkern 1.5mu_{t,j}^{-1}\overline{\textbf{g}}_{t,j}\|\leq\epsilon.

Next, we discuss how to efficiently find near-optimal weighted subsets that closely approximate the full preconditioned gradient by solving Eq. (5).

4.4 Extracting Second-order Coresets

The subset selection problem (5) is NP-hard (Natarajan, 1995). However, it can be considered as a special case of the sparse vector approximation problem that has been studied in the literature, including convex optimization formulations—e.g. basis pursuit (Chen et al., 2001), sparse projections (Pilanci et al., 2012; Kyrillidis et al., 2013), LASSO (Tibshirani, 1996), and compressed sensing (Donoho, 2006). These methods, however, are expensive to solve and often require tuning regularization coefficients and thresholding to ensure cardinality constraints. More recently, the connection between sparse modeling and submodular222A set function F:2V+F:2^{V}\rightarrow\mathbb{R}^{+} is submodular if F(S{e})F(S)F(T{e})F(T),F(S\cup\{e\})-F(S)\geq F(T\cup\{e\})-F(T), for any STVS\subseteq T\subseteq V and eVTe\in V\setminus T. optimization have been demonstrated (Elenberg et al., 2018; Mirzasoleiman et al., 2020). The advantage of submodular optimization is that a fast and simple greedy algorithm often provides a near-optimal solution. Next, we briefly discuss how submodularity can be used to find a near-optimal solution for Eq. (5). We build on the recent result of (Mirzasoleiman et al., 2020) that showed that the error of estimating an expectation by a weighted sum of a subset of elements is upper-bounded by a submodular facility location function. In particular, via the above result, we get:

minSV𝐇¯t1𝐠¯tjS\displaystyle\min_{S\subseteq V}\|\mkern 1.5mu\overline{\mkern-1.5mu\mathbf{H}\mkern-1.5mu}\mkern 1.5mu_{t}^{-1}\overline{\mathbf{g}}_{t}-\sum_{j\in S} γt,j𝐇¯t,j.1𝐠¯t,j\displaystyle\gamma_{t,j}\mkern 1.5mu\overline{\mkern-1.5mu\mathbf{H}\mkern-1.5mu}\mkern 1.5mu_{t,j.}^{-1}\overline{\mathbf{g}}_{t,j}\leavevmode\nobreak\ \| (11)
iVminjS𝐇¯t,i.1𝐠¯t,i𝐇¯t,j.1𝐠¯t,j.\displaystyle\leq\sum_{i\in V}\min_{j\in S}\|\mkern 1.5mu\overline{\mkern-1.5mu\mathbf{H}\mkern-1.5mu}\mkern 1.5mu_{t,i.}^{-1}\overline{\mathbf{g}}_{t,i}-\mkern 1.5mu\overline{\mkern-1.5mu\mathbf{H}\mkern-1.5mu}\mkern 1.5mu_{t,j.}^{-1}\overline{\mathbf{g}}_{t,j}\|.\vspace{-6mm}

Setting the upper bound in the right-hand side of Eq. (11) to be less than ϵ\epsilon results in the smallest weighted subset SS^{*} that approximates full preconditioned gradient by an error of at most ϵ\epsilon, at iteration tt. Formally, we wish to solve the following optimization problem:

S\displaystyle S^{*}\in argminSV|S|,s.t.\displaystyle{\arg\min}_{S\subseteq V}|S|,\quad\text{s.t.}\quad (12)
L(S)=iVminjS𝐇¯t,i.1𝐠¯t,i𝐇¯t,j.1𝐠¯t,jϵ,\displaystyle L(S)=\sum_{i\in V}\min_{j\in S}\|\mkern 1.5mu\overline{\mkern-1.5mu\mathbf{H}\mkern-1.5mu}\mkern 1.5mu_{t,i.}^{-1}\overline{\mathbf{g}}_{t,i}-\mkern 1.5mu\overline{\mkern-1.5mu\mathbf{H}\mkern-1.5mu}\mkern 1.5mu_{t,j.}^{-1}\overline{\mathbf{g}}_{t,j}\|\leq\epsilon,

By introducing a phantom example ee, we can turn the minimization problem (12) into the following submodular cover problem, with a facility location objective F(S)F(S):

SargminSV\displaystyle S^{*}\in\underset{S\subseteq V}{\arg\min} |S|,s.t.\displaystyle|S|,\quad\text{s.t.} (13)
F(S)=C1L(S{e})C1ϵ,\displaystyle F(S)=C_{1}-L(S\cup\{e\})\geq C_{1}-\epsilon,

where C1=L({e})C_{1}=L(\{e\}) is a constant upper-bounding the value of L(S)L(S). The subset SS^{*} obtained by solving the maximization problem (13) is the medoid of the preconditioned gradients, and the weights γj\gamma_{j} are the number of elements that are closest to the medoid jSj\in S^{*}, i.e. γj=iV𝕀[j=minsS𝐇¯t,i1𝐠¯t,i𝐇¯t,s1𝐠¯t,s]\gamma_{j}=\!\sum_{i\in V}\mathbb{I}[j=\min_{s\in S}\|\mkern 1.5mu\overline{\mkern-1.5mu\mathbf{H}\mkern-1.5mu}\mkern 1.5mu_{t,i}^{-1}\overline{\mathbf{g}}_{t,i}\!\!-\!\mkern 1.5mu\overline{\mkern-1.5mu\mathbf{H}\mkern-1.5mu}\mkern 1.5mu_{t,s}^{-1}\overline{\mathbf{g}}_{t,s}\|]. For the above submodular cover problem, the classical greedy algorithm provides a logarithmic approximation guarantee |S|(1+ln(maxeF(e|)))|S||S|\leq\big{(}1+\ln(\max_{e}F(e|\emptyset))\big{)}|S^{*}| (Wolsey, 1982). The greedy algorithm starts with the empty set S0=S_{0}=\emptyset, and at each iteration tt, it chooses an element eVe\in V that maximizes the marginal utility F(e|St)=F(St{e})F(St)F(e|S_{t})=F(S_{t}\cup\{e\})-F(S_{t}). Formally, St=St1{argmaxeVF(e|St1)}S_{t}=S_{t-1}\cup\{{\arg\max}_{e\in V}F(e|S_{t-1})\}. The computational complexity of the greedy algorithm is 𝒪(nk)\mathcal{O}(nk). However, its complexity can be reduced to 𝒪(|V|)\mathcal{O}(|V|) using stochastic methods (Mirzasoleiman et al., 2015), and can be further improved using lazy evaluation (Minoux, 1978) and distributed implementations (Mirzasoleiman et al., 2013). The pseudocode can be found in Alg. 1 in Appendix A.3.

One coreset for convex functions

For convex functions, normed gradient differences between data points can be efficiently upper-bounded by the normed difference between feature vectors (Allen-Zhu et al., 2016; Hofmann et al., 2015; Mirzasoleiman et al., 2020). We apply a similar idea to upper-bound the normed difference between preconditioned gradients. This allows us to find one subset before the training. See proof in Appendix B.1.

4.5 Convergence Analysis

Here, we analyze the convergence rate of first and second order methods applied to the weighted subsets SS found by AdaCore. By minimizing Eq. (13) at every iteration tt, AdaCore finds subsets that approximate full preconditioned gradient by an error of at most ϵ\epsilon, i.e. 𝐇t1gtjSγt,j𝐇t,j1gt,jϵ\|\mathbf{H}^{-1}_{t}\textbf{g}_{t}-\sum_{j\in S}\gamma_{t,j}\mathbf{H}_{t,j}^{-1}\textbf{g}_{t,j}\|\leq\epsilon. This allows us to effectively analyze the reduction in the value of the loss function {\cal{L}} at every iteration tt. Below, we discuss the convergence of a first and second-order gradient methodapplied to subsets extracted by AdaCore.

Convergence for Newton’s Methods and AdaHessian We first provide the convergence analysis for the case where the function {\cal{L}} in Problem (1) is strongly convex, i.e. there exist a constant α>0\alpha>0 such that w,wd\forall w,w^{\prime}\in\mathbb{R}^{d} we have (w)(w)+(w),ww+α2ww2{\cal{L}}(w)\geq{\cal{L}}(w^{\prime})+\langle\nabla{\cal{L}}(w^{\prime}),w-w^{\prime}\rangle+\frac{\alpha}{2}\|w^{\prime}-w\|^{2}, and each component function has a Lipschitz gradient, i.e. w𝒲\forall w\in\mathcal{W} we have (w)(w)βww\|\nabla{\cal{L}}(w)-\nabla{\cal{L}}(w^{\prime})\|\leq\!\beta\|w-w^{\prime}\|. We get the following results by applying Newton’s method and AdaHessian to the weighted subsets SS extracted by AdaCore.

Theorem 4.1.

Assume that {\cal{L}} is α\alpha-strongly convex and β\beta-smooth. Let SS be a weighted subset obtained by AdaCore that estimate the preconditioned gradient by an error of at most ϵ\epsilon at every iteration tt, i.e., 𝐇t1gtjSγt,j𝐇t,j1gt,jϵ\|\mathbf{H}^{-1}_{t}\textbf{g}_{t}-\sum_{j\in S}\gamma_{t,j}\mathbf{H}_{t,j}^{-1}\textbf{g}_{t,j}\|\leq\epsilon. Then with learning rate α/β\alpha/\beta, Newton’s method with update rule of Eq. (3) applied to the subsets has the following convergence behavior:

(wt+1)(wt)α32β4(𝐠tβϵ)2.\displaystyle{\cal{L}}(w_{t+1})-{\cal{L}}(w_{t})\leq-\frac{\alpha^{3}}{2\beta^{4}}(\|\mathbf{g}_{t}\|-\beta\epsilon)^{2}. (14)

In particular, the algorithm converges to a βϵ/α\beta\epsilon/\alpha-neighborhood of the optimal solution ww_{*}.

Corollary 4.2.

For an α\alpha-strongly convex and β\beta-smooth loss {\cal{L}}, AdaHessian with Hessian power kk, applied to subsets found by AdaCore converges to a βϵ/α\beta\epsilon/\alpha-neighborhood of the optimal solution ww_{*}, and satisfies:

(wt+1)(wt)αk+22βk+3(𝐠tβϵ)2.\displaystyle{\cal{L}}(w_{t+1})-{\cal{L}}(w_{t})\leq-\frac{\alpha^{k+2}}{2\beta^{k+3}}(\|\mathbf{g}_{t}\|-\beta\epsilon)^{2}. (15)

The proofs can be found in Appendix A.1.

Convergence for (S)GD in Over-parameterized Case Next, we discuss the convergence behavior of gradient descent applied to the subsets found by AdaCore. In particular, we build upon the recent results of (Liu et al., 2020) that guarantees convergence for first-order methods on a broad class of general over-parameterized non-linear systems, including neural networks for which the tangent kernel, defined as 𝐉T𝐉\mathbf{J}^{T}\mathbf{J} are not close to constant, but satisfy the Polyak-Lojasiewicz (PL) condition. Where 𝐉=f/w\mathbf{J}=\partial f/\partial w is the Jacobian of the function ff with respect to the parameters ww. A loss function {\cal{L}} is μ\mu-PL on a set 𝒲\mathcal{W}, if 12(w)2μ(w),w𝒲\frac{1}{2}\|\nabla{\cal{L}}(w)\|^{2}\geq\mu{\cal{L}}(w),\forall w\in\mathcal{W}.

Theorem 4.3.

Assume that the loss function (w){\cal{L}}(w) is β\beta-smooth, and μ\mu-PL on a set 𝒲\mathcal{W}, and SS is a weighted subset obtained by AdaCore that estimates the preconditioned gradient by an error of at most ϵ\epsilon, i.e., 𝐇t1gtjSγt,j𝐇t,j1gt,jϵ\|\mathbf{H}^{-1}_{t}\textbf{g}_{t}-\sum_{j\in S}\gamma_{t,j}\mathbf{H}_{t,j}^{-1}\textbf{g}_{t,j}\|\leq\epsilon. Then with learning rate η\eta, gradient descent with update rule of Eq. (2) applied to the subsets have the following convergence behavior at iteration tt:

(wt)(1ημα2β2)t(w0)ηα22β2(β2ϵ22βϵmax),\displaystyle{\cal{L}}(w_{t})\leq(1-\frac{\eta\mu\alpha^{2}}{\beta^{2}})^{t}{\cal{L}}(w_{0})-\frac{\eta\alpha^{2}}{2\beta^{2}}(\beta^{2}\epsilon^{2}-2\beta\epsilon\nabla_{\max}), (16)

where α\alpha is the minimum eigenvalue of all Hessian matrices during training, and max\nabla_{\max} is an upper bound on the norm of the gradients.

Theorem 4.4.

Under the same assumptions as in Theorem 4.3, for mini-batch SGD with mini-batch size mm\in\mathbb{N}, the mini-batch SGD with update rule Eq. (2), with learning rate η=mβ(m1)\eta=\frac{m}{\beta(m-1)}, applied to the subsets have the following convergence behavior:

𝔼[(wt)](1ημα22β)t𝔼[(w0)]α2η2β(βϵ22ϵmax)\displaystyle\mathbb{E}[{\cal{L}}(w_{t})]\leq(1-\frac{\eta\mu\alpha^{2}}{2\beta})^{t}\mathbb{E}[{\cal{L}}(w_{0})]-\frac{\alpha^{2}\eta}{2\beta}(\beta\epsilon^{2}-2\epsilon\nabla_{\max}) (17)

where α\alpha is the minimum eigenvalue of all Hessian matrices during training, and max\nabla_{\max} is an upper bound on the norm of the gradients, and the expectation is taken w.r.t. the randomness in the choice of mini-batch.

The proofs can be found in Appendix A.2.

We show an exponential convergence for GD (Theorem 4.3) and SGD (Theorem 4.4) under the μ\mu-PL condition, as well as for second order methods (Theorems 4.1, 4.2), under α\alpha-strongly convex and β\beta-smooth assumptions on the loss.

5 Experiments

In this section, we evaluate the effectiveness of AdaCore, by answering the following questions: (1) how does the performance of various first and second-order methods compare when applied to subsets found by AdaCore vs. the full data and baselines; (2) how effective is AdaCore for extracting crucial subsets for training convex and non-convex over-parameterized models with different optimizers; and (3) how does AdaCore perform in eliminating redundancies and enhancing diversity of the selected elements.

Baselines In the convex setting, we compare the performance of AdaCore with Craig (Mirzasoleiman et al., 2020) that extracts subsets that approximate the full gradient, as well as Random subsets. For non-convex experiments, we additionally compare AdaCore with GradMatch and Glister (Killamsetty et al., 2021, 2020). For AdaCore and Craig, we use the gradient w.r.t the input to the last layer, and for Glister and GradMatch we use the gradient w.r.t the penultimate layer, as specified by the methods. In all cases, we select subsets separately from each class proportional to the class sizes, and train on the union of the subsets. We report average test accuracy across 3 trials in all experiments.

5.1 Convex Experiments

In our convex experiments, we apply AdaCore to select a coreset to classify the Ijcnn1 dataset using L2-regularized logistic regression: fi(x)=ln(1exp(wTxyyi))+0.5μwTwf_{i}(x)=ln(1-\text{exp}(-w^{T}x_{y}y_{i}))+0.5\mu w^{T}w. Ijcnn1 includes 49,990 training and 91,701 test data points of 22 dimensions, from 2 classes with 9-to-1 class imbalance ratio. In the convex setting, we only need to calculate the curvature once to find one AdaCore subset for the entire training. Hence, we utilize the complete Hessian information, computed analytically, as discussed in Appendix B.3. We apply an exponential decay learning schedule αk=α0bk\alpha_{k}=\alpha_{0}b^{k} with learning rate parameters α0\alpha_{0} and bb. For each model and method (including the random baseline) we tuned the parameters via a search and reported the best results.

AdaCore achieves smaller loss residual with a speedup Figure 1 compares the loss residual for SGD and Newton’s method applied to coresets of size 10% extracted by AdaCore (blue), Craig (orange), and random (green) with that of full dataset (red). We see that AdaCore effectively minimizes the training loss, achieving a better loss residual than Craig and random sampling. In particular, AdaCore matches the loss achieved on the full dataset with more than a 2.5x speedup for SGD and Newton’s methods. We note that training on random 10% subsets of the data cannot effectively minimize the training loss. We show the superior performance of training with SGD on subsets of size 10% to 90% found with AdaCore vs Craig in Appendix Fig. 6.

AdaCore better estimates the full gradient Fig. 2 shows the normalized gradient difference between gradient of the full data vs. weighted gradient of subsets of different sizes obtained by AdaCore vs Craig and Random, at the end of training by each method. We see that by considering curvature information, AdaCore obtains a better gradient estimate than Craig and Random subsets.

Refer to caption
(a) Ijcnn1 SGD
Refer to caption
(b) Ijcnn1 Newton
Figure 1: Loss residual of SGD and Newton’s method for training Logistic Regression on Ijcnn1. Comparing AdaCore (blue), Craig (orange) and random subsets (green) of size 10% vs. entire data (red dot). AdaCore achieves 2.5x speedup for training with SGD and Newton’s method.
Refer to caption
(a) Ijcnn1 SGD
Refer to caption
(b) Ijcnn1 Newton
Figure 2: Normalized gradient difference between subsets of various sizes found by AdaCore (blue), AdaCore (orange), Random (green) vs full data, when training Logistic Regression with SGD and Newton on Ijcnn1. AdaCore has a smaller gradient error at the end of training.
Refer to caption
(a) Test Accuracy
Refer to caption
(b) Distribution of selected points
Refer to caption
(c) Forgetting vs class ranking
Figure 3: Training ResNet-18 on subsets of size SS=1% selected every RR=1 epoch, with AdaCore, Craig, Glister, GradMatch and Random for 200 epochs vs. full for 15 epochs. (a) AdaCore outperforms baselines by providing 2x speedup over full, and more than 4.5x speedup over Random. (b) Histograms of the number of times a point is selected by AdaCore, Craig and GradMatch. AdaCore selects a more diverse set of examples compare to Craig, and GradMatch augments several randomly selected examples. (c)Forgetting scores for examples of a class sorted by AdaCore at the end of training. AdaCore priorities less forgettable examples compared to Craig.
Table 1: Training ResNet20 using AdaHessian and SGD+momentum on coresets of size 1% selected by different methods from CIFAR10. Percent of full data selected during entire training is shown (in parentheses). Using bHb_{H}=64, AdaCore achieves up to 16.8% higher accuracy, while selecting a smaller fraction of data points. Exponential averaging of gradient and Hessian, ​and a smaller bHb_{H} helps.
AdaHessian SGD+Momentum
Random 59.1%±2.8(87%)59.1\%\!\pm 2.8(87\%) 45.9%±2.5(87%)45.9\%\!\pm 2.5(87\%)
Craig 59.5%±2.8(74%)59.5\%\!\pm 2.8(74\%) 43.6%±1.6(75%)43.6\%\!\pm 1.6(75\%)
GradMatch 57.5%±1.3(74%)57.5\%\!\pm 1.3(74\%) 49.4%±1.6(74%)49.4\%\!\pm 1.6(74\%)
Glister 37.5%±1.3(74%)37.5\%\!\pm 1.3(74\%) 38.6%±1.6(74%)38.6\%\pm 1.6(74\%)
AdaCore ​(no avg) 58.4%±0.2(73%)58.4\%\!\pm 0.2(73\%) 51.5%±1.1(74%)51.5\%\!\pm 1.1(74\%)
AdaCore ​ (avg g) 59.8%±0.5(73%)59.8\%\!\pm 0.5(73\%) 53.2%±1.1(74%)53.2\%\pm 1.1(74\%)
AdaCore ​(avg H) 60.2%±0.5(73%)60.2\%\!\pm 0.5(73\%) 54.4%±1.1(74%)54.4\%\!\pm 1.1(74\%)
AdaCore 60.2%±0.5(73%)\textbf{60.2\%}\!\pm 0.5(\textbf{73\%}) 55.4%±1.1(74%)\textbf{55.4\%}\!\pm 1.1(\textbf{74\%})
AdaCore bh=512b_{h}\!\!=\!\!512 57.2%±0.5(73%)57.2\%\pm 0.5(73\%) 52.4%±1.1(74%)52.4\%\pm 1.1(74\%)

5.2 Non-Convex Experiments

Datasets We use CIFAR10 (60k points from 10 classes) , class imbalanced version of CIFAR10 (32.5k points from 10 classes) and CIFAR100 (32.5k points from 100 classes) (Krizhevsky et al., 2009), BDD100k (100k points from 7 classes) (Yu et al., 2020). The results on MNIST (70k points from 10 classes) (Deng, 2012) can be found in Appendix C.6. Images are normalized to [0,1] by division with 255.

Models and Optimizers We train ResNet-20 and ResNet-18 (He et al., 2016), with convolution, average pooling and dense layers with softmax outputs and weight decay of 10410^{-4}. We use a batch size of 256 in all experiments (except Table 3, Fig. 4a), and train using SGD with momentum of 0.9 (default), or AdaHessian. For training, we use standard learning rate scheduler for ResNet starting with 0.1 and exponentially decaying by factor 0.1 at epochs 100 and 150. We used linear learning rate warm-up for the first 20 epochs to prevent weights from diverging when training with subsets. All experiments were ran on a 2.4GHz CPU and RTX 2080 Ti GPU.

Calculating the Curvature To calculate the Hessian diagonal using Eq. (8), we use a batch size of bH=64b_{H}\!=\!64 to calculate the expected Hessian diagonal over the training data. We observed that a smaller batch size provides a higher quality Hessian compared to larger batch sizes, ​as shown in Table 1.

Baseline Comparison and Ablation Study Table 1 shows the accuracy of training ResNet-20, using SGD with momentum of 0.9 and AdaHessian, for 200 epochs on SS=1% subsets of CIFAR-10 chosen every RR=1 epoch by different methods. For SGD+momentum, AdaCore outperforms Craig by 12%, Random by 10%, GradMatch by 6%, and Glister by 16.8%. Note that in total, AdaCore selects 74% of the dataset during the entire training process, whereas Random visits 87%. Thus, AdaCore effectively selects subsets contributing the most to generalization. We see that the accuracy gap between the baselines and AdaCore shrinks when applying more powerful optimizers such as AdaHessian. Table 1 also shows the effect of exponential averaging of gradients and Hessian diagonal, and larger batch sizes for calculating the Hessian diagonal bHb_{H}. We see that exponential averaging help AdaCore achieving better performance, and smaller bHb_{H} provides better results.

Fig. 3a compares the performance of ResNet-18 on 1% subsets selected from CIFAR-10 with different methods. We compare the performance of training on AdaCore, Craig, GradMatch, Glister, and Random subsets for 200 epochs, with training on full data for 15 epochs. This is the number of iterations required for training on the full data to achieve a comparable performance to that of AdaCore subsets. We see that training on AdaCore coresets achieves a better accuracy 2.5x faster than training on the full dataset, and more than 4.5x faster than the next best subset selection algorithm for this setting (c.f. Fig. 8b in Appendix for complete results).

Table 2: Test accuracy and percent of full data selected (in parentheses), when selecting SS=1% coresets every RR epochs from Imbalanced CIFAR-10 to train ResNet18.
SS=1%1\%, RR=20 SS=1%1\%, RR=10 SS=1%1\%, RR=5
AdaCore 57.3%(5%)\textbf{57.3\%}(\textbf{5\%}) 57.12(9.5%)\textbf{57.12}(\textbf{9.5\%}) 60.2%(14.5%)\textbf{60.2\%}(\textbf{14.5\%})
Craig 48.6%(8%)48.6\%(8\%) 55(16%)55(16\%) 53.05%(27.5%)53.05\%(27.5\%)
Random 54.7%(8%)54.7\%(8\%) 54.6(18%)54.6(18\%) 54.6%(33.2%)54.6\%(33.2\%)
GradM 29.9%(8.2%)29.9\%(8.2\%) 29.1%(14.7%)29.1\%(14.7\%) 32.75%(23.2%)32.75\%(23.2\%)
Glister 21.1%(8.6%)21.1\%(8.6\%) 17.2%(16%)17.2\%(16\%) 14.4%(22.2%)14.4\%(22.2\%)
Table 3: Training ResNet18 with SS=1% subsets every RR=1 epoch from CIFAR10 using batch size bb= 512, 256, 128. AdaCore can leverage larger mini-bath size and obtain a larger accuracy gap to Craig and Random. For bb=512, we have​ 1​ mini-batch (GD). Std is reported in Appendix Table​ 9.
AdaC. Craig Rand
Gap/
Craig
Gap/
Rand
GD   b=512
58.32% 56.32% 49.14% 1.69% 8.91%
SGD b=256
68.23% 58.3% 60.7% 9.93% 8.16%
SGD b=128
66.89% 58.17% 65.46% 8.81% 1.52%
Table 4: Test accuracy and percent of full data selected (in parentheses), when selecting SS% coresets every RR epochs from CIFAR-10 and Imbalanced CIFAR-10 to train ResNet18.
ResNet20,  CIFAR10 SS = 30%30\%,   RR = 20 ResNet20,  CIFAR10 SS = 10%10\%,   RR = 20 ResNet18,  CIFAR10-IMB SS = 30%30\%,   RR = 20 ResNet18,  CIFAR10-IMB SS = 10%10\%,   RR = 20
AdaCore 80.57%±0.11\textbf{80.57\%}\pm 0.11(74.6%)(\textbf{74.6\%}) 70.6%±0.33\textbf{70.6}\%\pm 0.33(44.8%)(\textbf{44.8\%}) 85.7%±0.1\textbf{85.7\%}\pm 0.1(74%)(\textbf{74\%}) 76%±0.3\textbf{76\%}\pm 0.3(43.8%)(\textbf{43.8\%})
Craig 65.8%±0.4165.8\%\pm 0.41(90.9%)(90.9\%) 58.5%±1.2758.5\%\pm 1.27(60.75%)(60.75\%) 79.3%±1.679.3\%\pm 1.6(84.5%)(84.5\%) 71.6%±0.1571.6\%\pm 0.15(56.4%)(56.4\%)

Frequency and size of subsets selection Table 2, 4 shows the performance of different methods for selecting subsets of size S%S\% of the data every RR epochs, from CIFAR-10 and imbalanced CIFAR-10. Table 2 shows that selecting subsets of size 1% every R=5,10,20R=5,10,20 epochs with AdaCore achieves a superior performance compared to the baselines. Table 4 shows that AdaCore can successfully select larger subsets of size S=10%,30%S=10\%,30\% and outperform Craig (Std is reported in Appendix, Table 5).

AdaCore speeds up training Fig 4 compares the speedup of various methods during training ResNet18 on 10% subsets selected every R=20R=20 epochs from BDD 100k and CIFAR-100. All the methods are trained to achieve a test accuracy between 72% and 74% on BDD 100k, and between 57% and 50% on CIFAR-100. On BDD 100k, AdaCore achieves 74% accuracy in 100 epochs and training on full data achieves a similar performance in 45 epochs. For CIFAR-100, AdaCore achieves 59% accuracy in 200 epochs and training on full data achieves a similar performance in 40 epochs. Complete results on speedup and test accuracy of each method can be found in Appendix C.3, C.4. We see that AdaCore achieves 2.5x speedup over training on full data and 1.7x over that of training on random subsets on BDD 100k. For CIFAR-100, AdaCore achieves 4.2x speedup over training on random subsets and 2.9x over training on full data. Compared to the baselines, AdaCore can achieve achieve the desired accuracy much faster.

Effect of batch size Table 3 compares the performance of training with different batch sizes on subsets found by various methods. We see that training with larger batch size on subsets selected by AdaCore can achieve a superior accuracy. As AdaCore selects more diverse subsets with smaller weights, one can train with larger mini-batches on the subsets without increasing the gradient estimate error. In contrast, Craig subsets have elements with larger weights and hence training with fewer larger mini-batches has larger gradient error and does not improve the performance.

In summary, see that AdaCore consistently outperforms the baselines over various architectures, optimizers, subset sizes, selection frequency, and batch sizes.

AdaCore selects more diverse subsets Fig. 3b shows the number of times different methods selected a particular elements during the entire training. We see that AdaCore successfully selects a more diverse set of examples compared to Craig. We note that GradMatch may not be able to select subsets with the desired size, and instead augments the selected subset with randomly selected examples. Hence, it has a normal-shaped distribution. Fig. 3c shows mean forgetting score for all examples within a class ranked by AdaCore at the end of training, over sliding window of size 100. We see that AdaCore prioritizes selecting less forgettable examples. This shows that indeed AdaCore is able to distinguish different groups of easier examples better, and hence can prevent catastrophic forgetting by including their representatives in the coresets.

Refer to caption
(a) BDD100k, ResNet50
Refer to caption
(b) CIFAR100, ResNet18
Figure 4: Speedup of various methods over training on random subsets and full data, for training ResNet18 on CIFAR100 and ResNet50 on BDD100k with batch size=128.
Refer to caption
(a) Forgetting vs class ranking
Refer to caption
(b) ​Uncertainty ​vs class ranking

Refer to caption

(c) Most selected

Refer to caption

(d) Not selected
Figure 5: Training ResNet20 on SS=1% subsets of CIFAR-10 selected by AdaCore. (a) Forgetting scores, and (b) uncertainty of examples in a class sorted by AdaCore at the end of training. AdaCore prioritize selecting more forgettable and uncertain examples. (c) Six images selected by AdaCore most frequently (25 times) from the airplane class. (d) Subset of images never selected by AdaCore.

AdaCore vs Forgettability and Uncertainty Fig. 5a, 5b show mean forgettability and uncertainty in sliding windows of size 100, 200 over examples sorted by AdaCore at the end of training. We see that AdaCore heavily biases its selections towards forgettable and uncertain points, as training proceeds. Interestingly, 5a reveals that AdaCore avoids the most forgettable samples in favor of slightly more memorable ones, suggesting that AdaCore can better distinguish easier groups of examples. Figure 5b shows similar bias towards uncertain samples. Fig. 5c, 5d show the most and least selected images by AdaCore, respectively. We see the redundancies in the never selected images, whereas images frequented by AdaCore are quite diverse in color, angles, occluded subjects, and airplane models. This confirms the effectiveness of AdaCore in extracting the most crucial subsets for learning and eliminating redundancies.

6 Conclusion

We proposed AdaCore, a method that leverages the topology of the dataset to extract salient subsets of large datasets for efficient machine learning. The key idea behind AdaCore is to dynamically incorporate the curvature and gradient of the loss function via an adaptive estimate of the Hessian to select weighted subsets (coresets) which closely approximate the preconditioned gradient of the full dataset. We proved exponential convergence rate for first and second-order optimization methods applied to AdaCore coresets, under certain assumptions. Our extensive experiments, using various optimizers e.g., SGD, AdaHessian, and Newton’s method, show that AdaCore can extract higher quality coresets compared to baselines, rejecting potentially redundant data points. This speeds up the training of various machine learning models, such as logistic regression and neural networks, by over 4.5x while selecting fewer but more diverse data points for training.

Acknowledgements

This research was supported in part by UCLA-Amazon Science Hub for Humanity and Artificial Intelligence.

References

  • Alain et al. (2015) Alain, G., Lamb, A., Sankar, C., Courville, A., and Bengio, Y. Variance reduction in sgd by distributed importance sampling. arXiv preprint arXiv:1511.06481, 2015.
  • Allen-Zhu et al. (2016) Allen-Zhu, Z., Yuan, Y., and Sridharan, K. Exploiting the structure: Stochastic gradient methods using raw clusters. In Advances in Neural Information Processing Systems, pp. 1642–1650, 2016.
  • Asi & Duchi (2019) Asi, H. and Duchi, J. C. The importance of better models in stochastic optimization. arXiv preprint arXiv:1903.08619, 2019.
  • Bekas et al. (2007) Bekas, C., Kokiopoulou, E., and Saad, Y. An estimator for the diagonal of a matrix. Applied Numerical Mathematics, 57(11):1214–1229, 2007. ISSN 0168-9274. doi: https://doi.org/10.1016/j.apnum.2007.01.003. URL https://www.sciencedirect.com/science/article/pii/S0168927407000244. Numerical Algorithms, Parallelism and Applications (2).
  • Bertsekas (1982) Bertsekas, D. P. Projected newton methods for optimization problems with simple constraints. SIAM Journal on control and Optimization, 20(2):221–246, 1982.
  • Birodkar et al. (2019) Birodkar, V., Mobahi, H., and Bengio, S. Semantic redundancies in image-classification datasets: The 10%10\% you don’t need. arXiv preprint arXiv:1901.11409, 2019.
  • Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, USA, 2004. ISBN 0521833787.
  • Chen et al. (2001) Chen, S. S., Donoho, D. L., and Saunders, M. A. Atomic decomposition by basis pursuit. SIAM review, 43(1):129–159, 2001.
  • Coleman et al. (2020) Coleman, C., Yeh, C., Mussmann, S., Mirzasoleiman, B., Bailis, P., Liang, P., Leskovec, J., and Zaharia, M. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations (ICLR), 2020.
  • Deng (2012) Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • Donoho (2006) Donoho, D. L. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
  • Elenberg et al. (2018) Elenberg, E. R., Khanna, R., Dimakis, A. G., Negahban, S., et al. Restricted strong convexity implies weak submodularity. Annals of Statistics, 46(6B):3539–3568, 2018.
  • Ghorbani & Zou (2019) Ghorbani, A. and Zou, J. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pp. 2242–2251. PMLR, 2019.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Hofmann et al. (2015) Hofmann, T., Lucchi, A., Lacoste-Julien, S., and McWilliams, B. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, pp. 2305–2313, 2015.
  • Katharopoulos & Fleuret (2018) Katharopoulos, A. and Fleuret, F. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pp. 2525–2534. PMLR, 2018.
  • Killamsetty et al. (2020) Killamsetty, K., Sivasubramanian, D., Ramakrishnan, G., and Iyer, R. Glister: Generalization based data subset selection for efficient and robust learning. arXiv preprint arXiv:2012.10630, 2020.
  • Killamsetty et al. (2021) Killamsetty, K., Sivasubramanian, D., Mirzasoleiman, B., Ramakrishnan, G., De, A., and Iyer, R. Grad-match: A gradient matching based data subset selection for efficient learning. arXiv preprint arXiv:2103.00123, 2021.
  • Krizhevsky et al. (2009) Krizhevsky, A., Nair, V., and Hinton, G. Cifar-10 (canadian institute for advanced research). 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html.
  • Kyrillidis et al. (2013) Kyrillidis, A., Becker, S., Cevher, V., and Koch, C. Sparse projections onto the simplex. In International Conference on Machine Learning, pp. 235–243. PMLR, 2013.
  • Liu et al. (2020) Liu, C., Zhu, L., and Belkin, M. Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning. arXiv preprint arXiv:2003.00307, 2020.
  • Loshchilov & Hutter (2015) Loshchilov, I. and Hutter, F. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
  • Martens & Grosse (2015) Martens, J. and Grosse, R. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417. PMLR, 2015.
  • Minoux (1978) Minoux, M. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques, pp.  234–243. Springer, 1978.
  • Mirzasoleiman et al. (2013) Mirzasoleiman, B., Karbasi, A., Sarkar, R., and Krause, A. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, pp. 2049–2057, 2013.
  • Mirzasoleiman et al. (2015) Mirzasoleiman, B., Badanidiyuru, A., Karbasi, A., Vondrák, J., and Krause, A. Lazier than lazy greedy. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  • Mirzasoleiman et al. (2020) Mirzasoleiman, B., Bilmes, J., and Leskovec, J. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp. 6950–6960. PMLR, 2020.
  • Natarajan (1995) Natarajan, B. K. Sparse approximate solutions to linear systems. SIAM journal on computing, 24(2):227–234, 1995.
  • Nocedal (1980) Nocedal, J. Updating quasi-newton matrices with limited storage. Mathematics of Computation, 35(151):773–782, 1980. ISSN 00255718, 10886842. URL http://www.jstor.org/stable/2006193.
  • Pilanci et al. (2012) Pilanci, M., El Ghaoui, L., and Chandrasekaran, V. Recovery of sparse probability measures via convex programming. 2012.
  • Qian (1999) Qian, N. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
  • Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp.  400–407, 1951.
  • Schaul et al. (2013) Schaul, T., Zhang, S., and LeCun, Y. No more pesky learning rates. In International Conference on Machine Learning, pp. 343–351. PMLR, 2013.
  • Schaul et al. (2015) Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
  • Schwartz et al. (2019) Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. Green ai. arXiv preprint arXiv:1907.10597, 2019.
  • Strubell et al. (2019) Strubell, E., Ganesh, A., and McCallum, A. Energy and policy considerations for deep learning in nlp. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  3645–3650, 2019.
  • Tibshirani (1996) Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • Toneva et al. (2018) Toneva, M., Sordoni, A., des Combes, R. T., Trischler, A., Bengio, Y., and Gordon, G. J. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2018.
  • Wolsey (1982) Wolsey, L. A. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2(4):385–393, 1982.
  • Xu et al. (2020) Xu, P., Roosta, F., and Mahoney, M. W. Second-order optimization for non-convex machine learning: An empirical study. In Proceedings of the 2020 SIAM International Conference on Data Mining, pp.  199–207. SIAM, 2020.
  • Yao et al. (2018) Yao, Z., Xu, P., Roosta-Khorasani, F., and Mahoney, M. W. Inexact non-convex newton-type methods. arXiv preprint arXiv:1802.06925, 2018.
  • Yao et al. (2020) Yao, Z., Gholami, A., Shen, S., Keutzer, K., and Mahoney, M. W. Adahessian: An adaptive second order optimizer for machine learning. arXiv preprint arXiv:2006.00719, 2020.
  • Yu et al. (2020) Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., and Darrell, T. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.

Appendix A Proofs of Theorems

A.1 Proof of Theorem 4.1

See 4.1

Proof.

We prove Theorem 4.1 (similarly to the proof of Newton’s method in (Boyd & Vandenberghe, 2004)) for the following general update rule for 0k10\leq k\leq 1:

Δwt=𝐇tk𝐠t\displaystyle\Delta w_{t}=\mathbf{H}_{t}^{-k}\mathbf{g}_{t} (18)
wt+1=wtηΔwt\displaystyle w_{t+1}=w_{t}-\eta\Delta w_{t} (19)

For k=1k=1, this corresponds to the update rule of the Newton’s method. Define λ(wt)=(𝐠tT𝐇tk𝐠t)1/2\lambda(w_{t})=(\mathbf{g}_{t}^{T}\mathbf{H}_{t}^{-k}\mathbf{g}_{t})^{1/2}. Since (w){\cal{L}}(w) is β\beta-smooth, we have

(wt+1)\displaystyle{\cal{L}}(w_{t+1}) (wt)η𝐠tTΔwt+η2βΔwt22\displaystyle\leq{\cal{L}}(w_{t})-\eta\mathbf{g}_{t}^{T}\Delta w_{t}+\frac{\eta^{2}\beta\|\Delta w_{t}\|^{2}}{2} (20)
(wt)ηλ(wt)2+β2αkη2λ(wt)2,\displaystyle\leq{\cal{L}}(w_{t})-\eta\lambda(w_{t})^{2}+\frac{\beta}{2\alpha^{k}}\eta^{2}\lambda(w_{t})^{2}, (21)

where in the last equality, we used

λ(wt)=Δwt𝐇tkΔwtT.\displaystyle\lambda(w_{t})=\Delta w_{t}\mathbf{H}_{t}^{k}\Delta w_{t}^{T}. (22)

Therefore, using step size η^=αkβ\hat{\eta}=\frac{\alpha^{k}}{\beta} we have wt+1=wtη^Δwtw_{t+1}=w_{t}-\hat{\eta}\Delta w_{t}

(wt+1)(wt)12η^λ(wt)2\displaystyle{\cal{L}}(w_{t+1})\leq{\cal{L}}(w_{t})-\frac{1}{2}\hat{\eta}\lambda(w_{t})^{2} (23)

Since αI𝐇tβI\alpha I\preceq\mathbf{H}_{t}\preceq\beta\emph{I}, we have

λ(wt)2=𝐠tT𝐇tk𝐠t1βk𝐠t2,\displaystyle\lambda(w_{t})^{2}=\mathbf{g}_{t}^{T}\mathbf{H}_{t}^{-k}\mathbf{g}_{t}\geq\frac{1}{\beta^{k}}\|\mathbf{g}_{t}\|^{2}, (24)

and therefore {\cal{L}} decreases as follows,

(wt+1)(wt)12βkη^𝐠t2=αk2βk+1𝐠t2.\displaystyle{\cal{L}}(w_{t+1})-{\cal{L}}(w_{t})\leq-\frac{1}{2\beta^{k}}\hat{\eta}\|\mathbf{g}_{t}\|^{2}=-\frac{\alpha^{k}}{2\beta^{k+1}}\|\mathbf{g}_{t}\|^{2}. (25)

Now for the subset, from Eq. (5) we have that 𝐇t1gtjSγt,j𝐇t,j1gt,jϵ\|\mathbf{H}^{-1}_{t}\textbf{g}_{t}-\sum_{j\in S}\gamma_{t,j}\mathbf{H}_{t,j}^{-1}\textbf{g}_{t,j}\|\leq\epsilon. Hence, via reverse triangle inequality 𝐇t1gtjSγt,j𝐇t,j1gt,j+ϵ\|\mathbf{H}^{-1}_{t}\textbf{g}_{t}\|\leq\|\sum_{j\in S}\gamma_{t,j}\mathbf{H}_{t,j}^{-1}\textbf{g}_{t,j}\|+\epsilon, and we get

𝐠tβ(𝐇t)1𝐠t(𝐇tS)1𝐠tS𝜸+ϵ𝐠tSα+ϵ,\displaystyle\frac{\|\mathbf{g}_{t}\|}{\beta}\leq\|(\mathbf{H}_{t})^{-1}\mathbf{g}_{t}\|\leq\|(\mathbf{H}^{S}_{t})^{-1}\mathbf{g}^{S}_{t}\boldsymbol{\gamma}\|+\epsilon\leq\frac{\|\mathbf{g}^{S}_{t}\|}{\alpha}+\epsilon, (26)

where 𝐠tS=jS𝐠t,j\mathbf{g}^{S}_{t}=\sum_{j\in S}\mathbf{g}_{t,j} and 𝐇tS=jS𝐇t,j.\mathbf{H}^{S}_{t}=\sum_{j\in S}\mathbf{H}_{t,j.} are the gradient and Hessian of the subset respectively. In Eq. (26) the RHS follows from operator norms and the LHS follows from the following lower bound on the norm of the product of two matrices:

AB=maxx=1xTAB=maxx=1xTAxTAxTABmaxx=1σmin(A)xTAxTAB=maxy=1σmin(A)yTB=σmin(A)B,\displaystyle\begin{aligned} \|AB\|&=\max_{\|x\|=1}\|x^{T}AB\|\\ &=\max_{\|x\|=1}\|x^{T}A\|\left\|\frac{x^{T}A}{\|x^{T}A\|}B\right\|\\ &\geq\max_{\|x\|=1}\sigma_{\min(A)}\left\|\frac{x^{T}A}{\|x^{T}A\|}B\right\|\\ &=\max_{\|y\|=1}\sigma_{\min(A)}\left\|y^{T}B\right\|\\ &=\sigma_{\min(A)}\|B\|,\end{aligned} (27)

Hence,

𝐠tSαβ(𝐠tβϵ)\displaystyle\|\mathbf{g}_{t}^{S}\|\geq\frac{\alpha}{\beta}(\|\mathbf{g}_{t}\|-\beta\epsilon) (28)

Therefore, on the subset we have

(wt+1)(wt)\displaystyle{\cal{L}}(w_{t+1})-{\cal{L}}(w_{t}) αk2βk+1𝐠tS2\displaystyle\leq-\frac{\alpha^{k}}{2\beta^{k+1}}\|\mathbf{g}^{S}_{t}\|^{2} (29)
αk2βk+1(αβ)2(𝐠tβϵ)2\displaystyle\leq-\frac{\alpha^{k}}{2\beta^{k+1}}(\frac{\alpha}{\beta})^{2}(\|\mathbf{g}_{t}\|-\beta\epsilon)^{2} (30)
=αk+22βk+3(𝐠tβϵ)2.\displaystyle=-\frac{\alpha^{k+2}}{2\beta^{k+3}}(\|\mathbf{g}_{t}\|-\beta\epsilon)^{2}. (31)

The algorithm stops descending when 𝐠t=βϵ\|\mathbf{g}_{t}\|=\beta\epsilon. From strong convexity we know that

𝐠t=βϵαww\displaystyle\|\mathbf{g}_{t}\|=\beta\epsilon\geq\alpha\|w-w_{*}\| (32)

Hence, we get

wwβϵ/α.\displaystyle\|w-w_{*}\|\leq\beta\epsilon/\alpha. (33)

As such we have Corollary 4.2. and when we set k=1k=1 we have proof of Theorem 4.1. ∎

Descent property for Equation 5 For a strongly convex function, {\cal{L}}, we have that the diagonal elements of the Hessian are positive (Yao et al., 2020). This allows the diagonal to approximate the full Hessian which has good convergence properties.

Given a function (w){\cal{L}}(w) which is strongly convex and strictly smooth, we have that 2(w)\nabla^{2}{\cal{L}}(w) is upper and lower bounded by two constants β\beta and α\alpha, so that αI2(w)βI\alpha I\leq\nabla^{2}{\cal{L}}(w)\leq\beta I, for all ww. For a strongly convex function the diagonal elements in diag(𝐇t\mathbf{H}_{t}) are all positive, and we have:

αeiT𝐇tei=eiTdiag(𝐇t)ei=diag(𝐇t)i,jβ\displaystyle\alpha\leq e_{i}^{T}\mathbf{H}_{t}e_{i}=e_{i}^{T}\text{diag}(\mathbf{H}_{t})e_{i}=\text{diag}(\mathbf{H}_{t})_{i,j}\leq\beta (34)

where eje_{j} represents the natural basis vectors. Therefore, the diagonal entries of diag(𝐇t)\text{diag}(\mathbf{H}_{t}) are in the range [α,β][\alpha,\beta]. Therefore, the average of a subset of the numbers are still in the range [α,β][\alpha,\beta]. As such, we can prove that Eq. (10) has the same convergence rate as its full matrix counterpart, by following the same proof as Theorem 4.1.

A.2 Proof of Theorem 4.3 and 4.4

A loss function (w)\mathcal{L}(w) is considered μ\mu-PL on a set 𝒮\mathcal{S}, if the following holds:

12𝐠2μ((w)(w)),w𝒮\displaystyle\frac{1}{2}\|\mathbf{g}\|^{2}\geq\mu\left(\mathcal{L}(w)-\mathcal{L}\left(w_{*}\right)\right),\forall w\in\mathcal{S} (35)

where ww_{*} is a global minimizer. When additionally (w)=0\mathcal{L}\left(w_{*}\right)=0, the μ\mu-PL condition is equivalent to the μ\mu-PL\text{PL}^{*} condition

12𝐠2μ(w),w𝒮.\displaystyle\frac{1}{2}\|\mathbf{g}\|^{2}\geq\mu\mathcal{L}(w),\forall w\in\mathcal{S}. (36)

See 4.3

For Lipschitz continuous 𝐠\mathbf{g} and μ\mu-PL condition, gradient descent on the entire dataset yields

(wt+1)(wt)η2𝐠t2ημ(wt),\displaystyle{\cal{L}}(w_{t+1})-{\cal{L}}(w_{t})\leq-\frac{\eta}{2}\|\mathbf{g}_{t}\|^{2}\leq-\eta\mu{\cal{L}}(w_{t}), (37)

and,

(wt)(1ημ)t(w0)\displaystyle{\cal{L}}(w_{t})\leq(1-\eta\mu)^{t}{\cal{L}}(w_{0}) (38)

which was shown in (Liu et al., 2020).

We build upon this result to an AdaCore subset.

Proof.

From Eq. (26) we have that

𝐠tβ(𝐇t)1𝐠t(𝐇tS)1𝐠tS𝜸+ϵ𝐠tSα+ϵ\displaystyle\frac{\|\mathbf{g}_{t}\|}{\beta}\leq\|(\mathbf{H}_{t})^{-1}\mathbf{g}_{t}\|\leq\|(\mathbf{H}^{S}_{t})^{-1}\mathbf{g}_{t}^{S}\boldsymbol{\gamma}\|+\epsilon\leq\frac{\|\mathbf{g}_{t}^{S}\|}{\alpha}+\epsilon (39)

Hence solving for 𝐠tS\|\mathbf{g}_{t}^{S}\| we have,

𝐠tSαβ(𝐠tβϵ).\|\mathbf{g}_{t}^{S}\|\geq\frac{\alpha}{\beta}(\|\mathbf{g}_{t}\|-\beta\epsilon). (40)

For the subset we have

(wt+1)(wt)\displaystyle{\cal{L}}(w_{t+1})-{\cal{L}}(w_{t}) η2𝐠tS2\displaystyle\leq-\frac{\eta}{2}\|\mathbf{g}_{t}^{S}\|^{2} (41)

By substituting Eq. (40) we have.

ηα22β2(𝐠tβϵ)2\displaystyle\leq-\frac{\eta\alpha^{2}}{2\beta^{2}}(\|\mathbf{g}_{t}\|-\beta\epsilon)^{2} (42)
=ηα22β2(𝐠t2+β2ϵ22βϵ𝐠t)\displaystyle=-\frac{\eta\alpha^{2}}{2\beta^{2}}(\|\mathbf{g}_{t}\|^{2}+\beta^{2}\epsilon^{2}-2\beta\epsilon\|\mathbf{g}_{t}\|) (43)
ηα22β2(𝐠t2+β2ϵ22βϵmax)\displaystyle\leq-\frac{\eta\alpha^{2}}{2\beta^{2}}(\|\mathbf{g}_{t}\|^{2}+\beta^{2}\epsilon^{2}-2\beta\epsilon\nabla_{\max}) (44)
ηα22β2(2μ(wt)+β2ϵ22βϵmax)\displaystyle\leq-\frac{\eta\alpha^{2}}{2\beta^{2}}(2\mu{\cal{L}}(w_{t})+\beta^{2}\epsilon^{2}-2\beta\epsilon\nabla_{\max}) (45)

Where we can upper bound the norm of 𝐠t\mathbf{g}_{t} in Eq. (43) by a constant max\nabla_{max}. And Eq. (45) follows from the μ\mu-PL condition from Eq. (35).

Hence,

(wt+1)(1ημα2β2)(wt)ηα22β2(β2ϵ22βϵmax)\displaystyle{\cal{L}}(w_{t+1})\leq(1-\frac{\eta\mu\alpha^{2}}{\beta^{2}}){\cal{L}}(w_{t})-\frac{\eta\alpha^{2}}{2\beta^{2}}(\beta^{2}\epsilon^{2}-2\beta\epsilon\nabla_{\max}) (46)

Since, j=0k(1ημα2β2)jβ2ημα2\sum_{j=0}^{k}(1-\frac{\eta\mu\alpha^{2}}{\beta^{2}})^{j}\leq\frac{\beta^{2}}{\eta\mu\alpha^{2}}, for a constant learning rate η\eta we get

(wt+1)(1ημα2β2)t+1(w0)ηα22β2(β2ϵ22βϵmax)\displaystyle{\cal{L}}(w_{t+1})\leq(1-\frac{\eta\mu\alpha^{2}}{\beta^{2}})^{t+1}{\cal{L}}(w_{0})-\frac{\eta\alpha^{2}}{2\beta^{2}}(\beta^{2}\epsilon^{2}-2\beta\epsilon\nabla_{\max}) (47)

See 4.4 For Lipschitz continuous 𝐠\mathbf{g} and μ\mu-PL condition, gradient descent on the entire dataset yields

(wt+1)(wt)η2𝐠t2ημ(wt),\displaystyle{\cal{L}}(w_{t+1})-{\cal{L}}(w_{t})\leq-\frac{\eta}{2}\|\mathbf{g}_{t}\|^{2}\leq-\eta\mu{\cal{L}}(w_{t}), (48)

and,

(wt)(1ημ)t(w0),\displaystyle{\cal{L}}(w_{t})\leq(1-\eta\mu)^{t}{\cal{L}}(w_{0}), (49)

which was shown in (Liu et al., 2020).

We build upon this result to an AdaCore subset.

Proof.

From Eq. (26) we have that

𝐠tβ(𝐇t)1𝐠t(𝐇tS)1𝐠tS𝜸+ϵ𝐠tSα+ϵ\displaystyle\frac{\|\mathbf{g}_{t}\|}{\beta}\leq\|(\mathbf{H}_{t})^{-1}\mathbf{g}_{t}\|\leq\|(\mathbf{H}^{S}_{t})^{-1}\mathbf{g}_{t}^{S}\boldsymbol{\gamma}\|+\epsilon\leq\frac{\|\mathbf{g}_{t}^{S}\|}{\alpha}+\epsilon (50)

Hence solving for 𝐠tS\|\mathbf{g}_{t}^{S}\| we have,

𝐠tSαβ(𝐠tβϵ).\|\mathbf{g}_{t}^{S}\|\geq\frac{\alpha}{\beta}(\|\mathbf{g}_{t}\|-\beta\epsilon). (51)

For the subset we have

(wt+1)(wt)\displaystyle{\cal{L}}(w_{t+1})-{\cal{L}}(w_{t}) η2𝐠tS2\displaystyle\leq-\frac{\eta}{2}\|\mathbf{g}_{t}^{S}\|^{2} (52)

Fixing wtw_{t} and taking expectation with respect to the randomness in the choice of the batch it(1)it(m)i_{t}^{(1)}\dots i_{t}^{(m)} (noting that those indices are i.i.d.), we have

𝔼it(1)it(m)[(wt+1)(wt)]\displaystyle\mathbb{E}_{i_{t}^{(1)}\dots i_{t}^{(m)}}[{\cal{L}}(w_{t+1})-{\cal{L}}(w_{t})] η(αηβm(αm12+β)(wt)\displaystyle\leq-\eta(\alpha-\eta\frac{\beta}{m}(\alpha\frac{m-1}{2}+\beta){\cal{L}}(w_{t}) (53)
η(1ηβ(m1)2m)c1𝐠t2+η2βλmc2(wt)\displaystyle\leq-\underbrace{\eta(1-\frac{\eta\beta(m-1)}{2m})}_{c_{1}}\|\mathbf{g}_{t}\|^{2}+\underbrace{\frac{\eta^{2}\beta\lambda}{m}}_{c_{2}}{\cal{L}}(w_{t}) (54)
c1α2β2𝐠t+βϵ)2+c2(wt)\displaystyle\leq-c_{1}\frac{\alpha^{2}}{\beta^{2}}\|\mathbf{g}_{t}+\beta\epsilon\|)^{2}+c_{2}{\cal{L}}(w_{t}) (55)

We can upper bound the norm of 𝐠t\mathbf{g}_{t} in Eq. (45) by a constant max\nabla_{max}. And Eq. (45) follows from the μ\mu-PL condition from Eq. (35) and assuming η2β\eta\leq\frac{2}{\beta}.

c1α2β2(μ(wt)2maxβϵ+β2ϵ2)+c2(wt)\displaystyle\leq-c_{1}\frac{\alpha^{2}}{\beta^{2}}(\mu{\cal{L}}(w_{t})-2\nabla_{max}\beta\epsilon+\beta^{2}\epsilon^{2})+c_{2}{\cal{L}}(w_{t}) (56)
η(1ηβ(m1)2m)α2β2(μ(wt)2maxβϵ+β2ϵ2)+η2βλm(wt)\displaystyle\leq-\eta(1-\frac{\eta\beta(m-1)}{2m})\frac{\alpha^{2}}{\beta^{2}}(\mu{\cal{L}}(w_{t})-2\nabla_{max}\beta\epsilon+\beta^{2}\epsilon^{2})+\frac{\eta^{2}\beta\lambda}{m}{\cal{L}}(w_{t}) (57)
=η(μα2β2ηβm(μα2(m1)β22+λ))(wt)+η2βλm(wt)+c1α2β2(2maxβϵβ2ϵ2)\displaystyle=\eta(\mu\frac{\alpha^{2}}{\beta^{2}}-\eta\frac{\beta}{m}(\mu\frac{\alpha^{2}(m-1)}{\beta^{2}2}+\lambda)){\cal{L}}(w_{t})+\frac{\eta^{2}\beta\lambda}{m}{\cal{L}}(w_{t})+c_{1}\frac{\alpha^{2}}{\beta^{2}}(2\nabla_{max}\beta\epsilon-\beta^{2}\epsilon^{2}) (58)
=ημα2β2(1ηβm12m)𝔼[(wt)]+ηα2β2(1ηβm12m)(2maxβϵβ2ϵ2)\displaystyle=\eta\mu\frac{\alpha^{2}}{\beta^{2}}(1-\eta\beta\frac{m-1}{2m})\mathbb{E}[{\cal{L}}(w_{t})]+\eta\frac{\alpha^{2}}{\beta^{2}}(1-\eta\beta\frac{m-1}{2m})(2\nabla_{max}\beta\epsilon-\beta^{2}\epsilon^{2}) (59)

By optimizing the quadratic term in the upper bound with respect to η\eta we get η=mβ(m1)\eta=\frac{m}{\beta(m-1)}.

𝔼[(wt+1)](1μα2m2β2(m1))𝔼[(wt)]+α2mβ22maxβϵβ2ϵ22β(m1)\displaystyle\mathbb{E}[{\cal{L}}(w_{t+1})]\leq(1-\frac{\mu\alpha^{2}m}{2\beta^{2}(m-1)})\mathbb{E}[{\cal{L}}(w_{t})]+\frac{\alpha^{2}m}{\beta^{2}}\frac{2\nabla_{max}\beta\epsilon-\beta^{2}\epsilon^{2}}{2\beta(m-1)} (60)

Hence,

𝔼[(wt+1)](1η(m)μα22β)𝔼[(wt)]+α2η(m)β(maxϵβϵ2/2)\displaystyle\mathbb{E}[{\cal{L}}(w_{t+1})]\leq\left(1-\frac{\eta^{*}(m)\mu\alpha^{2}}{2\beta}\right)\mathbb{E}[{\cal{L}}(w_{t})]+\frac{\alpha^{2}\eta^{*}(m)}{\beta}(\nabla_{max}\epsilon-\beta\epsilon^{2}/2) (61)

A.3 Discussion on Greedy to Extract Near-optimal Coresets

As discussed in Section 4.4, a greedy algorithm can be applied to find near-optimal coresets that estimate the general descent direction with an error of at most ϵ\epsilon by solving the submodular cover problem Eq. (13). For completeness, we include the pseudocode of the greedy algorithm in Algorithm 1. The AdaCore algorithm is run per class.

0:  Set of component functions fif_{i} for iV=[n]}i\in V=[n]\}.
0:  Subset SVS\subseteq V with corresponding per-element stepsizes {γ}jS\{\gamma\}_{j\in S}.
1:  S0,s0=0,i=0.S_{0}\leftarrow\emptyset,s_{0}=0,i=0.
2:  while F(S)<C1ϵF(S)<C_{1}-\epsilon do
3:     jargmaxeVSi1F(e|Si1)j\in{\arg\max}_{e\in V\setminus S_{i-1}}F(e|S_{i-1})
4:     Si=Si1{j}S_{i}=S_{i-1}\cup\{j\}
5:     i=i+1i=i+1
6:  end while
7:  for j=1j=1 to |S||S| do
8:     γj=iV𝕀[j=argminsSmaxw𝒲𝐇t1gtjSγt,j𝐇t,j1gt,j]\gamma_{j}=\sum_{i\in V}\mathbb{I}\big{[}j={\arg\min}_{s\in S}{\max_{w\in\mathcal{W}}}\|\mathbf{H}^{-1}_{t}\textbf{g}_{t}-\sum_{j\in S}\gamma_{t,j}\mathbf{H}_{t,j}^{-1}\textbf{g}_{t,j}\|\big{]}
9:  end for
Algorithm 1 AdaCore (Adaptive Coresets for Accelerating first and second order optimization methods)

Appendix B Bounding the Norm of Difference Between Preconditioned Gradients

B.1 Convex Loss Functions

We show the normed difference for ridge regression. Similar results can be deduced for other loss functions such as square loss (Allen-Zhu et al., 2016), logistic loss, smoothed hinge losses, etc.

For ridge regression fi(w)=12(xi,wyi)2+λ2w2f_{i}(w)=\frac{1}{2}(\langle x_{i},w\rangle-y_{i})^{2}+\frac{\lambda}{2}\|w\|^{2}, we have fi(w)=xi(xi,wyi)+λw\nabla f_{i}(w)=x_{i}(\langle x_{i},w\rangle-y_{i})+\lambda w. Furthermore for invertable Hessian H, let Hi1=AH^{-1}_{i}=A and Hj1=BH^{-1}_{j}=B. Therefore,

Afi(w)Bfj(w)=A(xixi,wxiyi+λw)B(xjxj,wxjyj+λw)\displaystyle\|A\nabla f_{i}(w)-B\nabla f_{j}(w)\|=\|A(x_{i}\langle x_{i},w\rangle-x_{i}y_{i}+\lambda w)-B(x_{j}\langle x_{j},w\rangle-x_{j}y_{j}+\lambda w)\| (62)
=Axixi,wBxjxj,w+BxjyjAxiyi+λ(AB)w\displaystyle=\|Ax_{i}\langle x_{i},w\rangle-Bx_{j}\langle x_{j},w\rangle+Bx_{j}y_{j}-Ax_{i}y_{i}+\lambda(A-B)w\| (63)
=Axixi,w+Bxjxi,wBxjxi,wBxjxj,w\displaystyle=\|Ax_{i}\langle x_{i},w\rangle+Bx_{j}\langle x_{i},w\rangle-Bx_{j}\langle x_{i},w\rangle-Bx_{j}\langle x_{j},w\rangle
+BxjyjAxiyi+BxjyiBxjyi+λ(A+B)w\displaystyle\quad\quad\quad\quad\quad+Bx_{j}y_{j}-Ax_{i}y_{i}+Bx_{j}y_{i}-Bx_{j}y_{i}+\lambda(A+B)w\| (64)
=xi,w(AxiBxj)+xixj,wBxj+(yjyi)Bxj+yi(BxjAxi)+λ(AB)w\displaystyle=\|\langle x_{i},w\rangle(Ax_{i}-Bx_{j})+\langle x_{i}-x_{j},w\rangle Bx_{j}+(y_{j}-y_{i})Bx_{j}+y_{i}(Bx_{j}-Ax_{i})+\lambda(A-B)w\| (65)
=(xi,wyi)(AxiBxj)+(xixj,w+yjyi)Bxj+λ(A+B)w\displaystyle=\|(\langle x_{i},w\rangle-y_{i})(Ax_{i}-Bx_{j})+(\langle x_{i}-x_{j},w\rangle+y_{j}-y_{i})Bx_{j}+\lambda(A+B)w\| (66)
|xi,wyi|AxiBxj+|xixj,w+yjyi|Bxj+λ(AB)w\displaystyle\leq|\langle x_{i},w\rangle-y_{i}|\|Ax_{i}-Bx_{j}\|+|\langle x_{i}-x_{j},w\rangle+y_{j}-y_{i}|\|Bx_{j}\|+\lambda\|(A-B)w\| (67)
|xi,wyi|(ABxi+Bxixj)+|xixj,w+yjyi|Bxj\displaystyle\leq|\langle x_{i},w\rangle-y_{i}|(\|A-B\|\|x_{i}\|+\|B\|\|x_{i}-x_{j}\|)+|\langle x_{i}-x_{j},w\rangle+y_{j}-y_{i}|\|Bx_{j}\|
+λ(A+B)w\displaystyle\quad\quad\quad\quad\quad+\lambda\|(A+B)w\| (68)
O(w)(AB+Bxixj)+O(w)Bxjxixj\displaystyle\leq O(\|w\|)(\|A-B\|+\|B\|\|x_{i}-x_{j}\|)+O(\|w\|)\|B\|\|x_{j}\|\|x_{i}-x_{j}\| (69)
O(w)(A+B+Bxixj)+O(w)Bxjxixj\displaystyle\leq O(\|w\|)(\|A\|+\|B\|+\|B\|\|x_{i}-x_{j}\|)+O(\|w\|)\|B\|\|x_{j}\|\|x_{i}-x_{j}\| (70)

In Eq. (70) we have the norm of the inverse of the Hessian matrix. Since H is invertible we have minσii>0,{}_{i}\sigma_{i}>0,

miniσi=infx0Hx2x21miniσi=supx0x2Hx2\displaystyle\min_{i}\sigma_{i}=\inf_{x\neq 0}\frac{\|Hx\|_{2}}{\|x\|_{2}}\Longleftrightarrow\frac{1}{\min_{i}\sigma_{i}}=\sup_{x\neq 0}\frac{\|x\|_{2}}{\|Hx\|_{2}} (71)
1miniσi=supx0x2Hx2=supH1z0H1z2z2=supz0H1z2z2=H12,\displaystyle\frac{1}{\min_{i}\sigma_{i}}=\sup_{x\neq 0}\frac{\|x\|_{2}}{\|Hx\|_{2}}=\sup_{H^{-1}z\neq 0}\frac{\left\|H^{-1}z\right\|_{2}}{\|z\|_{2}}=\sup_{z\neq 0}\frac{\left\|H^{-1}z\right\|_{2}}{\|z\|_{2}}=\left\|H^{-1}\right\|_{2}, (72)

where the substitution Hx=zHx=z was made, and utilized that H1z=0z=0H^{-1}z=0\Longleftrightarrow z=0 since H is invertible. Hence,

O(w)Bxixj\displaystyle\leq O(\|w\|)\|B\|\|x_{i}-x_{j}\| (73)
O(w)xixj\displaystyle\leq O(\|w\|)\|x_{i}-x_{j}\| (74)

For xi1\|x_{i}\|\leq 1, and |yiyj|0|y_{i}-y_{j}|\approx 0 .

Assuming that w\|w\| is bounded for all w𝒲w\in\mathcal{W}, an upper bound on the euclidean distance between preconditioned gradients can be precomputed.

B.2 Neural Networks

We closely follow proofs from (Katharopoulos & Fleuret, 2018) and (Mirzasoleiman et al., 2020) to show that we can bound the difference between the Hessian inverse preconditioned gradients of an entire NN up to a constant of the difference between the Hessian inverse preconditioned gradients of the last layer of the NN, between arbitrary datapoints ii and jj.

Consider an LL-layer perceptron, where w(l)MlxMl1w^{(l)}\in\mathbb{R}^{M_{l}xM_{l-1}} is the weight matrix for the lthl^{th} layer with MlM_{l} hidden units. Furthermore assume σ(l)(.)\sigma^{(l)}(.) is a Lipschitz continuous activation function. Then we let,

xi(0)\displaystyle x_{i}^{(0)} =xi,\displaystyle=x_{i}, (75)
zi(l)\displaystyle z_{i}^{(l)} =w(l)xi(l1),\displaystyle=w^{(l)}x_{i}^{(l-1)}, (76)
xi(l)\displaystyle x_{i}^{(l)} =σ(l)(zi(l)).\displaystyle=\sigma^{(l)}\left(z_{i}^{(l)}\right). (77)

With,

Σl(zi(l))\displaystyle\Sigma_{l}^{\prime}\left(z_{i}^{(l)}\right) =diag(σ(l)(zi,1(l)),σ(l)(zi,Ml(l)))\displaystyle=\operatorname{diag}\left(\sigma^{\prime(l)}\left(z_{i,1}^{(l)}\right),\cdots\sigma^{\prime(l)}\left(z_{i,M_{l}}^{(l)}\right)\right) (78)
Δi(l)\displaystyle\Delta_{i}^{(l)} =Σl(zi(l))wl+1TΣl(zi(L1))wLT.\displaystyle=\Sigma_{l}^{\prime}\left(z_{i}^{(l)}\right)w_{l+1}^{T}\cdots\Sigma_{l}^{\prime}\left(z_{i}^{(L-1)}\right)w_{L}^{T}. (79)

We have,

𝐇i1gi\displaystyle\|\mathbf{H}_{i}^{-1}\textbf{g}_{i}- 𝐇j1gj\displaystyle\mathbf{H}_{j}^{-1}\textbf{g}_{j}\| (80)
=\displaystyle= (Δi(l)ΣL(zi(L))(𝐇i1)(L)gi(L))(xi(l1))T(Δj(l)ΣL(zj(L))(𝐇j1)(L)gj(L))(xj(l1))T\displaystyle\left\|\left(\Delta_{i}^{(l)}\Sigma_{L}^{\prime}(z_{i}^{(L)})(\mathbf{H}_{i}^{-1})^{(L)}\textbf{g}_{i}^{(L)}\right)(x_{i}^{(l-1)})^{T}-\left(\Delta_{j}^{(l)}\Sigma_{L}^{\prime}(z_{j}^{(L)})(\mathbf{H}_{j}^{-1})^{(L)}\textbf{g}_{j}^{(L)}\right)(x_{j}^{(l-1)})^{T}\right\| (81)
\displaystyle\leq Δi(l)xi(l1)ΣL(zi(L))(𝐇i1)(L)gi(L)ΣL(zj(L))(𝐇j1)(L)gj(L)\displaystyle\left\|\Delta_{i}^{(l)}\right\|\cdot\left\|x_{i}^{(l-1)}\right\|\cdot\left\|\Sigma_{L}^{\prime}\left(z_{i}^{(L)}\right)(\mathbf{H}_{i}^{-1})^{(L)}\textbf{g}_{i}^{(L)}-\Sigma_{L}^{\prime}\left(z_{j}^{(L)}\right)(\mathbf{H}_{j}^{-1})^{(L)}\textbf{g}_{j}^{(L)}\right\| (82)
+ΣL(zj(L))(𝐇i1)(L)gi(L)Δi(l)(xi(l1))TΔj(l)(xj(l1))T\displaystyle+\left\|\Sigma_{L}^{\prime}\left(z_{j}^{(L)}\right)(\mathbf{H}_{i}^{-1})^{(L)}\textbf{g}_{i}^{(L)}\right\|\cdot\left\|\Delta_{i}^{(l)}\left(x_{i}^{(l-1)}\right)^{T}-\Delta_{j}^{(l)}\left(x_{j}^{(l-1)}\right)^{T}\right\|
\displaystyle\leq Δi(l)xi(l1)ΣL(zi(L))(𝐇i1)(L)gi(L)ΣL(zj(L))(𝐇j1)(L)gj(L)\displaystyle\left\|\Delta_{i}^{(l)}\right\|\cdot\left\|x_{i}^{(l-1)}\right\|\cdot\left\|\Sigma_{L}^{\prime}\left(z_{i}^{(L)}\right)(\mathbf{H}_{i}^{-1})^{(L)}\textbf{g}_{i}^{(L)}-\Sigma_{L}^{\prime}\left(z_{j}^{(L)}\right)(\mathbf{H}_{j}^{-1})^{(L)}\textbf{g}_{j}^{(L)}\right\| (83)
+ΣL(zj(L))(𝐇i1)(L)gi(L)(Δi(l)xi(l1)+Δj(l)xj(l1))\displaystyle+\left\|\Sigma_{L}^{\prime}\left(z_{j}^{(L)}\right)(\mathbf{H}_{i}^{-1})^{(L)}\textbf{g}_{i}^{(L)}\right\|\cdot\left(\left\|\Delta_{i}^{(l)}\right\|\cdot\left\|x_{i}^{(l-1)}\right\|+\left\|\Delta_{j}^{(l)}\right\|\cdot\left\|x_{j}^{(l-1)}\right\|\right)
\displaystyle\leq maxl(Δi(l)xi(l1))cl,iΣL(zi(L))(𝐇i1)(L)gi(L)ΣL(zj(L))(𝐇j1)(L)gj(L)\displaystyle\underbrace{\max_{l}\left(\left\|\Delta_{i}^{(l)}\right\|\cdot\left\|x_{i}^{(l-1)}\right\|\right)}_{c_{l,i}}\cdot\left\|\Sigma_{L}^{\prime}\left(z_{i}^{(L)}\right)(\mathbf{H}_{i}^{-1})^{(L)}\textbf{g}_{i}^{(L)}-\Sigma_{L}^{\prime}\left(z_{j}^{(L)}\right)(\mathbf{H}_{j}^{-1})^{(L)}\textbf{g}_{j}^{(L)}\right\| (84)
+ΣL(zi(L))(𝐇i1)(L)gi(L)maxl,i,j(Δi(l)xi(l1)+Δj(l)xj(l1))c2\displaystyle+\underbrace{\left\|\Sigma_{L}^{\prime}\left(z_{i}^{(L)}\right)(\mathbf{H}_{i}^{-1})^{(L)}\textbf{g}_{i}^{(L)}\right\|\cdot\max_{l,i,j}\left(\left\|\Delta_{i}^{(l)}\right\|\cdot\left\|x_{i}^{(l-1)}\right\|+\left\|\Delta_{j}^{(l)}\right\|\cdot\left\|x_{j}^{(l-1)}\right\|\right)}_{c_{2}}

From (Katharopoulos & Fleuret, 2018), (Mirzasoleiman et al., 2020), we have that the variation of the gradient norm is mostly captured by the gradient of the loss function with respect to the pre-activation outputs of the last layer of our neural network. Here we have a similar result, where, the variation of the gradient preconditioned on the inverse of the Hessian norm is mostly captured by the gradient preconditioned on the inverse of the Hessian of the loss function with respect to the pre-activation outputs of the last layer of our neural network. Assuming ΣL(zi(L))(𝐇i1)(L)gi(L)\left\|\Sigma_{L}^{\prime}\left(z_{i}^{(L)}\right)(\mathbf{H}_{i}^{-1})^{(L)}\textbf{g}_{i}^{(L)}\right\| is bounded, we get

𝐇i1gi\displaystyle\|\mathbf{H}_{i}^{-1}\textbf{g}_{i}- 𝐇j1gjc1ΣL(zi(L))(𝐇i1)(L)gi(L)ΣL(zj(L))(𝐇j1)(L)gj(L)+c2\displaystyle\mathbf{H}_{j}^{-1}\textbf{g}_{j}\|\leq c_{1}\left\|\Sigma_{L}^{\prime}\left(z_{i}^{(L)}\right)(\mathbf{H}_{i}^{-1})^{(L)}\textbf{g}_{i}^{(L)}-\Sigma_{L}^{\prime}\left(z_{j}^{(L)}\right)(\mathbf{H}_{j}^{-1})^{(L)}\textbf{g}_{j}^{(L)}\right\|+c_{2} (85)

where c1,c2c_{1},c_{2} are constants. The above holds for an affine operation followed by a slope-bounded non-linearity (|σ(w)|K)\left(\left|\sigma^{\prime}(w)\right|\leq K\right).

B.3 Analytic Hessian for Logistic Regression

Here we provide the analytical formulation of the Hessian of the binary cross entropy loss per data point nn with respect to weights 𝐰\mathbf{w} for Logistic Regression.

For Binary Logistic Regression we have a loss function (𝐰){\cal{L}}(\mathbf{w}) defined as:

(𝐰)=i=1Nli(𝐰)=i=1Nyiln(σ^)+(1yi)ln(1σ^), where σ^i=σ(𝐰𝐓𝐱𝐢+b)\displaystyle{\cal{L}}(\mathbf{w})=-\sum_{i=1}^{N}l_{i}(\mathbf{w})=-\sum_{i=1}^{N}{y_{i}ln(\hat{\sigma})+(1-y_{i})ln(1-\hat{\sigma})},\text{ where }\hat{\sigma}_{i}=\sigma(\mathbf{w^{T}x_{i}}+b) (86)

and σ\sigma is the sigmoid function.

We form a Hessian matrix for each data point ii based on loss function i(𝐰)\ell_{i}(\mathbf{w}) as follows:

Hn=(𝐰𝟐li(𝐰)𝐰bli(𝐰)b𝐰li(𝐰)bbli(𝐰))=(σ^i(1σ^i)𝐱𝐢𝐱𝐢𝐓σ^i(1σ^i)𝐱𝐢[σ^i(1σ^i)𝐱𝐢]Tσ^i(1σ^i))H_{n}=\left(\begin{array}[]{@{}c|c@{}}\frac{\partial}{\partial\mathbf{w^{2}}}l_{i}(\mathbf{w})&\frac{\partial}{\partial\mathbf{w}\partial b}l_{i}(\mathbf{w})\\ \hline\cr\frac{\partial}{\partial b\partial\mathbf{w}}l_{i}(\mathbf{w})&\begin{matrix}\frac{\partial}{\partial b\partial b}l_{i}(\mathbf{w})\end{matrix}\end{array}\right)=\left(\begin{array}[]{@{}c|c@{}}\begin{matrix}\hat{\sigma}_{i}(1-\hat{\sigma}_{i})\mathbf{x_{i}x_{i}^{T}}\end{matrix}&\hat{\sigma}_{i}(1-\hat{\sigma}_{i})\mathbf{x_{i}}\\ \hline\cr{[\hat{\sigma}_{i}(1-\hat{\sigma}_{i})\mathbf{x_{i}}]^{T}}&\begin{matrix}\hat{\sigma}_{i}(1-\hat{\sigma}_{i})\end{matrix}\end{array}\right)

This allows us to analytically form the Hessian information per point which is needed to precompute a single coreset which will be used throughout training of the convex regularized logistic regression problem.

Appendix C Further Empirical Evidence

C.1 AdaCore estimates full gradient closely, reaching smaller loss

AdaCore obtains a better estimate of the preconditioned gradient by considering curvature and gradient information compared to the state-of-the-art algorithm Craig and random subsets. This is quantified by calculating the difference between weighted gradients of coresets and the gradient of the complete dataset.

Fig 6, shows the difference in loss reached by AdaCore vs Craig over different subset sizes. This shows that corsets selected using AdaCore to classify the Ijcnn1 dataset using logistic regression can reach a lower loss over varying subset sizes than Craig.

Refer to caption
Figure 6: Normalized loss for Logistic Regression over different subset sizes on Ijcnn1 dataset using SGD. AdaCore corsets, considering curvature information, to classify Ijcnn1 dataset using logistic regression consistently reaches a lower loss compared to CRAIG, which only considers the gradient information.

C.2 Class imbalance CIFAR-10

To provide further empirical evidence, we include results using a class-imbalanced version of the CIFAR-10 dataset for ResNet18. We skewed the class distribution linearly, keeping 90%90\% of class 9, 80%80\% of class 8 …10%10\% of class 1, and 0%0\% of class 0, and trained for 200 epochs. Selecting a coreset for every epoch can be computationally expensive; instead, one can compute a coreset once every RR epochs. Here we investigate AdaCore’s performance on various RR values. As Table 5 shows, AdaCore can withstand class imbalance much better than Craig and randomly selected subsets. When R=20R=20, AdaCore achieves 57.3%57.3\% final test accuracy, +8.7%+8.7\% above Craig, +2.6%+2.6\% above Random, 27.4% above GradMatch and 36.2% above Glister.

Table 5: CIFAR-10 Class Imbalance, ResNet18. Final test accuracy and percent of full data selected (in parentheses). Trained with SGD + Momentum, selecting a coreset every RR epochs that is SS percent of the full dataset. Note AdaCore has greater accuracy while seeing fewer data points.
Accuracy S=1%R=20S=1\%R=20 S=1%R=10S=1\%R=10 S=1%R=5S=1\%R=5
AdaCore 57.3%±0.5\textbf{57.3\%}\pm 0.5(5%)(\textbf{5\%}) 57.12±0.96\textbf{57.12}\pm 0.96(9.5%)(\textbf{9.5\%}) 60.2%±0.36\textbf{60.2\%}\pm 0.36(14.5%)(\textbf{14.5\%})
Craig 48.6%±0.848.6\%\pm 0.8(8%)(8\%) 55±155\pm 1(16%)(16\%) 53.05%±0.2453.05\%\pm 0.24(27.5%)(27.5\%)
Random 54.7%±0.354.7\%\pm 0.3(8%)(8\%) 54.6±0.7654.6\pm 0.76(18%)(18\%) 54.6%±0.7454.6\%\pm 0.74(33.2%)(33.2\%)
GradMatch 29.9%±0.429.9\%\pm 0.4(8.2%)(8.2\%) 29.1%±0.829.1\%\pm 0.8(14.7%)(14.7\%) 32.75%±0.8332.75\%\pm 0.83(23.2%)(23.2\%)
Glister 21.1%±0.4221.1\%\pm 0.42(8.6%)(8.6\%) 17.2%±0.7517.2\%\pm 0.75(16%)(16\%) 14.4%±0.8314.4\%\pm 0.83(22.2%)(22.2\%)

Not only is AdaCore able to outperform Craig, random, GradMatch, and Glister, but it can do so while selecting a smaller fraction of the data points during training, as shown under all settings in Table 5.

C.3 Class imbalance BDD100k

Additionally, we compared AdaCore to Craig and random selection for the BDD100k dataset, which has seven inherently imbalanced classes and 100k data points. We train ResNet50 with SGD + momentum for 100 epochs choosing subset size (s = 10%) every (R = 20) epochs on the weather prediction task. We see that AdaCore can outperform Craig by 2% and random by 8.8% seen in Table 6.

Table 6: AdaCore outperforms other baseline subset selection algorithms as well as training on the full dataset, reaching a better accuracy in less time. This provides up to a 2.3x speedup compared to to the state of the art.
SGD + Momentum Accuracy S = 10% R = 20
AdaCore 74.3%
Craig 72.3%
Random 65.5%

Additionally, Table 7 shows that AdaCore outperforms baseline methods on BDD100k providing 2.3x speedup vs. training on the entire dataset and a 1.8x speedup vs. random. We see that Craig, GradMatch & Glister do not reach the accuracy of AdaCore even given more time and epochs. The epoch value is seen in parenthesis by accuracy. These experiments were run with SGD+momentum.

Table 7: AdaCore outperforms other baseline subset selection algorithms as well as training on the full dataset, reaching a better accuracy in less time. This provides up to a 2.3x speedup compared to to the state of the art.
BDD100k Speedup over
S=10%S=10\%
R = 20
Accuracy
(epoch)
Time
(s)
Rand Full
AdaCore 74.3%(100)74.3\%(100) 7331 1.8 2.3
Craig 73.1%(150)73.1\%(150) 10996 1.3 1.6
Random 73.3%(180)73.3\%(180) 13050 1 1.2
GradMatch 72%(200)72\%(200) 14040 .7 1.1
Glister 73%(200)73\%(200) 12665 1.03 1.2
Full Dataset 74.3%(45)74.3\%(45) 16093 0.8 1

C.4 CIFAR-100

Table 8 shows that AdaCore outperforms baseline methods on CIFAR100, providing 4x speedup vs. training on the entire dataset and a 3.8x speedup vs. Random. We see that Craig, GradMatch and Glister do not reach the accuracy of AdaCore even given more time and epochs. The epoch value is seen in parenthesis by accuracy. These experiments were run with SGD+momentum.

Table 8: AdaCore outperforms other baseline subset selection algorithms as well as training on the full dataset, reaching a better accuracy in less time. This provides up to a 4.3x speedup compared to to the state of the art.
CIFAR100 Speedup over
S = 10%
R = 20
Accuracy
(epoch)
Time
(s)
Rand Full
AdaCore 58.8%(200) 341 4.3 2.8
Craig 57.3%(250) 426 3.5 2.2
Random 58.1%(864) 1470 1 0.65
GradMatch 57%(200) 980 1.5 0.97
Glister 56%(300) 1110 1.3 0.86
Full Dataset 59% (40) 960 1.5 1

C.5 When first order coresets fail, continued

By preconditioning with curvature information, AdaCore is able to magnify smaller gradient dimensions that would otherwise be ignored during coreset selection. Moreover, it allows AdaCore to include points with similar gradients but different curvature properties. Hence, AdaCore can select more diverse subsets compared to Craig as well as GradMatch. This allows AdaCore to outperform first order coreset methods in many regimes, such as when subset size is large (e.g. \geq10%) and for larger batch size (e.g. \geq 128).

Refer to caption
(a) AdaCore with gradient w.r.t the penultimate layer, training with SGD + Momentum
Figure 7: Classification accuracy of ResNet20 across training on the CIFAR10 dataset, selecting coresets with AdaCore, Craig and GradMatch. Here, all coreset selection methods used the gradients of the model’s last layer (dimension 64). The algorithms were calculated every R=20R=20 epochs with coreset size S=10%S=10\%. Note that Craig and GradMatch are vulnerable to catastrophic forgetting, but not AdaCore.

In addition to the results shown in Figure 3a, (reproduced here as Fig 8a) where R=1R=1, AdaCore outperforms Craig as well as GradMatch when we increase the coreset selection period RR. Fig 7 shows that for larger RR, first-order methods succumb to catastrophic forgetting each time a new subset is chosen, whereas AdaCore achieves a smooth rise in classification accuracy. This increased stability between coresets is another benefit of AdaCore’s greater selection diversity. Interestingly, AdaCore achieves higher final test accuracy while selecting a smaller fraction of data points to train on during the training than Craig. Note that since AdaCore takes curvature into account while selecting the coresets, it can successfully select data points with a similar gradient but different curvature properties and extract a more diverse set of data points than Craig. However, as the coresets found by AdaCore provide a close estimation of the full preconditioned gradients for several epochs during training, the number of distinct data points selected by AdaCore is smaller than Craig.

Refer to caption
(a) Accuracy vs. Epoch
Refer to caption
(b) Accuracy vs. Time
Figure 8: (a) Test accuracy of AdaCore, CRAIG, Random, GradMatch and GLISTER with ResNet-18 selecting subsets of size 1% each epoch, batch size 256. (b) Training ResNet-18 on subsets of size SS=1% selected every RR=1 epoch, with AdaCore, Craig, Glister and GradMatch for 200 epochs vs. Random for 1000 epochs and full for 15 epochs. AdaCore outperforms baselines by providing 2x speedup over full, and more than 4.5x speedup over Random.

For completeness we provide Fig 8b, in which we allow training random subset selection 1000 epochs. We see that it takes over 4.5x longer for Random to near the accuracy of ResNet18 trained with AdaCore and Full. We use the same experimental setup as seen in Fig 8a.

C.6 MNIST

For our MNIST classifier, we use a fully-connected hidden layer of 100 nodes and ten softmax output nodes; sigmoid activation and L2 regularization with μ=104\mu=10^{-4} and mini-batch size of 32 on the MNIST dataset of handwritten digits containing 60,000 training and 10,000 test images all normalized to [0,1] by division with 255. We apply SGD with a momentum of 0.9 to subsets of size 40% of the dataset chosen at the beginning of each epoch found by AdaCore, CRAIG, and random. Fig 9 compares the training loss and test accuracy of the network trained on coresets chosen by AdaCore, CRAIG, and random, with that of the entire dataset. We see that AdaCore can benefit from the second-order information and effectively finds subsets that achieve superior performance to that of baselines and the entire dataset. At the same time, it achieves a 2.5x speedup over training on the entire dataset.

Refer to caption
(a) MNIST
Refer to caption
(b) MNIST
Figure 9: Test accuracy and training loss of SGD with momentum applied to subsets found by AdaCore vs. CRAIG, and random subsets on MNIST with a 2-layer neural network. AdaCore achieves 2.5x speedup and better test accuracy, compared to training on full dataset.

C.7 How batch size affects coreset performance

We see in Table 9 that training with larger batch size on subsets selected by AdaCore can achieve a superior accuracy. We reproduce Table 3 here with standard deviation values.

Table 9: Training ResNet18 with SS=1% subsets every RR=1 epoch from CIFAR10 using batch size bb= 512, 256, 128. AdaCore can leverage larger mini-bath size and obtain a larger accuracy gap to Craig and Random. For bb=512, we have 1 mini-batch (GD).
AdaCore Craig Rand
Gap/
Craig
Gap/
Rand
GD   b=512
58.32%±0.4558.32\%\pm 0.45 56.32%±0.3256.32\%\pm 0.32 49.14%±1.1949.14\%\pm 1.19 1.69%1.69\% 8.91%8.91\%
SGD b=256
68.23%±0.268.23\%\pm 0.2 58.3%±1.3858.3\%\pm 1.38 60.7%±1.0460.7\%\pm 1.04 9.93%9.93\% 8.16%8.16\%
SGD b=128
66.89%±0.7366.89\%\pm 0.73 58.17%±1.3458.17\%\pm 1.34 65.46%±0.9365.46\%\pm 0.93 8.81%8.81\% 1.52%1.52\%

C.8 Potential Social Impacts

Regarding social impact, our coreset method can outperform other methods in accuracy while selecting fewer data points over training and providing over 2.5x speedup. This will allow for a more efficient learning pipeline resulting in a lesser environmental impact. Our method can significantly decrease the financial and environmental costs of learning from big data. The financial costs are due to expensive computational resources, and environmental costs are due to the substantial energy consumption and the produced carbon footprint.