This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\zxrsetup

toltxlabel \newcitessecAppendices

Determinantal point processes based on orthogonal polynomials for sampling minibatches in SGD

Rémi Bardenet111Alphabetical order. 222Université de Lille, CNRS, Centrale Lille, UMR 9189–CRIStAL, F-59000 Lille, France (remi.bardenet@univ-lille.fr). ,  Subhroshekhar Ghosh111Alphabetical order. 333Corresponding author. National University of Singapore, Department of Mathematics, 10 Lower Kent Ridge Road, 119076, Singapore (subhrowork@gmail.com). ,  Meixia Lin111Alphabetical order. 444Corresponding author. National University of Singapore, Institute of Operations Research and Analytics, 10 Lower Kent Ridge Road, 119076, Singapore (lin_meixia@u.nus.edu).
(August 1, 2025)
Abstract

Stochastic gradient descent (SGD) is a cornerstone of machine learning. When the number NN of data items is large, SGD relies on constructing an unbiased estimator of the gradient of the empirical risk using a small subset of the original dataset, called a minibatch. Default minibatch construction involves uniformly sampling a subset of the desired size, but alternatives have been explored for variance reduction. In particular, experimental evidence suggests drawing minibatches from determinantal point processes (DPPs), tractable distributions over minibatches that favour diversity among selected items. However, like in recent work on DPPs for coresets, providing a systematic and principled understanding of how and why DPPs help has been difficult. In this work, we contribute an orthogonal polynomial-based determinantal point process paradigm for performing minibatch sampling in SGD. Our approach leverages the specific data distribution at hand, which endows it with greater sensitivity and power over existing data-agnostic methods. We substantiate our method via a detailed theoretical analysis of its convergence properties, interweaving between the discrete data set and the underlying continuous domain. In particular, we show how specific DPPs and a string of controlled approximations can lead to gradient estimators with a variance that decays faster with the batchsize than under uniform sampling. Coupled with existing finite-time guarantees for SGD on convex objectives, this entails that, for a large enough batchsize and a fixed budget of item-level gradients to evaluate, DPP minibatches lead to a smaller bound on the mean square approximation error than uniform minibatches. Moreover, our estimators are amenable to a recent algorithm that directly samples linear statistics of DPPs (i.e., the gradient estimator) without sampling the underlying DPP (i.e., the minibatch), thereby reducing computational overhead. We provide detailed synthetic as well as real data experiments to substantiate our theoretical claims.


1 Introduction

Consider minimizing an empirical loss

minθΘ1Ni=1N(𝒛i,θ)+λ(θ),\min_{\theta\in\Theta}\ \frac{1}{N}\cdot\sum_{i=1}^{N}\mathcal{L}(\bm{z}_{i},\theta)+\lambda(\theta), (1)

with some penalty λ:Θ+\lambda:\Theta\mapsto\mathbb{R}_{+}. Many learning tasks, such as regression and classification, are usually framed that way [1]. When N1N\gg 1, computing the gradient of the objective in (1) becomes a bottleneck, even if individual gradients θ(𝒛i,θ)\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta) are cheap to evaluate. For a fixed computational budget, it is thus tempting to replace vanilla gradient descent by more iterations but using an approximate gradient, obtained using only a few data points. Stochastic gradient descent (SGD; [2]) follows this template. In its simplest form, SGD builds an unbiased estimator at each iteration of gradient descent, independently from past iterations, using a minibatch of random samples from the data set. Theory [3] and practice suggest that the variance of the gradient estimators in SGD should be kept as small as possible. It is thus natural that variance reduction for SGD has been a rich area of research; see for instance the detailed references in [4, Section 2].

In a related vein, determinantal point processes (DPPs) are probability distributions over subsets of a (typically large or infinite) ground set that are known to yield samples made of collectively diverse items, while being tractable both in terms of sampling and inference. Originally introduced in electronic optics [5], they have been turned into generic statistical models for repulsion in spatial statistics [6] and machine learning [7, 8]. In ML, DPPs have also been shown to be efficient sampling tools; see Section 2. Importantly for us, there is experimental evidence that minibatches in (1) drawn from DPPs and other repulsive point processes can yield gradient estimators with low variance for advanced learning tasks [4, 9], though a conclusive theoretical result has remained elusive. In particular, it is hard to see the right candidate DPP when, unlike linear regression, the objective function does not necessarily have a geometric interpretation.

Our contributions are as follows. We combine continuous DPPs based on orthogonal polynomials [10] and kernel density estimators built on the data to obtain two gradient estimators; see Section 3. We prove that their variance is 𝒪P(p(1+1/d))\mathcal{O}_{P}(p^{-(1+1/d)}), where pp is the size of the minibatch, dd is the dimension of data; see Section 4. This provides theoretical backing to the claim that DPPs yield variance reduction [4] over, say, uniformly sampling without replacement. In passing, the combination of analytic tools –orthogonal polynomials–, and an essentially discrete subsampling task –minibatch sampling– sheds light on new ways to build discrete DPPs for subsampling. Finally, we demonstrate our theoretical results on simulated data in Section 5.

A cornerstone of our approach is to utilise orthogonal polynomials to construct our sampling paradigm, interweaving between the discrete set of data points and the continuum in which the orthogonal polynomials reside. A few words are in order regarding the motivation for our choice of techniques. Roughly speaking, we would like to use a DPP that is tailored to the data distribution at hand. Orthogonal Polynomial Ensembles (OPEs) provide a natural way of associating a DPP to a given measure (in this case, the probability distribution of the data points), along with a substantive body of mathematical literature and tools that can be summoned as per necessity. This makes it a natural choice for our purposes.

Notation.

Let data be denoted by 𝔇:={𝒛1,,𝒛N}\mathfrak{D}:=\{\bm{z}_{1},\ldots,\bm{z}_{N}\}, and assume that the 𝒛i\bm{z}_{i}’s are drawn i.i.d. from a distribution γ\gamma on d\mathbb{R}^{d}. Assume γ\gamma is compactly supported, with support 𝒟[1,1]d\mathcal{D}\subset[-1,1]^{d} bounded away from the border of [1,1]d[-1,1]^{d}. Assume also that γ\gamma is continuous with respect to the Lebesgue measure, and that its density is bounded away from zero on its support. While our assumptions exclude learning problems with discrete labels, such as classification, we later give experimental support that our estimators yield variance reduction in that case too. We define the empirical measure γ^N:=N1i=1Nδ𝒛i,\hat{\gamma}_{N}:=N^{-1}\cdot\sum_{i=1}^{N}\delta_{\bm{z}_{i}}, where δ𝒛i\delta_{\bm{z}_{i}} is the delta measure at the point 𝒛id\bm{z}_{i}\in\mathbb{R}^{d}. Clearly, γ^Nγ\hat{\gamma}_{N}\to\gamma in 𝒫(d)\mathcal{P}(\mathbb{R}^{d}); under our operating assumption of compact support, this amounts to convergence in 𝒫(𝒟)\mathcal{P}(\mathcal{D}).

For simplicity, we assume that no penalty is used in (1), but our results will extend straightforwardly. We denote the gradient of the empirical loss by ΞN=ΞN(θ):=N1i=1Nθ(𝒛i,θ)\Xi_{N}=\Xi_{N}(\theta):=N^{-1}\cdot\sum_{i=1}^{N}\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta). A minibatch is a (random) subset A[N]A\subset[N] of size |A|=pN|A|=p\ll N such that the random variable

ΞA=ΞA(θ):=iAwiθ(𝒛i,θ),\Xi_{A}=\Xi_{A}(\theta):=\sum_{i\in A}w_{i}\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta), (2)

for suitable weights (wi)(w_{i}), provides a good approximation for ΞN\Xi_{N}.

2 Background and relevant literature

Stochastic gradient descent.

Going back to [2], SGD has been a major workhorse for machine learning; see e.g. [1, Chapter 14]. The basic version of the algorithm, applied to the empirical risk minimization (1), is to repeatedly update

θt+1θtηtΞA(θt),t,\theta_{t+1}\leftarrow\theta_{t}-\eta_{t}\Xi_{A}(\theta_{t}),\quad t\in\mathbb{N}, (3)

where ηt\eta_{t} is a (typically decreasing) stepsize, and ΞA\Xi_{A} is a minibatch-based estimator (2) of the gradient of the empirical risk function (1), possibly depending on past iterations. Most theoretical analyses assume that for any θ\theta, ΞA(θ)\Xi_{A}(\theta) in the tt-th update (3) is unbiased, conditionally on the history of the Markov chain (θt)(\theta_{t}) so far. For simplicity, we henceforth make the assumption that ΞA\Xi_{A} does not depend on the past of the chain. In particular, using such unbiased gradients, one can derive a nonasymptotic bound [3] on the mean square distance of θt\theta_{t} to the minimizer of (1), for strongly convex and smooth loss functions like linear regression and logistic regression with 2\ell_{2} penalization. More precisely, for ηttα\eta_{t}\propto t^{-\alpha} and 0<α<10<\alpha<1, [3, Theorem 1] yields

𝔼θtθ2f(t)(𝔼θ0θ2+σ2L)+Cσ2tα,\mathbb{E}\|\theta_{t}-\theta_{\star}\|^{2}\leq f(t)\left(\mathbb{E}\|\theta_{0}-\theta_{\star}\|^{2}+\frac{\sigma^{2}}{L}\right)+C\frac{\sigma^{2}}{t^{\alpha}}, (4)

where C,L>0C,L>0 are problem-dependent constants, f(t)=𝒪(etα)f(t)=\mathcal{O}(e^{-t^{\alpha}}), and σ2=𝔼[ΞA(θ)2|𝔇]\sigma^{2}=\mathbb{E}[\|\Xi_{A}(\theta_{\star})\|^{2}|\mathfrak{D}] is the trace of the covariance matrix of the gradient estimator, evaluated at the optimizer θ\theta_{\star} of (1). The initialization bias is thus forgotten subexponentially fast, while the asymptotically leading term is proportional to σ2/tα\sigma^{2}/t^{\alpha}. Combined with practical insight that variance reduction for gradients is key, theoretical results like (4) have motivated methodological research on efficient gradient estimators [4, Section 2], i.e., constructing minibatches so as to minimize σ2\sigma^{2}. In particular, repulsive point processes such as determinantal point processes have been empirically demonstrated to yield variance reduction and overall better performance on ML tasks [4, 9]. Our paper is a stab at a theoretical analysis to support these experimental claims.

The Poissonian benchmark.

The default approach to sample a minibatch A[N]A\subset[N] is to sample pp data points from 𝔇\mathfrak{D} uniformly, with or without replacement, and take wi=1/pw_{i}=1/p constant in (2). Both sampling with or without replacement lead to unbiased gradient estimators. A similar third approach is Poissonian random sampling. This simply consists in starting from A=A=\emptyset, and independently adding each element of 𝔇\mathfrak{D} to the minibatch AA with probability p/Np/N. The Poisson estimator ΞA,Poi\Xi_{A,\mathrm{Poi}} is then (2), with constant weights again equal to 1/p1/p. When pNp\ll N, which is the regime of SGD, the cardinality of AA is tightly concentrated around pp, and ΞA,Poi\Xi_{A,\mathrm{Poi}} has the same fluctuations as the two default estimators, while being easier to analyze. In particular, 𝔼[ΞA,Poi|𝔇]=ΞN\mathbb{E}[\Xi_{A,\mathrm{Poi}}|\mathfrak{D}]=\Xi_{N} and Var[ΞA,Poi|𝔇]=𝒪P(p1)\mathrm{Var}[\Xi_{A,\mathrm{Poi}}|\mathfrak{D}]=\mathcal{O}_{P}(p^{-1}); see Appendix S1 for details.

DPPs as (sub)sampling algorithms.

As distributions over subsets of a large ground set that favour diversity, DPPs are intuitively good candidates at subsampling tasks, and one of their primary applications in ML has been as summary extractors [7]. Since then, DPPs or mixtures thereof have been used, e.g., to generate experimental designs for linear regression, leading to strong theoretical guarantees [11, 12, 13]; see also [14] for a survey of DPPs in randomized numerical algebra, or [15] for feature selection in linear regression with DPPs.

When the objective function of the task has less structure than linear regression, it has been more difficult to prove that finite DPPs significantly improve over i.i.d. sampling. For coreset construction, for instance, [16] manages to prove that a projection DPP necessarily improves over i.i.d. sampling with the same marginals [16, Corollary 3.7], but the authors stress the disappointing fact that current concentration results for strongly Rayleigh measures (such as DPPs) do not allow yet to prove that DPP coresets need significantly fewer points than their i.i.d. counterparts [16, Section 3.2]. Even closer to our motivation, DPPs for minibatch sampling have shown promising experimental performance [4], but reading between the lines of the proof of [4, Theorem 1], a bad choice of DPP can even yield a larger variance than i.i.d. sampling!

For such unstructured problems (compared to, say, linear regression) as coreset extraction or loss-agnostic minibatch sampling, we propose to draw inspiration from work on continuous DPPs, where faster-than-i.i.d. error decays have been proven in similar contexts. For instance, orthogonal polynomial theory motivated [10] to introduce a particular DPP, called a multivariate orthogonal polynomial ensemble, and prove a faster-than-i.i.d. central limit theorem for Monte Carlo integration of smooth functions. While resorting to continuous tools to study a discrete problem such as minibatch sampling may be unexpected, we shall see that the assumption that the size of the dataset is large compared to the ambient dimension crucially allows to transfer variance reduction arguments.

3 DPPs, and two gradient estimators

Since we shall need both discrete and continuous DPPs, we assume that 𝒳\mathcal{X} is either d\mathbb{R}^{d} or 𝔇\mathfrak{D}, and follow [17] in introducing DPPs in an abstract way that encompasses both cases. After that, we propose two gradient estimators for SGD that build on a particular family of continuous DPPs introduced in [10] for Monte Carlo integration, called multivariate orthogonal polynomial ensembles.

3.1 DPPs: The kernel machine of point processes

A point process on 𝒳\mathcal{X} is a distribution on finite subsets SS of 𝒳\mathcal{X}; see [18] for a general reference. Given a reference measure μ\mu on XX, a point process is said to be determinantal (DPP) if there exists a function K:𝒳×𝒳K:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}, the kernel of the DPP, such that, for every n1n\geq 1,

𝔼[\displaystyle\mathbb{E}\bigg{[}\sum_{\neq} f(xi1,,xin)]=(𝒳)nf(x1,,xn)Det[K(xi,x)]i,=1ndμn(x1,,xn)\displaystyle f(x_{i_{1}},\ldots,x_{i_{n}})\bigg{]}=\int_{(\mathcal{X})^{n}}f(x_{1},\ldots,x_{n})\cdot\mathrm{Det}\Big{[}K(x_{i},x_{\ell})\Big{]}_{i,\ell=1}^{n}\mathrm{d}\mu^{\otimes n}(x_{1},\ldots,x_{n}) (5)

for every bounded Borel function f:𝒳nf:\mathcal{X}^{n}\to\mathbb{R}, where the sum in the LHS of (5) ranges over all pairwise distinct nn-uplets of the random finite subset SS. A few remarks are in order. First, satisfying (5) for every nn is a strong constraint on KK, so that not every kernel yields a DPP. A set of sufficient conditions on KK is given by the Macchi-Soshnikov theorem [5, 19]. In words, if K(x,y)=K(y,x)K(x,y)=K(y,x), and if KK further corresponds to an integral operator

fK(x,y)f(y)dμ(y),fL2(𝒳,μ),f\mapsto\int K(x,y)f(y)\mathrm{d}\mu(y),\quad f\in L^{2}(\mathcal{X},\mu),

that is trace-class with spectrum in [0,1][0,1], then the corresponding DPP exists. Second, note in (5) that the kernel of a DPP encodes how the points in the random configurations interact. A strong point in favour of DPPs is that, unlike most interacting point processes, sampling and inference are tractable [17, 6]. Third, (5) yields simple formulas for the mean and variance of linear statistics of a DPP.

Proposition 1 (See e.g. [20, 21]).

Let SDPP(K,μ)S\sim\text{DPP}(K,\mu) and Φ:𝒳\Phi:\mathcal{X}\rightarrow\mathbb{R} be a bounded Borel function,

𝔼[xSΦ(x)]\displaystyle\mathbb{E}\left[\sum_{x\in S}\Phi(x)\right] =Φ(x)K(x,x)dμ(x),\displaystyle=\int\Phi(x)K(x,x)\mathrm{d}\mu(x), (6)
Var[xSΦ(x)]\displaystyle\mathrm{Var}\left[\sum_{x\in S}\Phi(x)\right] =Φ(x)Φ(y)22|K(x,y)|2dμ(x)dμ(y)\displaystyle=\iint\left\|\Phi(x)-\Phi(y)\right\|_{2}^{2}|K(x,y)|^{2}\mathrm{d}\mu(x)\mathrm{d}\mu(y)
+Φ(x)22(K(x,x)|K(x,y)|2dμ(y))dμ(x).\displaystyle\quad+\int\left\|\Phi(x)\right\|_{2}^{2}\left(K(x,x)-\int|K(x,y)|^{2}\mathrm{d}\mu(y)\right)\mathrm{d}\mu(x). (7)

Since the seminal paper [7], the case 𝒳=𝔇\mathcal{X}=\mathfrak{D} has been the most common in machine learning. Taking μ\mu to be γ^N\hat{\gamma}_{N}, any kernel is given by its restriction to 𝔇\mathfrak{D}, usually given as an N×NN\times N matrix 𝐊=K|𝔇\mathbf{K}=K|_{\mathfrak{D}}. Equation (6) with n=pn=p, AA a subset of size pp of 𝔇\mathfrak{D}, and Φ\Phi the indicator of AA yields (AS)=NpDet𝐊A.\mathbb{P}(A\subset S)=N^{-p}\mathrm{Det}\mathbf{K}_{A}. This is the usual way finite DPPs are introduced [7], except maybe for the factor NpN^{-p}, which comes from using γ^N\hat{\gamma}_{N} as the reference measure instead of Nγ^NN\hat{\gamma}_{N}. In this finite setting, a careful implementation of the general DPP sampler of [17] yields mm DPP samples of average cardinality Tr(𝐊)\text{Tr}(\mathbf{K}) in 𝒪(Nω+mNTr(𝐊)2)\mathcal{O}(N^{\omega}+mN\text{Tr}(\mathbf{K})^{2}) operations [22].

We now go back to a general 𝒳\mathcal{X} and fix pp\in\mathbb{N}. A canonical way to construct DPPs generating configurations of pp points almost surely, i.e. S={x1,,xp}S=\{x_{1},\ldots,x_{p}\}, is the following. Consider pp orthonormal functions ϕ0,,ϕp1\phi_{0},\ldots,\phi_{p-1} in L2(μ)L^{2}(\mu), and take for kernel

K(x,y)=k=0p1ϕk(x)ϕk(y).K(x,y)=\sum_{k=0}^{p-1}\phi_{k}(x)\phi_{k}(y). (8)

In this setting, the (permutation invariant) random variables x1,,xpx_{1},\ldots,x_{p} with joint distribution

1p!Det[K(xi,x)]i,=1pi=1pdμ(xi)\frac{1}{p!}\mathrm{Det}\Big{[}K(x_{i},x_{\ell})\Big{]}_{i,\ell=1}^{p}\prod_{i=1}^{p}\mathrm{d}\mu(x_{i}) (9)

generate a DPP {x1,,xp}\{x_{1},\dots,x_{p}\} with kernel K(x,y)K(x,y), called a projection DPP. For further information on determinantal point processes, we refer the reader to [17, 7].

3.2 Multivariate orthogonal polynomial ensembles

This section paraphrases [10] in their definition of a particular projection DPP on 𝒳=d\mathcal{X}=\mathbb{R}^{d}, called a multivariate orthogonal polynomial ensemble (OPE). Fix some reference measure q(x)dxq(x)\mathrm{d}x on [1,1]d[-1,1]^{d}, and assume that it puts positive mass on some open subset of [1,1]d[-1,1]^{d}. Now choose an ordering of the monomial functions (x1,,xd)x1α1xdαd(x_{1},\ldots,x_{d})\mapsto x_{1}^{\alpha_{1}}\cdots x_{d}^{\alpha_{d}}; in this work we use the graded lexical order. Then apply the Gram-Schmidt algorithm in L2(q(x)dx)L^{2}(q(x)\mathrm{d}x) to these ordered monomials. This yields a sequence of orthonormal polynomial functions (ϕk)k(\phi_{k})_{k\in\mathbb{N}}, the multivariate orthonormal polynomials w.r.t. qq. Finally, plugging the first pp multivariate orthonormal polynomials ϕ0,,ϕp1\phi_{0},\dots,\phi_{p-1} into the projection kernel (8), we obtain a projection DPP with kernel denoted as Kq(p)K_{q}^{(p)}, referred to as the multivariate OPE associated with the measure q(x)dxq(x)\mathrm{d}x.

3.3 Our first estimator: reweight, restrict, and saturate an OPE kernel

Let h>0h>0 and

γ~(𝒛)=1Nhdi=1Nk(𝒛𝒛ih)\tilde{\gamma}(\bm{z})=\frac{1}{Nh^{d}}\sum_{i=1}^{N}k\left(\frac{\bm{z}-\bm{z}_{i}}{h}\right) (10)

be a single-bandwidth kernel density estimator of the pdf of the data-generating distribution γ\gamma; see [23, Section 4.2], In particular, kk is chosen so that k(x)dx=1\int k(x)\mathrm{d}x=1. Note that the approximation kernel kk is unrelated to any DPP kernel in this paper. Let now q(x)=q1(x1)qd(xd)q(x)=q_{1}(x_{1})\dots q_{d}(x_{d}) be a separable pdf on [1,1]d[-1,1]^{d}, where each qiq_{i} is Nevai-class111See [10, Section 4] for details. It suffices that each qiq_{i} is positive on [1,1]d[-1,1]^{d}.. Let Kq(p)K_{q}^{(p)} be the multivariate OPE kernel defined in Section 3.2, and form a new kernel

Kq,γ~(p)(x,y):=q(x)γ~(x)Kq(p)(x,y)q(y)γ~(y).K_{q,\tilde{\gamma}}^{(p)}(x,y):=\sqrt{\frac{q(x)}{\tilde{\gamma}(x)}}K_{q}^{(p)}(x,y)\sqrt{\frac{q(y)}{\tilde{\gamma}(y)}}.

The form of Kq,γ~(p)K_{q,\tilde{\gamma}}^{(p)} is reminiscent of importance sampling [24], which is no accident. Indeed, while the positive semidefinite matrix

Kq,γ~(p)|𝔇:=(Kq,γ~(p)(𝒛i,𝒛j))1i,jNK_{q,\tilde{\gamma}}^{(p)}|_{\mathfrak{D}}:=\left(K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{j})\right)_{1\leq i,j\leq N} (11)

is not necessarily the kernel matrix of a DPP on ({1,,N},γ^N)(\{1,\dots,N\},\hat{\gamma}_{N}), see Section 3.1, we built it to be close to a projection of rank pp. More precisely,

Kq,γ~(p)(𝒛k,y)Kq,γ~(p)(y,𝒛)dγ^N(y)=q(𝒛k)γ~(𝒛k)[1Nn=1NKq(p)(𝒛k,𝒛n)Kq(p)(𝒛n,𝒛)q(𝒛n)γ~(𝒛n)]q(𝒛)γ~(𝒛).\int K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{k},y)K_{q,\tilde{\gamma}}^{(p)}(y,\bm{z}_{\ell})\mathrm{d}\hat{\gamma}_{N}(y)=\sqrt{\frac{q(\bm{z}_{k})}{\tilde{\gamma}(\bm{z}_{k})}}\left[\frac{1}{N}\sum_{n=1}^{N}K_{q}^{(p)}(\bm{z}_{k},\bm{z}_{n})K_{q}^{(p)}(\bm{z}_{n},\bm{z}_{\ell})\frac{q(\bm{z}_{n})}{\tilde{\gamma}(\bm{z}_{n})}\right]\sqrt{\frac{q(\bm{z}_{\ell})}{\tilde{\gamma}(\bm{z}_{\ell})}}.

If NN is large compared to pp and dd, so that in particular γ~γ\tilde{\gamma}\approx\gamma, the term within brackets will be close to Kq(p)(𝒛k,𝒛)Kq(p)(𝒛,𝒛)q(𝒛)d𝒛=Kq(p)(𝒛k,𝒛)\int K_{q}^{(p)}(\bm{z}_{k},\bm{z})K_{q}^{(p)}(\bm{z},\bm{z}_{\ell})q(\bm{z})\mathrm{d}\bm{z}=K_{q}^{(p)}(\bm{z}_{k},\bm{z}_{\ell}), so that Kq,γ~(p)|𝔇K_{q,\tilde{\gamma}}^{(p)}|_{\mathfrak{D}} is almost a projection in L2(γ^N)L^{2}(\hat{\gamma}_{N}).

Let us actually consider the orthogonal projection matrix 𝐊~\widetilde{\mathbf{K}} with the same eigenvectors as Kq,γ~(p)|𝔇K_{q,\tilde{\gamma}}^{(p)}|_{\mathfrak{D}}, but with the pp largest eigenvalues replaced by 11, and the rest of the spectrum set to 0. By the Macchi-Soshnikov theorem, 𝐊~\widetilde{\mathbf{K}} is the kernel matrix of a DPP; see Section 3.1. We thus consider a minibatch ADPP(𝐊~)A\sim\text{DPP}(\widetilde{\mathbf{K}}). Coming from a projection DPP, |A|=p|A|=p almost surely, and we define the gradient estimator

ΞA,DPP:=iAθ(𝒛i,θ)𝐊~ii.\Xi_{A,\text{DPP}}:=\sum_{i\in A}\frac{\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)}{\widetilde{\mathbf{K}}_{ii}}. (12)

In Section 4, we shall prove that ΞA,DPP\Xi_{A,\text{DPP}} is unbiased, and examine under what assumptions its variance decreases faster than 1/p1/p.

On the computational cost of ΞA,DPP\Xi_{A,\text{DPP}}.

The bottleneck is computing the pp largest eigenvalues of matrix (11), along with the corresponding eigenvectors. This can be done once before running SGD, as a preprocessing step. Note that storing the kernel in diagonalized form only requires 𝒪(Np)\mathcal{O}(Np) storage. Each iteration of SGD then only requires sampling a rank-pp projection DPP with diagonalized kernel, which takes 𝒪(Np2)\mathcal{O}(Np^{2}) elementary operations [22]. In practice, as the complexity of the model underlying 𝒛,θ(𝒛,θ)\bm{z},\theta\mapsto\nabla\mathcal{L}(\bm{z},\theta) increases, the cost of computing pp individual gradients shall outweigh this 𝒪(Np2)\mathcal{O}(Np^{2}) overhead. For instance, learning the parameters of a structured model like a conditional random field leads to arbitrarily costly individual gradients, as the underlying graph gets more dense [25]. Alternately, (12) can be sampled directly, without sampling the underlying DPP. Indeed the Laplace transform of (12) is a Fredholm determinant, and it is shown in [26] that Nyström-type approximations of that determinant, followed by Laplace inversion, yield an accurate inverse CDF sampler.


Finally, we stress the unusual way in which our finite DPP kernel 𝐊~\widetilde{\mathbf{K}} is constructed, through a reweighted continuous OPE kernel, restricted to the actual dataset. This construction is interesting per se, as it is key to leveraging analytic techniques from the continuous case in Section 4.

3.4 Our second estimator: sample the OPE, but smooth the gradient

In Section 3.3, we smoothed the empirical distribution of the data and restricted a continuous kernel to the dataset 𝔇\mathfrak{D}, to make sure that the drawn minibatch would be a subset of 𝔇\mathfrak{D}. But one could actually define another gradient estimator, directly from an OPE sample A={𝒘1,,𝒘p}DPP(Kq(p),q)A=\{\bm{w}_{1},\dots,\bm{w}_{p}\}\sim\text{DPP}(K_{q}^{(p)},q). Note that in that case, the “generalized minibatch” A[1,1]dA\subset[-1,1]^{d} is not necessarily a subset of the dataset 𝔇\mathfrak{D}. Defining a kernel density estimator of the gradient,

θ^(𝒛,θ):=1Nhdi=1Nθ(𝒛i,θ)k(𝒛𝒛ih),\widehat{\nabla_{\theta}\mathcal{L}}(\bm{z},\theta):=\frac{1}{Nh^{d}}\sum_{i=1}^{N}\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)\cdot k\left(\frac{\bm{z}-\bm{z}_{i}}{h}\right),

we consider the estimator

ΞA,s=wjAθ^(𝒘j,θ)q(𝒘j)Kq(p)(𝒘j,𝒘j).\Xi_{A,\mathrm{s}}=\sum_{w_{j}\in A}\frac{\widehat{\nabla_{\theta}\mathcal{L}}(\bm{w}_{j},\theta)}{q(\bm{w}_{j})K_{q}^{(p)}(\bm{w}_{j},\bm{w}_{j})}.
On the computational cost of ΞA,s\Xi_{A,\mathrm{s}}.

Since each evaluation of this estimator is at least as costly as evaluating the actual gradient ΞN(θ)\Xi_{N}(\theta), its use is mostly theoretical: the analysis of the fluctuations of ΞA,s\Xi_{A,\mathrm{s}} is easier than that of ΞA,DPP\Xi_{A,\mathrm{DPP}}, while requiring the same key steps. Moreover, the computation of all pairwise distances in ΞA,s\Xi_{A,\mathrm{s}} could be efficiently approximated, possibly using random projection arguments [27], so that the limited scope of ΞA,s\Xi_{A,\mathrm{s}} might be overcome in future work. Note also that, like ΞA,DPP\Xi_{A,\mathrm{DPP}}, inverse Laplace sampling [26] applies to ΞA,s\Xi_{A,\mathrm{s}}.

4 Analysis of determinantal sampling of SGD gradients

We first analyze the bias, and then the fluctuations, of the gradient estimators introduced in Section 3. By Proposition 1 with Φ=θ(,θ)\Phi=\nabla_{\theta}\mathcal{L}(\cdot,\theta), it comes

𝔼[ΞA,DPP|𝔇]=Dθ(𝒛,θ)Kq,γ~(p)(𝒛,𝒛)1Kq,γ~(p)(𝒛,𝒛)dγ^N(z)=1Ni=1Nθ(𝒛i,θ),\mathbb{E}[\Xi_{A,\mathrm{DPP}}|\mathfrak{D}]=\int_{D}\nabla_{\theta}\mathcal{L}(\bm{z},\theta)\cancel{K_{q,\tilde{\gamma}}^{(p)}(\bm{z},\bm{z})^{-1}}\cdot\cancel{K_{q,\tilde{\gamma}}^{(p)}(\bm{z},\bm{z})}\mathrm{d}\hat{\gamma}_{N}(z)=\frac{1}{N}\sum_{i=1}^{N}\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta),

so that we immediatly get the following result.

Proposition 2.

𝔼[ΞA,DPP|𝔇]=ΞN\mathbb{E}[\Xi_{A,\mathrm{DPP}}|\mathfrak{D}]=\Xi_{N}.

Thus, ΞA,DPP\Xi_{A,\mathrm{DPP}} is unbiased, like the classical Poissonian benchmark in Section 2. However, the smoothed estimator ΞA,s\Xi_{A,\mathrm{s}} from Section 3.4 is slightly biased. Note that while the results of [3], like (4), do not apply to biased estimators, SGD can be analyzed in the small-bias setting [28].

Proposition 3.

Assume that kk in (10) has compact support and qq is bounded on DD. Then 𝔼[ΞA,s|𝔇]=ΞN+𝒪P(ph/N)\mathbb{E}[\Xi_{A,\mathrm{s}}|\mathfrak{D}]=\Xi_{N}+\mathcal{O}_{P}(ph/N).

Proof.

Using the first part of Proposition 1 again, it comes

𝔼[ΞA,s|𝔇]=𝔼ADPP(Kq(p),q)[wjAθ^(𝒘j,θ)q(𝒘j)Kq(p)(𝒘j,𝒘j)]\displaystyle\mathbb{E}[\Xi_{A,\mathrm{s}}|\mathfrak{D}]=\mathbb{E}_{A\sim\text{DPP}(K_{q}^{(p)},q)}\left[\sum_{w_{j}\in A}\frac{\widehat{\nabla_{\theta}\mathcal{L}}(\bm{w}_{j},\theta)}{q(\bm{w}_{j})K_{q}^{(p)}(\bm{w}_{j},\bm{w}_{j})}\right]
=\displaystyle= Dθ^(𝒘,θ)q(𝒘)Kq(p)(𝒘,𝒘)Kq(p)(𝒘,𝒘)q(𝒘)d𝒘=D(1Nhdi=1Nθ(𝒛i,θ)k(𝒛𝒛ih))d𝒘\displaystyle\int_{D}\frac{\widehat{\nabla_{\theta}\mathcal{L}}(\bm{w},\theta)}{\cancel{q(\bm{w})K_{q}^{(p)}(\bm{w},\bm{w})}}\cdot\cancel{K_{q}^{(p)}(\bm{w},\bm{w})q(\bm{w})}\mathrm{d}\bm{w}=\int_{D}\left(\frac{1}{Nh^{d}}\sum_{i=1}^{N}\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)k\left(\frac{\bm{z}-\bm{z}_{i}}{h}\right)\right)\mathrm{d}\bm{w}
=\displaystyle= 1Ni=1N(D1hdk(𝒘𝒛ih)d𝒘)θ(𝒛i,θ)=1Ni=1Nθ(𝒛i,θ)+𝒪P(ph/N),\displaystyle\frac{1}{N}\sum_{i=1}^{N}\left(\int_{D}\frac{1}{h^{d}}\cdot k\left(\frac{\bm{w}-\bm{z}_{i}}{h}\right)\mathrm{d}\bm{w}\right)\cdot\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)=\frac{1}{N}\sum_{i=1}^{N}\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)+\mathcal{O}_{P}(ph/N), (13)

where, in the last step, we have used the fact that hdk(h1(𝒘𝒛i))h^{-d}\cdot k\left(h^{-1}(\bm{w}-\bm{z}_{i})\right) is a probability density. The 𝒪P(ph/N)\mathcal{O}_{P}(ph/N) error term arises because we integrate that pdf on DD rather than Supp(k)\text{Supp}(k). But since kk is a compactly supported kernel, for points 𝒛i\bm{z}_{i} that are within a distance 𝒪(h)\mathcal{O}(h) from the boundary of DD, we incur an error of 𝒪P(1)\mathcal{O}_{P}(1). By Proposition 1, the expected number of such points 𝒛i\bm{z}_{i} is DhKq(p)(𝒘,𝒘)q(𝒘)d𝒘\int_{D_{h}}K_{q}^{(p)}(\bm{w},\bm{w})q(\bm{w})\mathrm{d}\bm{w}, where DhD_{h} is the hh-neighbourhood of the boundary of DD. By a classical asymptotic result of Totik, see [10, Theorem 4.8], 𝒘Kq(p)(𝒘,𝒘)\bm{w}\mapsto K_{q}^{(p)}(\bm{w},\bm{w}) is 𝒪(p)\mathcal{O}(p) on DhD_{h}; whereas qq is a bounded density, implying that Dhq(𝒘)d𝒘=𝒪(Vol(Dh))=𝒪(h)\int_{D_{h}}q(\bm{w})\mathrm{d}\bm{w}=\mathcal{O}(\text{Vol}(D_{h}))=\mathcal{O}(h). Putting together all of these, we obtain the 𝒪P(ph/N)\mathcal{O}_{P}(ph/N) error term in (13). ∎

The rest of this section is devoted to the more involved task of analyzing the fluctuations of ΞA,DPP\Xi_{A,\mathrm{DPP}} and ΞA,s\Xi_{A,\mathrm{s}}. For the purposes of our analysis, we discuss certain desirable regularity behaviour for our kernels and loss functions in Section 4.1. Then, we tackle the main ideas behind the study of the fluctuations of our estimators in Section 4.2. Details can be found in Appendix S3.

4.1 Some regularity phenomena

Assume that (1pKq(p)(𝒛,𝒛))1θ(𝒛,θ)(\frac{1}{p}\cdot K_{q}^{(p)}(\bm{z},\bm{z}))^{-1}\nabla_{\theta}\mathcal{L}(\bm{z},\theta) is bounded on the domain DD uniformly in NN (note that pp possibly depends on NN). Furthermore, assume that 𝒛(p1Kq(p)(𝒛,𝒛)q(𝒛))1θ(𝒛,θ)γ~(𝒛)\bm{z}\mapsto(p^{-1}\cdot K_{q}^{(p)}(\bm{z},\bm{z})q(\bm{z}))^{-1}\nabla_{\theta}\mathcal{L}(\bm{z},\theta)\cdot\tilde{\gamma}(\bm{z}) is 1-Lipschitz, with Lipschitz constant 𝒪P(1)\mathcal{O}_{P}(1) and bounded in θ\theta.

Such properties are natural in the context of various known asymptotic phenomena, in particular asymptotic results on OPE kernels and the convergence of the KDE γ~\tilde{\gamma} to the distribution γ\gamma. The detailed discussion is deferred to Appendix S2; we record here in passing that the above asymptotic phenomena imply that

θ(𝒛,θ)1pKq(p)(𝒛,𝒛)q(𝒛)γ~(𝒛)θ(𝒛,θ)γ(𝒛)j=1d1𝒛[j]2.\frac{\nabla_{\theta}\mathcal{L}(\bm{z},\theta)}{\frac{1}{p}\cdot K_{q}^{(p)}(\bm{z},\bm{z})q(\bm{z})}\cdot\tilde{\gamma}(\bm{z})\longrightarrow\nabla_{\theta}\mathcal{L}(\bm{z},\theta)\cdot\gamma(\bm{z})\prod_{j=1}^{d}\sqrt{1-\bm{z}[j]^{2}}. (14)

Our desired regularity properties may therefore be understood in terms of closeness to this limit, which itself has similar regularity. At the price of these assumptions, we can use analytic tools to derive fluctuations for ΞA,DPP\Xi_{A,\mathrm{DPP}} by working on the limit in (14). For the fluctuation analysis of the smoothed estimator ΞA,s\Xi_{A,\mathrm{s}}, we similarly assume that (q(𝒘)1pKq(p)(𝒘,𝒘))1θ^(𝒘,θ)(q(\bm{w})\cdot\frac{1}{p}K_{q}^{(p)}(\bm{w},\bm{w}))^{-1}\widehat{\nabla_{\theta}\mathcal{L}}(\bm{w},\theta) is 𝒪P(1)\mathcal{O}_{P}(1) and 1-Lipschitz with an 𝒪P(1)\mathcal{O}_{P}(1) Lipschitz constant that is bounded in θ\theta. These are once again motivated by the convergence of (q(𝒘)1pKq(p)(𝒘,𝒘))1θ^(𝒘,θ)(q(\bm{w})\cdot\frac{1}{p}K_{q}^{(p)}(\bm{w},\bm{w}))^{-1}\widehat{\nabla_{\theta}\mathcal{L}}(\bm{w},\theta) to θ(𝒘,θ)j=1d1𝒘[j]2\nabla_{\theta}\mathcal{L}(\bm{w},\theta)\prod_{j=1}^{d}\sqrt{1-\bm{w}[j]^{2}}.

4.2 Reduced fluctuations for determinantal samplers

For succinctness of presentation, we focus on the theoretical analysis of the smoothed estimator ΞA,s\Xi_{A,\mathrm{s}}, leaving the analysis of ΞA,DPP\Xi_{A,\mathrm{DPP}} to Appendix S3, while still capturing the main ideas of our approach.

Proposition 4.

Under the assumptions in Section 4.1, Var[ΞA,s|𝔇]=𝒪P(p(1+1/d))\mathrm{Var}[\Xi_{A,\mathrm{s}}|\mathfrak{D}]=\mathcal{O}_{P}(p^{-\left(1+1/d\right)}).

Proof Sketch.

Invoking (7) in the case of a projection kernel,

Var\displaystyle\mathrm{Var} [ΞA,s|𝔇]=θ^(𝒛,θ)q(𝒛)Kq(p)(𝒛,𝒛)θ^(𝒘,θ)q(𝒘)Kq(p)(𝒘,𝒘)22|Kq(p)(𝒛,𝒘)|2dq(𝒛)dq(𝒘)\displaystyle[\Xi_{A,\mathrm{s}}|\mathfrak{D}]=\iint\left\|\frac{\widehat{\nabla_{\theta}\mathcal{L}}(\bm{z},\theta)}{q(\bm{z})K_{q}^{(p)}(\bm{z},\bm{z})}-\frac{\widehat{\nabla_{\theta}\mathcal{L}}(\bm{w},\theta)}{q(\bm{w})K_{q}^{(p)}(\bm{w},\bm{w})}\right\|_{2}^{2}|K_{q}^{(p)}(\bm{z},\bm{w})|^{2}\mathrm{d}q(\bm{z})\mathrm{d}q(\bm{w})
=\displaystyle= 1p2θ^(𝒛,θ)q(𝒛)1pKq(p)(𝒛,𝒛)θ^(𝒘,θ)q(𝒘)1pKq(p)(𝒘,𝒘)22|Kq(p)(𝒛,𝒘)|2dq(𝒛)dq(𝒘)\displaystyle\frac{1}{p^{2}}\iint\left\|\frac{\widehat{\nabla_{\theta}\mathcal{L}}(\bm{z},\theta)}{q(\bm{z})\cdot\frac{1}{p}K_{q}^{(p)}(\bm{z},\bm{z})}-\frac{\widehat{\nabla_{\theta}\mathcal{L}}(\bm{w},\theta)}{q(\bm{w})\cdot\frac{1}{p}K_{q}^{(p)}(\bm{w},\bm{w})}\right\|_{2}^{2}|K_{q}^{(p)}(\bm{z},\bm{w})|^{2}\mathrm{d}q(\bm{z})\mathrm{d}q(\bm{w})
\displaystyle\lesssim θ1p2𝒛𝒘22|Kq(p)(𝒛,𝒘)|2dq(𝒛)dq(𝒘),\displaystyle\mathcal{M}_{\theta}\cdot\frac{1}{p^{2}}\int\int\left\|\bm{z}-\bm{w}\right\|_{2}^{2}|K_{q}^{(p)}(\bm{z},\bm{w})|^{2}\mathrm{d}q(\bm{z})\mathrm{d}q(\bm{w}), (15)

where we used the 1-Lipschitzianity of θ^(𝒛,θ)q(𝒛)Kq(p)(𝒛,𝒛)\frac{\widehat{\nabla_{\theta}\mathcal{L}}(\bm{z},\theta)}{q(\bm{z})K_{q}^{(p)}(\bm{z},\bm{z})}, with θ=𝒪P(1)\mathcal{M}_{\theta}=\mathcal{O}_{P}(1) the Lipschitz constant.

We control the integral in (15) by invoking the renowned Christoffel-Darboux formula for the OPE kernel Kq(p)K_{q}^{(p)} [29]. For clarity of presentation, we outline here the main ideas for d=1d=1; the details for general dimensions are available in Appendix S3. Broadly speaking, since Kq(p)K_{q}^{(p)} is an orthogonal projection of rank pp in L2(q)L_{2}(q), we may observe that |Kq(p)(𝒛,𝒘)|2dq(𝒛)dq(𝒘)=p\iint|K_{q}^{(p)}(\bm{z},\bm{w})|^{2}\mathrm{d}q(\bm{z})\mathrm{d}q(\bm{w})=p; so that without the zw22\left\|z-w\right\|_{2}^{2} term in (15), we would have a 𝒪P(p1)\mathcal{O}_{P}(p^{-1}) behaviour in total, which would be similar to the Poissonian estimator. However, it turns out that the leading order contribution to zw22|Kq(p)(𝒛,𝒘)|2dq(𝒛)dq(𝒘)\int\int\left\|z-w\right\|_{2}^{2}|K_{q}^{(p)}(\bm{z},\bm{w})|^{2}\mathrm{d}q(\bm{z})\mathrm{d}q(\bm{w}) comes from near the diagonal z=wz=w, and this turns out to be neutralised in an extremely precise manner by the 𝒛𝒘22\left\|\bm{z}-\bm{w}\right\|_{2}^{2} factor that vanishes on the diagonal.

This neutralisation is systematically captured by the Christoffel-Darboux formula [29], which implies

Kq(p)(x,y)=(xy)1Ap(ϕq(x)ϕq1(y)ϕq(y)ϕq1(x)),K_{q}^{(p)}(x,y)=(x-y)^{-1}A_{p}\cdot(\phi_{q}(x)\phi_{q-1}(y)-\phi_{q}(y)\phi_{q-1}(x)),

where Ap=𝒪(1)A_{p}=\mathcal{O}(1) and ϕq,ϕq1\phi_{q},\phi_{q-1} are two orthonormal functions in L2(q)L_{2}(q). Substituting this into (15), a simple computation shows that zw22|Kq(p)(𝒛,𝒘)|2dq(𝒛)dq(𝒘)=𝒪(1)\iint\left\|z-w\right\|_{2}^{2}|K_{q}^{(p)}(\bm{z},\bm{w})|^{2}\mathrm{d}q(\bm{z})\mathrm{d}q(\bm{w})=\mathcal{O}(1). This leads to a variance bound Var[ΞA,s|𝔇]=𝒪P(p2)\mathrm{Var}[\Xi_{A,\mathrm{s}}|\mathfrak{D}]=\mathcal{O}_{P}(p^{-2}), which is the desired rate for d=1d=1. For general dd, we use the fact that q=i=1dqiq=\bigotimes_{i=1}^{d}q_{i} is a product measure, and apply the Christoffel-Darboux formula for each qiq_{i}, leading to a variance bound of Var[ΞA,s|𝔇]=𝒪P(p(1+1/d))\mathrm{Var}[\Xi_{A,\mathrm{s}}|\mathfrak{D}]=\mathcal{O}_{P}(p^{-\left(1+1/d\right)}), as desired. ∎

The theoretical analysis of Var[ΞA,DPP|𝔇]\mathrm{Var}[\Xi_{A,\mathrm{DPP}}|\mathfrak{D}] follows the broad contours of the argument for Var[ΞA,s|𝔇]\mathrm{Var}[\Xi_{A,\mathrm{s}}|\mathfrak{D}] as above, with additional difficulties introduced by the spectral truncation from Kq,γ~(p)K_{q,\tilde{\gamma}}^{(p)} to 𝐊~\widetilde{\mathbf{K}}; see Section 3. This is addressed by an elaborate spectral approximation analysis in Appendix S3. In combination with the ideas expounded in the proof of Proposition 4, our analysis in Appendix S3 indicates a fluctuation bound of Var[ΞA|𝔇]=𝒪P(p(1+1/d))\mathrm{Var}[\Xi_{A}|\mathfrak{D}]=\mathcal{O}_{P}(p^{-\left(1+1/d\right)}).

5 Experiments

In this section, we compare the empirical performance and the variance decay of our gradient estimator ΞA,DPP\Xi_{A,\mathrm{DPP}} to the default ΞA,Poi\Xi_{A,\mathrm{Poi}}. We do not to include ΞA,s\Xi_{A,\mathrm{s}}, whose interest is mostly theoretical; see Section 4. Moreover, while we focus on simulated data to illustrate our theoretical analysis, we provide experimental results on a real dataset in Appendix S4. Throughout this section, the pdf qq introduced in Section 3.2 is taken to be q(𝒘)j=1d(1+𝒘[j])αj(1𝒘[j])βjq(\bm{w})\propto\prod_{j=1}^{d}(1+\bm{w}[j])^{\alpha_{j}}(1-\bm{w}[j])^{\beta_{j}}, with αj,βj[1/2,1/2]\alpha_{j},\beta_{j}\in[-1/2,1/2] tuned to match the first two moments of the jjth marginal of γ^N\hat{\gamma}_{N}. All DPPs are sampled using the Python package

PPy } \cite{GBPV19}, which is under MIT licence.
\paragraph{Experimental setup.}
We consider the ERM setting \eqref{e:ERM} for linear regression $\el_{\text{lin}}((x,y),\t)=0.5(\langle x, \theta\rangle-y)^2$ and logistic regression $\el_\text{log}((x,y),\t)=\log[1+\exp(-y\langle x,\theta\rangle )]$, both with an additional $\ell_2$ penalty $\lambda(\theta)=(\lambda_0/2)\|\theta\|_2^2$.
Here the features are $x\in[-1,1]^{d-1}$, and the labels are, respectively, $y\in[-1,1]$ and $y\in \{-1,1\}$.
Note that our proofs currently assume that the law $\gamma$ of $z=(x,y)$ is continuous w.r.t. Lebesgue, which in all rigour excludes logistic regression.
However, we demonstrate below that if we draw a minibatch using our 
PP but on features only, and then deterministically associate each label to the drawn features, we observe the same gains for logistic regression as for linear regression, where the DPP kernel takes into account both features and labels.

For each experiment, we generate N=1000N=1000 data points xix_{i} with either the uniform distribution or a mixture of 22 well-separated Gaussian distributions on [1,1]d1[-1,1]^{d-1}. The variable yiy_{i}\in\mathbb{R} is generated as xi,θtrue+εi\langle x_{i},\theta_{\text{true}}\rangle+\varepsilon_{i}, where θtrued1\theta_{\text{true}}\in\mathbb{R}^{d-1} is given, and εN\varepsilon\in\mathbb{R}^{N} is a white Gaussian noise vector. In the linear regression case, we scale yiy_{i} by y\|y\|_{\infty} to make sure that yi[1,1]y_{i}\in[-1,1]. In the logistic regression case, we replace each yiy_{i} by its sign. The regularization parameter λ0\lambda_{0} is manually set to be 0.10.1 and the stepsize for the tt-th iteration is set as 1/t0.91/t^{0.9}, so that (4) applies. In each experiment, performance metrics are averaged over 10001000 independent runs of each SGD variant.

Performance evaluation of sampling strategies in SGD.

Figure 1 summarizes the experimental results of ΞA,Poi\Xi_{A,\mathrm{Poi}} and ΞA,DPP\Xi_{A,\mathrm{DPP}}, with p=5p=5 and p=10p=10. The top row shows how the norm of the complete gradient ΞN(θt)\|\Xi_{N}(\theta_{t})\| decreases with the number of individual gradient evaluations t×pt\times p, called here budget. The bottom row shows the decrease of θtθ\|\theta_{t}-\theta_{\star}\|. Note that using t×pt\times p on all xx-axes allows comparing different batchsizes. Error bars on the bottom row are ±\pm one standard deviation of the mean. In all experiments, using a DPP consistently improves the performance of Poisson minibatches of the same size, and the DPP with batchsize 55 sometimes even outperforms Poisson sampling with batchsize 1010, showing that smaller but more diverse batches can be a better way of spending a fixed number of gradient evaluations. This is particularly true for mixture data (middle column), where forcing diversity with our DPP brings the biggest improvement. Maybe less intuitively, the gains for the logistic regression in d=11d=11 (last column) are also significant, while the case of discrete labels is not covered yet by our theoretical analysis, and the moderately large dimension makes the improvement in the decay rate of the variance minor. This indicates that there is variance reduction beyond the change of the rate.

Refer to caption
(a) d=3d=3, =lin\mathcal{L}=\mathcal{L}_{\text{lin}}, uniform
Refer to caption
(b) d=3d=3, =lin\mathcal{L}=\mathcal{L}_{\text{lin}}, mixture
Refer to caption
(c) d=11d=11, =log\mathcal{L}=\mathcal{L}_{\text{log}}, uniform
Refer to caption
(d) d=3d=3, =lin\mathcal{L}=\mathcal{L}_{\text{lin}}, uniform
Refer to caption
(e) d=3d=3, =lin\mathcal{L}=\mathcal{L}_{\text{lin}}, mixture
Refer to caption
(f) d=11d=11, =log\mathcal{L}=\mathcal{L}_{\text{log}}, uniform
Figure 1: Summary of the performance of two sampling strategies in SGD.
Variance decay.

For a given dimension dd, we want to infer the rate of decay of the variance σ2=𝔼[ΞA(θ)2|𝔇]\sigma^{2}=\mathbb{E}[\|\Xi_{A}(\theta_{\star})\|^{2}|\mathfrak{D}], to confirm the 𝒪P(p(1+1/d))\mathcal{O}_{P}(p^{-(1+1/d)}) rate discussed in Section 4. We take =lin\mathcal{L}=\mathcal{L}_{\text{lin}} as an example, with N=1000N=1000 i.i.d. samples 𝔇\mathfrak{D} from the uniform distribution on [1,1]d[-1,1]^{d}. For d=1,2,3d=1,2,3, we show in Figure 2 the sample variance of 10001000 realizations of the variance of ΞA,Poi(θ)2\|\Xi_{A,\mathrm{Poi}}(\theta_{\star})\|^{2} (white dots) and ΞA,DPP(θ)2\|\Xi_{A,\mathrm{DPP}}(\theta_{\star})\|^{2} (black dots), conditionally on 𝔇\mathfrak{D}. Blue and red dots indicate standard 95%95\% marginal confidence intervals, for indication only. The slope found by maximum likelihood in a linear regression in the log-log scale is indicated as legend. The experiment confirms that σ2\sigma^{2} is smaller for the DPP, decays as p11/dp^{-1-1/d}, and that this decay starts at a batchsize pp that increases with dimension.

Refer to caption
(a) d=1d=1
Refer to caption
(b) d=2d=2
Refer to caption
(c) d=3d=3
Figure 2: Summary of the variance decay results.

6 Discussion

In this work, we introduced an orthogonal polynomial-based DPP paradigm for sampling minibatches in SGD that entails variance reduction in the resulting gradient estimator. We substantiated our proposal by detailed theoretical analysis and numerical experiments. Our work raises natural questions and leaves avenues for improvement in several directions. These include the smoothed estimator ΞA,s\Xi_{A,\mathrm{s}}, which calls for further investigation in order to be deployed as a computationally attractive procedure; improvement in the dimension dependence of the fluctuation exponent when the gradients are smooth enough, like [31, 32] did for [10]; sharpening of the regularity hypotheses for our theoretical investigations to obtain a more streamlined analysis. While our estimators were motivated by a continuous underlying data distribution, our experiments suggest notably good performance in situations like logistic regression, where the data is at least partially discrete. Extensions to account for discrete settings in a principled manner, via discrete OPEs or otherwise, would be a natural topic for future research. Another natural problem is to compare our approach with other, non-i.i.d., approaches for minibatch sampling. A case in point is the method of importance sampling, where the independence across data points suggests that the variance should still be 𝒪P(1/p)\mathcal{O}_{P}(1/p) as in uniform sampling. More generally, incorporating ingredients from other sampling paradigms to further enhance the variance reducing capacity of our approach would be of considerable interest. Finally, while our results already partly apply to more sophisticated gradient estimators like Polyak-Rupert averaged gradients [3, Section 4.2], it would be interesting to introduce repulsiveness across consecutive SGD iterations to further minimize the variance of averaged estimators. In summary, we believe that the ideas put forward in the present work will motivate a new perspective on improved minibatch sampling for SGD, more generally on estimators based on linear statistics (e.g. in coreset sampling), and beyond.

Acknowledgements

RB acknowledges support from ERC grant Blackjack (ERC-2019-STG-851866) and ANR AI chair Baccarat (ANR-20-CHIA-0002). S.G. was supported in part by the MOE grants R-146-000-250-133 and R-146-000-312-114.

Appendices

For ease of reference, sections, propositions and equations that belong to this supplementary material are prefixed with an ‘S’.

Appendix S1 The Poissonian Benchmark

Our benchmark for comparing the efficacy of our estimator would be a subset A[N]A\subset[N] obtained via Poissonian random sampling, which is characterised by independent random choices of the elements of AA from the ground set [N][N], with each element of [N][N] being selected independently with probability p/Np/N. The estimator ΞA,Poi\Xi_{A,\mathrm{Poi}} for Poissonian sampling is simply an analogue of the empirical average; in the context of (2) this is simply the choice wi=1/pw_{i}=1/p for all ii. While the true empirical average may be realised with wi=1/|A|w_{i}=1/|A|; we exploit here the fact that, for large pNp\ll N, the empirical cardinality |A||A| is tightly concentrated around its expectation pp.

Denoting by χA()\chi_{A}(\cdot) the indicator function for inclusion in the set AA, the variables χA(𝒛i)\chi_{A}(\bm{z}_{i}) are, under Poissonian sampling, i.i.d. Bernoulli random variables with parameter p/Np/N. It may then be easily verified that, 𝔼[ΞA,Poi|𝔇]=ΞN\mathbb{E}[\Xi_{A,\mathrm{Poi}}|\mathfrak{D}]=\Xi_{N}, whereas

Var[ΞA,Poi|𝔇]\displaystyle\mathrm{Var}[\Xi_{A,\mathrm{Poi}}|\mathfrak{D}] =1p2pN(1pN)(i=1Nθ(𝒛i,θ)2)\displaystyle=\frac{1}{p^{2}}\cdot\frac{p}{N}\left(1-\frac{p}{N}\right)\cdot\left(\sum_{i=1}^{N}\|\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)\|^{2}\right)
=1p(1Ni=1Nθ(𝒛i,θ)2)(1+𝒪(1/N)).\displaystyle=\frac{1}{p}\cdot\left(\frac{1}{N}\sum_{i=1}^{N}\|\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)\|^{2}\right)\left(1+\mathcal{O}(1/N)\right). (S1)

Now, by a uniform central limit theorem [33], we obtain

1Ni=1Nθ(𝒛i,θ)2=θ(𝒛,θ)2dγ(𝒛)+𝒪P(N1/2).\frac{1}{N}\sum_{i=1}^{N}\|\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)\|^{2}=\int\|\nabla_{\theta}\mathcal{L}(\bm{z},\theta)\|^{2}\mathrm{d}\gamma(\bm{z})+\mathcal{O}_{P}(N^{-1/2}). (S2)

Thus, the conditional variance

Var[ΞA,Poi|𝔇]=(1pθ(𝒛,θ)2dγ(𝒛))+𝒪P(N1/2).\mathrm{Var}[\Xi_{A,\mathrm{Poi}}|\mathfrak{D}]=\left(\frac{1}{p}\cdot\int\|\nabla_{\theta}\mathcal{L}(\bm{z},\theta)\|^{2}\mathrm{d}\gamma(\bm{z})\right)+\mathcal{O}_{P}(N^{-1/2}). (S3)

where γ𝒫(d)\gamma\in\mathcal{P}(\mathbb{R}^{d}), the space of probability measures on d\mathbb{R}^{d}, is the (compactly supported) distribution of the data 𝒛\bm{z}; see the paragraph on notation in Section 1. Equation (S3) provides us with the theoretical benchmark against which to compare any new approach to minibatch sampling for stochastic gradient descent.

Appendix S2 Regularity Phenomena

For the purposes of our analysis, we envisage certain regularity behaviour for our kernels and loss functions, and discuss the natural basis for such assumptions. The discussion here complements the discussion on this topic undertaken in Section 4 of the main text.

S2.1 Regularity phenomena and uniform CLTs

In addition to the OPE asymptotics discussed in the main text, another relevant class of asymptotic results is furnished by Glivenko-Cantelli type convergence phenomena for empirical measures and kernel density estimators. These results are naturally motivated by convergence issues arising from the law of large numbers. To elaborate, if {Xn}n1\{X_{n}\}_{n\geq 1} is a sequence of i.i.d. random variables following the same distribution as the random variable XX, and ff is a function such that 𝔼[|f(X)|]<\mathbb{E}[|f(X)|]<\infty, then the law of large numbers tells us that the empirical average 1/Ni=1Nf(Xi)1/N\cdot\sum_{i=1}^{N}f(X_{i}) converges almost surely to 𝔼[f(X)]\mathbb{E}[f(X)]. As is also well-known, the classical central limit theorem provides the distributional order of the difference [1/Ni=1Nf(Xi)𝔼[f(X)]][1/N\sum_{i=1}^{N}f(X_{i})-\mathbb{E}[f(X)]] as N1/2N^{-1/2}.

But the classical central limit theorem provides only distributional information, and is not sufficient for tracing the order of such approximations as along a sequence of data {Xi}i=1N\{X_{i}\}_{i=1}^{N}, as NN grows. Such results are rather obtained from uniform central limit theorems, which provide bounds on this error as 𝒪P(N1/2)\mathcal{O}_{P}(N^{-1/2}) that hold uniformly over large classes of functions under very mild assumptions. We refer the interested reader to the extensive literature on uniform CLTs [33].

On a related note, we would also be interested in the approximation of a measure ν\nu with density by a kernel smoothing ν~\tilde{\nu} obtained on the basis of a dataset {Xi}i=1N\{X_{i}\}_{i=1}^{N} [23]. Analogues of the uniform CLT for such kernel density approximations are available, also according an 𝒪P(N1/2)\mathcal{O}_{P}(N^{-1/2}) approximation [34].

In our context, the uniform CLT is applicable to situations where the measure γ\gamma is approximated by the empirical measure γ^N\hat{\gamma}_{N}, as well as γ~\tilde{\gamma} which is a kernel smoothing of γ^N\hat{\gamma}_{N}.

Appendix S3 Fluctuation analysis for determinantal samplers

In this section, we provide the detailed proof of Proposition  4 in the main text on the fluctuations of the estimator ΞA,s\Xi_{A,\mathrm{s}}, and a theoretical analysis for the fluctuations of the estimator ΞA,DPP\Xi_{A,\mathrm{DPP}}. Sections S3.1.1 and S3.1.2 elaborate one of the central OPE-based ideas that enable us to obtain reduced fluctuations.

S3.1 Detailed proof of Proposition  4

To begin with, we recall the fundamental integral expression controlling the variance of ΞA,s\Xi_{A,\mathrm{s}}, exploiting (7) in the case of a projection kernel:

Var\displaystyle\mathrm{Var} [ΞA,s|𝔇]=θ^(𝒛,θ)q(𝒛)Kq(p)(𝒛,𝒛)θ^(𝒘,θ)q(𝒘)Kq(p)(𝒘,𝒘)22|Kq(p)(𝒛,𝒘)|2dq(𝒛)dq(𝒘)\displaystyle[\Xi_{A,\mathrm{s}}|\mathfrak{D}]=\iint\left\|\frac{\widehat{\nabla_{\theta}\mathcal{L}}(\bm{z},\theta)}{q(\bm{z})K_{q}^{(p)}(\bm{z},\bm{z})}-\frac{\widehat{\nabla_{\theta}\mathcal{L}}(\bm{w},\theta)}{q(\bm{w})K_{q}^{(p)}(\bm{w},\bm{w})}\right\|_{2}^{2}|K_{q}^{(p)}(\bm{z},\bm{w})|^{2}\mathrm{d}q(\bm{z})\mathrm{d}q(\bm{w})
=\displaystyle= 1p2θ^(𝒛,θ)q(𝒛)1pKq(p)(𝒛,𝒛)θ^(𝒘,θ)q(𝒘)1pKq(p)(𝒘,𝒘)22|Kq(p)(𝒛,𝒘)|2dq(𝒛)dq(𝒘)\displaystyle\frac{1}{p^{2}}\iint\left\|\frac{\widehat{\nabla_{\theta}\mathcal{L}}(\bm{z},\theta)}{q(\bm{z})\cdot\frac{1}{p}K_{q}^{(p)}(\bm{z},\bm{z})}-\frac{\widehat{\nabla_{\theta}\mathcal{L}}(\bm{w},\theta)}{q(\bm{w})\cdot\frac{1}{p}K_{q}^{(p)}(\bm{w},\bm{w})}\right\|_{2}^{2}|K_{q}^{(p)}(\bm{z},\bm{w})|^{2}\mathrm{d}q(\bm{z})\mathrm{d}q(\bm{w})
\displaystyle\lesssim θ1p2𝒛𝒘22|Kq(p)(𝒛,𝒘)|2dq(𝒛)dq(𝒘),\displaystyle\mathcal{M}_{\theta}\cdot\frac{1}{p^{2}}\int\int\left\|\bm{z}-\bm{w}\right\|_{2}^{2}|K_{q}^{(p)}(\bm{z},\bm{w})|^{2}\mathrm{d}q(\bm{z})\mathrm{d}q(\bm{w}), (S4)

where we used the 1-Lipschitzianity of θ^(𝒛,θ)q(𝒛)Kq(p)(𝒛,𝒛)\frac{\widehat{\nabla_{\theta}\mathcal{L}}(\bm{z},\theta)}{q(\bm{z})K_{q}^{(p)}(\bm{z},\bm{z})}, with θ=𝒪P(1)\mathcal{M}_{\theta}=\mathcal{O}_{P}(1) the Lipschitz constant.

We control the integral in (S4) by invoking the renowned Christoffel-Darboux formula for the OPE kernel Kq(p)K_{q}^{(p)} [29]. As outlined in the main text, a fundamental idea which enables us to obtain reduced fluctuations in our determinantal samplers is that, in the context of (S4), the 𝒛𝒘2\|\bm{z}-\bm{w}\|_{2} term crucially suppresses fluctuations near the diagonal 𝒛=𝒘\bm{z}=\bm{w}; whereas far from the diagonal, the fluctuations are suppressed by the decay of the OPE kernel Kq(p)K_{q}^{(p)}.

In Sections S3.1.1 and S3.1.2 below, we provide the details of how this idea is implemented by exploiting the Christoffel-Darboux formula; first detailing the simpler 1D case, and subsequently examining the case of general dd dimensions. In (S6), we will finally demonstrate the desired 𝒪P(p(1+1/d))\mathcal{O}_{P}(p^{-(1+1/d)}) order of the fluctuations of ΞA,s\Xi_{A,\mathrm{s}} given the dataset 𝔇\mathfrak{D}, thereby completing the proof.

S3.1.1 Reduced fluctuations in one dimension

We first demonstrate how to control (S4) for d=1d=1. The Christoffel-Darboux formula reads

Kq(p)(x,y)=apϕp(x)ϕp1(y)ϕp(y)ϕp1(x)xy,K_{q}^{(p)}(x,y)=a_{p}\cdot\frac{\phi_{p}(x)\phi_{p-1}(y)-\phi_{p}(y)\phi_{p-1}(x)}{x-y}, (S5)

where apa_{p} is the so-called first recurrence coefficient of qq; see [29, Section 1 to 3]. But we assumed in Section 3.3 that qq is Nevai-class, which actually implies ap1/2a_{p}\rightarrow 1/2 by definition; see [10, Section 4]. This implies that, in d=1d=1, we have

Var[ΞA,s|𝔇]\displaystyle\mathrm{Var}[\Xi_{A,\mathrm{s}}|\mathfrak{D}] θ1p2(xy)2(ϕp(x)ϕp1(y)ϕp(y)ϕp1(x))2(xy)2dq(x)dq(y)\displaystyle\lesssim\mathcal{M}_{\theta}\cdot\frac{1}{p^{2}}\cdot\iint(x-y)^{2}\cdot\frac{\left(\phi_{p}(x)\phi_{p-1}(y)-\phi_{p}(y)\phi_{p-1}(x)\right)^{2}}{(x-y)^{2}}\mathrm{d}q(x)\mathrm{d}q(y)
θ1p2(ϕp(x)ϕp1(y)ϕp(y)ϕp1(x))2dq(x)dq(y)\displaystyle\leq\mathcal{M}_{\theta}\cdot\frac{1}{p^{2}}\cdot\iint\left(\phi_{p}(x)\phi_{p-1}(y)-\phi_{p}(y)\phi_{p-1}(x)\right)^{2}\mathrm{d}q(x)\mathrm{d}q(y)
=θ1p2(2ϕpL2(q)2ϕp1L2(q)22ϕp,ϕp12)\displaystyle=\mathcal{M}_{\theta}\cdot\frac{1}{p^{2}}\cdot\left(2\|\phi_{p}\|_{L_{2}(q)}^{2}\|\phi_{p-1}\|_{L_{2}(q)}^{2}-2\langle\phi_{p},\phi_{p-1}\rangle^{2}\right)
=2θ1p2.\displaystyle=2\mathcal{M}_{\theta}\cdot\frac{1}{p^{2}}.

S3.1.2 Reduced fluctuations in general dimension

For d>1d>1, following the main text we consider that the measure qq splits as a product measure of dd Nevai-class qiq_{i}s, i.e. q=i=1dqiq=\otimes_{i=1}^{d}q_{i}. Assume that p=mdp=m^{d} for some mm\in\mathbb{N} for simplicity; we discuss at the end of the section how to treat the case p1/dp^{1/d}\notin\mathbb{N}.

Because of the graded lexicographic ordering that we chose for multivariate monomials, Kq(p)K_{q}^{(p)} writes as a tensor product of dd coordinate-wise OPE kernels of degree mm, namely Kq(p)(,)=j=1dKqj(m)(,)K_{q}^{(p)}(\cdot,\cdot)=\prod_{j=1}^{d}K_{q_{j}}^{(m)}(\cdot,\cdot). Now, observe that for any j=1,,dj=1,\dots,d,

|Kqj(m)(z,y)|2dγ(y)dγ(x)=Kqj(m)(x,x)dγ(x)=m.\iint|K_{q_{j}}^{(m)}(z,y)|^{2}\mathrm{d}\gamma(y)\mathrm{d}\gamma(x)=\int K_{q_{j}}^{(m)}(x,x)\mathrm{d}\gamma(x)=m.

As a result, for d>1d>1, setting 𝒛=(x1,x2,,xd)\bm{z}=(x_{1},x_{2},\ldots,x_{d}) and 𝒘=(y1,y2,,yd)\bm{w}=(y_{1},y_{2},\ldots,y_{d}), it comes

Var[ΞA,DPP|𝔇]\displaystyle\mathrm{Var}[\Xi_{A,\mathrm{DPP}}|\mathfrak{D}] θ1p2(i=1d(xiyi)2)|Kq(p)(𝒛,𝒘)|2dq(𝒛)dq(𝒘)\displaystyle\lesssim\mathcal{M}_{\theta}\cdot\frac{1}{p^{2}}\cdot\iint\left(\sum_{i=1}^{d}(x_{i}-y_{i})^{2}\right)\cdot|K_{q}^{(p)}(\bm{z},\bm{w})|^{2}\mathrm{d}q(\bm{z})\mathrm{d}q(\bm{w})
=θ1p2i=1d((xiyi)2|Kq(p)(𝒛,𝒘)|2dq(𝒛)dq(𝒘))\displaystyle=\mathcal{M}_{\theta}\cdot\frac{1}{p^{2}}\cdot\sum_{i=1}^{d}\left(\iint(x_{i}-y_{i})^{2}\cdot|K_{q}^{(p)}(\bm{z},\bm{w})|^{2}\mathrm{d}q(\bm{z})\mathrm{d}q(\bm{w})\right)
=θ1p2i=1d((xiyi)2j=1d|Kj(xj,yj)|2j=1ddqj(xj)dqj(yj)),\displaystyle=\mathcal{M}_{\theta}\cdot\frac{1}{p^{2}}\cdot\sum_{i=1}^{d}\left(\iint(x_{i}-y_{i})^{2}\cdot\prod_{j=1}^{d}|K^{j}(x_{j},y_{j})|^{2}\prod_{j=1}^{d}\mathrm{d}q_{j}(x_{j})\mathrm{d}q_{j}(y_{j})\right),

leading to

Var[ΞA,DPP|𝔇]\displaystyle\mathrm{Var}[\Xi_{A,\mathrm{DPP}}|\mathfrak{D}]
=θ1p2i=1d[ji(|Kj(xj,yj)|2dqj(xj)dqj(yj))((xiyi)2|Ki(xi,yi)|2dqi(xi)dqi(yi))]\displaystyle=\!\!\mathcal{M}_{\theta}\frac{1}{p^{2}}\!\sum_{i=1}^{d}\!\left[\!\prod_{j\neq i}\!\left(\!\iint\!\!|K^{j}(x_{j},y_{j})|^{2}\mathrm{d}q_{j}(x_{j})\mathrm{d}q_{j}(y_{j})\!\right)\!\!\cdot\!\!\left(\!\iint\!\!(x_{i}-y_{i})^{2}|K^{i}(x_{i},y_{i})|^{2}\mathrm{d}q_{i}(x_{i})\mathrm{d}q_{i}(y_{i})\!\right)\!\right]
=θ1p2i=1d[(p1/d)d1((xiyi)2|Ki(xi,yi)|2dqi(xi)dqi(yi))]\displaystyle=\mathcal{M}_{\theta}\cdot\frac{1}{p^{2}}\cdot\sum_{i=1}^{d}\left[(p^{1/d})^{d-1}\cdot\left(\iint(x_{i}-y_{i})^{2}|K^{i}(x_{i},y_{i})|^{2}\mathrm{d}q_{i}(x_{i})\mathrm{d}q_{i}(y_{i})\right)\right]
θ1p2i=1d[(p1/d)d12]\displaystyle\lesssim\mathcal{M}_{\theta}\cdot\frac{1}{p^{2}}\cdot\sum_{i=1}^{d}\left[(p^{1/d})^{d-1}\cdot 2\right]
θdp1+1/d,\displaystyle\lesssim\mathcal{M}_{\theta}\cdot\frac{d}{p^{1+{1}/{d}}}, (S6)

where we have used our earlier analyses of Section S3.1.1.

When p1/dp^{1/d}\notin\mathbb{N}, the kernel Kq(p)K_{q}^{(p)} does not necessarily decompose as a product of dd one-dimensional OPE kernels. However, the graded lexicographic order that we borrowed from [10] ensures that Kq(p)K_{q}^{(p)} departs from such a product only up to 𝒪(p1/d)\mathcal{O}(\lfloor p^{1/d}\rfloor) additional terms, whose influence on the variance can be controlled by a counting argument; see e.g. [10, Lemma 5.3].

S3.2 Fluctuation analysis for ΞA,DPP\Xi_{A,\mathrm{DPP}}

Again using the formula (7) for the variance of linear statistics of DPPs, and remembering that 𝐊~\widetilde{\mathbf{K}} is a projection matrix, we may write

Var[ΞA,DPP|𝔇]=1N2i=1Nj=1Nθ(𝒛i,θ)𝐊~ii1θ(𝒛j,θ)𝐊~jj122|𝐊~ij|2.\displaystyle\mathrm{Var}[\Xi_{A,\mathrm{DPP}}|\mathfrak{D}]=\frac{1}{N^{2}}\cdot\sum_{i=1}^{N}\sum_{j=1}^{N}\left\|\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)\widetilde{\mathbf{K}}_{ii}^{-1}-\nabla_{\theta}\mathcal{L}(\bm{z}_{j},\theta)\widetilde{\mathbf{K}}_{jj}^{-1}\right\|_{2}^{2}|\widetilde{\mathbf{K}}_{ij}|^{2}. (S7)

At this point, we set τ\tau to be the exponent in the order of spectral approximation 𝒪P(Nτ)\mathcal{O}_{P}(N^{-\tau}) obtained in Section S3.2.1 (our analysis in Section S3.2.1 indicates a choice of τ=1/2\tau=1/2; nonetheless we present the analysis here in terms of the parameter τ\tau, so as to leave open the possibility of simple updates to our bound based on more improved analysis of the spectral approximation). We use the integral and spectral approximations (S15) and (S11), and the inequality |ab|22|a|2+2|b|2|a-b|^{2}\leq 2|a|^{2}+2|b|^{2} to continue from (S7) as

θ(𝒛i,θ)𝐊~(𝒛i,𝒛i)1θ(𝒛j,θ)𝐊~(𝒛j,𝒛j)122|𝐊~(𝒛i,𝒛j)|2\displaystyle\left\|\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)\widetilde{\mathbf{K}}(\bm{z}_{i},\bm{z}_{i})^{-1}-\nabla_{\theta}\mathcal{L}(\bm{z}_{j},\theta)\widetilde{\mathbf{K}}(\bm{z}_{j},\bm{z}_{j})^{-1}\right\|_{2}^{2}|\widetilde{\mathbf{K}}(\bm{z}_{i},\bm{z}_{j})|^{2}
((2θ(𝒛i,θ)Kq,γ~(p)(𝒛i,𝒛i)1θ(𝒛j,θ)Kq,γ~(p)(𝒛j,𝒛j)122\displaystyle\leq\Bigg{(}(2\left\|\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i})^{-1}-\nabla_{\theta}\mathcal{L}(\bm{z}_{j},\theta)K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{j},\bm{z}_{j})^{-1}\right\|_{2}^{2}
+(Kq,γ~(p)(𝒛i,𝒛i)2+Kq,γ~(p)(𝒛j,𝒛j)2)𝒪P(N2τ))×(2|Kq,γ~(p)(𝒛i,𝒛j)|2+𝒪P(N2τ)).\displaystyle\ \ +(K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i})^{-2}+K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{j},\bm{z}_{j})^{-2})\mathcal{O}_{P}(N^{-2\tau})\Bigg{)}\times\left(2|K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{j})|^{2}+\mathcal{O}_{P}(N^{-2\tau})\right). (S8)

We may combine (S7) and (S8) to obtain

Var\displaystyle\mathrm{Var} [ΞA,DPP|𝔇]\displaystyle[\Xi_{A,\mathrm{DPP}}|\mathfrak{D}]
1N2i=1Nj=1Nθ(𝒛i,θ)Kq,γ~(p)(𝒛i,𝒛i)1θ(𝒛j,θ)Kq,γ~(p)(𝒛j,𝒛j)122|Kq,γ~(p)(𝒛i,𝒛j)|2\displaystyle\lesssim\frac{1}{N^{2}}\cdot\sum_{i=1}^{N}\sum_{j=1}^{N}\left\|\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i})^{-1}-\nabla_{\theta}\mathcal{L}(\bm{z}_{j},\theta)K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{j},\bm{z}_{j})^{-1}\right\|_{2}^{2}|K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{j})|^{2}
+1N2i=1Nj=1N𝒪P(N2τ)(1Kq,γ~(p)(𝒛i,𝒛i)2+1Kq,γ~(p)(𝒛j,𝒛j)2)|Kq,γ~(p)(𝒛i,𝒛j)|2\displaystyle+\frac{1}{N^{2}}\cdot\sum_{i=1}^{N}\sum_{j=1}^{N}\mathcal{O}_{P}(N^{-2\tau})\cdot\left(\frac{1}{K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i})^{2}}+\frac{1}{K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{j},\bm{z}_{j})^{2}}\right)\cdot|K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{j})|^{2}
+1N2i=1Nj=1Nθ(𝒛i,θ)Kq,γ~(p)(𝒛i,𝒛i)1θ(𝒛j,θ)Kq,γ~(p)(𝒛j,𝒛j)122𝒪P(N2τ)\displaystyle+\frac{1}{N^{2}}\cdot\sum_{i=1}^{N}\sum_{j=1}^{N}\left\|\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i})^{-1}-\nabla_{\theta}\mathcal{L}(\bm{z}_{j},\theta)K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{j},\bm{z}_{j})^{-1}\right\|_{2}^{2}\mathcal{O}_{P}(N^{-2\tau})
+𝒪P(N2τ).\displaystyle+\mathcal{O}_{P}(N^{-2\tau}). (S9)

This is where we need more assumptions. We recall our assumption that θ(𝒛,θ)(1pKq(p)(𝒛,𝒛))1\nabla_{\theta}\mathcal{L}(\bm{z},\theta)(\frac{1}{p}K_{q}^{(p)}(\bm{z},\bm{z}))^{-1} is uniformly bounded in 𝒛D\bm{z}\in D. This assumption is justified, for the kernel part, by OPE asymptotics for Nevai-class measures; see Totik’s classical result [10, Theorem 4.8]. For the gradient part, it is enough to assume that θ(𝒛,θ)\nabla_{\theta}\mathcal{L}(\bm{z},\theta) is uniformly bounded in 𝒛D\bm{z}\in D and θ\theta (with γ(𝒛)\gamma(\bm{z}) being bounded away from 0 and \infty). Coupled with the hypothesis that q(𝒛)q(\bm{z}) and the density γ(𝒛)\gamma(\bm{z}) of γ\gamma are uniformly bounded away from 0 and \infty on DD, and the uniform CLT for the convergence γ~(z)=γ(𝒛)+𝒪P(N1/2)\tilde{\gamma}(z)=\gamma(\bm{z})+\mathcal{O}_{P}(N^{-1/2}), we may deduce that

θ\displaystyle\nabla_{\theta} (𝒛,θ)(1pKq,γ~(p)(𝒛,𝒛))1\displaystyle\mathcal{L}(\bm{z},\theta)\left(\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{z},\bm{z})\right)^{-1}
=θ(𝒛,θ)(1pKq(p)(𝒛,𝒛))1γ~(𝒛)q(𝒛)\displaystyle=\nabla_{\theta}\mathcal{L}(\bm{z},\theta)\left(\frac{1}{p}K_{q}^{(p)}(\bm{z},\bm{z})\right)^{-1}\cdot\frac{\tilde{\gamma}(\bm{z})}{q(\bm{z})}
=θ(𝒛,θ)(1pKq(p)(𝒛,𝒛))1γ(𝒛)q(𝒛)+θ(𝒛,θ)(1pKq(p)(𝒛,𝒛))11q(𝒛)𝒪P(N1/2)\displaystyle=\nabla_{\theta}\mathcal{L}(\bm{z},\theta)\left(\frac{1}{p}K_{q}^{(p)}(\bm{z},\bm{z})\right)^{-1}\cdot\frac{\gamma(\bm{z})}{q(\bm{z})}+\nabla_{\theta}\mathcal{L}(\bm{z},\theta)(\frac{1}{p}K_{q}^{(p)}(\bm{z},\bm{z}))^{-1}\cdot\frac{1}{q(\bm{z})}\cdot\mathcal{O}_{P}(N^{-1/2})
=𝒪P(1)+𝒪P(N1/2)\displaystyle=\mathcal{O}_{P}(1)+\mathcal{O}_{P}(N^{-1/2})
=𝒪P(1).\displaystyle=\mathcal{O}_{P}(1). (S10)

We now demonstrate how the spectral approximations lead to approximations for integrals appearing in the above fluctuation analysis for ΞA,DPP\Xi_{A,\mathrm{DPP}}. To give the general structure of the argument, to be invoked multiple times in the following, we note that

1N2i,j=1N𝒪P(a(N))|1pKq,γ~(p)(𝒛i,𝒛j)|2=𝒪P(a(N)),\frac{1}{N^{2}}\sum_{i,j=1}^{N}\mathcal{O}_{P}(a(N))\left|\frac{1}{p}\cdot K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{j})\right|^{2}=\mathcal{O}_{P}(a(N)), (S11)

using the fact that 1p|Kq,γ~(p)(𝒛i,𝒛j)|1pKq,γ~(p)(𝒛i,𝒛i)1pKq,γ~(p)(𝒛j,𝒛j)\frac{1}{p}\cdot|K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{j})|\leq\sqrt{\frac{1}{p}\cdot K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i})}\sqrt{\frac{1}{p}\cdot K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{j},\bm{z}_{j})} via the Cauchy-Schwarz inequality, and that 1pKq,γ~(p)(𝒛i,𝒛i)\frac{1}{p}\cdot K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i}) is 𝒪P(1)\mathcal{O}_{P}(1) by OPE asymptotics.

We may combine (S9) and (S10) to deduce that

Var\displaystyle\mathrm{Var} [ΞA,DPP|𝔇]\displaystyle[\Xi_{A,\mathrm{DPP}}|\mathfrak{D}]
1N2i=1Nj=1Nθ(𝒛i,θ)Kq,γ~(p)(𝒛i,𝒛i)1θ(𝒛j,θ)Kq,γ~(p)(𝒛j,𝒛j)122|Kq,γ~(p)(𝒛i,𝒛j)|2\displaystyle\lesssim\frac{1}{N^{2}}\cdot\sum_{i=1}^{N}\sum_{j=1}^{N}\left\|\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i})^{-1}-\nabla_{\theta}\mathcal{L}(\bm{z}_{j},\theta)K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{j},\bm{z}_{j})^{-1}\right\|_{2}^{2}|K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{j})|^{2}
+𝒪P(N2τ).\displaystyle+\mathcal{O}_{P}(N^{-2\tau}). (S12)

We now proceed as from (S12) as follows:

Var[ΞA,DPP|𝔇]\displaystyle\mathrm{Var}[\Xi_{A,\mathrm{DPP}}|\mathfrak{D}]
1N2i,j=1Nθ(𝒛i,θ)(1pKq,γ~(p)(𝒛i,𝒛i))1θ(𝒛j,θ)(1pKq,γ~(p)(𝒛j,𝒛j))122|1pKq,γ~(p)(𝒛i,𝒛j)|2\displaystyle\!\lesssim\!\!\frac{1}{N^{2}}\!\sum_{i,j=1}^{N}\!\!\left\|\nabla_{\theta}\mathcal{L}(\bm{z}_{i},\theta)\left(\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i})\right)^{-1}\!\!\!\!-\!\nabla_{\theta}\mathcal{L}(\bm{z}_{j},\theta)\left(\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{j},\bm{z}_{j})\right)^{-1}\!\right\|_{2}^{2}\left|\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{j})\right|^{2}
+𝒪P(N2τ)\displaystyle\qquad+\mathcal{O}_{P}(N^{-2\tau})
θ(𝒛,θ)(1pKq,γ~(p)(𝒛,𝒛))1θ(𝒘,θ)(1pKq,γ~(p)(𝒘,𝒘))122|1pKq,γ~(p)(𝒛,𝒘)|2dγ(𝒛)dγ(𝒘)\displaystyle\!\lesssim\!\!\!\iint\!\!\left\|\nabla_{\theta}\mathcal{L}(\bm{z},\theta)\!\!\left(\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{z},\bm{z})\right)^{-1}\!\!\!\!\!\!-\!\nabla_{\theta}\mathcal{L}(\bm{w},\theta)\!\!\left(\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{w},\bm{w})\right)^{-1}\!\right\|_{2}^{2}\left|\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{z},\bm{w})\right|^{2}\!\!\!\mathrm{d}\gamma(\bm{z})\mathrm{d}\gamma(\bm{w})
+𝒪P(N1/2)+𝒪P(N2τ)\displaystyle\qquad+\mathcal{O}_{P}(N^{-1/2})+\mathcal{O}_{P}(N^{-2\tau})
θ(𝒛,θ)(1pKq,γ~(p)(𝒛,𝒛))1θ(𝒘,θ)(1pKq,γ~(p)(𝒘,𝒘))122|1pKq,γ~(p)(𝒛,𝒘)|2dγ~(𝒛)dγ~(𝒘)\displaystyle\!\lesssim\!\!\!\iint\!\!\left\|\nabla_{\theta}\mathcal{L}(\bm{z},\theta)\!\!\left(\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{z},\bm{z})\right)^{-1}\!\!\!\!\!\!-\!\nabla_{\theta}\mathcal{L}(\bm{w},\theta)\!\!\left(\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{w},\bm{w})\right)^{-1}\!\right\|_{2}^{2}\left|\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{z},\bm{w})\right|^{2}\!\!\!\mathrm{d}\tilde{\gamma}(\bm{z})\mathrm{d}\tilde{\gamma}(\bm{w})
+𝒪P(N1/2)+𝒪P(N2τ)\displaystyle\qquad+\mathcal{O}_{P}(N^{-1/2})+\mathcal{O}_{P}(N^{-2\tau})
1p2θ(𝒛,θ)(1pKq,γ~(p)(𝒛,𝒛))1θ(𝒘,θ)(1pKq,γ~(p)(𝒘,𝒘))122|Kq(p)(𝒛,𝒘)|2dq(𝒛)dq(𝒘)\displaystyle\!\lesssim\!\!\!\frac{1}{p^{2}}\!\!\iint\!\!\left\|\nabla_{\theta}\mathcal{L}(\bm{z},\theta)\!\!\left(\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{z},\bm{z})\right)^{-1}\!\!\!\!\!\!-\!\nabla_{\theta}\mathcal{L}(\bm{w},\theta)\!\!\left(\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{w},\bm{w})\right)^{-1}\!\right\|_{2}^{2}\!\left|K_{q}^{(p)}(\bm{z},\bm{w})\right|^{2}\!\!\!\mathrm{d}q(\bm{z})\mathrm{d}q(\bm{w})
+𝒪P(N1/2)+𝒪P(N2τ),\displaystyle\qquad+\mathcal{O}_{P}(N^{-1/2})+\mathcal{O}_{P}(N^{-2\tau}), (S13)

where, in the last two steps, we have used the assumption that |1pKq,γ~(p)(𝒛,𝒘)|2\left|\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{z},\bm{w})\right|^{2} and θ(𝒛,θ)(1pKq,γ~(p)(𝒛,𝒛))1\nabla_{\theta}\mathcal{L}(\bm{z},\theta)\left(\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{z},\bm{z})\right)^{-1} are 𝒪P(1)\mathcal{O}_{P}(1), and used the rate of convergence of γ^N\hat{\gamma}_{N} to γ\gamma to pass from the sum to the integral respect to γ\gamma, and from there to the integral with respect to γ~\tilde{\gamma} using the rate estimates for kernel density estimation, incurring an additive cost of 𝒪P(N1/2)\mathcal{O}_{P}(N^{-1/2}) in each step.

Using the hypothesis that θ(𝒛,θ)(1pKq,γ~(p)(𝒛,𝒛))1\nabla_{\theta}\mathcal{L}(\bm{z},\theta)\left(\frac{1}{p}K_{q,\tilde{\gamma}}^{(p)}(\bm{z},\bm{z})\right)^{-1} is 1-Lipschitz with a Lipschitz constant that is 𝒪P(1)\mathcal{O}_{P}(1), we may proceed from (S13) as

Var[ΞA,DPP|𝔇]\displaystyle\mathrm{Var}[\Xi_{A,\mathrm{DPP}}|\mathfrak{D}] θ1p2𝒛𝒘22|Kq(p)(𝒛,𝒘)|2dq(𝒛)dq(𝒘)+𝒪P(N1/2)+𝒪P(N2τ),\displaystyle\lesssim\mathcal{M}_{\theta}\frac{1}{p^{2}}\int\int\left\|\bm{z}-\bm{w}\right\|_{2}^{2}\left|K_{q}^{(p)}(\bm{z},\bm{w})\right|^{2}\mathrm{d}q(\bm{z})\mathrm{d}q(\bm{w})\!+\!\mathcal{O}_{P}(N^{-1/2})\!+\!\mathcal{O}_{P}(N^{-2\tau}), (S14)

where θ1/2\mathcal{M}_{\theta}^{1/2} is the Lipschitz constant that is 𝒪P(1)\mathcal{O}_{P}(1). We are back to analyzing the same variance term as in Section S3.1, and the rest of the proof follows the very same lines.

S3.2.1 Spectral approximations

We analyse in this section the approximation error when we replace Kq,γ~(p)K_{q,\tilde{\gamma}}^{(p)} by 𝐊~\widetilde{\mathbf{K}} in Equation (S8). To this end, we study the difference between the entries of 𝐊~\widetilde{\mathbf{K}} and those of Kq,γ~(p)(,)K_{q,\tilde{\gamma}}^{(p)}(\cdot,\cdot) when restricted to the data set 𝔇\mathfrak{D}. We recall the fact that 𝐊~\widetilde{\mathbf{K}} is viewed as a kernel on the space L2(γ^N)L_{2}(\hat{\gamma}_{N}).

The idea is that, since NN is large, the kernel 𝐊~\widetilde{\mathbf{K}} acting on L2(γ^N)L_{2}(\hat{\gamma}_{N}), which is obtained by spectrally rounding off the kernel Kq,γ~(p)K_{q,\tilde{\gamma}}^{(p)} acting on L2(γ^N)L_{2}(\hat{\gamma}_{N}), is well approximated by the kernel Kq,γ~(p)K_{q,\tilde{\gamma}}^{(p)} acting on L2(γ~)L_{2}(\tilde{\gamma}). By definition, Kq,γ~(p)(𝒛,𝒘)=q(𝒛)γ~(𝒛)Kq(p)(𝒛,𝒘)q(𝒘)γ~(𝒘)K_{q,\tilde{\gamma}}^{(p)}(\bm{z},\bm{w})=\sqrt{\frac{q(\bm{z})}{\tilde{\gamma}(\bm{z})}}K_{q}^{(p)}(\bm{z},\bm{w})\sqrt{\frac{q(\bm{w})}{\tilde{\gamma}(\bm{w})}}, we may deduce that Kq,γ~(p)(,)dγ~()dγ~()=Kq(p)(,)dq()dq()K_{q,\tilde{\gamma}}^{(p)}(\cdot,\cdot)\mathrm{d}\tilde{\gamma}(\cdot)\mathrm{d}\tilde{\gamma}(\cdot)=K_{q}^{(p)}(\cdot,\cdot)\mathrm{d}q(\cdot)\mathrm{d}q(\cdot). Now, the kernel Kq(p)(,)K_{q}^{(p)}(\cdot,\cdot) is a projection on L2(q)L_{2}(q). As such, the spectrum of (Kq,γ~(p),dγ~)(K_{q,\tilde{\gamma}}^{(p)},\mathrm{d}\tilde{\gamma}) is also close to a projection. Since 𝐊~\widetilde{\mathbf{K}} is obtained by rounding off the spectrum of Kq,γ~(p)K_{q,\tilde{\gamma}}^{(p)} to {0,1}\{0,1\}, the quantities |𝐊~(𝒛i,𝒛j)Kq,γ~(p)(𝒛i,𝒛j)||\widetilde{\mathbf{K}}(\bm{z}_{i},\bm{z}_{j})-K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{j})| will be ultimately by controlled by how close the kernel (Kq,γ~(p)|𝔇,γ^N)(K_{q,\tilde{\gamma}}^{(p)}|_{\mathfrak{D}},\hat{\gamma}_{N}) is from a true projection.

To analyse this, we consider the operator [Kq,γ~(p)|𝔇]2[K_{q,\tilde{\gamma}}^{(p)}|_{\mathfrak{D}}]^{2} on L2(γ^N)L_{2}(\hat{\gamma}_{N}), which is an integral operator given by the convolution kernel

Kq,γ~(p)|𝔇Kq,γ~(p)|𝔇(𝒛i,𝒛k)\displaystyle K_{q,\tilde{\gamma}}^{(p)}|_{\mathfrak{D}}\star K_{q,\tilde{\gamma}}^{(p)}|_{\mathfrak{D}}(\bm{z}_{i},\bm{z}_{k}) =Kq,γ~(p)(𝒛i,𝒛j)Kq,γ~(p)(𝒛j,𝒛k)dγ^N(𝒛j)\displaystyle=\int K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{j})K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{j},\bm{z}_{k})\mathrm{d}\hat{\gamma}_{N}(\bm{z}_{j})
=1Nj=1NKq,γ~(p)(𝒛i,𝒛j)Kq,γ~(p)(𝒛j,𝒛k)\displaystyle=\frac{1}{N}\sum_{j=1}^{N}K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{j})K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{j},\bm{z}_{k})
=Kq,γ~(p)(𝒛i,𝒘)Kq,γ~(p)(𝒘,𝒛k)dγ~(𝒘)+𝒪P(N1/2)\displaystyle=\int K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{w})K_{q,\tilde{\gamma}}^{(p)}(\bm{w},\bm{z}_{k})\mathrm{d}\tilde{\gamma}(\bm{w})+\mathcal{O}_{P}(N^{-1/2})
=Kq,γ~(p)(𝒛i,𝒛k)+𝒪P(N1/2)\displaystyle=K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{k})+\mathcal{O}_{P}(N^{-1/2})
=Kq,γ~(p)|𝔇(𝒛i,𝒛k)+𝒪P(N1/2),\displaystyle=K_{q,\tilde{\gamma}}^{(p)}|_{\mathfrak{D}}(\bm{z}_{i},\bm{z}_{k})+\mathcal{O}_{P}(N^{-1/2}),

where we have used the convergence of the kernel density estimator γ~\tilde{\gamma} as well as the empirical measure γ^N\hat{\gamma}_{N} to the underlying measure γ\gamma, at the rate 𝒪P(N1/2)\mathcal{O}_{P}(N^{-1/2}) described e.g. by the uniform CLT.

We may summarize the above by observing that Kq,γ~(p)|𝔇2K_{q,\tilde{\gamma}}^{(p)}|_{\mathfrak{D}}^{2} on L2(γ^N)L_{2}(\hat{\gamma}_{N}) is a projection up to an error of 𝒪P(N1/2)\mathcal{O}_{P}(N^{-1/2}), which indicates an approximation of |𝐊~(𝒛i,𝒛j)Kq,γ~(p)|𝔇(𝒛i,𝒛j)|=𝒪P(N1/2)|\widetilde{\mathbf{K}}(\bm{z}_{i},\bm{z}_{j})-K_{q,\tilde{\gamma}}^{(p)}|_{\mathfrak{D}}(\bm{z}_{i},\bm{z}_{j})|=\mathcal{O}_{P}(N^{-1/2}).

To understand the estimator ΞA,DPP\Xi_{A,\mathrm{DPP}}, we also need to understand 𝐊~(𝒛i,𝒛i)1\widetilde{\mathbf{K}}(\bm{z}_{i},\bm{z}_{i})^{-1}, which we will deduce from the above discussion. To this end, we observe that

|𝐊~(𝒛i,𝒛i)1Kq,γ~(p)(𝒛i,𝒛i)1|\displaystyle|\widetilde{\mathbf{K}}(\bm{z}_{i},\bm{z}_{i})^{-1}-K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i})^{-1}| =|(Kq,γ~(p)(𝒛i,𝒛i)+𝒪P(N1/2))1Kq,γ~(p)(𝒛i,𝒛i)1|\displaystyle=|(K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i})+\mathcal{O}_{P}(N^{-1/2}))^{-1}-K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i})^{-1}|
=Kq,γ~(p)(𝒛i,𝒛i)1𝒪P(N1/2).\displaystyle=K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i})^{-1}\cdot\mathcal{O}_{P}(N^{-1/2}). (S15)

In drawing the above conclusion, we require that Kq,γ~(p)(𝒛i,𝒛i)K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i}) stays bounded away from 0, which we justity as follows. We recall that Kq,γ~(p)(𝒛i,𝒛i)=Kq(p)(𝒛i,𝒛i)q(𝒛i)γ~(𝒛i)K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i})=K_{q}^{(p)}(\bm{z}_{i},\bm{z}_{i})\cdot\frac{q(\bm{z}_{i})}{\tilde{\gamma}(\bm{z}_{i})}. We recall from OPE asymptotics that Kq(p)(𝒛i,𝒛i)K_{q}^{(p)}(\bm{z}_{i},\bm{z}_{i}) is of the order pp; whereas γ~(𝒛i)=γ(𝒛i)+𝒪P(N1/2)\tilde{\gamma}(\bm{z}_{i})=\gamma(\bm{z}_{i})+\mathcal{O}_{P}(N^{-1/2}) from kernel density approximation and uniform CLT asymptotics. We recall our hypothesis that the densities q,γq,\gamma are bounded away from 0 and \infty on 𝔇\mathfrak{D}. Putting together all of the above, we deduce that Kq,γ~(p)(𝒛i,𝒛i)K_{q,\tilde{\gamma}}^{(p)}(\bm{z}_{i},\bm{z}_{i}) is of order pp; in particular it is bounded away from 0 as desired.

S3.3 Order of pp vs NN and future work

In this section, we discuss the relative order of pp and NN, especially in the context of the estimator ΞA,s\Xi_{A,\mathrm{s}}. In order to do this for ΞA,s\Xi_{A,\mathrm{s}}, we undertake a classic bias-variance trade-off argument. To this end, we recall that while the bias is 𝒪P(ph/N)\mathcal{O}_{P}(ph/N) (with hh being the window size for kernel smoothing), the variance is 𝒪P(p(1+1/d))\mathcal{O}_{P}(p^{-(1+1/d)}). Further, we substitute one of the canonical choices for the window size hh, which is to set h=N1/dh=N^{-1/d} for dimension dd and data size NN. Setting the bias and the standard deviation to be roughly of the same order, we obtain a choice of pp as p=N2d+13d+1p=N^{\frac{2d+1}{3d+1}}. For the estimator ΞA,DPP\Xi_{A,\mathrm{DPP}}, a similar analysis may be undertaken. However, while the finite sample bias is 0, the variance term is more complicated, particularly with the contributions from the spectral approximation 𝒪P(Nτ)\mathcal{O}_{P}(N^{-\tau}). We believe that the present analysis of the spectral approximation can be further tightened and rigorised to yield more optimal values of τ\tau that more closely mimic experimental performance. Further avenues of improvement include better ways to handle the boundary effects (to control the asymptotic intensity of the OPE kernels that behave in a complicated manner at the boundary of the background measure); methods to bypass the spectral round-off step in constructing the estimator ΞA,DPP\Xi_{A,\mathrm{DPP}}; hands-on analysis of the errors caused by switching between discrete measures and continuous densities that is tailored to our setting (and therefore raising the possibility of sharper error bounds), among others.

Appendix S4 Experiments on a real dataset

To extensively compare the performance of our gradient estimator ΞA,DPP\Xi_{A,\mathrm{DPP}} to the default ΞA,Poi\Xi_{A,\mathrm{Poi}}, we run the same experiment as in Section 5 of the main paper, but on a benchmark dataset from LIBSVM222https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/. We download the letter dataset, which consists of 1500015000 training samples and 50005000 testing samples, where each sample contains 1616 features. We modify the 2626-class classification problem into a binary classification problem where the goal is to separate the classes 11-1313 from the other 1313 classes. Denote the preprocessed dataset as letter.binary. We consider =lin\mathcal{L}=\mathcal{L}_{\text{lin}}, λ0=0.001\lambda_{0}=0.001 and p=10p=10. Figure 3 summarizes the experimental results on letter.binary, where the performance metrics are averaged over 10001000 independent runs of each SGD variant. The left figure shows the decrease of the objective function value, the middle figure shows how the norm of the complete gradient ΞN(θt)\|\Xi_{N}(\theta_{t})\| decreases with the budget, and the right figure shows the value of the test error. Error bars in the last figure are ±\pm one standard deviation of the mean. In the experiment, we can see that using a DPP improves over Poisson minibatches of the same size both in terms of minimizing the empirical loss, and of reaching a small test error with a given budget. Compared to the experimental results on the simulated data in the main paper, we can see that although the 𝒪P(p(1+1/d))\mathcal{O}_{P}(p^{-(1+1/d)}) rate discussed in Section 4 becomes slower as dd grows, our DPP-based minibatches still gives better performance on this real dataset with d=16d=16 compared to Poisson minibatches of the same size, which again demonstrates the significance of variance reduction in SGD.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Logistic regression on the letter.binary dataset.

References

  • [1] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • [2] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
  • [3] É. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 24, pages 451–459, 2011.
  • [4] C. Zhang, H. Kjellstrom, and S. Mandt. Determinantal point processes for mini-batch diversification. In Uncertainty in Artificial Intelligence (UAI), 2017.
  • [5] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7:83–122, 1975.
  • [6] F. Lavancier, J. Møller, and E. Rubak. Determinantal point process models and statistical inference. Journal of the Royal Statistical Society, Series B, 2014.
  • [7] A. Kulesza and B. Taskar. Determinantal Point Processes for Machine Learning, volume 5 of Foundations and Trends in Machine Learning. Now Publishers Inc., 2012.
  • [8] Subhroshekhar Ghosh and Philippe Rigollet. Gaussian determinantal processes: A new model for directionality in data. Proceedings of the National Academy of Sciences, 117(24):13207–13213, 2020.
  • [9] C. Zhang, H. Kjellstrom, and S. Mandt. Active mini-batch sampling using repulsive point processes. In Proceedings of the AAAI conference on Artificial Intelligence, 2019.
  • [10] R. Bardenet and A. Hardy. Monte Carlo with determinantal point processes. Annals of Applied Probability, 2020.
  • [11] A. Nikolov, M. Singh, and U. T. Tantipongpipat. Proportional volume sampling and approximation algorithms for a-optimal design. arXiv:1802.08318, 2018.
  • [12] M. Derezinski and M. K. Warmuth. Unbiased estimates for linear regression via volume sampling. In Advances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., 2017.
  • [13] A. Poinas and R. Bardenet. On proportional volume sampling for experimental design in general spaces. arXiv preprint arXiv:2011.04562, 2020.
  • [14] M. Derezinski and M. W. Mahoney. Determinantal point processes in randomized numerical linear algebra. Notices of the American Mathematical Society, 68(1), 2021.
  • [15] A. Belhadji, R. Bardenet, and P. Chainais. A determinantal point process for column subset selection. Journal of Machine Learning Research (JMLR), 2020.
  • [16] N. Tremblay, S. Barthelmé, and P.-O. Amblard. Determinantal point processes for coresets. Journal of Machine Learning Research, 20(168):1–70, 2019.
  • [17] J. B. Hough, M. Krishnapur, Y. Peres, and B. Virág. Determinantal processes and independence. Probability surveys, 2006.
  • [18] D. J. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes, volume 1. Springer, 2nd edition, 2003.
  • [19] A. Soshnikov. Determinantal random point fields. Russian Mathematical Surveys, 55:923–975, 2000.
  • [20] A. Soshnikov. Gaussian limit for determinantal random point fields. Annals of Probability, 30(1):171–187, 2002.
  • [21] S. Ghosh. Determinantal processes and completeness of random exponentials: the critical case. Probability Theory and Related Fields, 163(3):643–665, 2015.
  • [22] J. A. Gillenwater. Approximate inference for determinantal point processes. PhD thesis, University of Pennsylvania, 2014.
  • [23] M. P. Wand and M. C. Jones. Kernel smoothing. CRC press, 1994.
  • [24] C. P. Robert and G. Casella. Monte Carlo statistical methods. Springer, 2004.
  • [25] S. V. N. Vishwanathan, N. N. Schraudolph, M. W. Schmidt, and K. P. Murphy. Accelerated training of conditional random fields with stochastic gradient methods. In Proceedings of the international conference on machine learning (ICML), pages 969–976, 2006.
  • [26] R. Bardenet and S. Ghosh. Learning from DPPs via Sampling: Beyond HKPV and symmetry. arXiv preprint arXiv:2007.04287, 2020.
  • [27] S. S. Vempala. The random projection method, volume 65. American Mathematical Society, 2005.
  • [28] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations. Springer Science & Business Media, 1990.
  • [29] B. Simon. The christoffel–darboux kernel. In Perspectives in PDE, Harmonic Analysis and Applications, volume 79 of Proceedings of Symposia in Pure Mathematics, pages 295–335, 2008.
  • [30] G. Gautier, R. Bardenet, G. Polito, and M. Valko. DPPy: Sampling determinantal point processes with Python. Journal of Machine Learning Research; Open Source Software (JMLR MLOSS), 2019.
  • [31] A. Belhadji, R. Bardenet, and P. Chainais. Kernel quadrature with determinantal point processes. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • [32] A. Belhadji, R. Bardenet, and P. Chainais. Kernel interpolation with continuous volume sampling. In International Conference on Machine Learning (ICML), 2020.
  • [33] R. M. Dudley. Uniform central limit theorems, volume 142. Cambridge university press, 2014.
  • [34] E. Giné and R. Nickl. Uniform central limit theorems for kernel density estimators. Probability Theory and Related Fields, 141(3):333–387, 2008.