This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generalization despite overfitting in quantum machine learning models

Evan Peters e6peters@uwaterloo.ca Department of Physics, University of Waterloo, Waterloo, ON, N2L 3G1, Canada Institute for Quantum Computing, Waterloo, ON, N2L 3G1, Canada Perimeter Institute for Theoretical Physics, Waterloo, Ontario, N2L 2Y5, Canada    Maria Schuld Xanadu, Toronto, ON, M5G 2C8, Canada
Abstract

The widespread success of deep neural networks has revealed a surprise in classical machine learning: very complex models often generalize well while simultaneously overfitting training data. This phenomenon of benign overfitting has been studied for a variety of classical models with the goal of better understanding the mechanisms behind deep learning. Characterizing the phenomenon in the context of quantum machine learning might similarly improve our understanding of the relationship between overfitting, overparameterization, and generalization. In this work, we provide a characterization of benign overfitting in quantum models. To do this, we derive the behavior of a classical interpolating Fourier features models for regression on noisy signals, and show how a class of quantum models exhibits analogous features, thereby linking the structure of quantum circuits (such as data-encoding and state preparation operations) to overparameterization and overfitting in quantum models. We intuitively explain these features according to the ability of the quantum model to interpolate noisy data with locally “spiky” behavior and provide a concrete demonstration example of benign overfitting.

1 Introduction

A long-standing paradigm in machine learning is the trade-off between the complexity of a model family and the model’s ability to generalize: more expressive model classes contain better candidates to fit complex trends in data, but are also prone to overfitting noise [1, 2]. Interpolation, defined for our purposes as choosing a model with zero training error, was hence long considered bad practice [3]. The success of deep learning – machine learning in a specific regime of extremely complex model families with vast amounts of tunable parameters – seems to contradict this notion; here, consistent evidence shows that among some interpolating models, more complexity tends not to harm the generalisation performance111The fact that interpolation does not always harm generalization performance is in fact well-known for models like boosting and nearest neighbours, which are both interpolating models., a phenomenon described as “benign overfitting” [4].

In recent years, a surge of theoretical studies have reproduced benign overfitting in simplified settings with the hope of isolating the essential ingredients of the phenomenon [4, 5]. For example, Ref. [6] showed how interpolating linear models in a high complexity regime (more dimensions than datapoints) could generalize just as well as their lower-complexity counterparts on new data, and analyzed the properties of the data that lead to the “absorption” of noise by the interpolating model without harming the model’s predictions. Ref. [7] showed that there are model classes of simple functions that change quickly in the vicinity of the noisy training data, but recover a smooth trend elsewhere in data space (see Figure 1). Such functions have also been used to train nearest neighbor models that perfectly overfit training data while generalizing well, thereby directly linking “spiking models” to benign overfitting [8]. Recent works try to recover the basic mechanism of such spiking models using the language of Fourier analysis [9, 10, 11].

In parallel to these exciting developments in the theory of deep learning, quantum computing researchers have proposed families of parametrised quantum algorithms as model classes for machine learning (e.g. Ref. [12]). These quantum models can be optimised similarly to neural networks [13, 14] and have interesting parallels to kernel methods [15, 16] and generative models [17, 18]. Although researchers have taken some first steps to study the expressivity [19, 20, 21, 22], trainability [23, 24] and generalisation [25, 26, 27, 28] of quantum models, we still know relatively little about their behaviour. In particular, the interplay of overparametrisation, interpolation, and generalisation that seems so important for deep learning is yet largely unexplored.

In this paper we develop a simplified framework in which questions of overfitting in quantum machine learning can be investigated. Essentially, we exploit the observation that quantum models can often be described in terms of Fourier series where well-defined components of the quantum circuit influence the selection of Fourier modes and their respective Fourier coefficients [29, 30, 31]. We link this description to the analysis of spiking models and benign overfitting by building on prior works analyzing these phenomena using Fourier methods. In this approach, the complexity of a model is related to the number of Fourier modes that its Fourier series representation consists of, and overparametrised model classes have more modes than needed to interpolate the training data (i.e., to have zero training error). After deriving the generalization error for such model classes these “superfluous” modes lead to spiking models, which have large oscillations around the training data while keeping a smooth trend everywhere else. However, large numbers of modes can also harm the recovery of an underlying signal, and we therefore balance this trade-off to produce an explicit example of benign overfitting in a quantum machine learning model.

The mathematical link described above allows us to probe the impact of important design choices for a simplified class of quantum models on this trade-off. For example, we find why a measure of redundancy in the spectrum of the Hamiltonian that defines standard data encoding strategies strongly influences this balance; in fact to an extent that is difficult to counterbalance by other design choices of the circuit.

The remainder of the paper proceeds as follows. We will first review the classical Fourier framework for the study of interpolating models and develop explicit formulae for the error in these models to produce a basic example of benign overfitting (Sec. 2). We will then construct a quantum model with analogous components to the classical model, and demonstrate how each of these components is related to the structure of the corresponding quantum circuit and measurement (Sec. 3). We then analyze specific cases that give rise to “spikiness” and benign overfitting in these quantum models (Sec. 3.2).

Refer to caption
Figure 1: Intuition behind the phenomenon of benign overfitting with spiking models. a Training a model typically involves a trade-off between fitting noisy training data well and recovering an underlying target function b Traditional learning theory associates interpolating models (reaching zero training error by fitting every data point) with low generalization capability (or high test error) by failing to recover a simple target function. c “spiking models” that change quickly in the vicinity of training data but otherwise exhibit simple behavior can explain how both kinds of errors may be kept low.

2 Interpolating models in the Fourier framework

In this section we will provide the essential tools to probe the phenomenon of overparametrized models that exhibit the spiking behaviour from Figure 1 using the language of Fourier series. We will review and formalize the problem setting and several examples from Refs. [9, 10, 11], before extending their framework by incorporating standard results from linear regression to derive closed-form error behavior and examples of benign overfitting.

2.1 Setting up the learning problem

We are interested in functions gg defined on a finite interval that may be written in terms of a linear combination of Fourier basis functions or modes ei2πkxe^{i2\pi kx} (each describing a complex sinusoid with integer-valued frequency kk) weighted by their corresponding Fourier coefficients g^k\hat{g}_{k}:

g(x)=k=g^kei2πkx.g(x)=\sum_{k=-\infty}^{\infty}\hat{g}_{k}e^{i2\pi kx}. (1)

We restrict our attention to well-behaved functions that are sufficiently smooth and continuous to be expressed in this form. We will now set up a simple learning problem whose basic components – the model and target function to be learned – can be expressed as Fourier series with only few non-zero coefficients, and define concepts such as overparametrization and interpolation.

Consider a machine learning problem in which data is generated by a target function of the form

g(x)=kΩn0g^kei2πkxg(x)=\sum_{k\in\Omega_{n_{0}}}\hat{g}_{k}e^{i2\pi kx} (2)

which only contains frequencies in the discrete, integer-valued spectrum

Ωn0={n012,n012+1,,0,,n012}\Omega_{n_{0}}=\Bigl{\{}-\frac{n_{0}-1}{2},-\frac{n_{0}-1}{2}+1,\dots,0,\dots,\frac{n_{0}-1}{2}\Bigr{\}} (3)

for some odd integer n0n_{0}. We call functions of this form band-limited with bandwidth n0/2n_{0}/2 (that is, |k|<n0/2|k|<n_{0}/2 for all frequencies kΩn0k\in\Omega_{n_{0}}). The bandwidth limits the complexity of a function, and will therefore be important for exploring scenarios where a complex model is used to overfit a less complex target function. We are provided with nn noisy samples yj=g(xj)+ϵy_{j}=g(x_{j})+\epsilon of the target function gg evaluated at points xj=j/nx_{j}=j/n spaced uniformly on interval [0,1][0,1] (we will assume input data has been rescaled to [0,1][0,1] without loss of generality), where n>n0n>n_{0} and we require nn to be odd for simplicity.

The model class we consider likewise contains band-limited Fourier series, and since we are interested in interpolating models, we always assume that they have enough modes to interpolate the noisy training data, namely:

f(x)\displaystyle f(x) =kΩdαkνkei2πkx,\displaystyle=\sum_{k\in\Omega_{d}}\alpha_{k}\sqrt{\nu_{k}}e^{i2\pi kx}, (4)
such that f(xj)\displaystyle\text{such that }\qquad f(x_{j}) =yj for all j[n].\displaystyle=y_{j}\text{ for all }j\in[n].

Similarly to Eq. (3) we define the spectrum

Ωd={d12,d12+1,,0,,d12}.\Omega_{d}=\Bigl{\{}-\frac{d-1}{2},-\frac{d-1}{2}+1,\dots,0,\dots,\frac{d-1}{2}\Bigr{\}}. (5)

Following Ref. [9], this model class has two components: The set of weighted Fourier basis functions νkei2πkx\sqrt{\nu_{k}}e^{i2\pi kx} describe a feature map applied to xx for some set of feature weights νk+\nu_{k}\in\mathbb{R}^{+}, while the trainable weights αk\alpha_{k}\in\mathbb{C} are optimized to ensure that ff interpolates the data. The theory of trigonometric polynomial interpolation [32] ensures that ff can always interpolate the training data for some choice of trainable weights αk\alpha_{k} under these conditions. In the following, we will therefore usually consider the αk\alpha_{k} as being determined by the data and interpolation condition, while the νk\nu_{k} serve as our “turning knobs” to create different settings in which to study spiking properties and benign overfitting. We call the model class described in Eq. (4) overparameterized when the degree of the Fourier series is much larger than the degree of the target function, dn0d\gg n_{0}, in which case the model has many more frequencies available to perform fitting than there are present in the underlying signal to fit.

Note that one can rewrite the Fourier series of Eq. (4) in a linear form

f(x)=𝜶,ϕ(x)f(x)=\langle\boldsymbol{\alpha},\phi(x)\rangle (6)

where

𝜶\displaystyle\boldsymbol{\alpha} =(α(d1)/2,,αk,,α(d1)/2)d,\displaystyle=(\alpha_{-(d-1)/2},\dots,\alpha_{k},\dots,\alpha_{(d-1)/2})\in\mathbb{C}^{d}, (7)
ϕ(x)\displaystyle\phi(x) =(ν(d1)/2eiπ(d1)x,νkei2πkx,,ν(d1)/2eiπ(d1)x)d.\displaystyle=\left(\sqrt{\nu_{-(d-1)/2}}e^{i\pi(d-1)x}\dots,\sqrt{\nu_{k}}e^{i2\pi kx},\dots,\sqrt{\nu_{(d-1)/2}}e^{i\pi(d-1)x}\right)\in\mathbb{C}^{d}. (8)

From this perspective, optimizing ff amounts to learning trainable weights 𝜶\boldsymbol{\alpha} by performing regression on observations ϕ(x)\phi(x) sampled from a random Fourier features model [33] for which νk\nu_{k} and ϕ(x)k\phi(x)_{k} are precisely the eigenvalues and eigenfunctions of a shift-invariant kernel [34].

To complete the problem setup, we have to impose one more constraint. Consider that exp(i2πkxj)\exp(i2\pi kx_{j}) with frequency k<n/2k<n/2 and uniformly spaced points xjx_{j} is equal to exp(i2πkxj)\exp(i2\pi k^{\prime}x_{j}) for any choice of alias frequency k=kmodnk^{\prime}=k\mod n. The presence of these aliases means that the model class described in Eq. (6) contains many interpolating solutions in the overparameterized regime. Motivated by prior work exploring benign overfitting for linear features [6], Fourier features [9, 35], and other nonlinear features [36, 37], we will study the minimum-2\ell_{2} norm interpolating solution,

𝜶opt=argmin𝜶:f(xj)=yjj𝜶2.\boldsymbol{\alpha}^{opt}=\operatorname*{arg\,min}_{\boldsymbol{\alpha}:f(x_{j})=y_{j}\,\forall\,j}\lVert\boldsymbol{\alpha}\rVert_{2}. (9)

Minimizing the 2\ell_{2} norm is a typical choice for imposing a penalty on complex functions (regularization) in the underparameterized regime, though we will see that this intuition does not carry over to the overparameterized regime. The remainder of this section will explore how this learning problem results in a trade-off in interpolating Fourier models: Overparameterization introduces alias frequencies that increase the error in fitting simple target functions but can also reduce error by absorbing noise into high-frequency modes with spiking behavior.

2.2 Two extreme cases to understand generalization

To better understand the trade-off that overparametrization – or in our case, a much larger number of Fourier modes than needed to interpolate the data – introduces between fitting noise and generalization error, we revisit two extreme cases explored in Ref. [9], involving a pure-noise signal and a noiseless signal.

Case 1: Noise only

The first case demonstrates how alias modes can help to fit noise without disturbing the (here trivial) signal. We set g=0g=0 and consider nn observations yj=ϵjy_{j}=\epsilon_{j} of zero-mean noise with known variance 𝔼[ϵ2]=σ2\mathbb{E}[\epsilon^{2}]=\sigma^{2}. After making nn uniformly spaced observations, we compute the discrete Fourier transform of the observations as the sequence of values ϵ^j\hat{\epsilon}_{j} satisfying

ϵ^j=1nkΩnϵkei2πjk/n,\hat{\epsilon}_{j}=\frac{1}{n}\sum_{k\in\Omega_{n}}\epsilon_{k}e^{-i2\pi jk/n}, (10)

which characterizes the frequency content of the noisy signal that can be captured and learned from using only nn evenly spaced samples. Suppose that the degree of the model (controlling the model complexity) is given by d=n(m+1)d=n(m+1) for some even integer mm and that νk=1\nu_{k}=1 for every mode, so that there are exactly mm equally-weighted aliases for each frequency in the spectrum of the Fourier series for gg. Then the optimal (i.e., the minimum 2\ell_{2}-norm, interpolating) trainable weight vector 𝜶opt\boldsymbol{\alpha}^{opt} has entries

αk+nopt=ϵ^kn(m+1)\alpha^{opt}_{k+n\ell}=\frac{\hat{\epsilon}_{k}}{n(m+1)} (11)

for =m/2,,m/2\ell=-m/2,\dots,m/2, with all other entries being zero (see Appendix A.2). From Eq. (11), the minimum-𝜶2\lVert\boldsymbol{\alpha}\rVert_{2} solution distributes noise Fourier coefficients ϵ^k\hat{\epsilon}_{k} evenly into many alias frequencies k+nk+n\ell, while enforcing that the sum of trainable weights αk+n\alpha_{k+n\ell} for all of these aliases is ϵ^k\hat{\epsilon}_{k} to guarantee interpolation. As shown in Figure 2, the higher-frequency aliases suppress the optimal model fopt(x)=𝜶opt,ϕ(x)f^{opt}(x)=\langle\boldsymbol{\alpha}^{opt},\phi(x)\rangle to near-zero at every point away from the interpolated points, resulting in a test error of 𝒪(σ2/m)\mathcal{O}(\sigma^{2}/m) that decreases monotonically with the complexity of the model. As mm increases, the optimal model foptf^{opt} remains close to the true signal y=0y=0 while becoming “spiky” near the noisy samples. By conforming to the true signal everywhere except in the vicinity of noise, this behavior embodies the mechanism of how overparameterized models can absorb noise into high frequency modes. In this case the generalization error, measuring how close the model is to the target function on average, decreases with increasing complexity of the model class.

Refer to caption
Figure 2: Comparison of overparametrized and simple models that interpolate data from different target functions. a Noise only: The overparameterized model (n=7,d=35n=7,\,d=35, blue) is more effective at recovering the target function g(x)=0g(x)=0 when provided with noisy data, while a band-limited model (red) cannot do so effectively. Inset: The distribution of trainable weights for the minimum 𝜶2\lVert\boldsymbol{\alpha}\rVert_{2} model are distributed among many aliases suppressing the error of the optimized model (proportional to j|f^j|2\sum_{j}|\hat{f}_{j}|^{2}) in this case. b Signal only: The opposite occurs in the case of noiseless input, for which a band-limited model (n0=5n_{0}=5) perfectly recovers g(x)=sin(2π(2x))g(x)=\sin\left(2\pi(2x)\right) while the overparameterized model fails to capture the behavior of the input signal. Inset: Minimizing the 2\ell_{2} norm of 𝜶\boldsymbol{\alpha} distributes Fourier coefficients g^2\hat{g}_{2} among aliases f^2+n\hat{f}_{2+\ell n} (and g^2\hat{g}_{-2} among aliases f^2+n\hat{f}_{-2+\ell n}); this bleeding of the signal gg into higher frequencies results in higher error in the overparameterized model [9, 11].

Case 2: Signal only

While the above case shows how overparametrization can help to absorb noise to reduce error without harming the signal, the second case will illustrate how alias frequencies in the overparameterized model can harm the model’s ability to learn the target function. To demonstrate this, we now consider a noiseless, single-mode signal g(x):=g^pei2πpxg(x):=\hat{g}_{p}e^{i2\pi px} of frequency pn0/2p\leq n_{0}/2. The data is hence of the form

yj\displaystyle y_{j} =g(xj):=g^pei2πpxj.\displaystyle=g(x_{j}):=\hat{g}_{p}e^{i2\pi px_{j}}. (12)

Once again we choose d=n(m+1)d=n(m+1) and for simplicity we assume an unweighted model, νk=1\nu_{k}=1 for kΩdk\in\Omega_{d}. By orthonormality of Fourier basis functions, the interpolation condition requires that only the modes of the model ff with integer multiples of the frequency pp are retained. The interpolation constraint can then be rewritten as

=m/2m/2αp+n=g^p.\sum_{\ell=-m/2}^{m/2}\alpha_{p+n\ell}=\hat{g}_{p}. (13)

The choice of trainable weights αk\alpha_{k} that satisfy Eq. (13) while minimizing 2\ell_{2}-norm is

αp+nopt=g^pm+1\alpha^{opt}_{p+n\ell}=\frac{\hat{g}_{p}}{m+1} (14)

for k=p+nk=p+n\ell and αk=0\alpha_{k}=0 otherwise (see Appendix A.2). Eq. (13) distributes the Fourier coefficient g^p\hat{g}_{p} among the trainable weights αp+n\alpha_{p+n\ell} corresponding to frequencies p+np+n\ell. Therefore, minimizing 𝜶2\lVert\boldsymbol{\alpha}\rVert_{2} in this case “bleeds” the target function into higher frequency aliases and results in the opposite effect compared to fitting a noisy signal (see Fig. 2b): The generalization error of the overparameterized model now increases with the number of aliases mm and the complexity of the model harms its ability to fit a noiseless target function.

In order to recover a trade-off in generalization error for more general cases, we will need to consider more interesting distributions of feature weights νk\nu_{k} (instead of νk=1\nu_{k}=1) that provide finer control over fitting the target function with low-frequency modes while spiking in the vicinity of noise with high-frequency aliases.

2.3 Generalization trade-offs and benign overfitting

The opposing effects of higher-frequency modes in overparameterized models in the cases discussed above hint at a trade-off in model performance that depends on the underlying signal and the feature weights of the Fourier feature map. Returning to the more general case of input samples yj=g(xj)+ϵjy_{j}=g(x_{j})+\epsilon_{j}, in Appendix A we show that the task of fitting uniformly spaced samples using weighted Fourier features may be transformed into a linear regression problem, thereby generalizing the results of [9] to derive the following general solution to the minimum-2\ell_{2} interpolating problem of Eq. (9):

αk+nopt=y^kνk+njS(k)νj,\alpha^{opt}_{k+\ell n}=\hat{y}_{k}\frac{\sqrt{\nu_{k+\ell n}}}{\sum_{j\in S(k)}\nu_{j}}, (15)

where y^k\hat{y}_{k} is the discrete Fourier transform of yky_{k} and kΩn0k\in\Omega_{n_{0}}, where

S(k)={:k+nΩd}S(k)=\{\ell:k+n\ell\in\Omega_{d}\} (16)

denotes the set of alias frequencies of kk appearing in the overparameterized model with spectrum Ωd\Omega_{d}. The optimal model is then expressed as

fopt(x)=kΩn0y~kS(k)ei2πxνjS(k)νj.f^{opt}(x)=\sum_{k\in\Omega_{n_{0}}}\tilde{y}_{k}\sum_{\ell\in S(k)}e^{i2\pi\ell x}\frac{\nu_{\ell}}{\sum_{j\in S(k)}\nu_{j}}. (17)

Recalling that our model ff is trained on nn noisy samples (y0,,yn1)(y_{0},\dots,y_{n-1}) of the target function gg, we are interested in the squared error of the model ff averaged over (noisy) samples over the input domain,

L(f):=𝔼x,𝐲(f(x)g(x))2.L(f):=\mathbb{E}_{x,\mathbf{y}}(f(x)-g(x))^{2}. (18)

and we call LL the generalization error of ff, as it captures the behavior of ff with respect to gg over the entire input domain x[0,1]x\in[0,1] instead of just the uniformly spaced training points xjx_{j}. In Appendix A we derive

L(fopt)\displaystyle L(f^{opt}) =σ2nkΩnS(k)ν2(jS(k)νj)2var+kΩn0|g^k|2[(S(k)\kνjS(k)νj)2+S(k)\kν2(jS(k)νj)2]bias2\displaystyle=\underbrace{\frac{\sigma^{2}}{n}\sum_{k\in\Omega_{n}}\frac{\sum_{\ell\in S(k)}\nu_{\ell}^{2}}{\left(\sum_{j\in S(k)}\nu_{j}\right)^{2}}}_{\textsc{var}}+\underbrace{\sum_{k\in\Omega_{n_{0}}}|\hat{g}_{k}|^{2}\left[\left(\frac{\sum_{\ell\in S(k)\backslash k}\nu_{\ell}}{\sum_{j\in S(k)}\nu_{j}}\right)^{2}+\frac{\sum_{\ell\in S(k)\backslash k}\nu_{\ell}^{2}}{\left(\sum_{j\in S(k)}\nu_{j}\right)^{2}}\right]}_{\textsc{bias}^{2}} (19)

We use this generalization error now to explore two interesting behaviors of the interpolating model in our setting: The tradeoff between noise absorption and signal recovery exemplified by the cases in Sec. 2.2, and the ability of an overparameterized Fourier features model to benignly overfit the training data.

The first behavior involves a trade-off in the generalization error L(fopt)L(f^{opt}) between absorbing noise (reducing var) and capturing the target function signal (reducing bias2\textsc{bias}^{2}) that recovers and generalizes the behavior of the two cases in Sec. 2.2. This trade-off is controlled by three components: The noise variance σ2\sigma^{2}, the input signal Fourier coefficients g^k\hat{g}_{k}, and the distribution of feature weights νk\sqrt{\nu_{k}}. As described in the two cases above, when σ20\sigma^{2}\rightarrow 0 (signal only) the variance term var vanishes and the model is biased for any choice of nonzero νk\nu_{k} where k>nk>n. Conversely, when g^0\hat{g}\rightarrow 0 (noise only) the bias term bias2\textsc{bias}^{2} vanishes, and the variance term is minimized by choosing uniform νk\nu_{k} for all kΩdk\in\Omega_{d}.

The second interesting behavior occurs when the generalization error of the overparameterized model decreases at a nearly optimal rate as the number of samples nn increases, known as benign overfitting. Prior work on benign overfitting in linear regression studied scenarios where the distribution of input data varied with the dimensionality of data and size of the training set in such a way that the excess generalization error of the overparameterized model (compared to a simple model) vanished [6]. However, since the dimensionality of the input data for our model is fixed, we instead consider sequences of feature weights that vary with respect to the problem parameters (n0n_{0}, nn, dd) in a way that results in bias2\textsc{bias}^{2} and var vanishing as nn\rightarrow\infty. In this case, by fitting an increasing number of samples nn using such a sequence of feature weights, the overparameterized model both perfectly fits the training data and generalizes well for unseen test points, and therefore exhibits benign overfitting.

These behaviors are exemplified by a straightforward choice of feature weights that incorporate some prior knowledge of the spectrum Ωn0\Omega_{n_{0}} available to the target function gg. For all kΩn0k\in\Omega_{n_{0}}, let νk=c/n0\nu_{k}=c/n_{0} for some positive cc and normalize the feature weights so that kΩkνk=1\sum_{k\in\Omega_{k}}\nu_{k}=1. We show in Appendix A.3.1 that the error terms of L(fopt)L(f^{opt}) scale as

var =𝒪(1n+nd),\displaystyle=\mathcal{O}\left(\frac{1}{n}+\frac{n}{d}\right), (20)
bias2\displaystyle\textsc{bias}^{2} =𝒪(1n2).\displaystyle=\mathcal{O}\left(\frac{1}{n^{2}}\right). (21)

Thus, as long as the dimension of the overparameterized Fourier features model grows strictly faster than nn (i.e., d=ω(n)d=\omega(n)), the model exhibits benign overfitting. In Appendix A.3.2 we demonstrate how this simple example actually captures the mechanisms of benign overfitting for much more general choices of feature weights. Fig. 3 summarizes this behavior and provides an example of the bias-variance tradeoff that occurs for overparameterized models. In particular, Fig. 3a exemplifies the setting in which benign overfitting occurs, wherein the feature weights of the Fourier features model are strongly concentrated over frequencies in Ωn0\Omega_{n_{0}} but extend over a large range of alias frequencies for each kΩn0k\in\Omega_{n_{0}}.

Refer to caption
Figure 3: The feature weights νk\nu_{k} may be engineered to exhibit a bias-variance tradeoff and benign overfitting. a A demonstration of feature weights νk\nu_{k} which account for prior knowledge of the target function gg, with large νk\nu_{k} for kΩn0k\in\Omega_{n_{0}} and small νk\nu_{k} for kΩd\Ωn0k\in\Omega_{d}\backslash\Omega_{n_{0}}. b After sampling a bandlimited target function (n0=15n_{0}=15) at a fixed number of uniformly spaced points (n=31n=31), the generalization error of the minimum 2\ell_{2}-norm model foptf^{opt} consists of a tradeoff between decreasing var and increasing bias2\textsc{bias}^{2} for larger dd. Note the peaking behavior for var at d/n=1d/n=1. c This choice of feature weights νk\nu_{k} benignly overfits the input signal such that limn,dL(fopt)=0\lim_{n,d\rightarrow\infty}L(f^{opt})=0 for any scaling of d=nαd=n^{\alpha} with α>1\alpha>1, and fails to generalize on the interval [0,1][0,1] when α1\alpha\leq 1.

The generalization behavior described here is fundamentally different from many generalization guarantees typically found in statistical learning theory. While prior work has derived guarantees for the generalization of quantum models by constructing bounds on the complexity of the model class [25], Eqs. 20-21 demonstrate that generalization may occur as the complexity (i.e. dimension) of a model grows arbitrarily large.

So far, we have reviewed the Fourier perspective on fitting periodic functions in a classical setting and extended the analysis to characterize benign overfitting. However, if we can link the basic components of quantum models to the terms appearing in the error of Eq. (19), then we will be able to study a similar trade-off in the error of overparameterized quantum models and the conditions necessary for benign overfitting. The remainder of this work is devoted to showing that analogous mechanisms exist in certain quantum machine learning models, and to studying the choices of feature weights for which quantum models can exhibit tradeoffs in generalization error and benign overfitting.

3 Benign overfitting in single-layer quantum models

In the previous section we have seen that the feature weights νk\sqrt{\nu_{k}} balance the trade-off between absorbing the noise and hurting the signal of overparametrized models. To understand how different design choices of quantum models impact this balance, we need to link their mathematical structure to the model class defined in Eq. (4), and in particular to the feature weights, which is what we do now.

The type of quantum models we consider here are simplified versions of parametrized quantum classifiers (also known as quantum neural networks) that have been heavily investigated in recent years [13, 38, 39]. They are represented by quantum circuits that consist of two steps: first, we encode a datapoint x[0,1]x\in[0,1] into the state of the quantum system by applying a dd-dimensional unitary operator V(x)V(x), and then we measure the expectation value of some dd-dimensional (Hermitian) observable MM. This gives rise to a general class of quantum models of the form

f(x)=0|V(x)MV(x)|0.f(x)=\langle 0|V^{\dagger}(x)MV(x)|0\rangle. (22)

To simplify the analysis, we will consider a quantum circuit V(x)V(x) that consists of a data-independent unitary UU and a diagonal data-encoding unitary generated by a dd-dimensional Hermitian operator HH,

S(x)=exp(i2πdiag(λ0,,λd1)x)=exp(i2πHx),\displaystyle S(x)=\exp(i2\pi\cdot\text{diag}(\lambda_{0},\dots,\lambda_{d-1})x)=\exp(i2\pi Hx), (23)

which includes a large class of quantum models commonly studied but excludes schemes involving data re-uploading [40, 41]. Defining U|0=|Γ=j=1dγj|jU|0\rangle=|\Gamma\rangle=\sum_{j=1}^{d}\gamma_{j}|j\rangle, the output of this quantum model becomes

f(x)=Γ|S(x)MS(x)|Γ,\displaystyle f(x)=\langle\Gamma|S^{\dagger}(x)MS(x)|\Gamma\rangle, (24)

where |Γ|\Gamma\rangle can be treated as an arbitrary input quantum state. We call the corresponding quantum circuit for this model single-layer in the sense that it contains a single diagonal data-encoding layer in which all data-dependent circuit operations could theoretically be executed simultaneously (though in general the operation UU and measurement MM may require significant depth to implement).

Applying insights from Refs. [29, 30], quantum models of this form can be expressed in terms of a Fourier series

f(x)\displaystyle f(x) =kΩei2πkx,mR(k)γγmMm,\displaystyle=\sum_{k\in\Omega}e^{i2\pi kx}\sum_{\ell,m\in R(k)}\gamma_{\ell}\gamma_{m}^{*}M_{m\ell}, (25)

where the spectrum Ω\Omega as well as the partitions R(k)R(k) depend on the eigenspectrum λ(H)\lambda(H) of the data-encoding Hamiltonian HH:

Ω\displaystyle\Omega ={(λλm):λ,λmλ(H)}\displaystyle=\{(\lambda_{\ell}-\lambda_{m}):\lambda_{\ell},\lambda_{m}\in\lambda(H)\} (26)
R(k)\displaystyle R(k) ={(,m):λλm=k;λ,λmλ(H)}.\displaystyle=\{(\ell,m):\lambda_{\ell}-\lambda_{m}=k;\,\,\lambda_{\ell},\lambda_{m}\in\lambda(H)\}. (27)

Comparing Eq. (25) to Eq. (4) we see that that the quantum model may be expressed as a linear combination of weighted Fourier modes, but it is not yet clear how the input state γj\gamma_{j} and the trainable observable MM of the quantum model correspond to feature weights νk\nu_{k} for each Fourier mode. To reveal this correspondence, we will need to first find the minimum-norm interpolating observable that solves the optimization problem

Mopt=argminM=M:f(xj)=yjjMF,\displaystyle M^{opt}=\operatorname*{arg\,min}_{M=M^{\dagger}:f(x_{j})=y_{j}\,\forall\,j}\lVert M\rVert_{F}, (28)

where MF=Tr(MM)\lVert M\rVert_{F}=\sqrt{\operatorname{Tr}\left(M^{\dagger}M\right)} denotes the Frobenius norm of MM. Solving Eq. (28) is analogous to the minimization the 2\ell_{2} norm of 𝜶\boldsymbol{\alpha} in the classical optimization problem of Eq. (9), and serves a role similar to regularization commonly applied to quantum models by introducing a penalty term proportional to MF2\lVert M\rVert_{F}^{2} [26, 42, 43]. In Appendix B we prove that subject to the condition that γi>0\gamma_{i}>0, the minimum-F\lVert\cdot\rVert_{F} interpolating observable that solves Eq. (28) is given as

(Mopt)m=y^kγγmi,jR(Sk)|γi|2|γj|2,(M^{opt})_{m\ell}=\hat{y}_{k}\frac{\gamma_{\ell}^{*}\gamma_{m}}{\sum_{i,j\in R(S_{k})}|\gamma_{i}|^{2}|\gamma_{j}|^{2}}, (29)

for all ,mR(Sk)\ell,m\in R(S_{k}), and the corresponding optimal quantum model is

fopt(x)=kΩn0y^kS(k)ei2πxνoptjS(k)νjopt,f^{opt}(x)=\sum_{k\in\Omega_{n_{0}}}\hat{y}_{k}\,\sum_{\ell\in S(k)}e^{i2\pi\ell x}\frac{\nu^{opt}_{\ell}}{{\sum_{j\in S(k)}\nu^{opt}_{j}}}, (30)

where S(k)S(k) denotes the set of aliases of kk appearing in Ω\Omega from Eq. (16). By comparison to the optimal classical model of Eq. (17) we have identified the feature weights of the optimized quantum model as

νkopt:=i,jR(k)|γi|2|γj|2.\nu_{k}^{opt}:=\sum_{i,j\in R(k)}|\gamma_{i}|^{2}|\gamma_{j}|^{2}. (31)

Interestingly, while there was initially no clear way to separate the building blocks of the quantum model in Eq. (25) into trainable weights αk\alpha_{k} and feature weights νk\nu_{k}, this separation has now appeared after solving for the optimal observable MoptM^{opt}. Furthermore, the optimal quantum model depends on |γi||\gamma_{i}| and is independent of phases associated with amplitudes γi\gamma_{i} (an effect that stems from using only a single data-encoding layer S(x)S(x) in the quantum model).

From Eq. (31) it is clear that the partitions R(k)R(k) of Eq. (27) arising from the choice of data-encoding unitary S(x)S(x) have a strong relationship with the feature weights νk\nu_{k} of the quantum model. We will now consider a simplified quantum model to highlight this relationship, thereby identifying a tradeoff between noise absorption and target signal recovery and the possibility of observing benign overfitting in quantum models.

3.1 Simplified quantum model

To explicitly highlight the role of R(k)R(k) in controlling the feature weights νkopt\nu_{k}^{opt} of the optimized quantum model, we will now simplify the model by using an equal superposition input state |Γ=1dj=0d1|j|\Gamma\rangle=\frac{1}{\sqrt{d}}\sum_{j=0}^{d-1}|j\rangle and by restricting the set of observables considered during optimization. If we fix every entry of the observable with respect to elements in a partition R(k)R(k) to be proportional to some complex constant M(k)M(k):

Mm=M(k)|R(k)| for all ,mR(k),M_{m\ell}=\frac{M(k)}{\sqrt{|R(k)|}}\qquad\text{ for all }\ell,m\in R(k), (32)

then we can simplify the quantum model of Eq. (25) to

f(x)\displaystyle f(x) =kΩM(k)|R(k)|dei2πkx.\displaystyle=\sum_{k\in\Omega}M(k)\frac{\sqrt{|R(k)|}}{d}e^{i2\pi kx}. (33)

Comparing Eq. (33) to Eq. (4) we identify a direct correspondence between the trainable weights αk\alpha_{k} in the classical model with M(k)M(k), as well as a correspondence between the feature weights νk\sqrt{\nu_{k}} and the the degeneracy |R(k)||R(k)| of the quantum model. Making the substitutions αkM(k)\alpha_{k}\rightarrow M(k), νk|R(k)|/d\sqrt{\nu_{k}}\rightarrow\sqrt{|R(k)|}/d, one can verify that kΩ|M(k)|2=MF2\sum_{k\in\Omega}|M(k)|^{2}=\lVert M\rVert_{F}^{2} for this restricted choice of MM and so the solution to the optimization of Eq. (28) is essentially the same as that of the classical problem in Eq. (15).

The crucial property of the simplified model is that the degeneracy |R(k)||R(k)| – and hence the combinatorial structure introduced by the data encoding Hamiltonian’s eigenvalues – completely controls the trade-off in the generalization error (Eq. (19)). We can hence study different types of partitions R(k)R(k) to show a direct effect of the data-encoding unitary S(x)S(x) on the fitting and generalization error behaviors for this simplified, overparameterized quantum model.

To study these behaviors we will now consider specific families of HH which we call encoding strategies since the choice of HH completely determines how the data is encoded into the quantum model. While R(k)R(k) and Ω\Omega may be computed for an arbitrary S(x)S(x) using brute-force combinatorics, some encoding strategies lead to particularly simple solutions. We have derived a few such examples of simple degeneracies and spectra for different encoding strategies in Appendix C and present the results in Table 1. These choices highlight the extreme variation in Ω\Omega resulting from minor changes to S(x)S(x), for example |Ω|nq|\Omega|\propto n_{q} for the “Hamming” encoding strategy compared to |Ω|3nq|\Omega|\propto 3^{n_{q}} for the “Ternary” encoding strategy. These examples also highlight the limitations in constructing Hamiltonians with specific properties such as uniform |R(k)||R(k)| or evenly-spaced frequencies in Ω\Omega.

Encoding strategy Hamiltonian example Degeneracy |R(k)||R(k)| Spectrum Ω\Omega |Ω||\Omega|
Hamming 12(Z0+Z1+Z2+)\frac{1}{2}(Z_{0}+Z_{1}+Z_{2}+\dots) (2nqnq|k|)\begin{pmatrix}2n_{q}\\ n_{q}-|k|\end{pmatrix} {nq,1nq,,nq}\{-n_{q},1-n_{q},\dots,n_{q}\} 2nq+12n_{q}+1
Binary 12(Z0+2Z1+4Z2+)diag(0,1,2,,2nq1)\begin{aligned} \textstyle&\frac{1}{2}(Z_{0}+2Z_{1}+4Z_{2}+\dots)\\ &\sim\operatorname{diag}\left(0,1,2,\dots,2^{n_{q}}-1\right)\end{aligned} 2nq|k|2^{n_{q}}-|k| {2nq+1,2nq1}\{-2^{n_{q}}+1,\dots 2^{n_{q}}-1\} 2nq+112^{n_{q}+1}-1
Ternary 12(Z0+3Z1+9Z2+)\frac{1}{2}(Z_{0}+3Z_{1}+9Z_{2}+\dots) 2nqT(k)12^{n_{q}-\lVert T(k)\rVert_{1}} {3nq2,,3nq2}\Bigl{\{}-\Bigl{\lfloor}\frac{3^{n_{q}}}{2}\Bigr{\rfloor},\dots,\Bigl{\lfloor}\frac{3^{n_{q}}}{2}\Bigr{\rfloor}\Bigr{\}} 3nq3^{n_{q}}
Golomb diag(0,1,4,6)\operatorname{diag}\left(0,1,4,6\right) {dk=01k0\begin{cases}d&k=0\\ 1&k\neq 0\end{cases} varies 2(d2)+12\begin{pmatrix}d\\ 2\end{pmatrix}+1
Table 1: Spectra Ω\Omega and degeneracies R(k)R(k) computed for various data-encoding Hamiltonians HH defined for either dd dimensions or nqn_{q} qubits. The Hamming, Binary, and Ternary data encoding strategies are realized on nqn_{q} qubits using a separable Hamiltonian consisting of Pauli-ZZ operators with different prefactor schemes. The Ternary encoding strategy results in the largest |Ω||\Omega| possible for a separable data encoding Hamiltonian (see also Ref. [44]), while the Golomb encoding encoding strategy (named in reference to Golomb rulers, e.g. [45]) results in the largest |Ω||\Omega| possible for any choice of dd-dimensional data-encoding Hamiltonian. Note that the spectrum is preserved under permutations and additive shifts of the diagonal of the Hamiltonian and so we use “\sim” to denote equivalence up to these operations. The function TT converts an integer to a signed ternary string, as defined in Eq. (234) (see Appendix C).

Since the feature weights νk\nu_{k} of the Fourier modes are fixed by the choice of the data-encoding unitary, we can understand a choice of S(x)S(x) as providing a structural bias of a quantum model towards different overfitting behaviors, and conversely the choices of feature weights available to quantum models are limited and are particular to the structure of the associated quantum circuit. Figure 4 shows distributions for feature weights arising from the example encoding strategies presented in Table 2, and demonstrates a broad connection between the degeneracies |R(k)||R(k)| of the model (giving rise to feature weights νkopt\nu_{k}^{opt}) and the generalization error L(fopt)L(f^{opt}).

Refer to caption
Figure 4: The degeneracies |R(k)||R(k)| for select quantum models directly influence the tendency of the models to overfit noise or bleed the underlying signal. a The degeneracies |R(k)||R(k)| for each frequency kk for the four models introduced in Table 1 induce significantly different feature weights in the simplified quantum model. b Two models exemplify the extreme cases of fitting behaviors from Sec. 2.2: The Golomb encoding strategy (magenta) has significant weight on feature weights νk\nu_{k} for large frequencies kk, resulting in a “spiky” overfitting behavior that interpolates each noisy sample yjy_{j} but sacrifices recovery of the underlying signal g(x)g(x); the Hamming encoding strategy (green) fits noisy samples smoothly and the error of the model is sensitive to the noise in the sampled data. c-e The Bias2\textsc{Bias}^{2} and var of the optimized Hamming encoding strategy (c), Binary encoding strategy (d), and Ternary encoding strategy (e) demonstrate the general effect of |R(k)||R(k)| and νk\nu_{k} on the generalization error. The concentration of feature weights νk\nu_{k} on kΩn0k\in\Omega_{n_{0}} in the Hamming encoding strategy reduces the bias (c), while νk\nu_{k} having broad support over kΩk\in\Omega (e.g. Ternary, e) reduces var as the resulting ff is close to its average (with respect to noise) except for becoming “spiky” in the vicinity of samples yjy_{j}, but increases the bias by introducing alias frequencies into the optimized model foptf^{opt}.

3.2 Trainable unitaries reweight the general quantum model

We now return to the general quantum model of Eq. (25) to understand how different choices of the state preparation unitary UU (giving rise to input state |Γ|\Gamma\rangle) affect the feature weights νkopt\nu_{k}^{opt} of the general quantum model, thereby influencing benign overfitting of the target function. While Eq. (31) differs from the correspondence νk=|R(k)|/d2\nu_{k}=|R(k)|/d^{2} that we observed for the simplified quantum model, we see that the feature weights of the optimal quantum model still depend heavily on the degeneracies |R(k)||R(k)|. For instance, the average νkopt\nu_{k}^{opt} with respect to Haar-random dd-dimensional input states |Γ|\Gamma\rangle is proportional to |R(k)||R(k)| whenever k0k\neq 0:

𝔼UU(d)νkopt=|R(k)|+dδ0kd(d+1).\displaystyle\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\nu_{k}^{opt}=\frac{|R(k)|+d\delta_{0}^{k}}{d(d+1)}. (34)

where δij\delta_{i}^{j} denotes the Kronecker delta. Furthermore, in Appendix B we observe that the variance of νkopt\nu_{k}^{opt} around its average tends to be small for encoding strategies considered in this work, for instance scaling like 𝒪(d3)\mathcal{O}\left(d^{-3}\right) for the Binary encoding strategy and 𝒪(d4)\mathcal{O}\left(d^{-4}\right) for the Golomb encoding strategy. This demonstrates that the feature weights of generic quantum models (i.e., ones for which UU is randomly sampled) will be dominated by the degeneracy |R(k)||R(k)| introduced by the data-encoding operation S(x)S(x).

Despite the behavior of νk\nu_{k} being dominated by |R(k)||R(k)| in an average sense, there are specific choices of UU for which the feature weights deviate significantly from this average. We will now use one such choice of UU to provide a concrete example of an interpolating quantum model that exhibits benign overfitting. Suppose we choose UU such that the elements of |Γ|\Gamma\rangle are given by

γj={a,j[c1,c2),b,otherwise\displaystyle\gamma_{j}=\begin{cases}\sqrt{a},&j\in[c_{1},c_{2}),\\ \sqrt{b},&\text{otherwise}\end{cases} (35)

for j[d]j\in[d], and some integers 0<c1<c2<d0<c_{1}<c_{2}<d and amplitudes a,ba,b subject to normalization. We show that given a band-limited target function gg with access to spectrum Ωn0\Omega_{n_{0}}, there is a specific choice of c1,c2c_{1},c_{2} dependent on dd and n0n_{0} for which the interpolating quantum model foptf^{opt} of Eq. (30) also benefits from vanishing generalization error in the limit of many samples, namely we show in Appendix D that

L(fopt)=𝒪(1n+nd).L(f^{opt})=\mathcal{O}\left(\frac{1}{n}+\frac{n}{d}\right). (36)

Thus, by perfectly fitting the training data and exhibiting good generalization in the nn\rightarrow\infty limit, the quantum model exhibits benign overfitting. This behavior is outlined in Figure 5, which highlights the role that |Γ|\Gamma\rangle plays in concentrating the feature weights νk\nu_{k} within the spectrum Ωn0\Omega_{n_{0}} of gg while preserving a long tail that provides the model with low-variance “spiky” behavior in the vicinity of noisy samples. In contrast, the feature weights for the Binary encoding strategy with a uniform input state has little support on Ωn0\Omega_{n_{0}} and resulting in a large bias error.

Refer to caption
Figure 5: Benign overfitting for the general quantum model. a For the Hamiltonian Hdiag((0,1,,d1))H\sim\operatorname{diag}\left((0,1,\dots,d-1)\right) we construct a simple input state |Γ1|\Gamma_{1}\rangle (Eq. (35)) that is dependent on n0n_{0} and dd (see Appendix D), compared to the uniform state |Γ0=1dk|k|\Gamma_{0}\rangle=\frac{1}{\sqrt{d}}\sum_{k}|k\rangle. b |Γ1|\Gamma_{1}\rangle induces feature weights νkopt\nu_{k}^{opt} in the optimized model that are heavily concentrated in the target function spectrum Ωn0\Omega_{n_{0}} (gray band) and decay like 1/d1/d and 1/d21/d^{2} elsewhere for kΩd\Ωn0k\in\Omega_{d}\backslash\Omega_{n_{0}}, compared to the feature weights induced by |Γ0|\Gamma_{0}\rangle which have similar magnitude over all of Ωd\Omega_{d}. c Plotting the MF\lVert M\rVert_{F}-minimizing quantum model f1optf_{1}^{opt} with input state |Γ1|\Gamma_{1}\rangle for a sample signal (d=128d=128, n=31n=31) highlights how |Γ1|\Gamma_{1}\rangle and HH work together to interpolate yjy_{j} via high-frequency “spiky” behavior near the sampled points, but otherwise closely matches the underlying signal g(x)g(x). d The resulting quantum model benignly overfits the band-limited target function: As long as d=ω(n)d=\omega(n), the generalization error L(f1opt)L(f_{1}^{opt}) of the minimum-MF\lVert M\rVert_{F} interpolating quantum model vanishes for large enough sample size nn.

The above discussion shows how the input state amplitudes γi\gamma_{i} provide additional degrees of freedom with which the feature weights νkopt\nu_{k}^{opt} can be tuned in order to modify the generalization behavior of the interpolating quantum model, and to exhibit benign overfitting in particular. It is therefore worthwhile to consider what other kinds of feature weights νk\nu_{k} might be prepared by some choice of input state |Γ|\Gamma\rangle. We may use simple counting arguments to demonstrate the restrictions in designing particular distributions of feature weights. Suppose we define Ω+={k:kΩ,k>0}\Omega_{+}=\{k:k\in\Omega,k>0\} containing the positive frequencies of a quantum model. Then the introduction of an arbitrary input state |Γ|\Gamma\rangle provides us with 2nq12^{n_{q}}-1 free parameters with which to tune |Ω+||\Omega_{+}|-many terms in the distribution of νkopt\nu_{k}^{opt} (subject to νk=νk\nu_{k}=\nu_{-k} and kνk=1\sum_{k}\nu_{k}=1). Clearly, there are distributions of feature weights νkopt\nu_{k}^{opt} that can not be achieved for models where |Ω+|2nq|\Omega_{+}|\geq 2^{n_{q}}.

Conversely, the condition |Ω+|<2nq|\Omega_{+}|<2^{n_{q}} does not necessarily mean that we can thoroughly explore the space of possible feature weights by modifying the input state |Γ|\Gamma\rangle. For example, consider the Hamming encoding strategy for which the number of free parameters controlling the distribution of feature weights νkopt\nu_{k}^{opt} is |Ω+|=nq|\Omega_{+}|=n_{q}, which is exponentially smaller the number of parameters in |Γ|\Gamma\rangle. While this might suggest significant freedom in adjusting νkopt\nu_{k}^{opt}, the opposite is true: For any choice of input state |Γ|\Gamma\rangle, there is another state of the form

|Γ=i=0nqϕi|Φi,|\Gamma^{\prime}\rangle=\sum_{i=0}^{n_{q}}\phi_{i}|\Phi_{i}\rangle, (37)

that achieves exactly the same distribution of feature weights νk\nu_{k}. In Eq. (37), |Φi|\Phi_{i}\rangle describes a uniform superposition over all computational basis state bitstrings with weight ii, and so the distribution of νkopt\nu_{k}^{opt} actually only depends on nq+1n_{q}+1 real parameters ϕi\phi_{i}, i=0,1,,nqi=0,1,\dots,n_{q}, and the feature weights are invariant under any operations in UU that preserve |Φi|\Phi_{i}\rangle (see Appendix B). An example of such operations are the particle-conserving unitaries well-known in quantum chemistry, which act to preserve the number of set bits (i.e., the Hamming weight) when each bit represents a mode occupied by a particle in second quantization [46, 47]. This example demonstrates how symmetry in the data-encoding Hamiltonian (e.g. Refs. [48, 49]) can have a profound influence on the ability to prepare specific distributions of feature weights νkopt\nu_{k}^{opt}, and consequently affect the generalization and overparameterization behavior of the associated quantum models.

4 Conclusion

In this work we have taken a first step towards characterizing the phenomenon of benign overfitting in a quantum machine learning setting. We derived the error for an overparameterized Fourier features model that interpolates the (noisy) input signal with minimum 2\ell_{2}-norm trainable weights and connected the feature weights associated with each Fourier mode to a trade-off in the generalization error of the model. We then demonstrated an analogous simplified quantum model for which the feature weights are induced by the choice of data-encoding unitary S(x)S(x). Finally, we discussed how introducing an arbitrary state-preparation unitary UU gives rise to effective feature weights in the optimized general quantum model, presenting the possibility of connecting UU and S(x)S(x) to benign overfitting in more general quantum models.

Our discussion of interpolating quantum models presents an interpretation of overparameterization (i.e., the size of the model spectrum Ω\Omega) that departs from other measures of quantum circuit complexity discussed in the literature [19, 50, 51], as even the simplified quantum models studied here are able to interpolate training data using a fixed circuit UU and optimized measurements. We also reemphasize that – unlike much of the quantum machine learning literature – we do not consider a setting where the model is optimized with respect to a trainable circuit, as the model of Eq. (30) is constructed to exhibit zero training error (and can therefore not be improved via optimization). Finding the input state |Γ|\Gamma\rangle that will result in a specific distribution of feature weights νkopt\nu_{k}^{opt} generally requires solving a |Ω+||\Omega_{+}|-dimensional system of equations that are second order in 2nq2^{n_{q}} many real parameters |γi|2|\gamma_{i}|^{2} (i.e., inverting the map of the form 2nq|Ω+|\mathbb{R}^{2^{n_{q}}}\rightarrow\mathbb{R}^{|\Omega_{+}|} in Eq. (31)) or otherwise performing a variational optimization that will likely fail due to the familiar phenomenon of barren plateaus [23, 24, 52, 53].

While we have shown an example of benign overfitting by a quantum model in a relatively restricted context, future work may lead to more general characterizations of this phenomenon. Similar behavior likely exists for quantum kernel methods and may complement existing studies on these methods’ generalization power [54]. An exciting possibility would be to demonstrate benign overfitting in quantum models trained on distributions of quantum states which are hard to learn classically [55, 56], thereby extending the growing body of statistical learning theory for quantum learning algorithms [27, 28, 57].

5 Code availability

Code to reproduce the figures and analysis is available at the following repository: https://github.com/peterse/benign-overfitting-quantum.

6 Acknowledgements

The authors thank Nathan Killoran, Achim Kempf, Angus Lowe, and Joseph Bowles for helpful feedback. This work was supported by Mitacs through the Mitacs Accelerate program. Research at Perimeter Institute is supported in part by the Government of Canada through the Department of Innovation, Science and Economic Development and by the Province of Ontario through the Ministry of Colleges and Universities. Circuit simulations were performed in PennyLane [58].

References

  • [1] Michael A Nielsen. “Neural networks and deep learning”. Determination Press.  (2015). url: http://neuralnetworksanddeeplearning.com/.
  • [2] Stuart Geman, Elie Bienenstock, and René Doursat. “Neural networks and the bias/variance dilemma”. Neural Comput. 4, 1–58 (1992).
  • [3] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. “The elements of statistical learning: data mining, inference, and prediction”. Volume 2. Springer.  (2009).
  • [4] Peter L. Bartlett, Andrea Montanari, and Alexander Rakhlin. “Deep learning: a statistical viewpoint”. Acta Numerica 30, 87–201 (2021).
  • [5] Mikhail Belkin. “Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation”. Acta Numerica 30, 203–248 (2021).
  • [6] Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. “Benign overfitting in linear regression”. Proc. Natl. Acad. Sci. 117, 30063–30070 (2020).
  • [7] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. “Reconciling modern machine-learning practice and the classical bias-variance trade-off”. Proc. Natl. Acad. Sci. 116, 15849–15854 (2019).
  • [8] Mikhail Belkin, Alexander Rakhlin, and Alexandre B. Tsybakov. “Does data interpolation contradict statistical optimality?”. In Proceedings of Machine Learning Research. Volume 89, pages 1611–1619. PMLR (2019). url: https://proceedings.mlr.press/v89/belkin19a.html.
  • [9] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. “Harmless interpolation of noisy data in regression”. IEEE Journal on Selected Areas in Information Theory 1, 67–83 (2020).
  • [10] Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai. “Classification vs regression in overparameterized regimes: Does the loss function matter?”. J. Mach. Learn. Res. 22, 1–69 (2021). url: http://jmlr.org/papers/v22/20-603.html.
  • [11] Yehuda Dar, Vidya Muthukumar, and Richard G. Baraniuk. “A farewell to the bias-variance tradeoff? an overview of the theory of overparameterized machine learning” (2021). arXiv:2109.02355.
  • [12] Marcello Benedetti, Erika Lloyd, Stefan Sack, and Mattia Fiorentini. “Parameterized quantum circuits as machine learning models”. Quantum Sci. Technol. 4, 043001 (2019).
  • [13] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii. “Quantum circuit learning”. Phys. Rev. A 98, 032309 (2018).
  • [14] Maria Schuld, Ville Bergholm, Christian Gogolin, Josh Izaac, and Nathan Killoran. “Evaluating analytic gradients on quantum hardware”. Phys. Rev. A 99, 032331 (2019).
  • [15] Maria Schuld and Nathan Killoran. “Quantum machine learning in feature hilbert spaces”. Phys. Rev. Lett. 122, 040504 (2019).
  • [16] Vojtěch Havlíček, Antonio D. Córcoles, Kristan Temme, Aram W. Harrow, Abhinav Kandala, Jerry M. Chow, and Jay M. Gambetta. “Supervised learning with quantum-enhanced feature spaces”. Nature 567, 209–212 (2019).
  • [17] Seth Lloyd and Christian Weedbrook. “Quantum generative adversarial learning”. Phys. Rev. Lett. 121, 040502 (2018).
  • [18] Pierre-Luc Dallaire-Demers and Nathan Killoran. “Quantum generative adversarial networks”. Phys. Rev. A 98, 012324 (2018).
  • [19] Amira Abbas, David Sutter, Christa Zoufal, Aurelien Lucchi, Alessio Figalli, and Stefan Woerner. “The power of quantum neural networks”. Nat. Comput. Sci. 1, 403–409 (2021).
  • [20] Logan G. Wright and Peter L. McMahon. “The capacity of quantum neural networks”. In 2020 Conference on Lasers and Electro-Optics (CLEO). Pages 1–2.  (2020). url: https://ieeexplore.ieee.org/document/9193529.
  • [21] Sukin Sim, Peter D. Johnson, and Alán Aspuru-Guzik. “Expressibility and entangling capability of parameterized quantum circuits for hybrid quantum-classical algorithms”. Adv. Quantum Technol. 2, 1900070 (2019).
  • [22] Thomas Hubregtsen, Josef Pichlmeier, Patrick Stecher, and Koen Bertels. “Evaluation of parameterized quantum circuits: on the relation between classification accuracy, expressibility and entangling capability”. Quantum Mach. Intell. 3, 1 (2021).
  • [23] Jarrod R McClean, Sergio Boixo, Vadim N Smelyanskiy, Ryan Babbush, and Hartmut Neven. “Barren plateaus in quantum neural network training landscapes”. Nat. Commun. 9, 4812 (2018).
  • [24] Marco Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio, and Patrick J Coles. “Cost function dependent barren plateaus in shallow parametrized quantum circuits”. Nat. Commun. 12, 1791 (2021).
  • [25] Matthias C. Caro, Elies Gil-Fuster, Johannes Jakob Meyer, Jens Eisert, and Ryan Sweke. “Encoding-dependent generalization bounds for parametrized quantum circuits”. Quantum 5, 582 (2021).
  • [26] Hsin-Yuan Huang, Michael Broughton, Masoud Mohseni, Ryan Babbush, Sergio Boixo, Hartmut Neven, and Jarrod R McClean. “Power of data in quantum machine learning”. Nat. Commun. 12, 2631 (2021).
  • [27] Matthias C. Caro, Hsin-Yuan Huang, M. Cerezo, Kunal Sharma, Andrew Sornborger, Lukasz Cincio, and Patrick J. Coles. “Generalization in quantum machine learning from few training data”. Nat. Commun. 13, 4919 (2022).
  • [28] Leonardo Banchi, Jason Pereira, and Stefano Pirandola. “Generalization in quantum machine learning: A quantum information standpoint”. PRX Quantum 2, 040321 (2021).
  • [29] Francisco Javier Gil Vidal and Dirk Oliver Theis. “Input redundancy for parameterized quantum circuits”. Front. Phys. 8, 297 (2020).
  • [30] Maria Schuld, Ryan Sweke, and Johannes Jakob Meyer. “Effect of data encoding on the expressive power of variational quantum-machine-learning models”. Phys. Rev. A 103, 032430 (2021).
  • [31] David Wierichs, Josh Izaac, Cody Wang, and Cedric Yen-Yu Lin. “General parameter-shift rules for quantum gradients”. Quantum 6, 677 (2022).
  • [32] Kendall E Atkinson. “An introduction to numerical analysis”. John Wiley & Sons.  (2008).
  • [33] Ali Rahimi and Benjamin Recht. “Random features for large-scale kernel machines”. In Advances in Neural Information Processing Systems. Volume 20.  (2007). url: https://papers.nips.cc/paper_files/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html.
  • [34] Walter Rudin. “The basic theorems of fourier analysis”. John Wiley & Sons, Ltd.  (1990).
  • [35] Song Mei and Andrea Montanari. “The generalization error of random features regression: Precise asymptotics and the double descent curve”. Commun. Pure Appl. Math. 75, 667–766 (2022).
  • [36] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. “Surprises in high-dimensional ridgeless least squares interpolation”. Ann. Stat. 50, 949 – 986 (2022).
  • [37] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. “On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels”. In Proceedings of Machine Learning Research. Volume 125, pages 1–29. PMLR (2020). url: http://proceedings.mlr.press/v125/liang20a.html.
  • [38] Edward Farhi and Hartmut Neven. “Classification with quantum neural networks on near term processors” (2018). arXiv:1802.06002.
  • [39] Maria Schuld, Alex Bocharov, Krysta M. Svore, and Nathan Wiebe. “Circuit-centric quantum classifiers”. Phys. Rev. A 101, 032308 (2020).
  • [40] Adrián Pérez-Salinas, Alba Cervera-Lierta, Elies Gil-Fuster, and José I. Latorre. “Data re-uploading for a universal quantum classifier”. Quantum 4, 226 (2020).
  • [41] Sofiene Jerbi, Lukas J Fiderer, Hendrik Poulsen Nautrup, Jonas M Kübler, Hans J Briegel, and Vedran Dunjko. “Quantum machine learning beyond kernel methods”. Nat. Commun. 14, 517 (2023).
  • [42] Casper Gyurik, Dyon Vreumingen, van, and Vedran Dunjko. “Structural risk minimization for quantum linear classifiers”. Quantum 7, 893 (2023).
  • [43] Maria Schuld. “Supervised quantum machine learning models are kernel methods” (2021). arXiv:2101.11020.
  • [44] S. Shin, Y. S. Teo, and H. Jeong. “Exponential data encoding for quantum supervised learning”. Phys. Rev. A 107, 012422 (2023).
  • [45] Sophie Piccard. “Sur les ensembles de distances des ensembles de points d’un espace euclidien.”. Memoires de l’Universite de Neuchatel. Secretariat de l’Universite.  (1939).
  • [46] Dave Wecker, Matthew B. Hastings, Nathan Wiebe, Bryan K. Clark, Chetan Nayak, and Matthias Troyer. “Solving strongly correlated electron models on a quantum computer”. Phys. Rev. A 92, 062318 (2015).
  • [47] Ian D. Kivlichan, Jarrod McClean, Nathan Wiebe, Craig Gidney, Alán Aspuru-Guzik, Garnet Kin-Lic Chan, and Ryan Babbush. “Quantum simulation of electronic structure with linear depth and connectivity”. Phys. Rev. Lett. 120, 110501 (2018).
  • [48] Martín Larocca, Frédéric Sauvage, Faris M. Sbahi, Guillaume Verdon, Patrick J. Coles, and M. Cerezo. “Group-invariant quantum machine learning”. PRX Quantum 3, 030341 (2022).
  • [49] Johannes Jakob Meyer, Marian Mularski, Elies Gil-Fuster, Antonio Anna Mele, Francesco Arzani, Alissa Wilms, and Jens Eisert. “Exploiting symmetry in variational quantum machine learning”. PRX Quantum 4, 010328 (2023).
  • [50] Martin Larocca, Nathan Ju, Diego García-Martín, Patrick J Coles, and Marco Cerezo. “Theory of overparametrization in quantum neural networks”. Nat. Comput. Sci. 3, 542–551 (2023).
  • [51] Yuxuan Du, Min-Hsiu Hsieh, Tongliang Liu, and Dacheng Tao. “Expressive power of parametrized quantum circuits”. Phys. Rev. Res. 2, 033125 (2020).
  • [52] Zoë Holmes, Kunal Sharma, M. Cerezo, and Patrick J. Coles. “Connecting ansatz expressibility to gradient magnitudes and barren plateaus”. PRX Quantum 3, 010313 (2022).
  • [53] Samson Wang, Enrico Fontana, Marco Cerezo, Kunal Sharma, Akira Sone, Lukasz Cincio, and Patrick J Coles. “Noise-induced barren plateaus in variational quantum algorithms”. Nat. Commun. 12, 6961 (2021).
  • [54] Abdulkadir Canatar, Evan Peters, Cengiz Pehlevan, Stefan M. Wild, and Ruslan Shaydulin. “Bandwidth enables generalization in quantum kernel models”. Transactions on Machine Learning Research (2023). url: https://openreview.net/forum?id=A1N2qp4yAq.
  • [55] Hsin-Yuan Huang, Michael Broughton, Jordan Cotler, Sitan Chen, Jerry Li, Masoud Mohseni, Hartmut Neven, Ryan Babbush, Richard Kueng, John Preskill, and Jarrod R. McClean. “Quantum advantage in learning from experiments”. Science 376, 1182–1186 (2022).
  • [56] Sitan Chen, Jordan Cotler, Hsin-Yuan Huang, and Jerry Li. “Exponential separations between learning with and without quantum memory”. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS). Pages 574–585.  (2022).
  • [57] Hsin-Yuan Huang, Richard Kueng, and John Preskill. “Information-theoretic bounds on quantum advantage in machine learning”. Phys. Rev. Lett. 126, 190505 (2021).
  • [58] Ville Bergholm, Josh Izaac, Maria Schuld, Christian Gogolin, M. Sohaib Alam, Shahnawaz Ahmed, Juan Miguel Arrazola, Carsten Blank, Alain Delgado, Soran Jahangiri, Keri McKiernan, Johannes Jakob Meyer, Zeyue Niu, Antal Száva, and Nathan Killoran. “Pennylane: Automatic differentiation of hybrid quantum-classical computations” (2018). arXiv:1811.04968.
  • [59] Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. “Benign overfitting in linear regression”. Proc. Natl. Acad. Sci. 117, 30063–30070 (2020).
  • [60] Vladimir Koltchinskii and Karim Lounici. “Concentration inequalities and moment bounds for sample covariance operators”. Bernoulli 23, 110 – 133 (2017).
  • [61] Zbigniew Puchała and Jarosław Adam Miszczak. “Symbolic integration with respect to the haar measure on the unitary group”. Bull. Pol. Acad. Sci. 65, 21–27 (2017).
  • [62] Daniel A. Roberts and Beni Yoshida. “Chaos and complexity by design”. J. High Energy Phys. 2017, 121 (2017).
  • [63] Wallace C. Babcock. “Intermodulation interference in radio systems frequency of occurrence and control by channel selection”. Bell Syst. tech. j. 32, 63–73 (1953).
  • [64] M. Atkinson, N. Santoro, and J. Urrutia. “Integer sets with distinct sums and differences and carrier frequency assignments for nonlinear repeaters”. IEEE Trans. Commun. 34, 614–617 (1986).
  • [65] J. Robinson and A. Bernstein. “A class of binary recurrent codes with limited error propagation”. IEEE Trans. Inf. 13, 106–113 (1967).
  • [66] R. J. F. Fang and W. A. Sandrin. “Carrier frequency assignment for nonlinear repeaters”. COMSAT Technical Review 7, 227–245 (1977).

Appendix A Solution for the classical overparameterized model

In this section we derive the optimal solution and generalization error for the classical overparameterized weighted Fourier functions model. We then discuss the conditions under which benign overfitting may be observed and construct examples of the phenomenon.

A.1 Linearization of overparameterized Fourier model

We first show that the classical overparameterized Fourier features model may be cast as a linear model under an appropriate orthogonal transformation. We are interested in learning a target function of the form

g(x)=kΩn0g^kei2πkx,g(x)=\sum_{k\in\Omega_{n_{0}}}\hat{g}_{k}e^{i2\pi kx}, (38)

with the additional constraint that the Fourier coefficients satisfy g^k=g^k\hat{g}_{k}=\hat{g}_{-k}^{*} such that gg is real. The spectrum of Eq. (38) for odd n0n_{0} only contains integer frequencies,

Ωn0={n012,,0,,n012},\Omega_{n_{0}}=\Biggl{\{}-\frac{n_{0}-1}{2},\dots,0,\dots,\frac{n_{0}-1}{2}\Biggr{\}}, (39)

and we accordingly call gg a bandlimited function with bandlimit n0/21n_{0}/2-1. To learn gg, we first sample nn equally spaced datapoints on the interval [0,1][0,1],

xj=jn,j[n],x_{j}=\frac{j}{n},\qquad j\in[n], (40)

where [n]={0,1,2,,n1}[n]=\{0,1,2,\dots,n-1\} and we assume nn is odd, and we then evaluate g(xj)g(x_{j}) with additive error. This noisy sampling process yields nn observations of the form yj=g(xj)+ϵy_{j}=g(x_{j})+\epsilon with 𝔼[ϵ2]=σ2\mathbb{E}[\epsilon^{2}]=\sigma^{2}. We will fit the observations yjy_{j} using overparameterized Fourier features models of the form

f(x)=ϕ(x),𝜶=kΩdαkνkei2πkx,f(x)=\langle\phi(x),\boldsymbol{\alpha}\rangle=\sum_{k\in\Omega_{d}}\alpha_{k}\sqrt{\nu_{k}}e^{i2\pi kx}, (41)

with 𝜶d\boldsymbol{\alpha}\in\mathbb{C}^{d}, and we have introduced weighted Fourier features ϕ:d\phi:\mathbb{R}\rightarrow\mathbb{C}^{d} defined elementwise as

ϕ(x)k=νkei2πkx.\phi(x)_{k}=\sqrt{\nu_{k}}e^{i2\pi kx}. (42)

In Eq. (41), Ωd\Omega_{d} describes the set of frequencies available to the model for any choice of dnn0d\geq n\geq n_{0}. We are interested in the case where ff interpolates the observations yjy_{j}, i.e., f(xj)=yjf(x_{j})=y_{j} for all j=0,,n1j=0,\dots,n-1. To this end, we define a n×dn\times d feature matrix Φ\Phi whose rows are given by ϕ(xj)\phi(x_{j})^{\dagger}:

Φjk=[ϕ(xj)]k=νkei2πjk/n.\displaystyle\Phi_{jk}=[\phi(x_{j})]_{k}^{*}=\sqrt{\nu_{k}}e^{-i2\pi jk/n}. (43)

The interpolation condition may then be stated in matrix form as

Φ𝜶=𝐲,\Phi\boldsymbol{\alpha}=\mathbf{y}, (44)

where (𝐲)j=yj(\mathbf{y})_{j}=y_{j} is the vector of noisy observations. Ωd\Omega_{d} contains alias frequencies of Ωn0\Omega_{n_{0}}, and so there the choice of 𝜶\boldsymbol{\alpha} that satisfies Eq. (44) is not unique. Here we will focus on the minimum-2\ell_{2} norm interpolating solution,

𝜶opt=argminΦ𝜶=𝐲𝜶2.\boldsymbol{\alpha}^{opt}=\operatorname*{arg\,min}_{\Phi\boldsymbol{\alpha}=\mathbf{y}}\lVert\boldsymbol{\alpha}\rVert_{2}. (45)

A.1.1 Fourier transform into the linear model

We will now show that Eq. (45) with uniformly-spaced datapoints xjx_{j} can be solved using methods from ordinary linear regression under a suitable choice of transformation. Defining the nn-th root of identity as ω=ei2π/n\omega=e^{i2\pi/n}, then the jj-th row of the LHS of Eq. (44) is equivalent to

kΩdαkνkωkj\displaystyle\sum_{k\in\Omega_{d}}\alpha_{k}\sqrt{\nu_{k}}\omega^{-kj} =kΩnS(k)αk+nνk+nω(k+n)j\displaystyle=\sum_{k\in\Omega_{n}}\sum_{\ell\in S(k)}\alpha_{k+n\ell}\sqrt{\nu_{k+n\ell}}\omega^{-(k+n\ell)j} (46)
=kΩnωkjS(k)αk+nνk+n\displaystyle=\sum_{k\in\Omega_{n}}\omega^{-kj}\sum_{\ell\in S(k)}\alpha_{k+n\ell}\sqrt{\nu_{k+n\ell}} (47)
:=kΩnωkjPS(k)𝜶,PS(k)𝐰.\displaystyle:=\sum_{k\in\Omega_{n}}\omega^{-kj}\langle P_{S(k)}\boldsymbol{\alpha},P_{S(k)}\mathbf{w}\rangle. (48)

where S(k)={j:jΩd,j mod n=k}S(k)=\{j:j\in\Omega_{d},j\text{ mod }n=k\} is the set of alias frequencies of kk appearing in Ωd\Omega_{d}, i.e. the set of frequencies k+nk+\ell n with \ell\in\mathbb{Z} that obey

ei2πkxj=ei2πkj/n=ei2π(k+n)j/n=ei2π(k+n)xj,e^{i2\pi kx_{j}}=e^{i2\pi kj/n}=e^{i2\pi(k+\ell n)j/n}=e^{i2\pi(k+\ell n)x_{j}}, (49)

which follows from ei2πk=1e^{i2\pi\ell k}=1 since kk\in\mathbb{Z}. The operator PS(k):ddP_{S(k)}:\mathbb{C}^{d}\rightarrow\mathbb{C}^{d} is the projector onto the set of standard basis vectors in d\mathbb{C}^{d} with indices in S(k)S(k), and (𝐰)k:=νk(\mathbf{w})_{k}:=\sqrt{\nu_{k}}. The roots of unity are orthonormal since

k=0n1ωk(pq)=nδpq\displaystyle\sum_{k=0}^{n-1}\omega^{k(p-q)}=n\delta_{p}^{q} (50)

for p,q[n]p,q\in[n]. This implies

kΩnωk(pq)=ωmin(Ωn)(pq)nδpq=nδpq,\displaystyle\sum_{k\in\Omega_{n}}\omega^{k(p-q)}=\omega^{\min(\Omega_{n})(p-q)}n\delta_{p}^{q}=n\delta_{p}^{q}, (51)

for p,qΩnp,q\in\Omega_{n}. Defining the discrete Fourier Transform y^k\hat{y}_{k} of 𝐲\mathbf{y} according to

y^j=1nkΩnykωjk,\hat{y}_{j}=\frac{1}{n}\sum_{k\in\Omega_{n}}y_{k}\omega^{jk}, (52)

then using the identity of Eq. (51), we evaluate the jj-th row of Eq. (44) as

kΩnωkjPS(k)𝜶,PS(k)𝐰\displaystyle\sum_{k\in\Omega_{n}}\omega^{-kj}\langle P_{S(k)}\boldsymbol{\alpha},P_{S(k)}\mathbf{w}\rangle =yj,\displaystyle=y_{j}, (53)
jΩnωpjkΩnωkjPS(k)𝜶,PS(k)𝐰\displaystyle\sum_{j\in\Omega_{n}}\omega^{pj}\sum_{k\in\Omega_{n}}\omega^{-kj}\langle P_{S(k)}\boldsymbol{\alpha},P_{S(k)}\mathbf{w}\rangle =jΩnωpj(kΩny^kωkj),\displaystyle=\sum_{j\in\Omega_{n}}\omega^{pj}\left(\sum_{k\in\Omega_{n}}\hat{y}_{k}\omega^{-kj}\right), (54)
kΩnnδkpPS(k)𝜶,PS(k)𝐰\displaystyle\sum_{k\in\Omega_{n}}n\delta_{k}^{p}\langle P_{S(k)}\boldsymbol{\alpha},P_{S(k)}\mathbf{w}\rangle =kΩnnδkpy^k,\displaystyle=\sum_{k\in\Omega_{n}}n\delta_{k}^{p}\hat{y}_{k}, (55)
PS(p)𝜶,PS(p)𝐰\displaystyle\langle P_{S(p)}\boldsymbol{\alpha},P_{S(p)}\mathbf{w}\rangle =y^p.\displaystyle=\hat{y}_{p}. (56)

Inspecting this final line yields a new matrix equation: Let XX be an n×dn\times d matrix XX with elements

Xjk=(PS(j)𝐰)k=νk𝟙{kS(j)},\displaystyle X_{jk}=(P_{S(j)}\mathbf{w})_{k}=\sqrt{\nu_{k}}\mathbbm{1}\{k\in S(j)\}, (57)

where the conditional operator 𝟙{Z}\mathbbm{1}\{Z\} evaluates to 11 if the predicate ZZ is true, and 0 otherwise. Then we may express Eq. (56) for all pΩnp\in\Omega_{n} as a matrix equation

X𝜶=𝐲^,X\boldsymbol{\alpha}=\hat{\mathbf{y}}, (58)

where (𝐲^)j=y^j(\hat{\mathbf{y}})_{j}=\hat{y}_{j}. We have shown that Eq. (44) is exactly equivalent to Eq. (58) for uniformly spaced inputs, and as 𝜶\boldsymbol{\alpha} is unchanged between these two representations this implies that the solution to Eq. (45) is also given by

𝜶opt=argminX𝜶=𝐲^𝜶2.\boldsymbol{\alpha}^{opt}=\operatorname*{arg\,min}_{X\boldsymbol{\alpha}=\hat{\mathbf{y}}}\lVert\boldsymbol{\alpha}\rVert_{2}. (59)

Therefore, the minimum 2\ell_{2}-norm solution to interpolate the input signal using weighted Fourier functions provided as samples yjy_{j} is exactly the same as the minimum 2\ell_{2}-norm solution for an equivalent linear regression problem on the matrix XX with targets 𝐲^\hat{\mathbf{y}}. Furthermore, this linear regression problem is related to the original problem via Fourier transform. Let FF be the (nonunitary) discrete Fourier transform defined on n\mathbb{C}^{n} elementwise as

Fjk\displaystyle F_{jk} =ωjk,\displaystyle=\omega^{jk}, (60)
FF\displaystyle F^{\dagger}F =n𝕀n.\displaystyle=n\mathbb{I}_{n}. (61)

Then for kΩdk\in\Omega_{d}, jΩnj\in\Omega_{n},

1n(FΦ)jk=1nΩnω(kj)νk=1n(nδjk mod n)νk=νk𝟙{kS(j)}=Xjk,\displaystyle\frac{1}{n}(F\Phi)_{jk}=\frac{1}{n}\sum_{\ell\in\Omega_{n}}\omega^{(k-j)\ell}\sqrt{\nu_{k}}=\frac{1}{n}\left(n\delta_{j}^{k\text{ mod }n}\right)\sqrt{\nu_{k}}=\sqrt{\nu_{k}}\mathbbm{1}\{k\in S(j)\}=X_{jk}, (62)

implying

1nFΦ\displaystyle\frac{1}{n}F\Phi =X.\displaystyle=X. (63)

We may similarly recover 1nF𝐲=𝐲^\frac{1}{n}F\mathbf{y}=\hat{\mathbf{y}} to show that the coefficients y^k\hat{y}_{k} are given by a discrete Fourier transform of yjy_{j}.

A.1.2 Error analysis of the linear model

Having shown that the discrete Fourier transform FF relates the original system of trigonometric equations Φ𝜶=𝐲\Phi\boldsymbol{\alpha}=\mathbf{y} to the system of linear equations (or ordinary least squares problem X𝜶=𝐲^X\boldsymbol{\alpha}=\hat{\mathbf{y}}, where we treat the rows of XX as observations in d\mathbb{R}^{d}), we now derive the error of the problem in the Fourier representation. Standard treatment for ordinary least squares (OLS) gives the minimum 2\ell_{2}-norm solution to Eq. (59) as

𝜶opt=XT(XXT)1𝐲^\boldsymbol{\alpha}^{opt}=X^{T}(XX^{T})^{-1}\hat{\mathbf{y}} (64)

in which case the optimal interpolating overparameterized Fourier model is

fopt(x)\displaystyle f^{opt}(x) =ϕ(x),𝜶opt\displaystyle=\langle\phi(x),\boldsymbol{\alpha}^{opt}\rangle (65)
=kΩny^kjS(k)νjS(k)νei2πx.\displaystyle=\sum_{k\in\Omega_{n}}\frac{\hat{y}_{k}}{\sum_{j\in S(k)}\nu_{j}}\,\sum_{\ell\in S(k)}\nu_{\ell}e^{i2\pi\ell x}. (66)

Once we have trained a model on noisy samples 𝐲\mathbf{y} of gg using uniformly spaced values of xx, we would like to evaluate how well the model performs for arbitrary xx in [0,1][0,1]. Given some function f=ϕ(x),𝜶f=\langle\phi(x),\boldsymbol{\alpha}\rangle the mean squared error 𝔼x(f(x)g(x))2\mathbb{E}_{x}(f(x)-g(x))^{2} may be evaluated with respect to the interval [0,1][0,1]. We define the generalization error of the model as the expected mean squared error evaluated with respect to 𝐲\mathbf{y}:

L(f):=𝔼x,𝐲(f(x)g(x))2.\displaystyle L(f):=\mathbb{E}_{x,\mathbf{y}}(f(x)-g(x))^{2}. (67)

We decompose the generalization error of the model as

𝔼x,𝐲(f(x)g(x))2\displaystyle\mathbb{E}_{x,\mathbf{y}}(f(x)-g(x))^{2} =𝔼x,𝐲(f(x)𝔼𝐲f(x)+𝔼𝐲f(x)g(x))2\displaystyle=\mathbb{E}_{x,\mathbf{y}}(f(x)-\mathbb{E}_{\mathbf{y}}f(x)+\mathbb{E}_{\mathbf{y}}f(x)-g(x))^{2} (68)
=𝔼x,𝐲(f(x)𝔼𝐲f(x))2+𝔼x,𝐲(𝔼𝐲f(x)g(x))2\displaystyle=\mathbb{E}_{x,\mathbf{y}}(f(x)-\mathbb{E}_{\mathbf{y}}f(x))^{2}+\mathbb{E}_{x,\mathbf{y}}(\mathbb{E}_{\mathbf{y}}f(x)-g(x))^{2} (69)
+2𝔼x,𝐲[(f(x)𝔼𝐲f(x))(𝔼𝐲f(x)g(x))].\displaystyle\qquad\qquad+2\mathbb{E}_{x,\mathbf{y}}\left[(f(x)-\mathbb{E}_{\mathbf{y}}f(x))(\mathbb{E}_{\mathbf{y}}f(x)-g(x))\right].

Because of orthonormality of Fourier basis functions, the cross-terms cancel resulting in a decomposition of the generalization error of the optimal standard bias and variance terms:

L(fopt)\displaystyle L(f^{opt}) =𝔼x,𝐲(fopt(x)𝔼𝐲fopt(x))2var+𝔼x,𝐲(𝔼𝐲fopt(x)g(x))2bias2.\displaystyle=\underbrace{\mathbb{E}_{x,\mathbf{y}}(f^{opt}(x)-\mathbb{E}_{\mathbf{y}}f^{opt}(x))^{2}}_{\textsc{var}}+\underbrace{\mathbb{E}_{x,\mathbf{y}}(\mathbb{E}_{\mathbf{y}}f^{opt}(x)-g(x))^{2}}_{\textsc{bias}^{2}}. (70)

We now evaluate var and bias2\textsc{bias}^{2} using the linear representation developed in the previous section. Beginning with the variance, conditional on constructing Φ\Phi from the set of uniformly spaced points xjx_{j} we apply the discrete Fourier transformation to yield XX and compute 𝜶opt\boldsymbol{\alpha}^{opt} using Eq. (64):

var =𝔼x,𝐲(fopt(x)𝔼𝐲fopt(x))2\displaystyle=\mathbb{E}_{x,\mathbf{y}}(f^{opt}(x)-\mathbb{E}_{\mathbf{y}}f^{opt}(x))^{2} (71)
=𝔼x,𝐲|ϕ(x),𝜶opt𝔼𝐲𝜶opt|2\displaystyle=\mathbb{E}_{x,\mathbf{y}}|\langle\phi(x),\boldsymbol{\alpha}^{opt}-\mathbb{E}_{\mathbf{y}}\boldsymbol{\alpha}^{opt}\rangle|^{2} (72)
=𝔼x,𝐲|ϕ(x),XT(XXT)1(𝐲^𝔼𝐲𝐲^)|2\displaystyle=\mathbb{E}_{x,\mathbf{y}}|\langle\phi(x),X^{T}(XX^{T})^{-1}(\hat{\mathbf{y}}-\mathbb{E}_{\mathbf{y}}\hat{\mathbf{y}})\rangle|^{2} (73)
=Tr(𝔼𝐲[(𝐲^𝔼𝐲𝐲^)(𝐲^𝔼𝐲𝐲^)](XXT)1X𝔼x[ϕ(x)ϕ(x)]XT(XXT)1)\displaystyle=\operatorname{Tr}\left(\mathbb{E}_{\mathbf{y}}\left[(\hat{\mathbf{y}}-\mathbb{E}_{\mathbf{y}}\hat{\mathbf{y}})(\hat{\mathbf{y}}-\mathbb{E}_{\mathbf{y}}\hat{\mathbf{y}})^{\dagger}\right](XX^{T})^{-1}X\mathbb{E}_{x}\left[\phi(x)\phi(x)^{\dagger}\right]X^{T}(XX^{T})^{-1}\right) (74)
=σ2nTr((XXT)2XΣϕXT).\displaystyle=\frac{\sigma^{2}}{n}\operatorname{Tr}\left((XX^{T})^{-2}X\Sigma_{\phi}X^{T}\right). (75)

Letting ϵ:=(𝐲𝔼𝐲𝐲)\boldsymbol{\epsilon}:=(\mathbf{y}-\mathbb{E}_{\mathbf{y}}\mathbf{y}) we have simplified the above using the following222In Eq. (75), 𝔼[XTX]=Σϕ\mathbb{E}[X^{T}X]=\Sigma_{\phi}, so that Σϕ\Sigma_{\phi} differs from the covariance matrix for the rows of XX by a factor of 1/n1/n.:

𝔼𝐲[(𝐲^𝔼𝐲𝐲^)(𝐲^𝔼𝐲𝐲^)]jk\displaystyle\mathbb{E}_{\mathbf{y}}\left[(\hat{\mathbf{y}}-\mathbb{E}_{\mathbf{y}}\hat{\mathbf{y}})(\hat{\mathbf{y}}-\mathbb{E}_{\mathbf{y}}\hat{\mathbf{y}})^{\dagger}\right]_{jk} =1n2𝔼𝐲(Fϵ)j(Fϵ)k\displaystyle=\frac{1}{n^{2}}\mathbb{E}_{\mathbf{y}}(F\boldsymbol{\epsilon})_{j}(F\boldsymbol{\epsilon})_{k}^{*} (76)
=1n2𝔼𝐲(pΩnωjpϵp)(qΩnωqkϵq)\displaystyle=\frac{1}{n^{2}}\mathbb{E}_{\mathbf{y}}\left(\sum_{p\in\Omega_{n}}\omega^{-jp}\epsilon_{p}\right)\left(\sum_{q\in\Omega_{n}}\omega^{qk}\epsilon_{q}\right) (77)
=1n2p,qΩn𝔼𝐲[ϵpϵq]ωjp+kq\displaystyle=\frac{1}{n^{2}}\sum_{p,q\in\Omega_{n}}\mathbb{E}_{\mathbf{y}}[\epsilon_{p}\epsilon_{q}]\omega^{-jp+kq} (78)
=σ2nδjk,\displaystyle=\frac{\sigma^{2}}{n}\delta_{j}^{k}, (79)

where we have used the fact that the errors ϵ\epsilon are independent and zero mean, 𝔼𝐲[ϵpϵq]=𝔼𝐲[ϵp2]δpq\mathbb{E}_{\mathbf{y}}[\epsilon_{p}\epsilon_{q}]=\mathbb{E}_{\mathbf{y}}[\epsilon_{p}^{2}]\delta_{p}^{q}. We have defined the feature covariance matrix as Σϕ=𝔼x[ϕ(x)ϕ(x)]\Sigma_{\phi}=\mathbb{E}_{x}[\phi(x)\phi(x)^{\dagger}], which may be computed elementwise using the orthonormality of Fourier features on [0,1][0,1]:

(Σϕ)jk=𝔼x[ϕ(x)jϕ(x)k]=νjνkei2πjx,ei2πkxL2[0,1]=δjkνj.\displaystyle(\Sigma_{\phi})_{jk}=\mathbb{E}_{x}[\phi(x)_{j}\phi(x)_{k}^{*}]=\sqrt{\nu_{j}\nu_{k}}\langle e^{i2\pi jx},e^{i2\pi kx}\rangle_{L_{2}^{[0,1]}}=\delta_{j}^{k}\nu_{j}. (80)

The following may be computed directly:

(XXT)jk\displaystyle(XX^{T})_{jk} =δjkS(j)ν\displaystyle=\delta_{j}^{k}\sum_{\ell\in S(j)}\nu_{\ell} (81)
(XΣ𝐱XT)jk\displaystyle(X\Sigma_{\mathbf{x}}X^{T})_{jk} =ΩdXj(νn)Xk\displaystyle=\sum_{\ell\in\Omega_{d}}X_{j\ell}\left(\frac{\nu_{\ell}}{n}\right)X_{k\ell} (82)
=1nΩdνp2𝟙{S(j)}𝟙{S(k)}\displaystyle=\frac{1}{n}\sum_{\ell\in\Omega_{d}}\nu_{p}^{2}\mathbbm{1}\{\ell\in S(j)\}\mathbbm{1}\{\ell\in S(k)\} (83)
=δjknS(k)ν2.\displaystyle=\frac{\delta_{j}^{k}}{n}\sum_{\ell\in S(k)}\nu_{\ell}^{2}. (84)

In line (84) we have used the identity

𝟙{S(j)}𝟙{S(k)}=𝟙{S(j)}δkj,\mathbbm{1}\{\ell\in S(j)\}\mathbbm{1}\{\ell\in S(k)\}=\mathbbm{1}\{\ell\in S(j)\}\delta_{k}^{j}, (85)

since S(k)k= mod n\ell\in S(k)\Rightarrow k=\ell\text{ mod }n and therefore S(j)j= mod n=k\ell\in S(j)\Rightarrow j=\ell\text{ mod }n=k. We now compute the variance as

var =σ2nTr((XXT)2XΣϕXT))\displaystyle=\frac{\sigma^{2}}{n}\operatorname{Tr}\left((XX^{T})^{-2}X\Sigma_{\phi}X^{T})\right) (86)
=σ2nj,kΩn(δkjS(k)ν)2(δjkS(k)ν2)\displaystyle=\frac{\sigma^{2}}{n}\sum_{j,k\in\Omega_{n}}\left(\delta_{k}^{j}\sum_{\ell\in S(k)}\nu_{\ell}\right)^{-2}\left(\delta_{j}^{k}\sum_{\ell\in S(k)}\nu_{\ell}^{2}\right) (87)
=σ2nkΩnS(k)ν2(S(k)ν)2.\displaystyle=\frac{\sigma^{2}}{n}\sum_{k\in\Omega_{n}}\frac{\sum_{\ell\in S(k)}\nu_{\ell}^{2}}{\left(\sum_{\ell\in S(k)}\nu_{\ell}\right)^{2}}. (88)

To evaluate the bias2\textsc{bias}^{2}, we will first rewrite g(x)g(x) in terms of its Fourier representation,

g(x)\displaystyle g(x) =ϕ(x),𝜶0,\displaystyle=\langle\phi(x),\boldsymbol{\alpha}_{0}\rangle, (89)
(𝜶0)k\displaystyle(\boldsymbol{\alpha}_{0})_{k} =g^kνk𝟙{kΩn0}.\displaystyle=\frac{\hat{g}_{k}}{\sqrt{\nu_{k}}}\mathbbm{1}\{k\in\Omega_{n_{0}}\}. (90)

Noting that (X𝜶0)k=g^k(X\boldsymbol{\alpha}_{0})_{k}=\hat{g}_{k} implies X𝜶0=𝔼𝐲𝐲^X\boldsymbol{\alpha}_{0}=\mathbb{E}_{\mathbf{y}}\hat{\mathbf{y}} we evaluate

bias2\displaystyle\textsc{bias}^{2} =𝔼x,𝐲(𝔼𝐲fopt(x)g(x))2\displaystyle=\mathbb{E}_{x,\mathbf{y}}(\mathbb{E}_{\mathbf{y}}f^{opt}(x)-g(x))^{2} (91)
=𝔼x,𝐲|ϕ(x),𝔼𝐲𝜶opt𝜶0|2\displaystyle=\mathbb{E}_{x,\mathbf{y}}|\langle\phi(x),\mathbb{E}_{\mathbf{y}}\boldsymbol{\alpha}^{opt}-\boldsymbol{\alpha}_{0}\rangle|^{2} (92)
=Tr(𝔼x[ϕ(x)ϕ(x)](𝔼𝐲𝜶opt𝜶0)(𝔼𝐲𝜶opt𝜶0))\displaystyle=\operatorname{Tr}\left(\mathbb{E}_{x}\left[\phi(x)\phi(x)^{\dagger}\right](\mathbb{E}_{\mathbf{y}}\boldsymbol{\alpha}^{opt}-\boldsymbol{\alpha}_{0})(\mathbb{E}_{\mathbf{y}}\boldsymbol{\alpha}^{opt}-\boldsymbol{\alpha}_{0})^{\dagger}\right) (93)
=Tr(Σϕ(XT(XXT)1X𝕀d)𝜶0𝜶0(XT(XXT)1X𝕀d))\displaystyle=\operatorname{Tr}\left(\Sigma_{\phi}(X^{T}(XX^{T})^{-1}X-\mathbb{I}_{d})\boldsymbol{\alpha}_{0}\boldsymbol{\alpha}_{0}^{\dagger}(X^{T}(XX^{T})^{-1}X-\mathbb{I}_{d})^{\dagger}\right) (94)
=Σϕ1/2(𝕀dXT(XXT)1X)𝜶022.\displaystyle=\lVert\Sigma_{\phi}^{1/2}(\mathbb{I}_{d}-X^{T}(XX^{T})^{-1}X)\boldsymbol{\alpha}_{0}\rVert_{2}^{2}. (95)

While Eq. (95) already completely characterizes the bias error in terms of the choice of feature weights and input data, it may be greatly simplified by taking advantage of the sparseness XX. We have

(XT(XXT)1X)jk\displaystyle(X^{T}(XX^{T})^{-1}X)_{jk} =Ωn(mS()νm)1XjXk\displaystyle=\sum_{\ell\in\Omega_{n}}\left(\sum_{m\in S(\ell)}\nu_{m}\right)^{-1}X_{\ell j}X_{\ell k} (96)
=(mS(j mod n)νm)1νkνj𝟙{kS(j mod n)},\displaystyle=\left(\sum_{m\in S(j\text{ mod }n)}\nu_{m}\right)^{-1}\sqrt{\nu_{k}}\sqrt{\nu_{j}}\mathbbm{1}\{k\in S(j\text{ mod }n)\}, (97)

where in line (97) we have used

Ωn𝟙{kS()}𝟙{jS()}=𝟙{kS(j mod n)}\sum_{\ell\in\Omega_{n}}\mathbbm{1}\{k\in S(\ell)\}\mathbbm{1}\{j\in S(\ell)\}=\mathbbm{1}\{k\in S(j\text{ mod }n)\} (98)

for k,jΩdk,j\in\Omega_{d} and Ωn\ell\in\Omega_{n}, which follows from similar reasoning as Eq. (84). And so, writing

(Σϕ1/2\displaystyle(\Sigma_{\phi}^{1/2} (𝕀dXT(XXT)1X)𝜶0)\displaystyle(\mathbb{I}_{d}-X^{T}(XX^{T})^{-1}X)\boldsymbol{\alpha}_{0})_{\ell} (99)
=jΩd[δjg^j𝟙{jΩn0}\displaystyle=\sum_{j\in\Omega_{d}}\Bigg{[}\delta_{\ell}^{j}\hat{g}_{j}\mathbbm{1}\{j\in\Omega_{n_{0}}\}
kΩdνδj(mS(j mod n)νm)1g^kνj𝟙{kS(j mod n)}𝟙{kΩn0}]\displaystyle\qquad-\sum_{k\in\Omega_{d}}\sqrt{\nu_{\ell}}\delta_{j}^{\ell}\left(\sum_{m\in S(j\text{ mod }n)}\nu_{m}\right)^{-1}\hat{g}_{k}\sqrt{\nu_{j}}\mathbbm{1}\{k\in S(j\text{ mod }n)\}\mathbbm{1}\{k\in\Omega_{n_{0}}\}\Bigg{]} (100)
=(jΩn0δjg^jkΩn0ν(mS( mod n)νm)1g^𝟙{kS( mod n)})\displaystyle=\left(\sum_{j\in\Omega_{n_{0}}}\delta_{\ell}^{j}\hat{g}_{j}-\sum_{k\in\Omega_{n_{0}}}\nu_{\ell}\left(\sum_{m\in S(\ell\text{ mod }n)}\nu_{m}\right)^{-1}\hat{g}_{\ell}\mathbbm{1}\{k\in S(\ell\text{ mod }n)\}\right) (101)
=(𝟙{Ωn0}g^jkΩn0ν(mS(k)νm)1g^k𝟙{S(k)}).\displaystyle=\left(\mathbbm{1}\{\ell\in\Omega_{n_{0}}\}\hat{g}_{j}-\sum_{k\in\Omega_{n_{0}}}\nu_{\ell}\left(\sum_{m\in S(k)}\nu_{m}\right)^{-1}\hat{g}_{k}\mathbbm{1}\{\ell\in S(k)\}\right). (102)

Letting Qk:=mS(k)νmQ_{k}:=\sum_{m\in S(k)}\nu_{m} the bias term of the error evaluates to

bias2\displaystyle\textsc{bias}^{2} =Ωd|(Σ𝐱1/2(𝕀dXT(XXT)1X)𝜶0)|2\displaystyle=\sum_{\ell\in\Omega_{d}}|(\Sigma_{\mathbf{x}}^{1/2}(\mathbb{I}_{d}-X^{T}(XX^{T})^{-1}X)\boldsymbol{\alpha}_{0})_{\ell}|^{2} (103)
=Ωd|𝟙{Ωn0}g^kΩn0νQk1g^k𝟙{S(k)}|2\displaystyle=\sum_{\ell\in\Omega_{d}}\left|\mathbbm{1}\{\ell\in\Omega_{n_{0}}\}\hat{g}_{\ell}-\sum_{k\in\Omega_{n_{0}}}\nu_{\ell}Q_{k}^{-1}\hat{g}_{k}\mathbbm{1}\{\ell\in S(k)\}\right|^{2} (104)
=Ωd(𝟙{Ωn0}|g^|2kΩn0𝟙{Ωn0}g^g^kνQk1𝟙{S(k)}\displaystyle=\sum_{\ell\in\Omega_{d}}\left(\mathbbm{1}\{\ell\in\Omega_{n_{0}}\}|\hat{g}_{\ell}|^{2}-\sum_{k\in\Omega_{n_{0}}}\mathbbm{1}\{\ell\in\Omega_{n_{0}}\}\hat{g}_{\ell}\hat{g}_{k}^{*}\nu_{\ell}Q_{k}^{-1}\mathbbm{1}\{\ell\in S(k)\}\right.
kΩn0𝟙{Ωn0}g^g^kνQk1𝟙{S(k)}\displaystyle\qquad\qquad\left.-\sum_{k\in\Omega_{n_{0}}}\mathbbm{1}\{\ell\in\Omega_{n_{0}}\}\hat{g}_{\ell}^{*}\hat{g}_{k}\nu_{\ell}Q_{k}^{-1}\mathbbm{1}\{\ell\in S(k)\}\right.
+j,kΩn0ν2Qj1Qk1g^jg^k𝟙{S(j)}𝟙{S(k)})\displaystyle\qquad\qquad\left.+\sum_{j,k\in\Omega_{n_{0}}}\nu_{\ell}^{2}Q_{j}^{-1}Q_{k}^{-1}\hat{g}_{j}^{*}\hat{g}_{k}\mathbbm{1}\{\ell\in S(j)\}\mathbbm{1}\{\ell\in S(k)\}\right) (105)
=(kΩn|g^k|22kΩn0νk|g^k|2Qk1+kΩn0S(k)ν2Qk2|g^k|2)\displaystyle=\left(\sum_{k\in\Omega_{n}}|\hat{g}_{k}|^{2}-2\sum_{k\in\Omega_{n_{0}}}\nu_{k}|\hat{g}_{k}|^{2}Q_{k}^{-1}+\sum_{k\in\Omega_{n_{0}}}\sum_{\ell\in S(k)}\nu_{\ell}^{2}Q_{k}^{-2}|\hat{g}_{k}|^{2}\right) (106)
=kΩn0|g^k|2Qk2((Qkνk)2+S(k)\kν2)\displaystyle=\sum_{k\in\Omega_{n_{0}}}\frac{|\hat{g}_{k}|^{2}}{Q_{k}^{2}}\left((Q_{k}-\nu_{k})^{2}+\sum_{\ell\in S(k)\backslash k}\nu_{\ell}^{2}\right) (107)
=kΩn0|g^k|2(S(k)\kν)2+S(k)\kν2(S(k)ν)2.\displaystyle=\sum_{k\in\Omega_{n_{0}}}|\hat{g}_{k}|^{2}\frac{\left(\sum_{\ell\in S(k)\backslash k}\nu_{\ell}\right)^{2}+\sum_{\ell\in S(k)\backslash k}\nu_{\ell}^{2}}{\left(\sum_{\ell\in S(k)}\nu_{\ell}\right)^{2}}. (108)

With Eqs. (88) and (108) we have recovered a closed form expression for generalization error in terms of feature weights νk\nu_{k}, the target function Fourier coefficients g^k\hat{g}_{k}, and the noise variance σ2\sigma^{2}. This was possible in part because of the orthonormality of the Fourier features ϕ(x)k\phi(x)_{k} and the choice to sample xjx_{j} uniformly on [0,1][0,1], resulting in diagonal Σϕ\Sigma_{\phi} and XXTXX^{T} respectively. This simplicity is more advantageous for studying scenarios where benign overfitting may exist compared to prior works [6, 59]. We will now analyze choices of feature weights νk\nu_{k} that may give rise to benign overfitting for overparameterized weighted Fourier Features models.

We remark that the results of this section may also be derived using the methods of Ref. [9], though we have opted here to use a language more reminiscent of linear regression to highlight connections between the analysis of weighted Fourier features models and ordinary least squares errors.

A.2 Solutions for special cases

A.2.1 Noise-only minimum norm estimator

We now derive the cases considered in Sec. 2.2 of the main text. We first consider when the target function is given by g=0g=0 and we attempt to predict on a pure noise signal

yj=ϵj,y_{j}=\epsilon_{j}, (109)

with 𝔼[ϵ2]=σ2\mathbb{E}[\epsilon^{2}]=\sigma^{2}. We can recover an unaliased, “simple” fit to this pure noise signal by reconstructing ϵ\boldsymbol{\epsilon} from the DFT using Eq. (52):

ϵj=kΩnϵ^kei2πjk/n.\epsilon_{j}=\sum_{k\in\Omega_{n}}\hat{\epsilon}_{k}e^{i2\pi jk/n}. (110)

By setting all weights νk=1\nu_{k}=1, an immediate choice for an interpolating ff with access to frequencies in Ωd\Omega_{d} is found by setting

αk=ϵ^k mod nn(m+1),\alpha_{k}=\frac{\hat{\epsilon}_{k\text{ mod }n}}{n(m+1)}, (111)

where we have assumed d=n(m+1)d=n(m+1). Eq. (111) is the minimum 2\ell_{2}-norm estimator and evenly distributes the weight of each Fourier coefficient ϵ^k\hat{\epsilon}_{k} over mm aliases frequencies in in S(k)S(k). The effect of higher-frequency aliases is to reduce the ff to near-zero everywhere except at the interpolated training data. We can directly compute the generalization error of the interpolating estimator as

𝔼x,𝐲|f(x)g(x)|2\displaystyle\mathbb{E}_{x,\mathbf{y}}|f(x)-g(x)|^{2} =𝔼x|f(x)|2\displaystyle=\mathbb{E}_{x}\left|f(x)\right|^{2} (112)
=𝔼𝐲|k=0d1αkei2πkx|2\displaystyle=\mathbb{E}_{\mathbf{y}}\left|\sum_{k=0}^{d-1}\alpha_{k}e^{-i2\pi kx}\right|^{2} (113)
=𝔼𝐲k=0d1|ϵ^k mod nn(m+1)|2\displaystyle=\mathbb{E}_{\mathbf{y}}\sum_{k=0}^{d-1}\left|\frac{\hat{\epsilon}_{k\text{ mod }n}}{n(m+1)}\right|^{2} (114)
=𝔼𝐲n(m+1)ϵ^22n2(m+1)2\displaystyle=\mathbb{E}_{\mathbf{y}}\frac{n(m+1)\lVert\hat{\boldsymbol{\epsilon}}\rVert_{2}^{2}}{n^{2}(m+1)^{2}} (115)
=σ2n(m+1).\displaystyle=\frac{\sigma^{2}}{n(m+1)}. (116)

Using d=n(m+1)d=n(m+1) as the dimensionality of the feature space, we have recovered the lower bound scaling for overparameterized models derived in Ref. [9]. In line (113) we have used the independence of ϵ\epsilon from xkx_{k} and yky_{k} and in line (114) we have used orthonormality of Fourier basis functions on [0,1][0,1]. line (116) uses Parseval’s relation for the Fourier coefficients, namely:

kΩn|ϵk|2=nkΩn|ϵ^k|2,\sum_{k\in\Omega_{n}}|\epsilon_{k}|^{2}=n\sum_{k\in\Omega_{n}}|\hat{\epsilon}_{k}|^{2}, (117)

which implies 𝔼ϵ^22=n1𝔼ϵ22=σ2\mathbb{E}\lVert\hat{\boldsymbol{\epsilon}}\rVert_{2}^{2}=n^{-1}\mathbb{E}\lVert\boldsymbol{\epsilon}\rVert_{2}^{2}=\sigma^{2}. Figure 2a of the main text shows the effect of the number of cohorts mm on the behavior of ff, which interpolates a pure noise signal with very little bias. As the number of cohorts increases, the function deviates very little away from the true “signal” y=0y=0, and becomes very “spiky” in the vicinity of noise

A.2.2 Signal-only minimum norm estimator

Now we study the opposite situation in which the pure tone is noiseless, and aliases in the spectrum of ff interfere in prediction of ff. In this case, we set σ=0\sigma=0 and interpolate target labels

yj\displaystyle y_{j} =g^pei2πpxj,\displaystyle=\hat{g}_{p}e^{i2\pi px_{j}}, (118)

with n/2<p<n/2-n/2<p<n/2. When we set d=n(m+1)d=n(m+1) and predict on yy, there are exactly |S(p)|1=m|S(p)|-1=m aliases for the target function with frequency pp. We again assume all feature weights are equal, νk=1\nu_{k}=1, and by orthonormality of Fourier basis functions, only the components of ff with frequency in S(p)S(p) are retained:

f(x)=αpei2πpx+kS(p)αkei2πkx,f(x)=\alpha_{p}e^{i2\pi px}+\sum_{k\in S(p)}\alpha_{k}e^{i2\pi kx}, (119)

which will interpolate the training points for any choice of 𝜶\boldsymbol{\alpha} satisfying

kS(p)αk=g^p.\sum_{k\in S(p)}\alpha_{k}=\hat{g}_{p}. (120)

The choice of trainable weights αk\alpha_{k} that satisfy Eq. (120) while minimizing 2\ell_{2}-norm is

αk={g^pm+1,kS(p),0, otherwise.\alpha_{k}=\begin{cases}\frac{\hat{g}_{p}}{m+1},&k\in S(p),\\ 0,&\text{ otherwise}.\end{cases} (121)

The problem with minimizing the 2\ell_{2} norm in this case is that it spreads the true signal into higher frequency aliases: The generalization error of this model is

𝔼x,𝐲|f(x)g(x)|2=|g^p(mm+1)|2+m|g^p|2(m+1)2=𝒪(1).\displaystyle\mathbb{E}_{x,\mathbf{y}}|f(x)-g(x)|^{2}=\left|\hat{g}_{p}\left(\frac{m}{m+1}\right)\right|^{2}+m\frac{|\hat{g}_{p}|^{2}}{(m+1)^{2}}=\mathcal{O}(1). (122)

We see that this model generally fails to generalize. This poor generalization was described as “signal bleeding” by Ref. [9]: Using nn samples, there is no way to distinguish the signal due to an alias of pp from the signal due to the true frequency pp, so the coefficients αk\alpha_{k} become evenly distributed over aliases with very little weight allocated to the true Fourier coefficient g^p\hat{g}_{p} in the model ff. Fig. 2b in the main text shows the effect of “signal bleed” for learning a pure tone in the absence of noise.

A.3 Conditions for Benign overfitting

The behavior of the error of the overparameterized Fourier model (Eq. (19) of the main text) depends on an interplay between noise variance σ2\sigma^{2}, signal Fourier coefficients g^k\hat{g}_{k}, feature weights νk\nu_{k}, and the size of the model dd. A desirable property of the feature weights νk\nu_{k} is that they should result in a model that both interpolates the sampled data while also achieving good generalization error in the limit nn\rightarrow\infty. For our purposes we will consider cases where limnL(fopt)=0\lim_{n\rightarrow\infty}L(f^{opt})=0, though this condition could be relaxed to allow for more interesting or natural choices of νk\nu_{k}. We now analyze the error arising from a simple weighting scheme to demonstrate benign overfitting using the overparameterized Fourier models discussed in this work.

A.3.1 A simple demonstration of benign overfitting

Here we demonstrate a simple example of benign overfitting when the feature weights νk\nu_{k} are chosen with direct knowledge of the spectrum Ωn0\Omega_{n_{0}} of gg. For some n0<n<dn_{0}<n<d, fix c(0,1)c\in(0,1) and use the feature weights given as

νk={cn0,kΩn0,1cdn0,otherwise,\displaystyle\nu_{k}=\begin{cases}\frac{c}{n_{0}},&k\in\Omega_{n_{0}},\\ \frac{1-c}{d-n_{0}},&\text{otherwise},\end{cases} (123)

for all kΩdk\in\Omega_{d}. For simplicity, suppose d=n(m+1)d=n(m+1) such that |S(k)|=m+1|S(k)|=m+1 for all kΩnk\in\Omega_{n}. Defining the signal power as

P:=kΩn0|g^k|2,P:=\sum_{k\in\Omega_{n_{0}}}|\hat{g}_{k}|^{2}, (124)

we can directly evaluate var of Eq. (88) and bias2\textsc{bias}^{2} of Eq. (108):

var σ2n(n0(cn0)2+m(1cdn0)2(cn0+m1cdn0)2+nn0m+1).\displaystyle\rightarrow\frac{\sigma^{2}}{n}\left(n_{0}\frac{\left(\frac{c}{n_{0}}\right)^{2}+m\left(\frac{1-c}{d-n_{0}}\right)^{2}}{\left(\frac{c}{n_{0}}+m\frac{1-c}{d-n_{0}}\right)^{2}}+\frac{n-n_{0}}{m+1}\right). (125)
bias2\displaystyle\textsc{bias}^{2} Pn02(m+1)m(c1c(dn0)+mn0)2.\displaystyle\rightarrow\frac{Pn_{0}^{2}(m+1)m}{\left(\frac{c}{1-c}(d-n_{0})+mn_{0}\right)^{2}}. (126)

Fixing n0n_{0}, we can bound generalization error in the asymptotic limit nn\rightarrow\infty as:

var =𝒪(1n+nd)\displaystyle=\mathcal{O}\left(\frac{1}{n}+\frac{n}{d}\right) (127)
bias2\displaystyle\textsc{bias}^{2} =𝒪(1n2).\displaystyle=\mathcal{O}\left(\frac{1}{n^{2}}\right). (128)

Therefore, by setting d=ω(n)d=\omega(n) the model perfectly interpolates the training data and also achieves vanishing generalization error in the limit of large number of data. A similar example was considered in Ref. [9], though a rigorous error analysis (and relationship to benign overfitting) was not considered there. This benign overfitting behavior is entirely due to the feature weights of Eq. (123): As d,nd,n\rightarrow\infty with d=ω(n)d=\omega(n), the feature weights are concentrated on Ωn0\Omega_{n_{0}} (suppressing bias) while becoming increasingly small and evenly distributed over all aliases of Ωn0\Omega_{n_{0}} (suppressing var).

A.3.2 More general conditions for benign overfitting

We have derived closed-form solutions for the bias2\textsc{bias}^{2} and var terms that determine the total generalization error of an interpolating model ff and in the previous section provided a concrete example of a model that achieves benign overfitting in this setting. We now discuss conditions under which a more general choice of model can exhibit benign overfitting.

We begin by showing that the variance of Eq. (88) splits naturally into an error due to a (simple) prediction component and a (spiky, complex) interpolating component. Following Refs. [4, 6], we will split the variance into components corresponding to eigenspaces of Σϕ\Sigma_{\phi} with large and small eigenvalues respectively. Let SpS^{\leq p} denote the set indices for the largest pp eigenvalues of Σϕ\Sigma_{\phi} (i.e., the largest pp values of νk\nu_{k}), and S>p=[d]\SpS^{>p}=[d]\backslash S^{\leq p} be its complement. Define Pp:ddP^{\leq p}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} as the projector onto the the subspace of d\mathbb{R}^{d} spanned by basis vectors labelled by indices in SpS^{\leq p} (and P>p=𝕀dPpP^{>p}=\mathbb{I}_{d}-P^{\leq p} is defined analogously). Then letting (𝝂)k=νk(\boldsymbol{\nu})_{k}=\nu_{k} be the vector of feature weights and assuming pnp\leq n, we may rewrite the variance of Eq. (88) as

var =σ2nkΩnS(k)ν2(S(k)ν)2\displaystyle=\frac{\sigma^{2}}{n}\sum_{k\in\Omega_{n}}\frac{\sum_{\ell\in S(k)}\nu_{\ell}^{2}}{\left(\sum_{\ell\in S(k)}\nu_{\ell}\right)^{2}} (129)
=σ2nkΩnPS(k)𝝂22PS(k)𝝂12\displaystyle=\frac{\sigma^{2}}{n}\sum_{k\in\Omega_{n}}\frac{\lVert P_{S(k)}\boldsymbol{\nu}\rVert_{2}^{2}}{\lVert P_{S(k)}\boldsymbol{\nu}\rVert_{1}^{2}} (130)
=σ2nkΩnPpPS(k)𝝂+P>pPS(k)𝝂22PS(k)𝝂12\displaystyle=\frac{\sigma^{2}}{n}\sum_{k\in\Omega_{n}}\frac{\lVert P^{\leq p}P_{S(k)}\boldsymbol{\nu}+P^{>p}P_{S(k)}\boldsymbol{\nu}\rVert_{2}^{2}}{\lVert P_{S(k)}\boldsymbol{\nu}\rVert_{1}^{2}} (131)
σ2n(kΩnPpPS(k)𝝂22PpPS(k)𝝂12+kΩnP>pPS(k)𝝂22P>pPS(k)𝝂12)\displaystyle\leq\frac{\sigma^{2}}{n}\left(\sum_{k\in\Omega_{n}}\frac{\lVert P^{\leq p}P_{S(k)}\boldsymbol{\nu}\rVert_{2}^{2}}{\lVert P^{\leq p}P_{S(k)}\boldsymbol{\nu}\rVert_{1}^{2}}+\sum_{k\in\Omega_{n}}\frac{\lVert P^{>p}P_{S(k)}\boldsymbol{\nu}\rVert_{2}^{2}}{\lVert P^{>p}P_{S(k)}\boldsymbol{\nu}\rVert_{1}^{2}}\right) (132)
σ2[pn+1nkΩn1Rp(k)(Σϕ)],\displaystyle\leq\sigma^{2}\left[\frac{p}{n}+\frac{1}{n}\sum_{k\in\Omega_{n}}\frac{1}{R_{p}^{(k)}(\Sigma_{\phi})}\right], (133)

where we have used PpPS(k)e^j=δjk𝟙{jp}P^{\leq p}P_{S(k)}\hat{e}_{j}=\delta_{j}^{k}\mathbbm{1}\{j\leq p\} for pnp\leq n and we have introduced an effective rank for the alias cohort of kk,

Rp(k)(Σϕ)=(S(k)S>pν)2S(k)S>pν2.\displaystyle R_{p}^{(k)}(\Sigma_{\phi})=\frac{\left(\sum_{\ell\in S(k)\cap S^{>p}}\nu_{\ell}\right)^{2}}{\sum_{\ell\in S(k)\cap S^{>p}}\nu_{\ell}^{2}}. (134)

Since p=n0p=n_{0} is a relevant choice for our problem setup, we define R(k):=Rn0(k)(Σϕ)R^{(k)}:=R_{n_{0}}^{(k)}(\Sigma_{\phi}) and focus on the bound

var σ2[n0n+1nkΩn1R(k)],\displaystyle\leq\sigma^{2}\left[\frac{n_{0}}{n}+\frac{1}{n}\sum_{k\in\Omega_{n}}\frac{1}{R^{(k)}}\right], (135)

The first term of the decomposition of Eq. (135) corresponds to the variance of a (non-interpolating) model with access to n0n_{0} Fourier modes, while the second term corresponds to excess error introduced by high frequency components of ff. Given a sequence of experiments with increasing nn (while gg and σ2\sigma^{2} remain fixed), we would like to understand the choices of feature weights νk\nu_{k} for which var vanishes as nn\rightarrow\infty. Given that |{kΩn}|=n|\{k\in\Omega_{n}\}|=n, a sufficient condition for a sequence of feature weight distributions to be benign is that R(k)=ω(1)R^{(k)}=\omega(1) for all kk while a necessary condition is that there is no kk for which R(k)=𝒪(1/n)R^{(k)}=\mathcal{O}(1/n). These conditions are not difficult to satisfy: Intuitively they require only that Σϕ\Sigma_{\phi} changes with increasing nn in such a way that the values of νk\nu_{k} in Ωn\Ωn0\Omega_{n}\backslash\Omega_{n_{0}} continue “flatten out” as nn increases. This is precisely the behavior engineered in the example of Sec. A.3.1.

We now proceed to bound the bias term of Eq. (95). Observe that P:=(𝕀dXT(XXT)1X)P^{\perp}:=(\mathbb{I}_{d}-X^{T}(XX^{T})^{-1}X) is a projector onto the subspace of d\mathbb{R}^{d} orthogonal to the rows of XX, therefore satisfying P1\lVert P^{\perp}\rVert\leq 1 and PXTX=0P^{\perp}X^{T}X=0. Then the bias2\textsc{bias}^{2} is bounded as

bias2\displaystyle\textsc{bias}^{2} =Σϕ1/2P𝜶022\displaystyle=\lVert\Sigma_{\phi}^{1/2}P^{\perp}\boldsymbol{\alpha}_{0}\rVert_{2}^{2} (136)
=(ΣϕXTX)1/2P𝜶022\displaystyle=\lVert(\Sigma_{\phi}-X^{T}X)^{1/2}P^{\perp}\boldsymbol{\alpha}_{0}\rVert_{2}^{2} (137)
ΣϕXTX𝜶022.\displaystyle\leq\lVert\Sigma_{\phi}-X^{T}X\rVert_{\infty}\lVert\boldsymbol{\alpha}_{0}\rVert_{2}^{2}. (138)

The term XTXX^{T}X can be interpreted as a finite-sample estimator for Σϕ\Sigma_{\phi} in the sense that XTX=n1ΦTΦ=ΣϕX^{T}X=n^{-1}\Phi^{T}\Phi=\Sigma_{\phi} for n=dn=d. However, we cannot apply standard results on the convergence of sample covariance matrices (e.g. Ref. [60]) since the uniform spacing requirement for training data xx violates the standard assumption of i.i.d. input data. To proceed, we will make a number of simplifying assumptions about the feature weights. First, to control for the possibility that large feature weights νk\nu_{k} concentrate within a specific S(k)S(k) we will assume that feature weights corresponding to any set of alias frequencies of Ωn\Omega_{n} are close to their average. Letting d=(m+1)nd=(m+1)n, we define

η=1nkΩnνk+n,\displaystyle\eta_{\ell}=\frac{1}{n}\sum_{k\in\Omega_{n}}\nu_{k+n\ell}, (139)

for =0,1,,m\ell=0,1,\dots,m. We will impose that |νk+nη|t|\nu_{k+n\ell}-\eta_{\ell}|\leq t for all kΩnk\in\Omega_{n}, =1,,m\ell=1,\dots,m, for some positive number tt. We further assume normalization, kΩdνk=1=n=0mη\sum_{k\in\Omega_{d}}\nu_{k}=1=n\sum_{\ell=0}^{m}\eta_{\ell}. Under these assumptions we can bound the first term of (138) as

ΣϕXTX\displaystyle\lVert\Sigma_{\phi}-X^{T}X\rVert_{\infty} maxjνj1/2(kS(j mod n)\jνk1/2)\displaystyle\leq\max_{j}\nu_{j}^{1/2}\left(\sum_{k\in S(j\text{ mod }n)\backslash j}\nu_{k}^{1/2}\right) (140)
(m1)1/2maxjνj1/2(kS(j mod n)\jνk)1/2\displaystyle\leq(m-1)^{1/2}\max_{j}\nu_{j}^{1/2}\left(\sum_{k\in S(j\text{ mod }n)\backslash j}\nu_{k}\right)^{1/2} (141)
m1/2maxjνj1/2(=1m(η+t))1/2\displaystyle\leq m^{1/2}\max_{j}\nu_{j}^{1/2}\left(\sum_{\ell=1}^{m}(\eta_{\ell}+t)\right)^{1/2} (142)
(mn+m2t)1/2,\displaystyle\leq\left(\frac{m}{n}+m^{2}t\right)^{1/2}, (143)

where in line (140) we have used Gershgorin circles. Meanwhile, defining ζ:=(kΩn0νk)\zeta:=\left(\sum_{k\in\Omega_{n_{0}}}\nu_{k}\right), by Cauchy-Schwarz we have that

𝜶02Pζ2,\lVert\boldsymbol{\alpha}_{0}\rVert^{2}\leq P\zeta^{-2}, (144)

where PP is the signal power of Eq. (124). A necessary condition for producing benign overfitting in overparameterized Fourier models is that ζ\zeta remains relatively large as n,dn,d\rightarrow\infty. If this is accomplished, then a small enough tt guarantees that all feature weights associated with frequencies kΩd\Ωn0k\in\Omega_{d}\backslash\Omega_{n_{0}} will be uniformly suppressed. For instance, if ζ\zeta is lower bounded as a constant while t=0t=0 then combining Eqs. (143) and (144) yields a bound of

bias2=𝒪(d1/2n).\textsc{bias}^{2}=\mathcal{O}\left(\frac{d^{1/2}}{n}\right). (145)

Although the analysis of Sec. A.3.1 yields a significantly tighter bound, this demonstrates that the mechanisms behind that simple demonstration of benign overfitting are somewhat generic. In particular, normalization and lower bounded support of the feature weights on Ωn0\Omega_{n_{0}} is almost sufficient to control the bias term of the generalization error.

Appendix B Solution for the quantum overparameterized model

We now derive the solution to the minimum-norm interpolating quantum model,

Mopt\displaystyle M^{opt} =argminM=Mf(xj)=yjMF,\displaystyle=\operatorname*{arg\,min}_{\begin{subarray}{c}M=M^{\dagger}\\ f(x_{j})=y_{j}\end{subarray}}\lVert M\rVert_{F}, (146)
f(x)\displaystyle f(x) =kΩei2πkx,mR(k)γMmγm.\displaystyle=\sum_{k\in\Omega}e^{i2\pi kx}\sum_{\ell,m\in R(k)}\gamma_{\ell}M_{m\ell}\gamma_{m}^{*}. (147)

We will use the following notation and definitions:

ak\displaystyle a_{k} =k+|Ω|n2n,\displaystyle=-\Biggl{\lfloor}\frac{k+\frac{|\Omega|-n}{2}}{n}\Biggl{\rfloor}, (148)
bk\displaystyle b_{k} =k+|Ω|+n21n,\displaystyle=\Biggr{\lfloor}\frac{-k+\frac{|\Omega|+n}{2}-1}{n}\Biggr{\rfloor}, (149)
Sk\displaystyle S_{k} ={ka(k)n,,k,k+n,,k+b(k)n}\displaystyle=\{k-a(k)n,\dots,k,k+n,\dots,k+b(k)n\} (150)
={k+pn,p=a(k),,b(k)},\displaystyle\qquad\qquad=\{k+pn,\quad p=-a(k),\dots,b(k)\}, (151)
R(Sk)\displaystyle R(S_{k}) :=jSkR(j).\displaystyle:=\bigcup_{j\in S_{k}}R(j). (152)

Here, aka_{k} and bkb_{k} characterize the number positive and negative frequency aliases of kk appearing in Ω\Omega (i.e., k+nΩk+n\ell\in\Omega for all akbka_{k}\leq\ell\leq b_{k}) assuming that |Ω||\Omega| is odd, a requirement for any quantum model. Let L(d)L(d) be the space of linear operators acting on d×dd\times d matrices. Define the linear operator Pk:L(d)L(d)P_{k}:L(d)\rightarrow L(d) for k[n]k\in[n] as

Pk(X)=,mR(k)|X|m|m|P_{k}(X)=\sum_{\ell,m\in R(k)}\langle\ell|X|m\rangle|\ell\rangle\langle m| (153)

for any XL(d)X\in L(d). Importantly, PkP_{k} is not necessarily Hermitian preserving. Denoting Γ:=|ΓΓ|\Gamma:=|\Gamma\rangle\langle\Gamma| for brevity, we may rewrite the Fourier coefficients of f(x)f(x) as

Pk(M),Pk(Γ)\displaystyle\langle P_{k}(M),P_{k}(\Gamma)\rangle =Tr(Pk(M)Pk(Γ))\displaystyle=\operatorname{Tr}\left(P_{k}(M)^{\dagger}P_{k}(\Gamma)\right) (154)
=Tr(,mR(k)Mm|m|i,jR(k)γiγj|ij|)\displaystyle=\operatorname{Tr}\left(\sum_{\ell,m\in R(k)}M_{\ell m}^{*}|m\rangle\langle\ell|\sum_{i,j\in R(k)}\gamma_{i}\gamma_{j}^{*}|i\rangle\langle j|\right) (155)
=,mR(k)γMmγm,\displaystyle=\sum_{\ell,m\in R(k)}\gamma_{\ell}^{*}M_{m\ell}\gamma_{m}, (156)

where in the last line we have used hermiticity of MM. Applying f(xj)=yjf(x_{j})=y_{j} j[n]\forall\,j\in[n] and substituting into Eq. (147) we find the interpolation condition

y^p\displaystyle\hat{y}_{p} =q=akbkPp+nq(M),Pp+nq(Γ)\displaystyle=\sum_{q=a_{k}}^{b_{k}}\langle P_{p+nq}(M),P_{p+nq}(\Gamma)\rangle (157)
=PSp(M),PSp(Γ),p[n],\displaystyle=\langle P_{S_{p}}(M),P_{S_{p}}(\Gamma)\rangle,\qquad\forall\,p\in[n], (158)

where PSp:=q=akbkPp+nqP_{S_{p}}:=\sum_{q=a_{k}}^{b_{k}}P_{p+nq}. The equality follows from

Pj(X),Pk(Y)=Pj(X),Pj(Y)δjk\langle P_{j}(X),P_{k}(Y)\rangle=\langle P_{j}(X),P_{j}(Y)\rangle\delta_{j}^{k} (159)

due to the fact that R(j)R(j) and R(k)R(k) are disjoint sets for any jkj\neq k. Following the technique of Ref. [9] we apply the Cauchy-Schwarz inequality to find

PSp(M)F|PSp(M),PSp(Γ)|PSp(Γ)F,\lVert P_{S_{p}}(M)\rVert_{F}\geq\frac{|\langle P_{S_{p}}(M),P_{S_{p}}(\Gamma)\rangle|}{\lVert P_{S_{p}}(\Gamma)\rVert_{F}}, (160)

with equality if and only if PSp(M)P_{S_{p}}(M) is proportional to PSp(Γ)P_{S_{p}}(\Gamma). Saturating this lower bound by setting Mm=cγγmM_{\ell m}=c\gamma_{\ell}\gamma_{m}^{*} and solving for the proportionality cc constant using Eq. (160), we find

y^p=cPSp(Γ),PSp(Γ)=,mR(Sp)c|γ|2|γm|2.\displaystyle\hat{y}_{p}=\langle cP_{S_{p}}(\Gamma),P_{S_{p}}(\Gamma)\rangle=\sum_{\ell,m\in R(S_{p})}c^{*}|\gamma_{\ell}|^{2}|\gamma_{m}|^{2}. (161)

This indicates an additional requirement for interpolation that γ,γm>0\gamma_{\ell},\gamma_{m}>0 for some pair (,m)R(k)(\ell,m)\in R(k) whenever y~k0\tilde{y}_{k}\neq 0 and so for simplicity we will require that γ>0\gamma_{\ell}>0 for all =1,,d\ell=1,\dots,d. Within each set of indices R(Sk)R(S_{k}), the elements of the optimal observable are defined piecewise with respect to that partition RR:

(,m)R(Sk)Mm=y^kγγmi,jR(Sk)|γi|2|γj|2.(\ell,m)\in R(S_{k})\Rightarrow M_{\ell m}=\hat{y}_{k}^{*}\frac{\gamma_{\ell}\gamma_{m}^{*}}{\sum_{i,j\in R(S_{k})}|\gamma_{i}|^{2}|\gamma_{j}|^{2}}. (162)

Minimization of MF\lVert M\rVert_{F} subject to the interpolation constraint is equivalent to minimization of Pp+nq(M)F\lVert P_{p+nq}(M)\rVert_{F} for all q[ak,bk],kΩq\in[a_{k},b_{k}],\,k\in\Omega, and so solving the constrained optimization over all distinct subspaces in k=0n1R(Sk)={0,1}n×{0,1}n\bigcup_{k=0}^{n-1}R(S_{k})=\{0,1\}^{n}\times\{0,1\}^{n} we recover the optimal observable

Mopt=argminf(xj)=yjjMF=kΩny^k,mR(Sk)γγmi,jR(Sk)|γi|2|γj|2|m|.M_{opt}=\operatorname*{arg\,min}_{f(x_{j})=y_{j}\,\forall\,j}\lVert M\rVert_{F}=\sum_{k\in\Omega_{n}}\hat{y}_{k}^{*}\sum_{\ell,m\in R(S_{k})}\frac{\gamma_{\ell}\gamma_{m}^{*}}{\sum_{i,j\in R(S_{k})}|\gamma_{i}|^{2}|\gamma_{j}|^{2}}|\ell\rangle\langle m|. (163)

We now verify that this matrix is Hermitian and therefore a valid observable. We will use the following:

y^k\displaystyle\hat{y}_{k} =y^k,\displaystyle=\hat{y}_{-k}^{*}, (164)
R(k)\displaystyle R(-k) ={(m,) for all (,m)R(k)}:=R(k)T.\displaystyle=\{(m,\ell)\text{ for all }(\ell,m)\in R(k)\}:=R(k)^{T}. (165)

Eq. (164) follows from our assumption that yjj[n]y_{j}\in\mathbb{R}\,\forall\,j\in[n]. And so

Mm\displaystyle M_{\ell m} =y^kγγmi,jR(Sk)|γi|2|γj|2(,m)R(Sk)\displaystyle=\hat{y}_{k}^{*}\frac{\gamma_{\ell}\gamma_{m}^{*}}{\sum_{i,j\in R(S_{k})}|\gamma_{i}|^{2}|\gamma_{j}|^{2}}\quad\forall\,(\ell,m)\in R(S_{k}) (166)
=y^kγγmj,iR(Sk)T|γi|2|γj|2(,m)R(Sk)T\displaystyle=\hat{y}_{-k}\frac{\gamma_{\ell}\gamma_{m}^{*}}{\sum_{j,i\in R(S_{-k})^{T}}|\gamma_{i}|^{2}|\gamma_{j}|^{2}}\quad\forall\,(\ell,m)\in R(S_{-k})^{T} (167)
=(y^kγmγi,jR(Sk)|γi|2|γj|2)(m,)R(Sk)\displaystyle=\left(\hat{y}_{-k}^{*}\frac{\gamma_{m}\gamma_{\ell}^{*}}{\sum_{i,j\in R(S_{-k})}|\gamma_{i}|^{2}|\gamma_{j}|^{2}}\right)^{*}\quad\forall\,(m,\ell)\in R(S_{-k}) (168)
=Mm.\displaystyle=M_{m\ell}^{*}. (169)

In line (167) we have used Eq. (164), while in line (169) we have observed that Eq. (162) holds only with respect to any fixed partition; MmM_{\ell m} and MmM_{m\ell} must be computed with respect to distinct partitions. The optimal model may now be rewritten in terms of base frequencies as

fopt(x)\displaystyle f^{opt}(x) =kΩny^k=akbkei2π(k+n)xi,jR(k+n)|γi|2|γj|2r=akbkc,dR(k+nr)|γc|2|γd|2.\displaystyle=\sum_{k\in\Omega_{n}}\hat{y}_{k}\sum_{\ell=a_{k}}^{b_{k}}e^{i2\pi(k+n\ell)x}\frac{\sum_{i,j\in R(k+n\ell)}|\gamma_{i}|^{2}|\gamma_{j}|^{2}}{\sum_{r=a_{k}}^{b_{k}}\sum_{c,d\in R(k+nr)}|\gamma_{c}|^{2}|\gamma_{d}|^{2}}. (170)

Recall that the optimal classical model derived in Eq. (65) is given by

f(x)=kΩny^k=akbkei2π(k+n)xνk+nr=akbkνk+nr.f(x)=\sum_{k\in\Omega_{n}}\hat{y}_{k}\,\sum_{\ell=a_{k}}^{b_{k}}e^{i2\pi(k+\ell n)x}\frac{\nu_{k+\ell n}}{{\sum_{r=a_{k}}^{b_{k}}\nu_{k+nr}}}. (171)

Then despite Eq. (147) not having a clear decomposition into scalar feature weights and trainable weights, we can identify the feature weights of the optimized quantum model as

νkopt:=i,jR(k)|γi|2|γj|2,\nu_{k}^{opt}:=\sum_{i,j\in R(k)}|\gamma_{i}|^{2}|\gamma_{j}|^{2}, (172)

which recovers the same form of the optimal classical model of Eq. (65). This means that the behavior of the feature weights of the (optimal) general quantum model are strongly controlled by the degeneracy sets R(k)R(k). The generalization error of the quantum model is also described by Eq. (19) of the main text under the identification of Eq. (172), and therefore exhibits a tradeoff that is predominantly controlled by the degeneracies R(k)R(k) of the data-encoding Hamiltonian.

We can now substitute γ=1/d\gamma_{\ell}=1/\sqrt{d} to recover the optimal observable for the simplified model derived in Sec 3.1 of the main text using other means, namely

Mmopt\displaystyle M_{m\ell}^{opt} =y^kdr=akbk|R(k+nr)|,\displaystyle=\hat{y}_{k}\frac{d}{\sum_{r=a_{k}}^{b_{k}}|R(k+nr)|}, (173)
f(x)\displaystyle f(x) =kΩny^kq=akbkei2π(k+nq)x|R(k+nq)|r=akbk|R(k+nr)|,\displaystyle=\sum_{k\in\Omega_{n}}\hat{y}_{k}\sum_{q=a_{k}}^{b_{k}}e^{i2\pi(k+nq)x}\frac{|R(k+nq)|}{\sum_{r=a_{k}}^{b_{k}}|R(k+nr)|}, (174)
νk\displaystyle\nu_{k} =|R(k)|d2.\displaystyle=\frac{|R(k)|}{d^{2}}. (175)

B.1 Computing feature weights of typical quantum models

In Sec 3.1 we introduced a simple model with a uniform amplitude state |Γ|\Gamma\rangle as input and demonstrated that the feature weights of this simple quantum model are completely determined by the sets R(k)R(k) induced by the encoding strategy. We now wish to extend the intuition that the behavior of the optimized general quantum model is strongly influenced by the distribution of the degeneracies |R(k)||R(k)|. We do so by evaluating the optimal quantum models with respect to an “average” state preparation unitary UU. We can compute the average value of |γi|2|γj|2|\gamma_{i}|^{2}|\gamma_{j}|^{2} for UU sampled uniformly from the Haar distribution using standard results from the literature [61]333Note that since νkopt\nu_{k}^{opt} is invariant with respect to γieiϕγi\gamma_{i}\rightarrow e^{i\phi}\gamma_{i} a spherical measure would suffice here.:

𝔼UU(d)|γi|2|γj|2=𝔼UU(d)Ui0Ui0Uj0Uj0=δij+1d(d+1).\displaystyle\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}|\gamma_{i}|^{2}|\gamma_{j}|^{2}=\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}U_{i0}U_{i0}^{*}U_{j0}U_{j0}^{*}=\frac{\delta_{i}^{j}+1}{d(d+1)}. (176)

Since (i,i)R(0)(i,i)\in R(0) we can then compute the feature weights of the optimal model according to444It is implied that we compute the optimal ν\nu with respect to each distinct UU sampled independently and uniformly with respect to the Haar measure – without optimizing MM with respect to each UU we would find the trivial result 𝔼UU(d)f(x)=𝔼UU(d)Tr(Mρ(x))=d1Tr(M)\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}f(x)=\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\operatorname{Tr}\left(M\rho(x)\right)=d^{-1}\operatorname{Tr}\left(M\right) .

𝔼UU(d)νkopt=𝔼UU(d)i,jR(k)|γi|2|γj|2=|R(k)|d(d+1)+δ0k1d+1.\displaystyle\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\nu_{k}^{opt}=\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\sum_{i,j\in R(k)}|\gamma_{i}|^{2}|\gamma_{j}|^{2}=\frac{|R(k)|}{d(d+1)}+\delta_{0}^{k}\frac{1}{d+1}. (177)

From Eq. (177) we see that the feature weights of a quantum model optimized with respect to random UU are completely determined by the degeneracies R(k)R(k). This expected value is useful but does not fully characterize the behavior of an encoding strategy. To demonstrate that this average behavior is meaningful, we would further like to verify that the feature weights corresponding to a random UU concentrate around the mean of Eq. (177). We characterize this by computing the variance

Var(νk)=𝔼UU(d)(νk2)(𝔼UU(d)νk)2,\text{Var}(\nu_{k})=\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\left(\nu_{k}^{2}\right)-\left(\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\nu_{k}\right)^{2}, (178)

where we have dropped the superscript on νkopt\nu_{k}^{opt} for brevity. This computation requires significantly more counting arguments dependent on the structure of R(k)R(k). When k0k\neq 0, we identify cases for which iji\neq j whenever (i,j)R(k)(i,j)\in R(k):

νk2\displaystyle\nu_{k}^{2} =i,jR(k),mR(k)|γi|2|γj|2|γ|2|γm|2\displaystyle=\sum_{i,j\in R(k)}\sum_{\ell,m\in R(k)}|\gamma_{i}|^{2}|\gamma_{j}|^{2}|\gamma_{\ell}|^{2}|\gamma_{m}|^{2} (179)
=i,jR(k)(|γi|4|γj|4+,mR(k)j,m=i|γi|4|γj|2|γ|2\displaystyle=\sum_{i,j\in R(k)}\left(|\gamma_{i}|^{4}|\gamma_{j}|^{4}+\sum_{\begin{subarray}{c}\ell,m\in R(k)\\ \ell\neq j,m=i\end{subarray}}|\gamma_{i}|^{4}|\gamma_{j}|^{2}|\gamma_{\ell}|^{2}\right.
+m,R(k)=j,mi|γi|2|γj|4|γm|2+m,R(k)j,mi|γi|2|γj|2|γ|2|γm|2).\displaystyle\qquad\qquad\qquad\qquad\left.+\sum_{\begin{subarray}{c}m,\ell\in R(k)\\ \ell=j,m\neq i\end{subarray}}|\gamma_{i}|^{2}|\gamma_{j}|^{4}|\gamma_{m}|^{2}+\sum_{\begin{subarray}{c}m,\ell\in R(k)\\ \ell\neq j,m\neq i\end{subarray}}|\gamma_{i}|^{2}|\gamma_{j}|^{2}|\gamma_{\ell}|^{2}|\gamma_{m}|^{2}\right). (180)

The expected values for of these terms are evaluated using the observation that the vector (|γ1|2,,|γs|2)(|\gamma_{1}|^{2},\dots,|\gamma_{s}|^{2}) is distributed uniformly on the dd-simplex leading to simple expressions for the following expected values [61]:

𝔼UU(d)|γp|4|γq|4\displaystyle\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}|\gamma_{p}|^{4}|\gamma_{q}|^{4} =4/D,\displaystyle=4/D, (181)
𝔼UU(d)|γp|4|γq|2|γr|2\displaystyle\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}|\gamma_{p}|^{4}|\gamma_{q}|^{2}|\gamma_{r}|^{2} =2/D,\displaystyle=2/D, (182)
𝔼UU(d)|γp|2|γq|2|γr|2|γs|2\displaystyle\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}|\gamma_{p}|^{2}|\gamma_{q}|^{2}|\gamma_{r}|^{2}|\gamma_{s}|^{2} =1/D,\displaystyle=1/D, (183)

where p,q,r,s=1,,dp,q,r,s=1,\dots,d and pqrsp\neq q\neq r\neq s, and D:=(d+3)(d+2)(d+1)dD:=(d+3)(d+2)(d+1)d. In computing the expected value of Eq. (180), by linearity the terms of each sum will become a constant and only the number of items in each sum will be relevant. The first sum of Eq. (180) contains |R(k)||R(k)| many terms and the total number of terms is |R(k)|2|R(k)|^{2}, and so we only need compute the number of elements in the two middle sums of Eq. (180) (which contain an equal number of terms due to the symmetry R(k)=R(k)TR(k)=R(-k)^{T}). These computations may be carried out by brute-force combinatorics and are summarized in Table 2 for a few of the models studied in this work.

Encoding strategy |{(i,j,,m):mi,=j,(i,j),(,m)R(k)}||\{(i,j,\ell,m):m\neq i,\ell=j,(i,j),(\ell,m)\in R(k)\}| Var(νkopt)\text{Var}(\nu_{k}^{opt})
Hamming p=0nq2k(nqp)(nqp+|k|)(nqp+2|k|)\sum_{p=0}^{n_{q}-2k}\begin{pmatrix}n_{q}\\ p\end{pmatrix}\begin{pmatrix}n_{q}\\ p+|k|\end{pmatrix}\begin{pmatrix}n_{q}\\ p+2|k|\end{pmatrix} -
Binary max(2nq2|k|,0)\max(2^{n_{q}}-2|k|,0) O(23nq)O\left(2^{-3n_{q}}\right)
Golomb 0 𝒪(d4)\mathcal{O}(d^{-4})
Table 2: The size of the subset of R(k)×R(k)R(k)\times R(k) with a single repeated index computed for different encoding strategies and the corresponding scaling of the variance of feature weights (all entries assume k0k\neq 0). The Hamming encoding strategy computation works as follows: Each bitstring ii with weight pp (of which there are nn-choose-pp) is paired with nn-choose-(p+k)(p+k) many bitstrings jj with weight p+kp+k. Then, taking =j\ell=j there will be nn-choose-(p+2k)(p+2k) many bitstrings mm with weight (p+k)+k=p+2k(p+k)+k=p+2k. Summing over all such pp’s where p+2knqp+2k\leq n_{q} yields the desired result, however this computation does not admit a clear closed-form expression and so we have omitted the corresponding scaling of Var(νkopt)\text{Var}(\nu_{k}^{opt}) for the Binary encoding strategy.

For the Binary encoding strategy we compute

𝔼(νk2)={5d7k+(dk)2D,0<2kd3(dk)+(dk)2D,2k>d,\displaystyle\mathbb{E}(\nu_{k}^{2})=\begin{cases}\frac{5d-7k+(d-k)^{2}}{D},&0<2k\leq d\\ \frac{3(d-k)+(d-k)^{2}}{D},&2k>d,\end{cases} (184)

and so

Var(νk)={(5d7k)d(d+1)(dk)2(4d+6)Dd(d+1) if 0<2kd3(dk)d(d+1)(dk)2(4d+6)Dd(d+1) if 2k>d}𝒪(d3).\displaystyle\text{Var}(\nu_{k})=\left\{\!\begin{aligned} &\frac{(5d-7k)d(d+1)-(d-k)^{2}(4d+6)}{Dd(d+1)}\text{ if }0<2k\leq d\\ &\frac{3(d-k)d(d+1)-(d-k)^{2}(4d+6)}{Dd(d+1)}\text{ if }2k>d\end{aligned}\right\}\leq\mathcal{O}(d^{-3}). (185)

Importantly, 𝔼[νk]𝒪(d2)\mathbb{E}[\nu_{k}]\geq\mathcal{O}(d^{-2}) for the Binary encoding strategy. Taking d=2nqd=2^{n_{q}} corresponding to nqn_{q} qubits, then while the mean decays exponentially in the number of qubits nqn_{q}, the variance decays exponentially faster. For the Golomb encoding strategy, the calculation is comparatively straightforward, yielding for k0k\neq 0

Var(νk)=3d2d6Dd(d+1)=𝒪(d4),\text{Var}(\nu_{k})=\frac{3d^{2}-d-6}{Dd(d+1)}=\mathcal{O}(d^{-4}), (186)

with the variance again decaying significantly faster in dd than the mean. Figure 6 shows the average and variance of νkopt\nu_{k}^{opt} for the general quantum model with UU sampled uniformly with respect to the Haar measure. We find tight concentration of νkopt\nu_{k}^{opt} around its average in each of these cases

Refer to caption
Figure 6: Plotting 𝔼[νkopt]\mathbb{E}\left[\nu_{k}^{opt}\right] and Var[νkopt]1/2\text{Var}\left[\nu_{k}^{opt}\right]^{1/2} demonstrates tight concentration of the distribution of νkopt\nu_{k}^{opt} around its mean with respect to Haar-random UU in the general quantum model. For the Binary and Golomb encoding strategies we plot Eqs. (185) and (186) respectively. For all models we also include averages and variances computed by sampling states uniformly from the Haar measure numerically.

Why is it that the feature weights νkopt\nu_{k}^{opt} of the quantum model given by Eq. (172) only appear after deriving the optimal observable? One explanation is that while the classical Fourier Features model utilizes random Fourier features such that components of ϕ(x)\phi(x) are mutually orthogonal on [0,1][0,1], the quantum model does not. Consider the operator

Σ:=𝒳Vec(ρ(x))Vec(ρ(x))p(x)dx=𝒳ρ(x)ρT(x)p(x)dx,\Sigma:=\int_{\mathcal{X}}\operatorname{Vec}\left(\rho(x)\right)\operatorname{Vec}\left(\rho(x)\right)^{\dagger}p(x)dx=\int_{\mathcal{X}}\rho(x)\otimes\rho^{T}(x)p(x)dx, (187)

where p:𝒳p:\mathcal{X}\rightarrow\mathbb{R} is the probability density function describing the distribution of data in 𝒳\mathcal{X} and Vec:d×dd2\text{Vec}:\mathbb{C}^{d\times d}\rightarrow\mathbb{C}^{d^{2}} is the vectorization map that acts by stacking transposed rows of a square matrix into a column vector. Σ\Sigma is analogous to a classical covariance matrix, and here determines the correlation between components of ρ(x)\rho(x) (with the second equality holding when ρ(x)\rho(x) is a pure state for all xx). From Eq. (8) describing the classical Fourier features it is straightforward to compute 𝔼x[0,1][ϕϕ]=𝕀d\mathbb{E}_{x\sim[0,1]}[\phi\phi^{\dagger}]=\mathbb{I}_{d}, demonstrating that the classical Fourier features are indeed orthonormal. However, under the identification of the feature vector ϕ(x)Vec(ρ(x))\phi(x)\rightarrow\operatorname{Vec}\left(\rho(x)\right) in the quantum case, the same is not true for the quantum model:

Σ\displaystyle\Sigma =01(kΩei2πkxi,jR(k)γiγj|ij)(kΩei2πkx,mR(k)γγmm|)𝑑x\displaystyle=\int_{0}^{1}\left(\sum_{k\in\Omega}e^{i2\pi kx}\sum_{i,j\in R(k)}\gamma_{i}\gamma_{j}^{*}|ij\rangle\right)\left(\sum_{k^{\prime}\in\Omega}e^{i2\pi k^{\prime}x}\sum_{\ell,m\in R(k^{\prime})}\gamma_{\ell}^{*}\gamma_{m}\langle\ell m|\right)dx (188)
=ki,jR(k),mR(k)γiγjγγm|ijm|.\displaystyle=\sum_{k}\sum_{\begin{subarray}{c}i,j\in R(k)\\ \ell,m\in R(k)\end{subarray}}\gamma_{i}\gamma_{j}^{*}\gamma_{\ell}^{*}\gamma_{m}|ij\rangle\langle\ell m|. (189)

Thus Σ\Sigma is not diagonal in general, as many components of S(x)|ΓS(x)|\Gamma\rangle each contribute to a single frequency kk. Nor will it be possible in general to construct a quantum feature map via unitary operations that does act as an orthogonal Fourier features vector. As Σ\Sigma is positive semidefinite by construction, there exists a spectral decomposition Σ=WDW\Sigma=WDW^{\dagger} with diagonal DD and d2×d2d^{2}\times d^{2} unitary WW. However the linear operator ΦL(d)\Phi\in L(d) acting on dd-dimensional states according to Vec(Φ(ρ))=WVec(ρ(x))\operatorname{Vec}\left(\Phi(\rho)\right)=W^{\dagger}\operatorname{Vec}\left(\rho(x)\right) will not be unitary in general, and thus it may not be possible to prepare ρ\rho in such a way that the elements of Vec(ρ(x))\operatorname{Vec}\left(\rho(x)\right) consist of orthonormal Fourier features.

B.2 The effect of state preparation on feature weights in the general quantum model

We now discuss how the state preparation unitary UU may affect the feature weights νkopt\nu_{k}^{opt} in a quantum model, and what choices one can make to construct a UU that gives rise to a specific distribution of feature weights νkopt\nu_{k}^{opt}.

As an example, we consider the general quantum model using the Hamming encoding strategy and an input state |Γ|\Gamma\rangle. Computing the feature weights νkopt\nu_{k}^{opt} for the Hamming encoding strategy depends on the amplitudes γi\gamma_{i} and γj\gamma_{j} for which the weights of the indices i,ji,j differ by kk. We have

νkopt=i,jR(k)|γi|2|γj|2=i,j:w(j)w(i)=k|γi|2|γj|2.\displaystyle\nu_{k}^{opt}=\sum_{i,j\in R(k)}|\gamma_{i}|^{2}|\gamma_{j}|^{2}=\sum_{\begin{subarray}{c}i,j:\\ w(j)-w(i)=k\end{subarray}}|\gamma_{i}|^{2}|\gamma_{j}|^{2}. (190)

We will now show how the feature weight νkopt\nu_{k}^{opt} may be computed with respect to a rebalanced input state which distributes each amplitude γi\gamma_{i} among all computational basis states with index having weight w(i)w(i). We define this rebalanced state on nn qubits as

|Γ\displaystyle|\Gamma^{\prime}\rangle =i{0,1}nγi|i=i=0njW(i)ϕi|j,\displaystyle=\sum_{i\in\{0,1\}^{n}}\gamma_{i}^{\prime}|i\rangle=\sum_{i=0}^{n}\sum_{j\in W(i)}\phi_{i}|j\rangle, (191)
ϕi\displaystyle\phi_{i} =1|W(i)|(jW(i)|γj|2)1/2.\displaystyle=\frac{1}{\sqrt{|W(i)|}}\left(\sum_{j\in W(i)}|\gamma_{j}|^{2}\right)^{1/2}. (192)

where W(i)={j:j{0,1}n,w(j)=i}W(i)=\{j:j\in\{0,1\}^{n},w(j)=i\} is the set of weight-ii indices and |W(i)|=(ni)|W(i)|=\binom{n}{i}. Observing that the amplitudes of |Γ|\Gamma^{\prime}\rangle satisfy |γi|2=ϕw(i)2|\gamma_{i}^{\prime}|^{2}=\phi_{w(i)}^{2}, we can derive for k0k\geq 0

νkopt\displaystyle\nu_{k}^{opt} =i,j:w(j)w(i)=k|γi|2|γj|2\displaystyle=\sum_{\begin{subarray}{c}i,j:\\ w(j)-w(i)=k\end{subarray}}|\gamma_{i}|^{2}|\gamma_{j}|^{2} (193)
==0nk(i:w(i)=|γi|2)(j:w(j)=+k|γj|2)\displaystyle=\sum_{\ell=0}^{n-k}\left(\sum_{i:w(i)=\ell}|\gamma_{i}|^{2}\right)\left(\sum_{j:w(j)=\ell+k}|\gamma_{j}|^{2}\right) (194)
==0nk(|W()|i:w(i)=|γi|2|W()|)(|W(+k)|j:w(j)=+k|γj|2|W(+k)|)\displaystyle=\sum_{\ell=0}^{n-k}\left(|W(\ell)|\frac{\sum_{i:w(i)=\ell}|\gamma_{i}|^{2}}{|W(\ell)|}\right)\left(|W(\ell+k)|\frac{\sum_{j:w(j)=\ell+k}|\gamma_{j}|^{2}}{|W(\ell+k)|}\right) (195)
==0nk|W()|ϕ2|W(+k)|ϕ+k2\displaystyle=\sum_{\ell=0}^{n-k}|W(\ell)|\phi_{\ell}^{2}|W(\ell+k)|\phi_{\ell+k}^{2} (196)
==0nk(i:w(i)=ϕw(i)2)(j:w(j)=+kϕw(j)2)\displaystyle=\sum_{\ell=0}^{n-k}\left(\sum_{i:w(i)=\ell}\phi_{w(i)}^{2}\right)\left(\sum_{j:w(j)=\ell+k}\phi_{w(j)}^{2}\right) (197)
=i,j:w(j)w(i)=k|γi|2|γj|2.\displaystyle=\sum_{\begin{subarray}{c}i,j:\\ w(j)-w(i)=k\end{subarray}}|\gamma_{i}^{\prime}|^{2}|\gamma_{j}^{\prime}|^{2}. (198)

Therefore, the feature weights computed from |Γ=U|0|\Gamma\rangle=U|0\rangle and the rebalanced state |Γ|\Gamma^{\prime}\rangle are identical. We can emphasize the significance of this observation by rewriting |Γ|\Gamma^{\prime}\rangle of Eq. (191) as

|Γ\displaystyle|\Gamma^{\prime}\rangle =i=0nϕi|Φi,\displaystyle=\sum_{i=0}^{n}\phi_{i}|\Phi_{i}\rangle, (199)

where |Φi=jW(i)|j|\Phi_{i}\rangle=\sum_{j\in W(i)}|j\rangle is an (unnormalized) superposition of bitstrings with weight ii. The state |Γ|\Gamma^{\prime}\rangle of Eq. (199) has only n+1n+1 real parameters ϕi\phi_{i}, and is invariant under operations that are restricted to act within the subspace spanned by components of |Φ|\Phi\rangle. This invariance greatly reduces the class of unitaries UU that affect the distribution of feature weights νkopt\nu_{k}^{opt} and enables some degree of tuning for these parameters.

B.2.1 Vanishing gradients in preparing feature weights

We briefly remark on the possibility of training feature weights νkopt\nu_{k}^{opt} of the general quantum model variationally. For example, one could consider defining a desired distribution of feature weights as ϕk|Ω|\phi_{k}\in\mathbb{R}^{|\Omega|} and then attempting to tune parameters of UU in order to minimize a cost function such as555As νk\nu_{k} represents a probability distribution, other distance measures such as relative entropy or earth-mover’s distance may be considered more appropriate. However using alternative cost functions will not affect the main arguments here.

L=kΩ|νkoptϕk|2.L=\sum_{k\in\Omega}|\nu_{k}^{opt}-\phi_{k}|^{2}. (200)

We will now demonstrate that for a sufficiently expressive class of state preparation unitaries UU, such an optimization problem will be difficult on average. For simplicity, we follow the original formulation of barren plateaus presented in Ref. [23]: Let UU be defined with respect to a set of parameters 𝜽L\boldsymbol{\theta}\in\mathbb{R}^{L} as

U==1LU(θ)WU=\prod_{\ell=1}^{L}U(\theta_{\ell})W_{\ell} (201)

with U(θ)=exp(iθV)U(\theta_{\ell})=\exp(i\theta_{\ell}V_{\ell}), VV_{\ell} being a dd-dimensional Hermitian operator, and WW_{\ell} being a dd-dimensional unitary operator. Pick a parameter θm\theta_{m} and define U+==mLU(θ)WU_{+}=\prod_{\ell=m}^{L}U(\theta_{\ell})W_{\ell} and U==1mU(θ)WU_{-}=\prod_{\ell=1}^{m}U(\theta_{\ell})W_{\ell}, and observe that

θm|γj|2=i0|U[Vm,U+|jj|U+]U|0.\partial_{\theta_{m}}|\gamma_{j}|^{2}=i\langle 0|U_{-}^{\dagger}[V_{m},U_{+}^{\dagger}|j\rangle\langle j|U_{+}]U_{-}|0\rangle. (202)

We can compute the derivative of νk\nu_{k} with respect to θm\theta_{m} using the chain rule:

νkθm\displaystyle\frac{\partial\nu_{k}}{\partial\theta_{m}} =j=1dνk|γj|2|γj|2θm\displaystyle=\sum_{j=1}^{d}\frac{\partial\nu_{k}}{\partial|\gamma_{j}|^{2}}\frac{\partial|\gamma_{j}|^{2}}{\partial\theta_{m}} (203)
=j=1d(a,bR(k)(|γa|2δbj+δaj|γb|2))i0|U[Vm,U+|jj|U+]U|0\displaystyle=\sum_{j=1}^{d}\left(\sum_{a,b\in R(k)}\left(|\gamma_{a}|^{2}\delta_{b}^{j}+\delta_{a}^{j}|\gamma_{b}|^{2}\right)\right)i\langle 0|U_{-}^{\dagger}[V_{m},U_{+}^{\dagger}|j\rangle\langle j|U_{+}]U_{-}|0\rangle (204)
=ia,bR(k)[Tr(ρHa)Tr(ρ[Vm,Hb])+Tr(ρ[Vm,Ha])Tr(ρHb)]\displaystyle=i\sum_{a,b\in R(k)}\left[\operatorname{Tr}\left(\rho_{-}H_{a}\right)\operatorname{Tr}\left(\rho_{-}[V_{m},H_{b}]\right)+\operatorname{Tr}\left(\rho_{-}[V_{m},H_{a}]\right)\operatorname{Tr}\left(\rho_{-}H_{b}\right)\right] (205)

where ρ=U|00|U\rho_{-}=U_{-}|0\rangle\langle 0|U_{-}^{\dagger} and Hx=U+|xx|U+H_{x}=U_{+}^{\dagger}|x\rangle\langle x|U_{+}, and we used the equality

|γa|2=0|UU+|aa|U+U|0=Tr(ρHa).\displaystyle|\gamma_{a}|^{2}=\langle 0|U_{-}^{\dagger}U_{+}^{\dagger}|a\rangle\langle a|U_{+}U_{-}|0\rangle=\operatorname{Tr}\left(\rho_{-}H_{a}\right). (206)

We now show that each term in this sum vanishes by following Ref. [23] in letting UU_{-} be sampled from a distribution that forms a unitary 2-design:

𝔼UU(d)Tr(ρHa)\displaystyle\mathbb{E}_{U_{-}\sim U(d)}\operatorname{Tr}\left(\rho_{-}H_{a}\right) Tr(ρ[Vm,Hb])\displaystyle\operatorname{Tr}\left(\rho_{-}[V_{m},H_{b}]\right) (207)
=Tr(𝔼UU(d)[ρρ]Ha[Vm,Hb])\displaystyle=\operatorname{Tr}\left(\mathbb{E}_{U_{-}\sim U(d)}\left[\rho_{-}\otimes\rho_{-}\right]H_{a}\otimes[V_{m},H_{b}]\right) (208)
=1d(d+1)Tr((𝕀d+i,j=1d|ijji|)Ha[Vm,Hb])\displaystyle=\frac{1}{d(d+1)}\operatorname{Tr}\left(\left(\mathbb{I}_{d}+\sum_{i,j=1}^{d}|ij\rangle\langle ji|\right)H_{a}\otimes[V_{m},H_{b}]\right) (209)
=Tr([Vm,Hb]Ha)d(d+1)\displaystyle=\frac{\operatorname{Tr}\left([V_{m},H_{b}]H_{a}\right)}{d(d+1)} (210)
=VmU+|bb|U+U+|aa|U+U+|bb|U+VmU+|aa|U+d(d+1)\displaystyle=\frac{V_{m}U_{+}^{\dagger}|b\rangle\langle b|U_{+}U_{+}^{\dagger}|a\rangle\langle a|U_{+}-U_{+}^{\dagger}|b\rangle\langle b|U_{+}V_{m}U_{+}^{\dagger}|a\rangle\langle a|U_{+}}{d(d+1)} (211)
=δaba|U+VmU+|bδabb|U+VmU+|ad(d+1)\displaystyle=\frac{\delta_{a}^{b}\langle a|U_{+}V_{m}U_{+}^{\dagger}|b\rangle-\delta_{a}^{b}\langle b|U_{+}V_{m}U_{+}^{\dagger}|a\rangle}{d(d+1)} (212)
=0,\displaystyle=0, (213)

where in line (209) we have used a common expression for 𝔼UU(d)[ρρ]\mathbb{E}_{U\sim U(d)}[\rho\otimes\rho] in terms of the projector onto the symmetric subspace (e.g. [62]). Substituting this result into Eq. (205) we find

𝔼UU(d)(νkθm)=0\mathbb{E}_{U\sim U(d)}\left(\frac{\partial\nu_{k}}{\partial\theta_{m}}\right)=0 (214)

By extension, the gradient of a loss function of the form of Eq. (200) will vanish for expressive enough state-preparation unitaries UU, suggesting that solving for a choice of UU to induce a specific distribution of feature weights νkopt\nu_{k}^{opt} will be infeasible in practice.

Appendix C Determining the degeneracy of quantum models

Here we will develop a theoretical framework for manipulating the degeneracy (and therefore feature weights) of quantum models. We will begin with choosing the data-encoding operator S(z)=exp(iHz)S(z)=\exp(-iHz) to be separable, generated from a Hamiltonian of the form

H=𝐫𝐙=j=0n1rjZ(j),H=\mathbf{r}\cdot\mathbf{Z}=\sum_{j=0}^{n-1}r_{j}Z^{(j)}, (215)

and Z(j)Z^{(j)} denotes the operator that applies a Pauli ZZ operator to the jthj^{th} register and acts trivially elsewhere, and 𝐫n\mathbf{r}\in\mathbb{R}^{n}. One can show that the diagonal elements of HH are given by

λj=Hjj=2(𝐫𝐣)𝐫1,\lambda_{j}=H_{jj}=2(\mathbf{r}\cdot\mathbf{j})-\lVert\mathbf{r}\rVert_{1}, (216)

where here and elsewhere we use interchangeably a bitstring denoted as j=j0j1jn1j=j_{0}j_{1}\dots j_{n-1} with decimal value j=kjk2j[0,2n1]j=\sum_{k}j_{k}2^{j}\in[0,2^{n}-1], and 𝐣{0,1}n\mathbf{j}\in\{0,1\}^{n} as its corresponding vector representation. The Fourier spectrum and degeneracy of the encoding strategy are then computed as

Ω\displaystyle\Omega ={λjλi:i,j=0,,2n1}\displaystyle=\{\lambda_{j}-\lambda_{i}:i,j=0,\dots,2^{n}-1\} (217)
={2(𝐫(𝐣𝐢)):𝐢,𝐣{0,1}n},\displaystyle=\{2(\mathbf{r}\cdot(\mathbf{j}-\mathbf{i})):\mathbf{i},\mathbf{j}\in\{0,1\}^{n}\}, (218)
R(ω)\displaystyle R(\omega) ={(i,j):2𝐫(𝐣𝐢)=k}.\displaystyle=\{(i,j):2\mathbf{r}\cdot(\mathbf{j}-\mathbf{i})=k\}. (219)

Note that the subtraction in the definition of Ω\Omega does not preserve 2\mathbb{Z}_{2}, i.e. (𝐣𝐢){1,0,1}n(\mathbf{j}-\mathbf{i})\in\{-1,0,1\}^{n}. From Eq. (217) we see that the largest possible size for |Ω||\Omega| is 3n3^{n}, the number of unique choices for (𝐣𝐢)(\mathbf{j}-\mathbf{i}). Any encoding strategy of Eq. (215) therefore introduces a combinatorial degeneracy, since the set {(𝐣𝐢):𝐢,𝐣{0,1}n}\{(\mathbf{j}-\mathbf{i}):\mathbf{i},\mathbf{j}\in\{0,1\}^{n}\} is the image of a surjective map on {0,1}n×{0,1}n\{0,1\}^{n}\times\{0,1\}^{n}, with each (𝐣𝐢)(\mathbf{j}-\mathbf{i}) occurring with multiplicity

#(𝐣𝐢)=2n(𝐣𝐢)1.\#(\mathbf{j}-\mathbf{i})=2^{n-\lVert(\mathbf{j}-\mathbf{i})\rVert_{1}}. (220)

To prove this, construct the set {(𝐣𝐢):𝐢,𝐣{0,1}n}\{(\mathbf{j}-\mathbf{i}):\mathbf{i},\mathbf{j}\in\{0,1\}^{n}\} by reflecting the hypercube {0,1}n\{0,1\}^{n} over 2n2^{n} distinct axes, and count the number of vertices shared by among reflected images – this gives the desired degeneracy. Many choices of 𝐫\mathbf{r} will reduce |Ω||\Omega| by reducing the size of the image of the map (𝐢,𝐣)𝐫(𝐣𝐢)(\mathbf{i},\mathbf{j})\rightarrow\mathbf{r}\cdot(\mathbf{j}-\mathbf{i}). Conversely, the spectrum Ω\Omega will saturate |Ω|=3n|\Omega|=3^{n} only under particular conditions on 𝐫\mathbf{r}. For example with n=2n=2, |Ω|=9|\Omega|=9 is achieved only if 𝐫\mathbf{r} satisfies all of the following conditions

r00,r0±12r1,r0±r1,r0±2r1,\displaystyle r_{0}\neq 0,\qquad r_{0}\neq\pm\frac{1}{2}r_{1},\qquad r_{0}\neq\pm r_{1},\qquad r_{0}\neq\pm 2r_{1},\qquad (221)

i.e., the inverse of the dot product 𝐫𝐝\mathbf{r}\cdot\mathbf{d} with respect to an input 𝐝\mathbf{d} exists if the elements of 𝐫\mathbf{r} satisfy rjri mod 2pr_{j}\neq r_{i}\text{ mod }2^{p} for some integer pp; a more general, recursive statement of this property is provided in Ref. [44]. Requirements of this form may be proven using the frame analysis operator Tn3n×nT_{n}\in\mathbb{Z}^{3^{n}\times n}, the rows of which contain each element of {1,0,1}n\{-1,0,1\}^{n}. And so to demonstrate Eq. (221) above, if we let 𝐤3n\mathbf{k}\in\mathbb{R}^{3^{n}} be a vector containing (repeated) frequencies in Ω\Omega, we observe that T2𝐫=𝐤9T_{2}\mathbf{r}=\mathbf{k}\in\mathbb{R}^{9} will have unique entries only under specific conditions on 𝐫\mathbf{r}. Ω\Omega being the result of a frame reconstruction of 𝐫\mathbf{r} also provides a direct way to tune the elements of Ω\Omega: given a length 3n3^{n} vector 𝐤\mathbf{k} containing the (not necessarily unique) desired elements of Ω\Omega, one can then compute HH by solving 𝐫=Tn𝐤\mathbf{r}=T_{n}^{\dagger}\mathbf{k}.

C.1 Hamming encoding strategy

We instantiate the Hamiltonian of Eq. (215) for

𝐫=12(1,1,,1)n.\mathbf{r}=\frac{1}{2}(1,1,\dots,1)\in\mathbb{R}^{n}.

This introduces many degeneracies into the spectrum; using Tn𝐫=𝐤T_{n}\mathbf{r}=\mathbf{k} we immediately recover the result of Ref. [30] that the unique elements of 𝐤\mathbf{k} are given by

Ω={n,(n1),,0,,n1,n},\Omega=\{-n,-(n-1),\dots,0,\dots,n-1,n\}, (222)

with |Ω|=2n+1|\Omega|=2n+1. To recover the degeneracy of each frequency we compute

λiλj=12t=1n(1)it12t=1n(1)jt=12t=1n(12it)12t=1n(12jt)=w(j)w(i),\displaystyle\lambda_{i}-\lambda_{j}=\frac{1}{2}\sum_{t=1}^{n}(-1)^{i_{t}}-\frac{1}{2}\sum_{t=1}^{n}(-1)^{j_{t}}=\frac{1}{2}\sum_{t=1}^{n}(1-2i_{t})-\frac{1}{2}\sum_{t=1}^{n}(1-2j_{t})=w(j)-w(i), (223)

where w:{0,1}n[n]w:\{0,1\}^{n}\rightarrow[n] is the the weight of a bitstring b=bn1b1b0{0,1}nb=b_{n-1}\dots b_{1}b_{0}\in\{0,1\}^{n} defined as

w(b)=k=1nbk,w(b)=\sum_{k=1}^{n}b_{k}, (224)

and when bb is written as an integer it should be interpreted according to its binary value. It will be useful to consider subsets of bitstrings having a constant weight. For each k[n]k\in[n] (results for k<0k<0 follow by symmetry), we now define the kk-weight set W(k)W(k) as

W(k)={j{0,1}n:w(j)=k}.W(k)=\{j\in\{0,1\}^{n}:w(j)=k\}. (225)

Then the degeneracy set is exactly the set of bitstring indices differing by constant weight:

R(k)\displaystyle R(k) ={(i,j){0,1}n×{0,1}n:w(j)w(i)=k}\displaystyle=\{(i,j)\in\{0,1\}^{n}\times\{0,1\}^{n}:w(j)-w(i)=k\} (226)
=p=knW(p)×W(pk).\displaystyle=\bigcup_{p=k}^{n}W(p)\times W(p-k). (227)

where ×\times denotes the Cartesian product (this motivates the name “Hamming encoding strategy”, as the frequencies and degeneracy of this model arise from index pairs having fixed Hamming weight). From counting arguments, the degeneracy of this model is then

|R(k)|=j=1nk(nj)(nk+j)=(2nnk).|R(k)|=\sum_{j=1}^{n-k}\begin{pmatrix}n\\ j\end{pmatrix}\begin{pmatrix}n\\ k+j\end{pmatrix}=\begin{pmatrix}2n\\ n-k\end{pmatrix}. (228)

C.2 Binary encoding strategy

We instantiate the Hamiltonian of Eq. (215) for

𝐫=12(2n1,,4,2,1)n,\mathbf{r}=\frac{1}{2}(2^{n-1},\dots,4,2,1)\in\mathbb{R}^{n},

such that 𝐫\mathbf{r} generates the decimal value of binary vector. We know from relations like those given in Eq. (221) that |Ω|<<3n|\Omega|<<3^{n}, and indeed using Eq. (216) we find that λj=(j(2n1))/2\lambda_{j}=(j-(2^{n}-1))/2 resulting in a frequency spectrum of

Ω={2n+1,2n+2,,0,1,2n1},\Omega=\{-2^{n}+1,-2^{n}+2,\dots,0,1,\dots 2^{n}-1\}, (229)

such that |Ω|=2n+11|\Omega|=2^{n+1}-1, and the degenerate index set of the kernel for frequency kk is equivalent (up to permutations) to the indices corresponding to nonzero elements of an elementary Toeplitz matrix with 1’s on the kk-th diagonal:

R(k)={(i,j):i=jki,j[2n]}.R(k)=\{(i,j):i=j-k\,\,i,j\in[2^{n}]\}. (230)

C.3 Ternary encoding strategy

We instantiate the Hamiltonian of Eq. (215) for

𝐫=12(3n1,,9,3,1)n,\mathbf{r}=\frac{1}{2}(3^{n-1},\dots,9,3,1)\in\mathbb{R}^{n},

resulting in the spectrum

Ω={12(3n1),12(3n1)+1,,0,,12(3n1)}.\Omega=\Bigl{\{}-\frac{1}{2}(3^{n}-1),-\frac{1}{2}(3^{n}-1)+1,\dots,0,\dots,\frac{1}{2}(3^{n}-1)\Bigr{\}}. (231)

This spectrum (also studied in Ref. [44]) is interesting to study because it is the unique choice giving rise to a Fourier spectrum with frequencies spaced by 11 saturating the |Ω|=3n|\Omega|=3^{n} limit for spectra generated by separable Hamiltonians. Given k=2𝐫𝐝k=2\mathbf{r}\cdot\mathbf{d}, then 𝐝=𝐢𝐣\mathbf{d}=\mathbf{i}-\mathbf{j} may be uniquely recovered from kk. Suppose kk has a ternary representation

k=knk1k0=jkj3j,k=k_{n}\dots k_{1}k_{0}=\sum_{j}k_{j}3^{j}, (232)

with n=log3kn=\lceil\log_{3}k\rceil and kj{0,1,2}k_{j}\in\{0,1,2\}. Then defining the operation t:{0,1,2}nt:\mathbb{Z}\rightarrow\{0,1,2\}^{n} which converts decimals to their vector-form ternary representation according to t(k)=(k0,k1,,kn1)t(k)=(k_{0},k_{1},\dots,k_{n-1}), then the degeneracy of this encoding strategy follows directly from the combinatorial degeneracy given in Eq. (220):

|R(k)|=2nT(k)1,\displaystyle|R(k)|=2^{n-\lVert T(k)\rVert_{1}}, (233)

where the shifted ternary operation is defined for k[0,3n1]k\in[0,3^{n}-1]

T(k)=t(k+𝐫1)𝟏,\displaystyle T(k)=t\left(k+\lVert\mathbf{r}\rVert_{1}\right)-\mathbf{1}, (234)

and 𝟏=(1,1,,1)n\mathbf{1}=(1,1,\dots,1)\in\mathbb{Z}^{n} is the array of all ones. Intuitively, this computation converts a signed ternary expression k=jkj3jk=\sum_{j}k_{j}3^{j} (with 𝐤{1,0,1}\mathbf{k}\in\{-1,0,1\}) to an unsigned ternary expression provided in Eq. (232).

C.4 Nonseparable models

Given a data-encoding Hamiltonian H=diag(Λ)H=\operatorname{diag}\left(\Lambda\right) with Λd\Lambda\in\mathbb{R}^{d}, a key feature of the associated spectrum is that it is preserved under permutations of Λ\Lambda. This means that a large class of models where S(x)S(x) is generated by a separable H=H0H1Hn1H=H_{0}\otimes H_{1}\otimes H_{n-1} are capable of producing the same spectrum as models where S(x)S(x) is generated by an nonseparable Hamiltonian. For instance, letting \sim denote equivalence up to permutation and additive shift, it holds that IZZZI\otimes Z\sim Z\otimes Z, so that a data encoding unitaries

S1(x)\displaystyle S_{1}(x) =exp(i2πr(IZ)x),\displaystyle=\exp(i2\pi r(I\otimes Z)x), (235)
S2(x)\displaystyle S_{2}(x) =exp(i2πr(ZZ)x),\displaystyle=\exp(i2\pi r(Z\otimes Z)x), (236)

have identical spectrum and degeneracy properties (differing only in their corresponding degeneracy index sets RR). Understanding the classes of nonseparable data-encoding Hamiltonians that are equivalent to a separable encoding strategy up to permutation and translation may be an interesting problem for future work.

We therefore limit our discussion of nonseparable models to a particular choice of Hamiltonian which is provably inequivalent to any separable Hamiltonian, as it saturates the largest possible |Ω||\Omega| (and smallest values of |R(k)||R(k)|) available to any single-layer quantum model. This can be achieved by setting the diagonal of the data-encoding Hamiltonian as the elements of a Golomb ruler [45, 63]. The resulting spectrum is nondegenerate for all nonzero frequencies, though it is straightforward to prove that one cannot achieve uniform spacing in the spectrum (i.e., a Perfect Golomb ruler) for d5d\geq 5. As a result, the corresponding spectrum of this model generally exhibits gaps between frequencies. Further exploration of the connections to concepts from radio engineering [64, 65] and classical coding theory [66] may enrich investigations into the spectral properties of quantum models.

Appendix D Demonstration of benign overfitting in the general quantum model

In this section we demonstrate an example of benign overfitting in the general quantum model by explicitly constructing a sequence of state preparation unitaries UU for a particular choice of data-encoding unitary S(d)S(d). We will consider a dd-dimensional version of the Binary encoding strategy constructed from a data-encoding Hamiltonian H=diag(0,1,2,,d1)H=\operatorname{diag}\left(0,1,2,\dots,d-1\right) that gives rise to

Ω\displaystyle\Omega ={d2+1,,0,,d21},\displaystyle=\Bigl{\{}\frac{-d}{2}+1,\dots,0,\dots,\frac{d}{2}-1\Bigr{\}}, (237)
|R(k)|\displaystyle|R(k)| =d|k|,\displaystyle=d-|k|, (238)

for kΩk\in\Omega, and we will require dd to be even for simplicity. Suppose the target function spectrum has size n0<n<dn_{0}<n<d for some integer nn and satisfies (n0+1) mod 4=0(n_{0}+1)\text{ mod }4=0. Define the constants

c1\displaystyle c_{1} =d2n0+14,\displaystyle=\frac{d}{2}-\frac{n_{0}+1}{4}, (239)
c2\displaystyle c_{2} =d2+n0+14.\displaystyle=\frac{d}{2}+\frac{n_{0}+1}{4}. (240)

We then choose a constant a[0,2/(n0+1)]a\in[0,2/(n_{0}+1)] and prepare |Γ|\Gamma\rangle with elements given by

γj={a,j[c1,c2),b,j[0,c1)[c2,d),\displaystyle\gamma_{j}=\begin{cases}\sqrt{a},&j\in[c_{1},c_{2}),\\ \sqrt{b},&j\in[0,c_{1})\cup[c_{2},d),\end{cases} (241)

where normalization requires that

b=1n0+12adn0+12.b=\frac{1-\frac{n_{0}+1}{2}a}{d-\frac{n_{0}+1}{2}}. (242)

For this encoding strategy, the optimal feature weights of the quantum model corresponding to positive frequencies kΩ+k\in\Omega_{+} are given by

νk=j=0dk1|γj|2|γj+k|2.\nu_{k}=\sum_{j=0}^{d-k-1}|\gamma_{j}|^{2}|\gamma_{j+k}|^{2}. (243)

Counting arguments and algebraic simplification lead to

νk={(dB02k)b2+2kab+(B0k)a2,0k<c2c1,(d2B0k)b2+2B0ab,c2c1k<c1,(kB0)b2+(dB02k)ab,c1k<c2,(dk)b2,c2k<d,\nu_{k}=\begin{cases}(d-B_{0}-2k)b^{2}+2kab+(B_{0}-k)a^{2},&0\leq k<c_{2}-c_{1},\\ (d-2B_{0}-k)b^{2}+2B_{0}ab,&c_{2}-c_{1}\leq k<c_{1},\\ (k-B_{0})b^{2}+(d-B_{0}-2k)ab,&c_{1}\leq k<c_{2},\\ (d-k)b^{2},&c_{2}\leq k<d,\end{cases} (244)

where B0=(n0+1)/2B_{0}=(n_{0}+1)/2 and νk+1νk\nu_{k+1}\leq\nu_{k} for all k>0k>0. Noting that b=𝒪(d1)b=\mathcal{O}(d^{-1}) and aa is constant, we find that νk\nu_{k} has three distinct scaling regimes: νk=Ω(1)\nu_{k}=\Omega(1) for kΩn0k\in\Omega_{n_{0}}, νk=Θ(d2)\nu_{k}=\Theta\left(d^{-2}\right) when k[c2,d)k\in[c_{2},d), and νk=𝒪(d1)\nu_{k}=\mathcal{O}(d^{-1}) otherwise. We can use this behavior to bound the generalization error of the quantum model in a manner analogous to Sec. A.3.1. Defining m=|S(k)|1m=|S(k)|-1, we bound the bias according to

bias2\displaystyle\textsc{bias}^{2} =kΩn0|g^k|2(S(k)\kν)2+S(k)\kν2(S(k)ν)2\displaystyle=\sum_{k\in\Omega_{n_{0}}}|\hat{g}_{k}|^{2}\frac{\left(\sum_{\ell\in S(k)\backslash k}\nu_{\ell}\right)^{2}+\sum_{\ell\in S(k)\backslash k}\nu_{\ell}^{2}}{\left(\sum_{\ell\in S(k)}\nu_{\ell}\right)^{2}} (245)
kΩn0|g^k|2m(m+1)(db2+2B0ab)2(a2+mb2)2\displaystyle\leq\sum_{k\in\Omega_{n_{0}}}|\hat{g}_{k}|^{2}\frac{m(m+1)(db^{2}+2B_{0}ab)^{2}}{(a^{2}+mb^{2})^{2}} (246)
[m(m+1)b2](db+2B0aa2+mb2)2P\displaystyle\leq\left[m(m+1)b^{2}\right]\left(\frac{db+2B_{0}a}{a^{2}+mb^{2}}\right)^{2}P (247)
=𝒪(1n2),\displaystyle=\mathcal{O}\left(\frac{1}{n^{2}}\right), (248)

where we have defined P:=kΩn0|g^k|2P:=\sum_{k\in\Omega_{n_{0}}}|\hat{g}_{k}|^{2}. In line (246) we have used the monotonicity of νk\nu_{k} to bound the components of S(k)\kS(k)\backslash k in the numerator according to the upper bound for the interval k[c2c1,c1)k\in[c_{2}-c_{1},c_{1}) and applied a lower bound for elements of S(k)\kS(k)\backslash k in the denominator using the interval k[c2,d)k\in[c_{2},d). The scaling of Eq. (248) follows from m=𝒪(d/n)m=\mathcal{O}(d/n), which implies that m(m+1)b2=𝒪(n2)m(m+1)b^{2}=\mathcal{O}(n^{-2}), while the remaining terms in line (247) scale as no greater than a constant. To bound the variance, we observe

kΩn0S(k)ν2(S(k)ν)2\displaystyle\sum_{k\in\Omega_{n_{0}}}\frac{\sum_{\ell\in S(k)}\nu_{\ell}^{2}}{\left(\sum_{\ell\in S(k)}\nu_{\ell}\right)^{2}} (249)
kΩn0[a2(B0k)+2kab+(dB02k)b2]2+[(d2B0k)b2+2B0ab]2(a2+mb2)2\displaystyle\leq\sum_{k\in\Omega_{n_{0}}}\frac{\left[a^{2}(B_{0}-k)+2kab+(d-B_{0}-2k)b^{2}\right]^{2}+\left[(d-2B_{0}-k)b^{2}+2B_{0}ab\right]^{2}}{(a^{2}+mb^{2})^{2}} (250)
n0(a2B0+n0ab+db2)2+m(db2+2B0ab)2(a2+mb2)2\displaystyle\leq n_{0}\frac{(a^{2}B_{0}+n_{0}ab+db^{2})^{2}+m(db^{2}+2B_{0}ab)^{2}}{(a^{2}+mb^{2})^{2}} (251)
=𝒪(1),\displaystyle=\mathcal{O}(1), (252)

which follows identically to the arguments used to reach Eq. (248) and the observation that 2kn02k\leq n_{0} for kΩn0k\in\Omega_{n_{0}}. Furthermore, defining m=(c1n0)/nm^{\prime}=\lfloor(c_{1}-n_{0})/n\rfloor as the lower bound for the number of aliases of any k{Ωn\Ωn0:c1<c2}k\in\{\ell\in\Omega_{n}\backslash\Omega_{n_{0}}:c_{1}\leq\ell<c_{2}\} we have

kΩn\Ωn0S(k)ν2(S(k)ν)2\displaystyle\sum_{k\in\Omega_{n}\backslash\Omega_{n_{0}}}\frac{\sum_{\ell\in S(k)}\nu_{\ell}^{2}}{\left(\sum_{\ell\in S(k)}\nu_{\ell}\right)^{2}} (m+1)(db2+2B0ab)2(m[(d2B0k)b2+2B0ab]+(m+1m)b2)2\displaystyle\leq\frac{(m+1)(db^{2}+2B_{0}ab)^{2}}{\left(m^{\prime}\left[(d-2B_{0}-k)b^{2}+2B_{0}ab\right]+(m+1-m^{\prime})b^{2}\right)^{2}} (253)
nn0(m+1)(db+2B0a)2(mm+1[(d2B0c1)b+2B0a]+(1mm+1)b)2\displaystyle\leq\frac{n-n_{0}}{(m+1)}\frac{(db+2B_{0}a)^{2}}{\left(\frac{m^{\prime}}{m+1}\left[(d-2B_{0}-c_{1})b+2B_{0}a\right]+\left(1-\frac{m^{\prime}}{m+1}\right)b\right)^{2}} (254)
=𝒪(n2d),\displaystyle=\mathcal{O}\left(\frac{n^{2}}{d}\right), (255)

where in line (254) we have split the denominator into frequencies n0/2k<c1n_{0}/2\leq k<c_{1} and c1k<dc_{1}\leq k<d, bounding νk\nu_{k} for the latter as b2b^{2}. Line (255) follows from observing that m=Θ(d/n)m/m=Θ(1)m^{\prime}=\Theta(d/n)\Rightarrow m^{\prime}/m=\Theta(1), and thus the entire second term of line (254) is bounded as 𝒪(1)\mathcal{O}(1). Combining Eqs. (252) and (255) with Eq. (88) for the variance of the Fourier features model gives

var=𝒪(1n+nd).\textsc{var}=\mathcal{O}\left(\frac{1}{n}+\frac{n}{d}\right). (256)

It follows that the minimum-F\lVert\cdot\rVert_{F} interpolating quantum model with state preparation unitary UU satisfying Eq. (241) (up to permutations) and data encoded using the Binary encoding strategy will result in benign overfitting as long as the dimensionality of the model scales as d=ω(n)d=\omega(n) (i.e., the number of qubits satisfies nq=ω(logn)n_{q}=\omega(\log n)), in which case Eqs. (248) and (256) characterizing the generalization error of the model both vanish in the large-nn limit.

A convenience of this demonstration of benign overfitting in a quantum model was that the state |Γ|\Gamma\rangle of Eq. (241) incorporates knowledge of the target function (namely the band-limited spectrum Ωn0\Omega_{n_{0}}). This could be considered a limitation of the approach, as it imposes an inductive bias on the resulting interpolating model foptf^{opt}, in contrast to other examples of benign overfitting that are more agnostic to the underlying distribution [6, 59]. Future work could reveal choices of |Γ|\Gamma\rangle that are more data-independent (no explicit dependence on n0n_{0}) but give rise to feature weights νkopt\nu_{k}^{opt} with the same desirable properties as Eq. (244). As described in Appendix A.3.2, the desirable properties include a more weight on all νk\nu_{k} with kΩn0k\in\Omega_{n_{0}} and a long, thin tail of feature weights for all other kΩd\Ωn0k\in\Omega_{d}\backslash\Omega_{n_{0}}.