Generalization despite overfitting in quantum machine learning models

Evan Peters e6peters@uwaterloo.ca Department of Physics, University of Waterloo, Waterloo, ON, N2L 3G1, Canada Institute for Quantum Computing, Waterloo, ON, N2L 3G1, Canada Perimeter Institute for Theoretical Physics, Waterloo, Ontario, N2L 2Y5, Canada Maria Schuld Xanadu, Toronto, ON, M5G 2C8, Canada

Abstract

The widespread success of deep neural networks has revealed a surprise in classical machine learning: very complex models often generalize well while simultaneously overfitting training data. This phenomenon of benign overfitting has been studied for a variety of classical models with the goal of better understanding the mechanisms behind deep learning. Characterizing the phenomenon in the context of quantum machine learning might similarly improve our understanding of the relationship between overfitting, overparameterization, and generalization. In this work, we provide a characterization of benign overfitting in quantum models. To do this, we derive the behavior of a classical interpolating Fourier features models for regression on noisy signals, and show how a class of quantum models exhibits analogous features, thereby linking the structure of quantum circuits (such as data-encoding and state preparation operations) to overparameterization and overfitting in quantum models. We intuitively explain these features according to the ability of the quantum model to interpolate noisy data with locally “spiky” behavior and provide a concrete demonstration example of benign overfitting.

1 Introduction

A long-standing paradigm in machine learning is the trade-off between the complexity of a model family and the model’s ability to generalize: more expressive model classes contain better candidates to fit complex trends in data, but are also prone to overfitting noise [1, 2]. Interpolation, defined for our purposes as choosing a model with zero training error, was hence long considered bad practice [3]. The success of deep learning – machine learning in a specific regime of extremely complex model families with vast amounts of tunable parameters – seems to contradict this notion; here, consistent evidence shows that among some interpolating models, more complexity tends not to harm the generalisation performance¹¹1The fact that interpolation does not always harm generalization performance is in fact well-known for models like boosting and nearest neighbours, which are both interpolating models., a phenomenon described as “benign overfitting” [4].

In recent years, a surge of theoretical studies have reproduced benign overfitting in simplified settings with the hope of isolating the essential ingredients of the phenomenon [4, 5]. For example, Ref. [6] showed how interpolating linear models in a high complexity regime (more dimensions than datapoints) could generalize just as well as their lower-complexity counterparts on new data, and analyzed the properties of the data that lead to the “absorption” of noise by the interpolating model without harming the model’s predictions. Ref. [7] showed that there are model classes of simple functions that change quickly in the vicinity of the noisy training data, but recover a smooth trend elsewhere in data space (see Figure 1). Such functions have also been used to train nearest neighbor models that perfectly overfit training data while generalizing well, thereby directly linking “spiking models” to benign overfitting [8]. Recent works try to recover the basic mechanism of such spiking models using the language of Fourier analysis [9, 10, 11].

In parallel to these exciting developments in the theory of deep learning, quantum computing researchers have proposed families of parametrised quantum algorithms as model classes for machine learning (e.g. Ref. [12]). These quantum models can be optimised similarly to neural networks [13, 14] and have interesting parallels to kernel methods [15, 16] and generative models [17, 18]. Although researchers have taken some first steps to study the expressivity [19, 20, 21, 22], trainability [23, 24] and generalisation [25, 26, 27, 28] of quantum models, we still know relatively little about their behaviour. In particular, the interplay of overparametrisation, interpolation, and generalisation that seems so important for deep learning is yet largely unexplored.

In this paper we develop a simplified framework in which questions of overfitting in quantum machine learning can be investigated. Essentially, we exploit the observation that quantum models can often be described in terms of Fourier series where well-defined components of the quantum circuit influence the selection of Fourier modes and their respective Fourier coefficients [29, 30, 31]. We link this description to the analysis of spiking models and benign overfitting by building on prior works analyzing these phenomena using Fourier methods. In this approach, the complexity of a model is related to the number of Fourier modes that its Fourier series representation consists of, and overparametrised model classes have more modes than needed to interpolate the training data (i.e., to have zero training error). After deriving the generalization error for such model classes these “superfluous” modes lead to spiking models, which have large oscillations around the training data while keeping a smooth trend everywhere else. However, large numbers of modes can also harm the recovery of an underlying signal, and we therefore balance this trade-off to produce an explicit example of benign overfitting in a quantum machine learning model.

The mathematical link described above allows us to probe the impact of important design choices for a simplified class of quantum models on this trade-off. For example, we find why a measure of redundancy in the spectrum of the Hamiltonian that defines standard data encoding strategies strongly influences this balance; in fact to an extent that is difficult to counterbalance by other design choices of the circuit.

The remainder of the paper proceeds as follows. We will first review the classical Fourier framework for the study of interpolating models and develop explicit formulae for the error in these models to produce a basic example of benign overfitting (Sec. 2). We will then construct a quantum model with analogous components to the classical model, and demonstrate how each of these components is related to the structure of the corresponding quantum circuit and measurement (Sec. 3). We then analyze specific cases that give rise to “spikiness” and benign overfitting in these quantum models (Sec. 3.2).

Refer to caption — Figure 1: Intuition behind the phenomenon of benign overfitting with spiking models. a Training a model typically involves a trade-off between fitting noisy training data well and recovering an underlying target function b Traditional learning theory associates interpolating models (reaching zero training error by fitting every data point) with low generalization capability (or high test error) by failing to recover a simple target function. c “spiking models” that change quickly in the vicinity of training data but otherwise exhibit simple behavior can explain how both kinds of errors may be kept low.

2 Interpolating models in the Fourier framework

In this section we will provide the essential tools to probe the phenomenon of overparametrized models that exhibit the spiking behaviour from Figure 1 using the language of Fourier series. We will review and formalize the problem setting and several examples from Refs. [9, 10, 11], before extending their framework by incorporating standard results from linear regression to derive closed-form error behavior and examples of benign overfitting.

2.1 Setting up the learning problem

We are interested in functions $g$ defined on a finite interval that may be written in terms of a linear combination of Fourier basis functions or modes $e^{i2\pi kx}$ (each describing a complex sinusoid with integer-valued frequency $k$ ) weighted by their corresponding Fourier coefficients $\hat{g}_{k}$ :

g(x)=\sum_{k=-\infty}^{\infty}\hat{g}_{k}e^{i2\pi kx}.

(1)

We restrict our attention to well-behaved functions that are sufficiently smooth and continuous to be expressed in this form. We will now set up a simple learning problem whose basic components – the model and target function to be learned – can be expressed as Fourier series with only few non-zero coefficients, and define concepts such as overparametrization and interpolation.

Consider a machine learning problem in which data is generated by a target function of the form

g(x)=\sum_{k\in\Omega_{n_{0}}}\hat{g}_{k}e^{i2\pi kx}

(2)

which only contains frequencies in the discrete, integer-valued spectrum

\Omega_{n_{0}}=\Bigl{\{}-\frac{n_{0}-1}{2},-\frac{n_{0}-1}{2}+1,\dots,0,\dots,\frac{n_{0}-1}{2}\Bigr{\}}

(3)

for some odd integer $n_{0}$ . We call functions of this form band-limited with bandwidth $n_{0}/2$ (that is, $|k|<n_{0}/2$ for all frequencies $k\in\Omega_{n_{0}}$ ). The bandwidth limits the complexity of a function, and will therefore be important for exploring scenarios where a complex model is used to overfit a less complex target function. We are provided with $n$ noisy samples $y_{j}=g(x_{j})+\epsilon$ of the target function $g$ evaluated at points $x_{j}=j/n$ spaced uniformly on interval $[0,1]$ (we will assume input data has been rescaled to $[0,1]$ without loss of generality), where $n>n_{0}$ and we require $n$ to be odd for simplicity.

The model class we consider likewise contains band-limited Fourier series, and since we are interested in interpolating models, we always assume that they have enough modes to interpolate the noisy training data, namely:

	$\displaystyle f(x)$	$\displaystyle=\sum_{k\in\Omega_{d}}\alpha_{k}\sqrt{\nu_{k}}e^{i2\pi kx},$		(4)
	$\displaystyle\text{such that }\qquad f(x_{j})$	$\displaystyle=y_{j}\text{ for all }j\in[n].$

Similarly to Eq. (3) we define the spectrum

\Omega_{d}=\Bigl{\{}-\frac{d-1}{2},-\frac{d-1}{2}+1,\dots,0,\dots,\frac{d-1}{2}\Bigr{\}}.

(5)

Following Ref. [9], this model class has two components: The set of weighted Fourier basis functions $\sqrt{\nu_{k}}e^{i2\pi kx}$ describe a feature map applied to $x$ for some set of feature weights $\nu_{k}\in\mathbb{R}^{+}$ , while the trainable weights $\alpha_{k}\in\mathbb{C}$ are optimized to ensure that $f$ interpolates the data. The theory of trigonometric polynomial interpolation [32] ensures that $f$ can always interpolate the training data for some choice of trainable weights $\alpha_{k}$ under these conditions. In the following, we will therefore usually consider the $\alpha_{k}$ as being determined by the data and interpolation condition, while the $\nu_{k}$ serve as our “turning knobs” to create different settings in which to study spiking properties and benign overfitting. We call the model class described in Eq. (4) overparameterized when the degree of the Fourier series is much larger than the degree of the target function, $d\gg n_{0}$ , in which case the model has many more frequencies available to perform fitting than there are present in the underlying signal to fit.

Note that one can rewrite the Fourier series of Eq. (4) in a linear form

f(x)=\langle\boldsymbol{\alpha},\phi(x)\rangle

(6)

where

	$\displaystyle\boldsymbol{\alpha}$	$\displaystyle=(\alpha_{-(d-1)/2},\dots,\alpha_{k},\dots,\alpha_{(d-1)/2})\in\mathbb{C}^{d},$		(7)
	$\displaystyle\phi(x)$	$\displaystyle=\left(\sqrt{\nu_{-(d-1)/2}}e^{i\pi(d-1)x}\dots,\sqrt{\nu_{k}}e^{i2\pi kx},\dots,\sqrt{\nu_{(d-1)/2}}e^{i\pi(d-1)x}\right)\in\mathbb{C}^{d}.$		(8)

From this perspective, optimizing $f$ amounts to learning trainable weights $\boldsymbol{\alpha}$ by performing regression on observations $\phi(x)$ sampled from a random Fourier features model [33] for which $\nu_{k}$ and $\phi(x)_{k}$ are precisely the eigenvalues and eigenfunctions of a shift-invariant kernel [34].

To complete the problem setup, we have to impose one more constraint. Consider that $\exp(i2\pi kx_{j})$ with frequency $k<n/2$ and uniformly spaced points $x_{j}$ is equal to $\exp(i2\pi k^{\prime}x_{j})$ for any choice of alias frequency $k^{\prime}=k\mod n$ . The presence of these aliases means that the model class described in Eq. (6) contains many interpolating solutions in the overparameterized regime. Motivated by prior work exploring benign overfitting for linear features [6], Fourier features [9, 35], and other nonlinear features [36, 37], we will study the minimum- $\ell_{2}$ norm interpolating solution,

\boldsymbol{\alpha}^{opt}=\operatorname*{arg\,min}_{\boldsymbol{\alpha}:f(x_{j})=y_{j}\,\forall\,j}\lVert\boldsymbol{\alpha}\rVert_{2}.

(9)

Minimizing the $\ell_{2}$ norm is a typical choice for imposing a penalty on complex functions (regularization) in the underparameterized regime, though we will see that this intuition does not carry over to the overparameterized regime. The remainder of this section will explore how this learning problem results in a trade-off in interpolating Fourier models: Overparameterization introduces alias frequencies that increase the error in fitting simple target functions but can also reduce error by absorbing noise into high-frequency modes with spiking behavior.

2.2 Two extreme cases to understand generalization

To better understand the trade-off that overparametrization – or in our case, a much larger number of Fourier modes than needed to interpolate the data – introduces between fitting noise and generalization error, we revisit two extreme cases explored in Ref. [9], involving a pure-noise signal and a noiseless signal.

Case 1: Noise only

The first case demonstrates how alias modes can help to fit noise without disturbing the (here trivial) signal. We set $g=0$ and consider $n$ observations $y_{j}=\epsilon_{j}$ of zero-mean noise with known variance $\mathbb{E}[\epsilon^{2}]=\sigma^{2}$ . After making $n$ uniformly spaced observations, we compute the discrete Fourier transform of the observations as the sequence of values $\hat{\epsilon}_{j}$ satisfying

\hat{\epsilon}_{j}=\frac{1}{n}\sum_{k\in\Omega_{n}}\epsilon_{k}e^{-i2\pi jk/n},

(10)

which characterizes the frequency content of the noisy signal that can be captured and learned from using only $n$ evenly spaced samples. Suppose that the degree of the model (controlling the model complexity) is given by $d=n(m+1)$ for some even integer $m$ and that $\nu_{k}=1$ for every mode, so that there are exactly $m$ equally-weighted aliases for each frequency in the spectrum of the Fourier series for $g$ . Then the optimal (i.e., the minimum $\ell_{2}$ -norm, interpolating) trainable weight vector $\boldsymbol{\alpha}^{opt}$ has entries

\alpha^{opt}_{k+n\ell}=\frac{\hat{\epsilon}_{k}}{n(m+1)}

(11)

for $\ell=-m/2,\dots,m/2$ , with all other entries being zero (see Appendix A.2). From Eq. (11), the minimum- $\lVert\boldsymbol{\alpha}\rVert_{2}$ solution distributes noise Fourier coefficients $\hat{\epsilon}_{k}$ evenly into many alias frequencies $k+n\ell$ , while enforcing that the sum of trainable weights $\alpha_{k+n\ell}$ for all of these aliases is $\hat{\epsilon}_{k}$ to guarantee interpolation. As shown in Figure 2, the higher-frequency aliases suppress the optimal model $f^{opt}(x)=\langle\boldsymbol{\alpha}^{opt},\phi(x)\rangle$ to near-zero at every point away from the interpolated points, resulting in a test error of $\mathcal{O}(\sigma^{2}/m)$ that decreases monotonically with the complexity of the model. As $m$ increases, the optimal model $f^{opt}$ remains close to the true signal $y=0$ while becoming “spiky” near the noisy samples. By conforming to the true signal everywhere except in the vicinity of noise, this behavior embodies the mechanism of how overparameterized models can absorb noise into high frequency modes. In this case the generalization error, measuring how close the model is to the target function on average, decreases with increasing complexity of the model class.

Case 2: Signal only

While the above case shows how overparametrization can help to absorb noise to reduce error without harming the signal, the second case will illustrate how alias frequencies in the overparameterized model can harm the model’s ability to learn the target function. To demonstrate this, we now consider a noiseless, single-mode signal $g(x):=\hat{g}_{p}e^{i2\pi px}$ of frequency $p\leq n_{0}/2$ . The data is hence of the form

\displaystyle y_{j}

\displaystyle=g(x_{j}):=\hat{g}_{p}e^{i2\pi px_{j}}.

(12)

Once again we choose $d=n(m+1)$ and for simplicity we assume an unweighted model, $\nu_{k}=1$ for $k\in\Omega_{d}$ . By orthonormality of Fourier basis functions, the interpolation condition requires that only the modes of the model $f$ with integer multiples of the frequency $p$ are retained. The interpolation constraint can then be rewritten as

\sum_{\ell=-m/2}^{m/2}\alpha_{p+n\ell}=\hat{g}_{p}.

(13)

The choice of trainable weights $\alpha_{k}$ that satisfy Eq. (13) while minimizing $\ell_{2}$ -norm is

\alpha^{opt}_{p+n\ell}=\frac{\hat{g}_{p}}{m+1}

(14)

for $k=p+n\ell$ and $\alpha_{k}=0$ otherwise (see Appendix A.2). Eq. (13) distributes the Fourier coefficient $\hat{g}_{p}$ among the trainable weights $\alpha_{p+n\ell}$ corresponding to frequencies $p+n\ell$ . Therefore, minimizing $\lVert\boldsymbol{\alpha}\rVert_{2}$ in this case “bleeds” the target function into higher frequency aliases and results in the opposite effect compared to fitting a noisy signal (see Fig. 2b): The generalization error of the overparameterized model now increases with the number of aliases $m$ and the complexity of the model harms its ability to fit a noiseless target function.

In order to recover a trade-off in generalization error for more general cases, we will need to consider more interesting distributions of feature weights $\nu_{k}$ (instead of $\nu_{k}=1$ ) that provide finer control over fitting the target function with low-frequency modes while spiking in the vicinity of noise with high-frequency aliases.

2.3 Generalization trade-offs and benign overfitting

The opposing effects of higher-frequency modes in overparameterized models in the cases discussed above hint at a trade-off in model performance that depends on the underlying signal and the feature weights of the Fourier feature map. Returning to the more general case of input samples $y_{j}=g(x_{j})+\epsilon_{j}$ , in Appendix A we show that the task of fitting uniformly spaced samples using weighted Fourier features may be transformed into a linear regression problem, thereby generalizing the results of [9] to derive the following general solution to the minimum- $\ell_{2}$ interpolating problem of Eq. (9):

\alpha^{opt}_{k+\ell n}=\hat{y}_{k}\frac{\sqrt{\nu_{k+\ell n}}}{\sum_{j\in S(k)}\nu_{j}},

(15)

where $\hat{y}_{k}$ is the discrete Fourier transform of $y_{k}$ and $k\in\Omega_{n_{0}}$ , where

S(k)=\{\ell:k+n\ell\in\Omega_{d}\}

(16)

denotes the set of alias frequencies of $k$ appearing in the overparameterized model with spectrum $\Omega_{d}$ . The optimal model is then expressed as

f^{opt}(x)=\sum_{k\in\Omega_{n_{0}}}\tilde{y}_{k}\sum_{\ell\in S(k)}e^{i2\pi\ell x}\frac{\nu_{\ell}}{\sum_{j\in S(k)}\nu_{j}}.

(17)

Recalling that our model $f$ is trained on $n$ noisy samples $(y_{0},\dots,y_{n-1})$ of the target function $g$ , we are interested in the squared error of the model $f$ averaged over (noisy) samples over the input domain,

L(f):=\mathbb{E}_{x,\mathbf{y}}(f(x)-g(x))^{2}.

(18)

and we call $L$ the generalization error of $f$ , as it captures the behavior of $f$ with respect to $g$ over the entire input domain $x\in[0,1]$ instead of just the uniformly spaced training points $x_{j}$ . In Appendix A we derive

\displaystyle L(f^{opt})

\displaystyle=\underbrace{\frac{\sigma^{2}}{n}\sum_{k\in\Omega_{n}}\frac{\sum_{\ell\in S(k)}\nu_{\ell}^{2}}{\left(\sum_{j\in S(k)}\nu_{j}\right)^{2}}}_{\textsc{var}}+\underbrace{\sum_{k\in\Omega_{n_{0}}}|\hat{g}_{k}|^{2}\left[\left(\frac{\sum_{\ell\in S(k)\backslash k}\nu_{\ell}}{\sum_{j\in S(k)}\nu_{j}}\right)^{2}+\frac{\sum_{\ell\in S(k)\backslash k}\nu_{\ell}^{2}}{\left(\sum_{j\in S(k)}\nu_{j}\right)^{2}}\right]}_{\textsc{bias}^{2}}

(19)

We use this generalization error now to explore two interesting behaviors of the interpolating model in our setting: The tradeoff between noise absorption and signal recovery exemplified by the cases in Sec. 2.2, and the ability of an overparameterized Fourier features model to benignly overfit the training data.

The first behavior involves a trade-off in the generalization error $L(f^{opt})$ between absorbing noise (reducing var) and capturing the target function signal (reducing $\textsc{bias}^{2}$ ) that recovers and generalizes the behavior of the two cases in Sec. 2.2. This trade-off is controlled by three components: The noise variance $\sigma^{2}$ , the input signal Fourier coefficients $\hat{g}_{k}$ , and the distribution of feature weights $\sqrt{\nu_{k}}$ . As described in the two cases above, when $\sigma^{2}\rightarrow 0$ (signal only) the variance term var vanishes and the model is biased for any choice of nonzero $\nu_{k}$ where $k>n$ . Conversely, when $\hat{g}\rightarrow 0$ (noise only) the bias term $\textsc{bias}^{2}$ vanishes, and the variance term is minimized by choosing uniform $\nu_{k}$ for all $k\in\Omega_{d}$ .

The second interesting behavior occurs when the generalization error of the overparameterized model decreases at a nearly optimal rate as the number of samples $n$ increases, known as benign overfitting. Prior work on benign overfitting in linear regression studied scenarios where the distribution of input data varied with the dimensionality of data and size of the training set in such a way that the excess generalization error of the overparameterized model (compared to a simple model) vanished [6]. However, since the dimensionality of the input data for our model is fixed, we instead consider sequences of feature weights that vary with respect to the problem parameters ( $n_{0}$ , $n$ , $d$ ) in a way that results in $\textsc{bias}^{2}$ and var vanishing as $n\rightarrow\infty$ . In this case, by fitting an increasing number of samples $n$ using such a sequence of feature weights, the overparameterized model both perfectly fits the training data and generalizes well for unseen test points, and therefore exhibits benign overfitting.

These behaviors are exemplified by a straightforward choice of feature weights that incorporate some prior knowledge of the spectrum $\Omega_{n_{0}}$ available to the target function $g$ . For all $k\in\Omega_{n_{0}}$ , let $\nu_{k}=c/n_{0}$ for some positive $c$ and normalize the feature weights so that $\sum_{k\in\Omega_{k}}\nu_{k}=1$ . We show in Appendix A.3.1 that the error terms of $L(f^{opt})$ scale as

	var	$\displaystyle=\mathcal{O}\left(\frac{1}{n}+\frac{n}{d}\right),$		(20)
	$\displaystyle\textsc{bias}^{2}$	$\displaystyle=\mathcal{O}\left(\frac{1}{n^{2}}\right).$		(21)

Thus, as long as the dimension of the overparameterized Fourier features model grows strictly faster than $n$ (i.e., $d=\omega(n)$ ), the model exhibits benign overfitting. In Appendix A.3.2 we demonstrate how this simple example actually captures the mechanisms of benign overfitting for much more general choices of feature weights. Fig. 3 summarizes this behavior and provides an example of the bias-variance tradeoff that occurs for overparameterized models. In particular, Fig. 3a exemplifies the setting in which benign overfitting occurs, wherein the feature weights of the Fourier features model are strongly concentrated over frequencies in $\Omega_{n_{0}}$ but extend over a large range of alias frequencies for each $k\in\Omega_{n_{0}}$ .

The generalization behavior described here is fundamentally different from many generalization guarantees typically found in statistical learning theory. While prior work has derived guarantees for the generalization of quantum models by constructing bounds on the complexity of the model class [25], Eqs. 20-21 demonstrate that generalization may occur as the complexity (i.e. dimension) of a model grows arbitrarily large.

So far, we have reviewed the Fourier perspective on fitting periodic functions in a classical setting and extended the analysis to characterize benign overfitting. However, if we can link the basic components of quantum models to the terms appearing in the error of Eq. (19), then we will be able to study a similar trade-off in the error of overparameterized quantum models and the conditions necessary for benign overfitting. The remainder of this work is devoted to showing that analogous mechanisms exist in certain quantum machine learning models, and to studying the choices of feature weights for which quantum models can exhibit tradeoffs in generalization error and benign overfitting.

3 Benign overfitting in single-layer quantum models

In the previous section we have seen that the feature weights $\sqrt{\nu_{k}}$ balance the trade-off between absorbing the noise and hurting the signal of overparametrized models. To understand how different design choices of quantum models impact this balance, we need to link their mathematical structure to the model class defined in Eq. (4), and in particular to the feature weights, which is what we do now.

The type of quantum models we consider here are simplified versions of parametrized quantum classifiers (also known as quantum neural networks) that have been heavily investigated in recent years [13, 38, 39]. They are represented by quantum circuits that consist of two steps: first, we encode a datapoint $x\in[0,1]$ into the state of the quantum system by applying a $d$ -dimensional unitary operator $V(x)$ , and then we measure the expectation value of some $d$ -dimensional (Hermitian) observable $M$ . This gives rise to a general class of quantum models of the form

f(x)=\langle 0|V^{\dagger}(x)MV(x)|0\rangle.

(22)

To simplify the analysis, we will consider a quantum circuit $V(x)$ that consists of a data-independent unitary $U$ and a diagonal data-encoding unitary generated by a $d$ -dimensional Hermitian operator $H$ ,

\displaystyle S(x)=\exp(i2\pi\cdot\text{diag}(\lambda_{0},\dots,\lambda_{d-1})x)=\exp(i2\pi Hx),

(23)

which includes a large class of quantum models commonly studied but excludes schemes involving data re-uploading [40, 41]. Defining $U|0\rangle=|\Gamma\rangle=\sum_{j=1}^{d}\gamma_{j}|j\rangle$ , the output of this quantum model becomes

\displaystyle f(x)=\langle\Gamma|S^{\dagger}(x)MS(x)|\Gamma\rangle,

(24)

where $|\Gamma\rangle$ can be treated as an arbitrary input quantum state. We call the corresponding quantum circuit for this model single-layer in the sense that it contains a single diagonal data-encoding layer in which all data-dependent circuit operations could theoretically be executed simultaneously (though in general the operation $U$ and measurement $M$ may require significant depth to implement).

Applying insights from Refs. [29, 30], quantum models of this form can be expressed in terms of a Fourier series

\displaystyle f(x)

\displaystyle=\sum_{k\in\Omega}e^{i2\pi kx}\sum_{\ell,m\in R(k)}\gamma_{\ell}\gamma_{m}^{*}M_{m\ell},

(25)

where the spectrum $\Omega$ as well as the partitions $R(k)$ depend on the eigenspectrum $\lambda(H)$ of the data-encoding Hamiltonian $H$ :

	$\displaystyle\Omega$	$\displaystyle=\{(\lambda_{\ell}-\lambda_{m}):\lambda_{\ell},\lambda_{m}\in\lambda(H)\}$		(26)
	$\displaystyle R(k)$	$\displaystyle=\{(\ell,m):\lambda_{\ell}-\lambda_{m}=k;\,\,\lambda_{\ell},\lambda_{m}\in\lambda(H)\}.$		(27)

Comparing Eq. (25) to Eq. (4) we see that that the quantum model may be expressed as a linear combination of weighted Fourier modes, but it is not yet clear how the input state $\gamma_{j}$ and the trainable observable $M$ of the quantum model correspond to feature weights $\nu_{k}$ for each Fourier mode. To reveal this correspondence, we will need to first find the minimum-norm interpolating observable that solves the optimization problem

\displaystyle M^{opt}=\operatorname*{arg\,min}_{M=M^{\dagger}:f(x_{j})=y_{j}\,\forall\,j}\lVert M\rVert_{F},

(28)

where $\lVert M\rVert_{F}=\sqrt{\operatorname{Tr}\left(M^{\dagger}M\right)}$ denotes the Frobenius norm of $M$ . Solving Eq. (28) is analogous to the minimization the $\ell_{2}$ norm of $\boldsymbol{\alpha}$ in the classical optimization problem of Eq. (9), and serves a role similar to regularization commonly applied to quantum models by introducing a penalty term proportional to $\lVert M\rVert_{F}^{2}$ [26, 42, 43]. In Appendix B we prove that subject to the condition that $\gamma_{i}>0$ , the minimum- $\lVert\cdot\rVert_{F}$ interpolating observable that solves Eq. (28) is given as

(M^{opt})_{m\ell}=\hat{y}_{k}\frac{\gamma_{\ell}^{*}\gamma_{m}}{\sum_{i,j\in R(S_{k})}|\gamma_{i}|^{2}|\gamma_{j}|^{2}},

(29)

for all $\ell,m\in R(S_{k})$ , and the corresponding optimal quantum model is

f^{opt}(x)=\sum_{k\in\Omega_{n_{0}}}\hat{y}_{k}\,\sum_{\ell\in S(k)}e^{i2\pi\ell x}\frac{\nu^{opt}_{\ell}}{{\sum_{j\in S(k)}\nu^{opt}_{j}}},

(30)

where $S(k)$ denotes the set of aliases of $k$ appearing in $\Omega$ from Eq. (16). By comparison to the optimal classical model of Eq. (17) we have identified the feature weights of the optimized quantum model as

\nu_{k}^{opt}:=\sum_{i,j\in R(k)}|\gamma_{i}|^{2}|\gamma_{j}|^{2}.

(31)

Interestingly, while there was initially no clear way to separate the building blocks of the quantum model in Eq. (25) into trainable weights $\alpha_{k}$ and feature weights $\nu_{k}$ , this separation has now appeared after solving for the optimal observable $M^{opt}$ . Furthermore, the optimal quantum model depends on $|\gamma_{i}|$ and is independent of phases associated with amplitudes $\gamma_{i}$ (an effect that stems from using only a single data-encoding layer $S(x)$ in the quantum model).

From Eq. (31) it is clear that the partitions $R(k)$ of Eq. (27) arising from the choice of data-encoding unitary $S(x)$ have a strong relationship with the feature weights $\nu_{k}$ of the quantum model. We will now consider a simplified quantum model to highlight this relationship, thereby identifying a tradeoff between noise absorption and target signal recovery and the possibility of observing benign overfitting in quantum models.

3.1 Simplified quantum model

To explicitly highlight the role of $R(k)$ in controlling the feature weights $\nu_{k}^{opt}$ of the optimized quantum model, we will now simplify the model by using an equal superposition input state $|\Gamma\rangle=\frac{1}{\sqrt{d}}\sum_{j=0}^{d-1}|j\rangle$ and by restricting the set of observables considered during optimization. If we fix every entry of the observable with respect to elements in a partition $R(k)$ to be proportional to some complex constant $M(k)$ :

M_{m\ell}=\frac{M(k)}{\sqrt{|R(k)|}}\qquad\text{ for all }\ell,m\in R(k),

(32)

then we can simplify the quantum model of Eq. (25) to

\displaystyle f(x)

\displaystyle=\sum_{k\in\Omega}M(k)\frac{\sqrt{|R(k)|}}{d}e^{i2\pi kx}.

(33)

Comparing Eq. (33) to Eq. (4) we identify a direct correspondence between the trainable weights $\alpha_{k}$ in the classical model with $M(k)$ , as well as a correspondence between the feature weights $\sqrt{\nu_{k}}$ and the the degeneracy $|R(k)|$ of the quantum model. Making the substitutions $\alpha_{k}\rightarrow M(k)$ , $\sqrt{\nu_{k}}\rightarrow\sqrt{|R(k)|}/d$ , one can verify that $\sum_{k\in\Omega}|M(k)|^{2}=\lVert M\rVert_{F}^{2}$ for this restricted choice of $M$ and so the solution to the optimization of Eq. (28) is essentially the same as that of the classical problem in Eq. (15).

The crucial property of the simplified model is that the degeneracy $|R(k)|$ – and hence the combinatorial structure introduced by the data encoding Hamiltonian’s eigenvalues – completely controls the trade-off in the generalization error (Eq. (19)). We can hence study different types of partitions $R(k)$ to show a direct effect of the data-encoding unitary $S(x)$ on the fitting and generalization error behaviors for this simplified, overparameterized quantum model.

To study these behaviors we will now consider specific families of $H$ which we call encoding strategies since the choice of $H$ completely determines how the data is encoded into the quantum model. While $R(k)$ and $\Omega$ may be computed for an arbitrary $S(x)$ using brute-force combinatorics, some encoding strategies lead to particularly simple solutions. We have derived a few such examples of simple degeneracies and spectra for different encoding strategies in Appendix C and present the results in Table 1. These choices highlight the extreme variation in $\Omega$ resulting from minor changes to $S(x)$ , for example $|\Omega|\propto n_{q}$ for the “Hamming” encoding strategy compared to $|\Omega|\propto 3^{n_{q}}$ for the “Ternary” encoding strategy. These examples also highlight the limitations in constructing Hamiltonians with specific properties such as uniform $|R(k)|$ or evenly-spaced frequencies in $\Omega$ .

Encoding strategy	Hamiltonian example	Degeneracy $\|R(k)\|$	Spectrum $\Omega$	$\|\Omega\|$
Hamming	$\frac{1}{2}(Z_{0}+Z_{1}+Z_{2}+\dots)$	$\begin{pmatrix}2n_{q}\\ n_{q}-\|k\|\end{pmatrix}$	$\{-n_{q},1-n_{q},\dots,n_{q}\}$	$2n_{q}+1$
Binary	$\begin{aligned} \textstyle&\frac{1}{2}(Z_{0}+2Z_{1}+4Z_{2}+\dots)\\ &\sim\operatorname{diag}\left(0,1,2,\dots,2^{n_{q}}-1\right)\end{aligned}$	$2^{n_{q}}-\|k\|$	$\{-2^{n_{q}}+1,\dots 2^{n_{q}}-1\}$	$2^{n_{q}+1}-1$
Ternary	$\frac{1}{2}(Z_{0}+3Z_{1}+9Z_{2}+\dots)$	$2^{n_{q}-\lVert T(k)\rVert_{1}}$	$\Bigl{\{}-\Bigl{\lfloor}\frac{3^{n_{q}}}{2}\Bigr{\rfloor},\dots,\Bigl{\lfloor}\frac{3^{n_{q}}}{2}\Bigr{\rfloor}\Bigr{\}}$	$3^{n_{q}}$
Golomb	$\operatorname{diag}\left(0,1,4,6\right)$	$\begin{cases}d&k=0\\ 1&k\neq 0\end{cases}$	varies	$2\begin{pmatrix}d\\ 2\end{pmatrix}+1$

Table 1: Spectra

\Omega

and degeneracies

R(k)

computed for various data-encoding Hamiltonians

H

defined for either

d

dimensions or

n_{q}

qubits. The Hamming, Binary, and Ternary data encoding strategies are realized on

n_{q}

qubits using a separable Hamiltonian consisting of Pauli-

Z

operators with different prefactor schemes. The Ternary encoding strategy results in the largest

|\Omega|

possible for a separable data encoding Hamiltonian (see also Ref. [44]), while the Golomb encoding encoding strategy (named in reference to Golomb rulers, e.g. [45]) results in the largest

|\Omega|

possible for any choice of

d

-dimensional data-encoding Hamiltonian. Note that the spectrum is preserved under permutations and additive shifts of the diagonal of the Hamiltonian and so we use “

\sim

” to denote equivalence up to these operations. The function

T

converts an integer to a signed ternary string, as defined in Eq. (234) (see Appendix C).

Since the feature weights $\nu_{k}$ of the Fourier modes are fixed by the choice of the data-encoding unitary, we can understand a choice of $S(x)$ as providing a structural bias of a quantum model towards different overfitting behaviors, and conversely the choices of feature weights available to quantum models are limited and are particular to the structure of the associated quantum circuit. Figure 4 shows distributions for feature weights arising from the example encoding strategies presented in Table 2, and demonstrates a broad connection between the degeneracies $|R(k)|$ of the model (giving rise to feature weights $\nu_{k}^{opt}$ ) and the generalization error $L(f^{opt})$ .

3.2 Trainable unitaries reweight the general quantum model

We now return to the general quantum model of Eq. (25) to understand how different choices of the state preparation unitary $U$ (giving rise to input state $|\Gamma\rangle$ ) affect the feature weights $\nu_{k}^{opt}$ of the general quantum model, thereby influencing benign overfitting of the target function. While Eq. (31) differs from the correspondence $\nu_{k}=|R(k)|/d^{2}$ that we observed for the simplified quantum model, we see that the feature weights of the optimal quantum model still depend heavily on the degeneracies $|R(k)|$ . For instance, the average $\nu_{k}^{opt}$ with respect to Haar-random $d$ -dimensional input states $|\Gamma\rangle$ is proportional to $|R(k)|$ whenever $k\neq 0$ :

\displaystyle\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\nu_{k}^{opt}=\frac{|R(k)|+d\delta_{0}^{k}}{d(d+1)}.

(34)

where $\delta_{i}^{j}$ denotes the Kronecker delta. Furthermore, in Appendix B we observe that the variance of $\nu_{k}^{opt}$ around its average tends to be small for encoding strategies considered in this work, for instance scaling like $\mathcal{O}\left(d^{-3}\right)$ for the Binary encoding strategy and $\mathcal{O}\left(d^{-4}\right)$ for the Golomb encoding strategy. This demonstrates that the feature weights of generic quantum models (i.e., ones for which $U$ is randomly sampled) will be dominated by the degeneracy $|R(k)|$ introduced by the data-encoding operation $S(x)$ .

Despite the behavior of $\nu_{k}$ being dominated by $|R(k)|$ in an average sense, there are specific choices of $U$ for which the feature weights deviate significantly from this average. We will now use one such choice of $U$ to provide a concrete example of an interpolating quantum model that exhibits benign overfitting. Suppose we choose $U$ such that the elements of $|\Gamma\rangle$ are given by

\displaystyle\gamma_{j}=\begin{cases}\sqrt{a},&j\in[c_{1},c_{2}),\\ \sqrt{b},&\text{otherwise}\end{cases}

(35)

for $j\in[d]$ , and some integers $0<c_{1}<c_{2}<d$ and amplitudes $a,b$ subject to normalization. We show that given a band-limited target function $g$ with access to spectrum $\Omega_{n_{0}}$ , there is a specific choice of $c_{1},c_{2}$ dependent on $d$ and $n_{0}$ for which the interpolating quantum model $f^{opt}$ of Eq. (30) also benefits from vanishing generalization error in the limit of many samples, namely we show in Appendix D that

L(f^{opt})=\mathcal{O}\left(\frac{1}{n}+\frac{n}{d}\right).

(36)

Thus, by perfectly fitting the training data and exhibiting good generalization in the $n\rightarrow\infty$ limit, the quantum model exhibits benign overfitting. This behavior is outlined in Figure 5, which highlights the role that $|\Gamma\rangle$ plays in concentrating the feature weights $\nu_{k}$ within the spectrum $\Omega_{n_{0}}$ of $g$ while preserving a long tail that provides the model with low-variance “spiky” behavior in the vicinity of noisy samples. In contrast, the feature weights for the Binary encoding strategy with a uniform input state has little support on $\Omega_{n_{0}}$ and resulting in a large bias error.

The above discussion shows how the input state amplitudes $\gamma_{i}$ provide additional degrees of freedom with which the feature weights $\nu_{k}^{opt}$ can be tuned in order to modify the generalization behavior of the interpolating quantum model, and to exhibit benign overfitting in particular. It is therefore worthwhile to consider what other kinds of feature weights $\nu_{k}$ might be prepared by some choice of input state $|\Gamma\rangle$ . We may use simple counting arguments to demonstrate the restrictions in designing particular distributions of feature weights. Suppose we define $\Omega_{+}=\{k:k\in\Omega,k>0\}$ containing the positive frequencies of a quantum model. Then the introduction of an arbitrary input state $|\Gamma\rangle$ provides us with $2^{n_{q}}-1$ free parameters with which to tune $|\Omega_{+}|$ -many terms in the distribution of $\nu_{k}^{opt}$ (subject to $\nu_{k}=\nu_{-k}$ and $\sum_{k}\nu_{k}=1$ ). Clearly, there are distributions of feature weights $\nu_{k}^{opt}$ that can not be achieved for models where $|\Omega_{+}|\geq 2^{n_{q}}$ .

Conversely, the condition $|\Omega_{+}|<2^{n_{q}}$ does not necessarily mean that we can thoroughly explore the space of possible feature weights by modifying the input state $|\Gamma\rangle$ . For example, consider the Hamming encoding strategy for which the number of free parameters controlling the distribution of feature weights $\nu_{k}^{opt}$ is $|\Omega_{+}|=n_{q}$ , which is exponentially smaller the number of parameters in $|\Gamma\rangle$ . While this might suggest significant freedom in adjusting $\nu_{k}^{opt}$ , the opposite is true: For any choice of input state $|\Gamma\rangle$ , there is another state of the form

|\Gamma^{\prime}\rangle=\sum_{i=0}^{n_{q}}\phi_{i}|\Phi_{i}\rangle,

(37)

that achieves exactly the same distribution of feature weights $\nu_{k}$ . In Eq. (37), $|\Phi_{i}\rangle$ describes a uniform superposition over all computational basis state bitstrings with weight $i$ , and so the distribution of $\nu_{k}^{opt}$ actually only depends on $n_{q}+1$ real parameters $\phi_{i}$ , $i=0,1,\dots,n_{q}$ , and the feature weights are invariant under any operations in $U$ that preserve $|\Phi_{i}\rangle$ (see Appendix B). An example of such operations are the particle-conserving unitaries well-known in quantum chemistry, which act to preserve the number of set bits (i.e., the Hamming weight) when each bit represents a mode occupied by a particle in second quantization [46, 47]. This example demonstrates how symmetry in the data-encoding Hamiltonian (e.g. Refs. [48, 49]) can have a profound influence on the ability to prepare specific distributions of feature weights $\nu_{k}^{opt}$ , and consequently affect the generalization and overparameterization behavior of the associated quantum models.

4 Conclusion

In this work we have taken a first step towards characterizing the phenomenon of benign overfitting in a quantum machine learning setting. We derived the error for an overparameterized Fourier features model that interpolates the (noisy) input signal with minimum $\ell_{2}$ -norm trainable weights and connected the feature weights associated with each Fourier mode to a trade-off in the generalization error of the model. We then demonstrated an analogous simplified quantum model for which the feature weights are induced by the choice of data-encoding unitary $S(x)$ . Finally, we discussed how introducing an arbitrary state-preparation unitary $U$ gives rise to effective feature weights in the optimized general quantum model, presenting the possibility of connecting $U$ and $S(x)$ to benign overfitting in more general quantum models.

Our discussion of interpolating quantum models presents an interpretation of overparameterization (i.e., the size of the model spectrum $\Omega$ ) that departs from other measures of quantum circuit complexity discussed in the literature [19, 50, 51], as even the simplified quantum models studied here are able to interpolate training data using a fixed circuit $U$ and optimized measurements. We also reemphasize that – unlike much of the quantum machine learning literature – we do not consider a setting where the model is optimized with respect to a trainable circuit, as the model of Eq. (30) is constructed to exhibit zero training error (and can therefore not be improved via optimization). Finding the input state $|\Gamma\rangle$ that will result in a specific distribution of feature weights $\nu_{k}^{opt}$ generally requires solving a $|\Omega_{+}|$ -dimensional system of equations that are second order in $2^{n_{q}}$ many real parameters $|\gamma_{i}|^{2}$ (i.e., inverting the map of the form $\mathbb{R}^{2^{n_{q}}}\rightarrow\mathbb{R}^{|\Omega_{+}|}$ in Eq. (31)) or otherwise performing a variational optimization that will likely fail due to the familiar phenomenon of barren plateaus [23, 24, 52, 53].

While we have shown an example of benign overfitting by a quantum model in a relatively restricted context, future work may lead to more general characterizations of this phenomenon. Similar behavior likely exists for quantum kernel methods and may complement existing studies on these methods’ generalization power [54]. An exciting possibility would be to demonstrate benign overfitting in quantum models trained on distributions of quantum states which are hard to learn classically [55, 56], thereby extending the growing body of statistical learning theory for quantum learning algorithms [27, 28, 57].

5 Code availability

Code to reproduce the figures and analysis is available at the following repository: https://github.com/peterse/benign-overfitting-quantum.

6 Acknowledgements

The authors thank Nathan Killoran, Achim Kempf, Angus Lowe, and Joseph Bowles for helpful feedback. This work was supported by Mitacs through the Mitacs Accelerate program. Research at Perimeter Institute is supported in part by the Government of Canada through the Department of Innovation, Science and Economic Development and by the Province of Ontario through the Ministry of Colleges and Universities. Circuit simulations were performed in PennyLane [58].

References

[1] Michael A Nielsen. “Neural networks and deep learning”. Determination Press. (2015). url: http://neuralnetworksanddeeplearning.com/.
[2] Stuart Geman, Elie Bienenstock, and René Doursat. “Neural networks and the bias/variance dilemma”. Neural Comput. 4, 1–58 (1992).
[3] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. “The elements of statistical learning: data mining, inference, and prediction”. Volume 2. Springer. (2009).
[4] Peter L. Bartlett, Andrea Montanari, and Alexander Rakhlin. “Deep learning: a statistical viewpoint”. Acta Numerica 30, 87–201 (2021).
[5] Mikhail Belkin. “Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation”. Acta Numerica 30, 203–248 (2021).
[6] Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. “Benign overfitting in linear regression”. Proc. Natl. Acad. Sci. 117, 30063–30070 (2020).
[7] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. “Reconciling modern machine-learning practice and the classical bias-variance trade-off”. Proc. Natl. Acad. Sci. 116, 15849–15854 (2019).
[8] Mikhail Belkin, Alexander Rakhlin, and Alexandre B. Tsybakov. “Does data interpolation contradict statistical optimality?”. In Proceedings of Machine Learning Research. Volume 89, pages 1611–1619. PMLR (2019). url: https://proceedings.mlr.press/v89/belkin19a.html.
[9] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. “Harmless interpolation of noisy data in regression”. IEEE Journal on Selected Areas in Information Theory 1, 67–83 (2020).
[10] Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai. “Classification vs regression in overparameterized regimes: Does the loss function matter?”. J. Mach. Learn. Res. 22, 1–69 (2021). url: http://jmlr.org/papers/v22/20-603.html.
[11] Yehuda Dar, Vidya Muthukumar, and Richard G. Baraniuk. “A farewell to the bias-variance tradeoff? an overview of the theory of overparameterized machine learning” (2021). arXiv:2109.02355.
[12] Marcello Benedetti, Erika Lloyd, Stefan Sack, and Mattia Fiorentini. “Parameterized quantum circuits as machine learning models”. Quantum Sci. Technol. 4, 043001 (2019).
[13] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii. “Quantum circuit learning”. Phys. Rev. A 98, 032309 (2018).
[14] Maria Schuld, Ville Bergholm, Christian Gogolin, Josh Izaac, and Nathan Killoran. “Evaluating analytic gradients on quantum hardware”. Phys. Rev. A 99, 032331 (2019).
[15] Maria Schuld and Nathan Killoran. “Quantum machine learning in feature hilbert spaces”. Phys. Rev. Lett. 122, 040504 (2019).
[16] Vojtěch Havlíček, Antonio D. Córcoles, Kristan Temme, Aram W. Harrow, Abhinav Kandala, Jerry M. Chow, and Jay M. Gambetta. “Supervised learning with quantum-enhanced feature spaces”. Nature 567, 209–212 (2019).
[17] Seth Lloyd and Christian Weedbrook. “Quantum generative adversarial learning”. Phys. Rev. Lett. 121, 040502 (2018).
[18] Pierre-Luc Dallaire-Demers and Nathan Killoran. “Quantum generative adversarial networks”. Phys. Rev. A 98, 012324 (2018).
[19] Amira Abbas, David Sutter, Christa Zoufal, Aurelien Lucchi, Alessio Figalli, and Stefan Woerner. “The power of quantum neural networks”. Nat. Comput. Sci. 1, 403–409 (2021).
[20] Logan G. Wright and Peter L. McMahon. “The capacity of quantum neural networks”. In 2020 Conference on Lasers and Electro-Optics (CLEO). Pages 1–2. (2020). url: https://ieeexplore.ieee.org/document/9193529.
[21] Sukin Sim, Peter D. Johnson, and Alán Aspuru-Guzik. “Expressibility and entangling capability of parameterized quantum circuits for hybrid quantum-classical algorithms”. Adv. Quantum Technol. 2, 1900070 (2019).
[22] Thomas Hubregtsen, Josef Pichlmeier, Patrick Stecher, and Koen Bertels. “Evaluation of parameterized quantum circuits: on the relation between classification accuracy, expressibility and entangling capability”. Quantum Mach. Intell. 3, 1 (2021).
[23] Jarrod R McClean, Sergio Boixo, Vadim N Smelyanskiy, Ryan Babbush, and Hartmut Neven. “Barren plateaus in quantum neural network training landscapes”. Nat. Commun. 9, 4812 (2018).
[24] Marco Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio, and Patrick J Coles. “Cost function dependent barren plateaus in shallow parametrized quantum circuits”. Nat. Commun. 12, 1791 (2021).
[25] Matthias C. Caro, Elies Gil-Fuster, Johannes Jakob Meyer, Jens Eisert, and Ryan Sweke. “Encoding-dependent generalization bounds for parametrized quantum circuits”. Quantum 5, 582 (2021).
[26] Hsin-Yuan Huang, Michael Broughton, Masoud Mohseni, Ryan Babbush, Sergio Boixo, Hartmut Neven, and Jarrod R McClean. “Power of data in quantum machine learning”. Nat. Commun. 12, 2631 (2021).
[27] Matthias C. Caro, Hsin-Yuan Huang, M. Cerezo, Kunal Sharma, Andrew Sornborger, Lukasz Cincio, and Patrick J. Coles. “Generalization in quantum machine learning from few training data”. Nat. Commun. 13, 4919 (2022).
[28] Leonardo Banchi, Jason Pereira, and Stefano Pirandola. “Generalization in quantum machine learning: A quantum information standpoint”. PRX Quantum 2, 040321 (2021).
[29] Francisco Javier Gil Vidal and Dirk Oliver Theis. “Input redundancy for parameterized quantum circuits”. Front. Phys. 8, 297 (2020).
[30] Maria Schuld, Ryan Sweke, and Johannes Jakob Meyer. “Effect of data encoding on the expressive power of variational quantum-machine-learning models”. Phys. Rev. A 103, 032430 (2021).
[31] David Wierichs, Josh Izaac, Cody Wang, and Cedric Yen-Yu Lin. “General parameter-shift rules for quantum gradients”. Quantum 6, 677 (2022).
[32] Kendall E Atkinson. “An introduction to numerical analysis”. John Wiley & Sons. (2008).
[33] Ali Rahimi and Benjamin Recht. “Random features for large-scale kernel machines”. In Advances in Neural Information Processing Systems. Volume 20. (2007). url: https://papers.nips.cc/paper_files/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html.
[34] Walter Rudin. “The basic theorems of fourier analysis”. John Wiley & Sons, Ltd. (1990).
[35] Song Mei and Andrea Montanari. “The generalization error of random features regression: Precise asymptotics and the double descent curve”. Commun. Pure Appl. Math. 75, 667–766 (2022).
[36] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. “Surprises in high-dimensional ridgeless least squares interpolation”. Ann. Stat. 50, 949 – 986 (2022).
[37] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. “On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels”. In Proceedings of Machine Learning Research. Volume 125, pages 1–29. PMLR (2020). url: http://proceedings.mlr.press/v125/liang20a.html.
[38] Edward Farhi and Hartmut Neven. “Classification with quantum neural networks on near term processors” (2018). arXiv:1802.06002.
[39] Maria Schuld, Alex Bocharov, Krysta M. Svore, and Nathan Wiebe. “Circuit-centric quantum classifiers”. Phys. Rev. A 101, 032308 (2020).
[40] Adrián Pérez-Salinas, Alba Cervera-Lierta, Elies Gil-Fuster, and José I. Latorre. “Data re-uploading for a universal quantum classifier”. Quantum 4, 226 (2020).
[41] Sofiene Jerbi, Lukas J Fiderer, Hendrik Poulsen Nautrup, Jonas M Kübler, Hans J Briegel, and Vedran Dunjko. “Quantum machine learning beyond kernel methods”. Nat. Commun. 14, 517 (2023).
[42] Casper Gyurik, Dyon Vreumingen, van, and Vedran Dunjko. “Structural risk minimization for quantum linear classifiers”. Quantum 7, 893 (2023).
[43] Maria Schuld. “Supervised quantum machine learning models are kernel methods” (2021). arXiv:2101.11020.
[44] S. Shin, Y. S. Teo, and H. Jeong. “Exponential data encoding for quantum supervised learning”. Phys. Rev. A 107, 012422 (2023).
[45] Sophie Piccard. “Sur les ensembles de distances des ensembles de points d’un espace euclidien.”. Memoires de l’Universite de Neuchatel. Secretariat de l’Universite. (1939).
[46] Dave Wecker, Matthew B. Hastings, Nathan Wiebe, Bryan K. Clark, Chetan Nayak, and Matthias Troyer. “Solving strongly correlated electron models on a quantum computer”. Phys. Rev. A 92, 062318 (2015).
[47] Ian D. Kivlichan, Jarrod McClean, Nathan Wiebe, Craig Gidney, Alán Aspuru-Guzik, Garnet Kin-Lic Chan, and Ryan Babbush. “Quantum simulation of electronic structure with linear depth and connectivity”. Phys. Rev. Lett. 120, 110501 (2018).
[48] Martín Larocca, Frédéric Sauvage, Faris M. Sbahi, Guillaume Verdon, Patrick J. Coles, and M. Cerezo. “Group-invariant quantum machine learning”. PRX Quantum 3, 030341 (2022).
[49] Johannes Jakob Meyer, Marian Mularski, Elies Gil-Fuster, Antonio Anna Mele, Francesco Arzani, Alissa Wilms, and Jens Eisert. “Exploiting symmetry in variational quantum machine learning”. PRX Quantum 4, 010328 (2023).
[50] Martin Larocca, Nathan Ju, Diego García-Martín, Patrick J Coles, and Marco Cerezo. “Theory of overparametrization in quantum neural networks”. Nat. Comput. Sci. 3, 542–551 (2023).
[51] Yuxuan Du, Min-Hsiu Hsieh, Tongliang Liu, and Dacheng Tao. “Expressive power of parametrized quantum circuits”. Phys. Rev. Res. 2, 033125 (2020).
[52] Zoë Holmes, Kunal Sharma, M. Cerezo, and Patrick J. Coles. “Connecting ansatz expressibility to gradient magnitudes and barren plateaus”. PRX Quantum 3, 010313 (2022).
[53] Samson Wang, Enrico Fontana, Marco Cerezo, Kunal Sharma, Akira Sone, Lukasz Cincio, and Patrick J Coles. “Noise-induced barren plateaus in variational quantum algorithms”. Nat. Commun. 12, 6961 (2021).
[54] Abdulkadir Canatar, Evan Peters, Cengiz Pehlevan, Stefan M. Wild, and Ruslan Shaydulin. “Bandwidth enables generalization in quantum kernel models”. Transactions on Machine Learning Research (2023). url: https://openreview.net/forum?id=A1N2qp4yAq.
[55] Hsin-Yuan Huang, Michael Broughton, Jordan Cotler, Sitan Chen, Jerry Li, Masoud Mohseni, Hartmut Neven, Ryan Babbush, Richard Kueng, John Preskill, and Jarrod R. McClean. “Quantum advantage in learning from experiments”. Science 376, 1182–1186 (2022).
[56] Sitan Chen, Jordan Cotler, Hsin-Yuan Huang, and Jerry Li. “Exponential separations between learning with and without quantum memory”. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS). Pages 574–585. (2022).
[57] Hsin-Yuan Huang, Richard Kueng, and John Preskill. “Information-theoretic bounds on quantum advantage in machine learning”. Phys. Rev. Lett. 126, 190505 (2021).
[58] Ville Bergholm, Josh Izaac, Maria Schuld, Christian Gogolin, M. Sohaib Alam, Shahnawaz Ahmed, Juan Miguel Arrazola, Carsten Blank, Alain Delgado, Soran Jahangiri, Keri McKiernan, Johannes Jakob Meyer, Zeyue Niu, Antal Száva, and Nathan Killoran. “Pennylane: Automatic differentiation of hybrid quantum-classical computations” (2018). arXiv:1811.04968.
[59] Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. “Benign overfitting in linear regression”. Proc. Natl. Acad. Sci. 117, 30063–30070 (2020).
[60] Vladimir Koltchinskii and Karim Lounici. “Concentration inequalities and moment bounds for sample covariance operators”. Bernoulli 23, 110 – 133 (2017).
[61] Zbigniew Puchała and Jarosław Adam Miszczak. “Symbolic integration with respect to the haar measure on the unitary group”. Bull. Pol. Acad. Sci. 65, 21–27 (2017).
[62] Daniel A. Roberts and Beni Yoshida. “Chaos and complexity by design”. J. High Energy Phys. 2017, 121 (2017).
[63] Wallace C. Babcock. “Intermodulation interference in radio systems frequency of occurrence and control by channel selection”. Bell Syst. tech. j. 32, 63–73 (1953).
[64] M. Atkinson, N. Santoro, and J. Urrutia. “Integer sets with distinct sums and differences and carrier frequency assignments for nonlinear repeaters”. IEEE Trans. Commun. 34, 614–617 (1986).
[65] J. Robinson and A. Bernstein. “A class of binary recurrent codes with limited error propagation”. IEEE Trans. Inf. 13, 106–113 (1967).
[66] R. J. F. Fang and W. A. Sandrin. “Carrier frequency assignment for nonlinear repeaters”. COMSAT Technical Review 7, 227–245 (1977).

Appendix A Solution for the classical overparameterized model

In this section we derive the optimal solution and generalization error for the classical overparameterized weighted Fourier functions model. We then discuss the conditions under which benign overfitting may be observed and construct examples of the phenomenon.

A.1 Linearization of overparameterized Fourier model

We first show that the classical overparameterized Fourier features model may be cast as a linear model under an appropriate orthogonal transformation. We are interested in learning a target function of the form

g(x)=\sum_{k\in\Omega_{n_{0}}}\hat{g}_{k}e^{i2\pi kx},

(38)

with the additional constraint that the Fourier coefficients satisfy $\hat{g}_{k}=\hat{g}_{-k}^{*}$ such that $g$ is real. The spectrum of Eq. (38) for odd $n_{0}$ only contains integer frequencies,

\Omega_{n_{0}}=\Biggl{\{}-\frac{n_{0}-1}{2},\dots,0,\dots,\frac{n_{0}-1}{2}\Biggr{\}},

(39)

and we accordingly call $g$ a bandlimited function with bandlimit $n_{0}/2-1$ . To learn $g$ , we first sample $n$ equally spaced datapoints on the interval $[0,1]$ ,

x_{j}=\frac{j}{n},\qquad j\in[n],

(40)

where $[n]=\{0,1,2,\dots,n-1\}$ and we assume $n$ is odd, and we then evaluate $g(x_{j})$ with additive error. This noisy sampling process yields $n$ observations of the form $y_{j}=g(x_{j})+\epsilon$ with $\mathbb{E}[\epsilon^{2}]=\sigma^{2}$ . We will fit the observations $y_{j}$ using overparameterized Fourier features models of the form

f(x)=\langle\phi(x),\boldsymbol{\alpha}\rangle=\sum_{k\in\Omega_{d}}\alpha_{k}\sqrt{\nu_{k}}e^{i2\pi kx},

(41)

with $\boldsymbol{\alpha}\in\mathbb{C}^{d}$ , and we have introduced weighted Fourier features $\phi:\mathbb{R}\rightarrow\mathbb{C}^{d}$ defined elementwise as

\phi(x)_{k}=\sqrt{\nu_{k}}e^{i2\pi kx}.

(42)

In Eq. (41), $\Omega_{d}$ describes the set of frequencies available to the model for any choice of $d\geq n\geq n_{0}$ . We are interested in the case where $f$ interpolates the observations $y_{j}$ , i.e., $f(x_{j})=y_{j}$ for all $j=0,\dots,n-1$ . To this end, we define a $n\times d$ feature matrix $\Phi$ whose rows are given by $\phi(x_{j})^{\dagger}$ :

\displaystyle\Phi_{jk}=[\phi(x_{j})]_{k}^{*}=\sqrt{\nu_{k}}e^{-i2\pi jk/n}.

(43)

The interpolation condition may then be stated in matrix form as

\Phi\boldsymbol{\alpha}=\mathbf{y},

(44)

where $(\mathbf{y})_{j}=y_{j}$ is the vector of noisy observations. $\Omega_{d}$ contains alias frequencies of $\Omega_{n_{0}}$ , and so there the choice of $\boldsymbol{\alpha}$ that satisfies Eq. (44) is not unique. Here we will focus on the minimum- $\ell_{2}$ norm interpolating solution,

\boldsymbol{\alpha}^{opt}=\operatorname*{arg\,min}_{\Phi\boldsymbol{\alpha}=\mathbf{y}}\lVert\boldsymbol{\alpha}\rVert_{2}.

(45)

A.1.1 Fourier transform into the linear model

We will now show that Eq. (45) with uniformly-spaced datapoints $x_{j}$ can be solved using methods from ordinary linear regression under a suitable choice of transformation. Defining the $n$ -th root of identity as $\omega=e^{i2\pi/n}$ , then the $j$ -th row of the LHS of Eq. (44) is equivalent to

$\displaystyle\sum_{k\in\Omega_{d}}\alpha_{k}\sqrt{\nu_{k}}\omega^{-kj}$	$\displaystyle=\sum_{k\in\Omega_{n}}\sum_{\ell\in S(k)}\alpha_{k+n\ell}\sqrt{\nu_{k+n\ell}}\omega^{-(k+n\ell)j}$	(46)
	$\displaystyle=\sum_{k\in\Omega_{n}}\omega^{-kj}\sum_{\ell\in S(k)}\alpha_{k+n\ell}\sqrt{\nu_{k+n\ell}}$	(47)
	$\displaystyle:=\sum_{k\in\Omega_{n}}\omega^{-kj}\langle P_{S(k)}\boldsymbol{\alpha},P_{S(k)}\mathbf{w}\rangle.$	(48)

where $S(k)=\{j:j\in\Omega_{d},j\text{ mod }n=k\}$ is the set of alias frequencies of $k$ appearing in $\Omega_{d}$ , i.e. the set of frequencies $k+\ell n$ with $\ell\in\mathbb{Z}$ that obey

e^{i2\pi kx_{j}}=e^{i2\pi kj/n}=e^{i2\pi(k+\ell n)j/n}=e^{i2\pi(k+\ell n)x_{j}},

(49)

which follows from $e^{i2\pi\ell k}=1$ since $k\in\mathbb{Z}$ . The operator $P_{S(k)}:\mathbb{C}^{d}\rightarrow\mathbb{C}^{d}$ is the projector onto the set of standard basis vectors in $\mathbb{C}^{d}$ with indices in $S(k)$ , and $(\mathbf{w})_{k}:=\sqrt{\nu_{k}}$ . The roots of unity are orthonormal since

\displaystyle\sum_{k=0}^{n-1}\omega^{k(p-q)}=n\delta_{p}^{q}

(50)

for $p,q\in[n]$ . This implies

\displaystyle\sum_{k\in\Omega_{n}}\omega^{k(p-q)}=\omega^{\min(\Omega_{n})(p-q)}n\delta_{p}^{q}=n\delta_{p}^{q},

(51)

for $p,q\in\Omega_{n}$ . Defining the discrete Fourier Transform $\hat{y}_{k}$ of $\mathbf{y}$ according to

\hat{y}_{j}=\frac{1}{n}\sum_{k\in\Omega_{n}}y_{k}\omega^{jk},

(52)

then using the identity of Eq. (51), we evaluate the $j$ -th row of Eq. (44) as

$\displaystyle\sum_{k\in\Omega_{n}}\omega^{-kj}\langle P_{S(k)}\boldsymbol{\alpha},P_{S(k)}\mathbf{w}\rangle$	$\displaystyle=y_{j},$	(53)
$\displaystyle\sum_{j\in\Omega_{n}}\omega^{pj}\sum_{k\in\Omega_{n}}\omega^{-kj}\langle P_{S(k)}\boldsymbol{\alpha},P_{S(k)}\mathbf{w}\rangle$	$\displaystyle=\sum_{j\in\Omega_{n}}\omega^{pj}\left(\sum_{k\in\Omega_{n}}\hat{y}_{k}\omega^{-kj}\right),$	(54)
$\displaystyle\sum_{k\in\Omega_{n}}n\delta_{k}^{p}\langle P_{S(k)}\boldsymbol{\alpha},P_{S(k)}\mathbf{w}\rangle$	$\displaystyle=\sum_{k\in\Omega_{n}}n\delta_{k}^{p}\hat{y}_{k},$	(55)
$\displaystyle\langle P_{S(p)}\boldsymbol{\alpha},P_{S(p)}\mathbf{w}\rangle$	$\displaystyle=\hat{y}_{p}.$	(56)

Inspecting this final line yields a new matrix equation: Let $X$ be an $n\times d$ matrix $X$ with elements

\displaystyle X_{jk}=(P_{S(j)}\mathbf{w})_{k}=\sqrt{\nu_{k}}\mathbbm{1}\{k\in S(j)\},

(57)

where the conditional operator $\mathbbm{1}\{Z\}$ evaluates to $1$ if the predicate $Z$ is true, and $0$ otherwise. Then we may express Eq. (56) for all $p\in\Omega_{n}$ as a matrix equation

X\boldsymbol{\alpha}=\hat{\mathbf{y}},

(58)

where $(\hat{\mathbf{y}})_{j}=\hat{y}_{j}$ . We have shown that Eq. (44) is exactly equivalent to Eq. (58) for uniformly spaced inputs, and as $\boldsymbol{\alpha}$ is unchanged between these two representations this implies that the solution to Eq. (45) is also given by

\boldsymbol{\alpha}^{opt}=\operatorname*{arg\,min}_{X\boldsymbol{\alpha}=\hat{\mathbf{y}}}\lVert\boldsymbol{\alpha}\rVert_{2}.

(59)

Therefore, the minimum $\ell_{2}$ -norm solution to interpolate the input signal using weighted Fourier functions provided as samples $y_{j}$ is exactly the same as the minimum $\ell_{2}$ -norm solution for an equivalent linear regression problem on the matrix $X$ with targets $\hat{\mathbf{y}}$ . Furthermore, this linear regression problem is related to the original problem via Fourier transform. Let $F$ be the (nonunitary) discrete Fourier transform defined on $\mathbb{C}^{n}$ elementwise as

	$\displaystyle F_{jk}$	$\displaystyle=\omega^{jk},$		(60)
	$\displaystyle F^{\dagger}F$	$\displaystyle=n\mathbb{I}_{n}.$		(61)

Then for $k\in\Omega_{d}$ , $j\in\Omega_{n}$ ,

\displaystyle\frac{1}{n}(F\Phi)_{jk}=\frac{1}{n}\sum_{\ell\in\Omega_{n}}\omega^{(k-j)\ell}\sqrt{\nu_{k}}=\frac{1}{n}\left(n\delta_{j}^{k\text{ mod }n}\right)\sqrt{\nu_{k}}=\sqrt{\nu_{k}}\mathbbm{1}\{k\in S(j)\}=X_{jk},

(62)

implying

\displaystyle\frac{1}{n}F\Phi

\displaystyle=X.

(63)

We may similarly recover $\frac{1}{n}F\mathbf{y}=\hat{\mathbf{y}}$ to show that the coefficients $\hat{y}_{k}$ are given by a discrete Fourier transform of $y_{j}$ .

A.1.2 Error analysis of the linear model

Having shown that the discrete Fourier transform $F$ relates the original system of trigonometric equations $\Phi\boldsymbol{\alpha}=\mathbf{y}$ to the system of linear equations (or ordinary least squares problem $X\boldsymbol{\alpha}=\hat{\mathbf{y}}$ , where we treat the rows of $X$ as observations in $\mathbb{R}^{d}$ ), we now derive the error of the problem in the Fourier representation. Standard treatment for ordinary least squares (OLS) gives the minimum $\ell_{2}$ -norm solution to Eq. (59) as

\boldsymbol{\alpha}^{opt}=X^{T}(XX^{T})^{-1}\hat{\mathbf{y}}

(64)

in which case the optimal interpolating overparameterized Fourier model is

	$\displaystyle f^{opt}(x)$	$\displaystyle=\langle\phi(x),\boldsymbol{\alpha}^{opt}\rangle$		(65)
		$\displaystyle=\sum_{k\in\Omega_{n}}\frac{\hat{y}_{k}}{\sum_{j\in S(k)}\nu_{j}}\,\sum_{\ell\in S(k)}\nu_{\ell}e^{i2\pi\ell x}.$		(66)

Once we have trained a model on noisy samples $\mathbf{y}$ of $g$ using uniformly spaced values of $x$ , we would like to evaluate how well the model performs for arbitrary $x$ in $[0,1]$ . Given some function $f=\langle\phi(x),\boldsymbol{\alpha}\rangle$ the mean squared error $\mathbb{E}_{x}(f(x)-g(x))^{2}$ may be evaluated with respect to the interval $[0,1]$ . We define the generalization error of the model as the expected mean squared error evaluated with respect to $\mathbf{y}$ :

\displaystyle L(f):=\mathbb{E}_{x,\mathbf{y}}(f(x)-g(x))^{2}.

(67)

We decompose the generalization error of the model as

$\displaystyle\mathbb{E}_{x,\mathbf{y}}(f(x)-g(x))^{2}$	$\displaystyle=\mathbb{E}_{x,\mathbf{y}}(f(x)-\mathbb{E}_{\mathbf{y}}f(x)+\mathbb{E}_{\mathbf{y}}f(x)-g(x))^{2}$	(68)
	$\displaystyle=\mathbb{E}_{x,\mathbf{y}}(f(x)-\mathbb{E}_{\mathbf{y}}f(x))^{2}+\mathbb{E}_{x,\mathbf{y}}(\mathbb{E}_{\mathbf{y}}f(x)-g(x))^{2}$	(69)
	$\displaystyle\qquad\qquad+2\mathbb{E}_{x,\mathbf{y}}\left[(f(x)-\mathbb{E}_{\mathbf{y}}f(x))(\mathbb{E}_{\mathbf{y}}f(x)-g(x))\right].$

Because of orthonormality of Fourier basis functions, the cross-terms cancel resulting in a decomposition of the generalization error of the optimal standard bias and variance terms:

\displaystyle L(f^{opt})

\displaystyle=\underbrace{\mathbb{E}_{x,\mathbf{y}}(f^{opt}(x)-\mathbb{E}_{\mathbf{y}}f^{opt}(x))^{2}}_{\textsc{var}}+\underbrace{\mathbb{E}_{x,\mathbf{y}}(\mathbb{E}_{\mathbf{y}}f^{opt}(x)-g(x))^{2}}_{\textsc{bias}^{2}}.

(70)

We now evaluate var and $\textsc{bias}^{2}$ using the linear representation developed in the previous section. Beginning with the variance, conditional on constructing $\Phi$ from the set of uniformly spaced points $x_{j}$ we apply the discrete Fourier transformation to yield $X$ and compute $\boldsymbol{\alpha}^{opt}$ using Eq. (64):

var	$\displaystyle=\mathbb{E}_{x,\mathbf{y}}(f^{opt}(x)-\mathbb{E}_{\mathbf{y}}f^{opt}(x))^{2}$	(71)
	$\displaystyle=\mathbb{E}_{x,\mathbf{y}}\|\langle\phi(x),\boldsymbol{\alpha}^{opt}-\mathbb{E}_{\mathbf{y}}\boldsymbol{\alpha}^{opt}\rangle\|^{2}$	(72)
	$\displaystyle=\mathbb{E}_{x,\mathbf{y}}\|\langle\phi(x),X^{T}(XX^{T})^{-1}(\hat{\mathbf{y}}-\mathbb{E}_{\mathbf{y}}\hat{\mathbf{y}})\rangle\|^{2}$	(73)
	$\displaystyle=\operatorname{Tr}\left(\mathbb{E}_{\mathbf{y}}\left[(\hat{\mathbf{y}}-\mathbb{E}_{\mathbf{y}}\hat{\mathbf{y}})(\hat{\mathbf{y}}-\mathbb{E}_{\mathbf{y}}\hat{\mathbf{y}})^{\dagger}\right](XX^{T})^{-1}X\mathbb{E}_{x}\left[\phi(x)\phi(x)^{\dagger}\right]X^{T}(XX^{T})^{-1}\right)$	(74)
	$\displaystyle=\frac{\sigma^{2}}{n}\operatorname{Tr}\left((XX^{T})^{-2}X\Sigma_{\phi}X^{T}\right).$	(75)

Letting $\boldsymbol{\epsilon}:=(\mathbf{y}-\mathbb{E}_{\mathbf{y}}\mathbf{y})$ we have simplified the above using the following²²2In Eq. (75), $\mathbb{E}[X^{T}X]=\Sigma_{\phi}$ , so that $\Sigma_{\phi}$ differs from the covariance matrix for the rows of $X$ by a factor of $1/n$ .:

$\displaystyle\mathbb{E}_{\mathbf{y}}\left[(\hat{\mathbf{y}}-\mathbb{E}_{\mathbf{y}}\hat{\mathbf{y}})(\hat{\mathbf{y}}-\mathbb{E}_{\mathbf{y}}\hat{\mathbf{y}})^{\dagger}\right]_{jk}$	$\displaystyle=\frac{1}{n^{2}}\mathbb{E}_{\mathbf{y}}(F\boldsymbol{\epsilon})_{j}(F\boldsymbol{\epsilon})_{k}^{*}$	(76)
	$\displaystyle=\frac{1}{n^{2}}\mathbb{E}_{\mathbf{y}}\left(\sum_{p\in\Omega_{n}}\omega^{-jp}\epsilon_{p}\right)\left(\sum_{q\in\Omega_{n}}\omega^{qk}\epsilon_{q}\right)$	(77)
	$\displaystyle=\frac{1}{n^{2}}\sum_{p,q\in\Omega_{n}}\mathbb{E}_{\mathbf{y}}[\epsilon_{p}\epsilon_{q}]\omega^{-jp+kq}$	(78)
	$\displaystyle=\frac{\sigma^{2}}{n}\delta_{j}^{k},$	(79)

where we have used the fact that the errors $\epsilon$ are independent and zero mean, $\mathbb{E}_{\mathbf{y}}[\epsilon_{p}\epsilon_{q}]=\mathbb{E}_{\mathbf{y}}[\epsilon_{p}^{2}]\delta_{p}^{q}$ . We have defined the feature covariance matrix as $\Sigma_{\phi}=\mathbb{E}_{x}[\phi(x)\phi(x)^{\dagger}]$ , which may be computed elementwise using the orthonormality of Fourier features on $[0,1]$ :

\displaystyle(\Sigma_{\phi})_{jk}=\mathbb{E}_{x}[\phi(x)_{j}\phi(x)_{k}^{*}]=\sqrt{\nu_{j}\nu_{k}}\langle e^{i2\pi jx},e^{i2\pi kx}\rangle_{L_{2}^{[0,1]}}=\delta_{j}^{k}\nu_{j}.

(80)

The following may be computed directly:

$\displaystyle(XX^{T})_{jk}$	$\displaystyle=\delta_{j}^{k}\sum_{\ell\in S(j)}\nu_{\ell}$	(81)
$\displaystyle(X\Sigma_{\mathbf{x}}X^{T})_{jk}$	$\displaystyle=\sum_{\ell\in\Omega_{d}}X_{j\ell}\left(\frac{\nu_{\ell}}{n}\right)X_{k\ell}$	(82)
	$\displaystyle=\frac{1}{n}\sum_{\ell\in\Omega_{d}}\nu_{p}^{2}\mathbbm{1}\{\ell\in S(j)\}\mathbbm{1}\{\ell\in S(k)\}$	(83)
	$\displaystyle=\frac{\delta_{j}^{k}}{n}\sum_{\ell\in S(k)}\nu_{\ell}^{2}.$	(84)

In line (84) we have used the identity

\mathbbm{1}\{\ell\in S(j)\}\mathbbm{1}\{\ell\in S(k)\}=\mathbbm{1}\{\ell\in S(j)\}\delta_{k}^{j},

(85)

since $\ell\in S(k)\Rightarrow k=\ell\text{ mod }n$ and therefore $\ell\in S(j)\Rightarrow j=\ell\text{ mod }n=k$ . We now compute the variance as

var	$\displaystyle=\frac{\sigma^{2}}{n}\operatorname{Tr}\left((XX^{T})^{-2}X\Sigma_{\phi}X^{T})\right)$	(86)
	$\displaystyle=\frac{\sigma^{2}}{n}\sum_{j,k\in\Omega_{n}}\left(\delta_{k}^{j}\sum_{\ell\in S(k)}\nu_{\ell}\right)^{-2}\left(\delta_{j}^{k}\sum_{\ell\in S(k)}\nu_{\ell}^{2}\right)$	(87)
	$\displaystyle=\frac{\sigma^{2}}{n}\sum_{k\in\Omega_{n}}\frac{\sum_{\ell\in S(k)}\nu_{\ell}^{2}}{\left(\sum_{\ell\in S(k)}\nu_{\ell}\right)^{2}}.$	(88)

To evaluate the $\textsc{bias}^{2}$ , we will first rewrite $g(x)$ in terms of its Fourier representation,

	$\displaystyle g(x)$	$\displaystyle=\langle\phi(x),\boldsymbol{\alpha}_{0}\rangle,$		(89)
	$\displaystyle(\boldsymbol{\alpha}_{0})_{k}$	$\displaystyle=\frac{\hat{g}_{k}}{\sqrt{\nu_{k}}}\mathbbm{1}\{k\in\Omega_{n_{0}}\}.$		(90)

Noting that $(X\boldsymbol{\alpha}_{0})_{k}=\hat{g}_{k}$ implies $X\boldsymbol{\alpha}_{0}=\mathbb{E}_{\mathbf{y}}\hat{\mathbf{y}}$ we evaluate

$\displaystyle\textsc{bias}^{2}$	$\displaystyle=\mathbb{E}_{x,\mathbf{y}}(\mathbb{E}_{\mathbf{y}}f^{opt}(x)-g(x))^{2}$	(91)
	$\displaystyle=\mathbb{E}_{x,\mathbf{y}}\|\langle\phi(x),\mathbb{E}_{\mathbf{y}}\boldsymbol{\alpha}^{opt}-\boldsymbol{\alpha}_{0}\rangle\|^{2}$	(92)
	$\displaystyle=\operatorname{Tr}\left(\mathbb{E}_{x}\left[\phi(x)\phi(x)^{\dagger}\right](\mathbb{E}_{\mathbf{y}}\boldsymbol{\alpha}^{opt}-\boldsymbol{\alpha}_{0})(\mathbb{E}_{\mathbf{y}}\boldsymbol{\alpha}^{opt}-\boldsymbol{\alpha}_{0})^{\dagger}\right)$	(93)
	$\displaystyle=\operatorname{Tr}\left(\Sigma_{\phi}(X^{T}(XX^{T})^{-1}X-\mathbb{I}_{d})\boldsymbol{\alpha}_{0}\boldsymbol{\alpha}_{0}^{\dagger}(X^{T}(XX^{T})^{-1}X-\mathbb{I}_{d})^{\dagger}\right)$	(94)
	$\displaystyle=\lVert\Sigma_{\phi}^{1/2}(\mathbb{I}_{d}-X^{T}(XX^{T})^{-1}X)\boldsymbol{\alpha}_{0}\rVert_{2}^{2}.$	(95)

While Eq. (95) already completely characterizes the bias error in terms of the choice of feature weights and input data, it may be greatly simplified by taking advantage of the sparseness $X$ . We have

	$\displaystyle(X^{T}(XX^{T})^{-1}X)_{jk}$	$\displaystyle=\sum_{\ell\in\Omega_{n}}\left(\sum_{m\in S(\ell)}\nu_{m}\right)^{-1}X_{\ell j}X_{\ell k}$		(96)
		$\displaystyle=\left(\sum_{m\in S(j\text{ mod }n)}\nu_{m}\right)^{-1}\sqrt{\nu_{k}}\sqrt{\nu_{j}}\mathbbm{1}\{k\in S(j\text{ mod }n)\},$		(97)

where in line (97) we have used

\sum_{\ell\in\Omega_{n}}\mathbbm{1}\{k\in S(\ell)\}\mathbbm{1}\{j\in S(\ell)\}=\mathbbm{1}\{k\in S(j\text{ mod }n)\}

(98)

for $k,j\in\Omega_{d}$ and $\ell\in\Omega_{n}$ , which follows from similar reasoning as Eq. (84). And so, writing

$\displaystyle(\Sigma_{\phi}^{1/2}$	$\displaystyle(\mathbb{I}_{d}-X^{T}(XX^{T})^{-1}X)\boldsymbol{\alpha}_{0})_{\ell}$	(99)
	$\displaystyle=\sum_{j\in\Omega_{d}}\Bigg{[}\delta_{\ell}^{j}\hat{g}_{j}\mathbbm{1}\{j\in\Omega_{n_{0}}\}$
	$\displaystyle\qquad-\sum_{k\in\Omega_{d}}\sqrt{\nu_{\ell}}\delta_{j}^{\ell}\left(\sum_{m\in S(j\text{ mod }n)}\nu_{m}\right)^{-1}\hat{g}_{k}\sqrt{\nu_{j}}\mathbbm{1}\{k\in S(j\text{ mod }n)\}\mathbbm{1}\{k\in\Omega_{n_{0}}\}\Bigg{]}$	(100)
	$\displaystyle=\left(\sum_{j\in\Omega_{n_{0}}}\delta_{\ell}^{j}\hat{g}_{j}-\sum_{k\in\Omega_{n_{0}}}\nu_{\ell}\left(\sum_{m\in S(\ell\text{ mod }n)}\nu_{m}\right)^{-1}\hat{g}_{\ell}\mathbbm{1}\{k\in S(\ell\text{ mod }n)\}\right)$	(101)
	$\displaystyle=\left(\mathbbm{1}\{\ell\in\Omega_{n_{0}}\}\hat{g}_{j}-\sum_{k\in\Omega_{n_{0}}}\nu_{\ell}\left(\sum_{m\in S(k)}\nu_{m}\right)^{-1}\hat{g}_{k}\mathbbm{1}\{\ell\in S(k)\}\right).$	(102)

Letting $Q_{k}:=\sum_{m\in S(k)}\nu_{m}$ the bias term of the error evaluates to

$\displaystyle\textsc{bias}^{2}$	$\displaystyle=\sum_{\ell\in\Omega_{d}}\|(\Sigma_{\mathbf{x}}^{1/2}(\mathbb{I}_{d}-X^{T}(XX^{T})^{-1}X)\boldsymbol{\alpha}_{0})_{\ell}\|^{2}$	(103)
	$\displaystyle=\sum_{\ell\in\Omega_{d}}\left\|\mathbbm{1}\{\ell\in\Omega_{n_{0}}\}\hat{g}_{\ell}-\sum_{k\in\Omega_{n_{0}}}\nu_{\ell}Q_{k}^{-1}\hat{g}_{k}\mathbbm{1}\{\ell\in S(k)\}\right\|^{2}$	(104)
	$\displaystyle=\sum_{\ell\in\Omega_{d}}\left(\mathbbm{1}\{\ell\in\Omega_{n_{0}}\}\|\hat{g}_{\ell}\|^{2}-\sum_{k\in\Omega_{n_{0}}}\mathbbm{1}\{\ell\in\Omega_{n_{0}}\}\hat{g}_{\ell}\hat{g}_{k}^{*}\nu_{\ell}Q_{k}^{-1}\mathbbm{1}\{\ell\in S(k)\}\right.$
	$\displaystyle\qquad\qquad\left.-\sum_{k\in\Omega_{n_{0}}}\mathbbm{1}\{\ell\in\Omega_{n_{0}}\}\hat{g}_{\ell}^{*}\hat{g}_{k}\nu_{\ell}Q_{k}^{-1}\mathbbm{1}\{\ell\in S(k)\}\right.$
	$\displaystyle\qquad\qquad\left.+\sum_{j,k\in\Omega_{n_{0}}}\nu_{\ell}^{2}Q_{j}^{-1}Q_{k}^{-1}\hat{g}_{j}^{*}\hat{g}_{k}\mathbbm{1}\{\ell\in S(j)\}\mathbbm{1}\{\ell\in S(k)\}\right)$	(105)
	$\displaystyle=\left(\sum_{k\in\Omega_{n}}\|\hat{g}_{k}\|^{2}-2\sum_{k\in\Omega_{n_{0}}}\nu_{k}\|\hat{g}_{k}\|^{2}Q_{k}^{-1}+\sum_{k\in\Omega_{n_{0}}}\sum_{\ell\in S(k)}\nu_{\ell}^{2}Q_{k}^{-2}\|\hat{g}_{k}\|^{2}\right)$	(106)
	$\displaystyle=\sum_{k\in\Omega_{n_{0}}}\frac{\|\hat{g}_{k}\|^{2}}{Q_{k}^{2}}\left((Q_{k}-\nu_{k})^{2}+\sum_{\ell\in S(k)\backslash k}\nu_{\ell}^{2}\right)$	(107)
	$\displaystyle=\sum_{k\in\Omega_{n_{0}}}\|\hat{g}_{k}\|^{2}\frac{\left(\sum_{\ell\in S(k)\backslash k}\nu_{\ell}\right)^{2}+\sum_{\ell\in S(k)\backslash k}\nu_{\ell}^{2}}{\left(\sum_{\ell\in S(k)}\nu_{\ell}\right)^{2}}.$	(108)

With Eqs. (88) and (108) we have recovered a closed form expression for generalization error in terms of feature weights $\nu_{k}$ , the target function Fourier coefficients $\hat{g}_{k}$ , and the noise variance $\sigma^{2}$ . This was possible in part because of the orthonormality of the Fourier features $\phi(x)_{k}$ and the choice to sample $x_{j}$ uniformly on $[0,1]$ , resulting in diagonal $\Sigma_{\phi}$ and $XX^{T}$ respectively. This simplicity is more advantageous for studying scenarios where benign overfitting may exist compared to prior works [6, 59]. We will now analyze choices of feature weights $\nu_{k}$ that may give rise to benign overfitting for overparameterized weighted Fourier Features models.

We remark that the results of this section may also be derived using the methods of Ref. [9], though we have opted here to use a language more reminiscent of linear regression to highlight connections between the analysis of weighted Fourier features models and ordinary least squares errors.

A.2 Solutions for special cases

A.2.1 Noise-only minimum norm estimator

We now derive the cases considered in Sec. 2.2 of the main text. We first consider when the target function is given by $g=0$ and we attempt to predict on a pure noise signal

y_{j}=\epsilon_{j},

(109)

with $\mathbb{E}[\epsilon^{2}]=\sigma^{2}$ . We can recover an unaliased, “simple” fit to this pure noise signal by reconstructing $\boldsymbol{\epsilon}$ from the DFT using Eq. (52):

\epsilon_{j}=\sum_{k\in\Omega_{n}}\hat{\epsilon}_{k}e^{i2\pi jk/n}.

(110)

By setting all weights $\nu_{k}=1$ , an immediate choice for an interpolating $f$ with access to frequencies in $\Omega_{d}$ is found by setting

\alpha_{k}=\frac{\hat{\epsilon}_{k\text{ mod }n}}{n(m+1)},

(111)

where we have assumed $d=n(m+1)$ . Eq. (111) is the minimum $\ell_{2}$ -norm estimator and evenly distributes the weight of each Fourier coefficient $\hat{\epsilon}_{k}$ over $m$ aliases frequencies in in $S(k)$ . The effect of higher-frequency aliases is to reduce the $f$ to near-zero everywhere except at the interpolated training data. We can directly compute the generalization error of the interpolating estimator as

$\displaystyle\mathbb{E}_{x,\mathbf{y}}\|f(x)-g(x)\|^{2}$	$\displaystyle=\mathbb{E}_{x}\left\|f(x)\right\|^{2}$	(112)
	$\displaystyle=\mathbb{E}_{\mathbf{y}}\left\|\sum_{k=0}^{d-1}\alpha_{k}e^{-i2\pi kx}\right\|^{2}$	(113)
	$\displaystyle=\mathbb{E}_{\mathbf{y}}\sum_{k=0}^{d-1}\left\|\frac{\hat{\epsilon}_{k\text{ mod }n}}{n(m+1)}\right\|^{2}$	(114)
	$\displaystyle=\mathbb{E}_{\mathbf{y}}\frac{n(m+1)\lVert\hat{\boldsymbol{\epsilon}}\rVert_{2}^{2}}{n^{2}(m+1)^{2}}$	(115)
	$\displaystyle=\frac{\sigma^{2}}{n(m+1)}.$	(116)

Using $d=n(m+1)$ as the dimensionality of the feature space, we have recovered the lower bound scaling for overparameterized models derived in Ref. [9]. In line (113) we have used the independence of $\epsilon$ from $x_{k}$ and $y_{k}$ and in line (114) we have used orthonormality of Fourier basis functions on $[0,1]$ . line (116) uses Parseval’s relation for the Fourier coefficients, namely:

\sum_{k\in\Omega_{n}}|\epsilon_{k}|^{2}=n\sum_{k\in\Omega_{n}}|\hat{\epsilon}_{k}|^{2},

(117)

which implies $\mathbb{E}\lVert\hat{\boldsymbol{\epsilon}}\rVert_{2}^{2}=n^{-1}\mathbb{E}\lVert\boldsymbol{\epsilon}\rVert_{2}^{2}=\sigma^{2}$ . Figure 2a of the main text shows the effect of the number of cohorts $m$ on the behavior of $f$ , which interpolates a pure noise signal with very little bias. As the number of cohorts increases, the function deviates very little away from the true “signal” $y=0$ , and becomes very “spiky” in the vicinity of noise

A.2.2 Signal-only minimum norm estimator

Now we study the opposite situation in which the pure tone is noiseless, and aliases in the spectrum of $f$ interfere in prediction of $f$ . In this case, we set $\sigma=0$ and interpolate target labels

\displaystyle y_{j}

\displaystyle=\hat{g}_{p}e^{i2\pi px_{j}},

(118)

with $-n/2<p<n/2$ . When we set $d=n(m+1)$ and predict on $y$ , there are exactly $|S(p)|-1=m$ aliases for the target function with frequency $p$ . We again assume all feature weights are equal, $\nu_{k}=1$ , and by orthonormality of Fourier basis functions, only the components of $f$ with frequency in $S(p)$ are retained:

f(x)=\alpha_{p}e^{i2\pi px}+\sum_{k\in S(p)}\alpha_{k}e^{i2\pi kx},

(119)

which will interpolate the training points for any choice of $\boldsymbol{\alpha}$ satisfying

\sum_{k\in S(p)}\alpha_{k}=\hat{g}_{p}.

(120)

The choice of trainable weights $\alpha_{k}$ that satisfy Eq. (120) while minimizing $\ell_{2}$ -norm is

\alpha_{k}=\begin{cases}\frac{\hat{g}_{p}}{m+1},&k\in S(p),\\ 0,&\text{ otherwise}.\end{cases}

(121)

The problem with minimizing the $\ell_{2}$ norm in this case is that it spreads the true signal into higher frequency aliases: The generalization error of this model is

\displaystyle\mathbb{E}_{x,\mathbf{y}}|f(x)-g(x)|^{2}=\left|\hat{g}_{p}\left(\frac{m}{m+1}\right)\right|^{2}+m\frac{|\hat{g}_{p}|^{2}}{(m+1)^{2}}=\mathcal{O}(1).

(122)

We see that this model generally fails to generalize. This poor generalization was described as “signal bleeding” by Ref. [9]: Using $n$ samples, there is no way to distinguish the signal due to an alias of $p$ from the signal due to the true frequency $p$ , so the coefficients $\alpha_{k}$ become evenly distributed over aliases with very little weight allocated to the true Fourier coefficient $\hat{g}_{p}$ in the model $f$ . Fig. 2b in the main text shows the effect of “signal bleed” for learning a pure tone in the absence of noise.

A.3 Conditions for Benign overfitting

The behavior of the error of the overparameterized Fourier model (Eq. (19) of the main text) depends on an interplay between noise variance $\sigma^{2}$ , signal Fourier coefficients $\hat{g}_{k}$ , feature weights $\nu_{k}$ , and the size of the model $d$ . A desirable property of the feature weights $\nu_{k}$ is that they should result in a model that both interpolates the sampled data while also achieving good generalization error in the limit $n\rightarrow\infty$ . For our purposes we will consider cases where $\lim_{n\rightarrow\infty}L(f^{opt})=0$ , though this condition could be relaxed to allow for more interesting or natural choices of $\nu_{k}$ . We now analyze the error arising from a simple weighting scheme to demonstrate benign overfitting using the overparameterized Fourier models discussed in this work.

A.3.1 A simple demonstration of benign overfitting

Here we demonstrate a simple example of benign overfitting when the feature weights $\nu_{k}$ are chosen with direct knowledge of the spectrum $\Omega_{n_{0}}$ of $g$ . For some $n_{0}<n<d$ , fix $c\in(0,1)$ and use the feature weights given as

\displaystyle\nu_{k}=\begin{cases}\frac{c}{n_{0}},&k\in\Omega_{n_{0}},\\ \frac{1-c}{d-n_{0}},&\text{otherwise},\end{cases}

(123)

for all $k\in\Omega_{d}$ . For simplicity, suppose $d=n(m+1)$ such that $|S(k)|=m+1$ for all $k\in\Omega_{n}$ . Defining the signal power as

P:=\sum_{k\in\Omega_{n_{0}}}|\hat{g}_{k}|^{2},

(124)

we can directly evaluate var of Eq. (88) and $\textsc{bias}^{2}$ of Eq. (108):

	var	$\displaystyle\rightarrow\frac{\sigma^{2}}{n}\left(n_{0}\frac{\left(\frac{c}{n_{0}}\right)^{2}+m\left(\frac{1-c}{d-n_{0}}\right)^{2}}{\left(\frac{c}{n_{0}}+m\frac{1-c}{d-n_{0}}\right)^{2}}+\frac{n-n_{0}}{m+1}\right).$		(125)
	$\displaystyle\textsc{bias}^{2}$	$\displaystyle\rightarrow\frac{Pn_{0}^{2}(m+1)m}{\left(\frac{c}{1-c}(d-n_{0})+mn_{0}\right)^{2}}.$		(126)

Fixing $n_{0}$ , we can bound generalization error in the asymptotic limit $n\rightarrow\infty$ as:

	var	$\displaystyle=\mathcal{O}\left(\frac{1}{n}+\frac{n}{d}\right)$		(127)
	$\displaystyle\textsc{bias}^{2}$	$\displaystyle=\mathcal{O}\left(\frac{1}{n^{2}}\right).$		(128)

Therefore, by setting $d=\omega(n)$ the model perfectly interpolates the training data and also achieves vanishing generalization error in the limit of large number of data. A similar example was considered in Ref. [9], though a rigorous error analysis (and relationship to benign overfitting) was not considered there. This benign overfitting behavior is entirely due to the feature weights of Eq. (123): As $d,n\rightarrow\infty$ with $d=\omega(n)$ , the feature weights are concentrated on $\Omega_{n_{0}}$ (suppressing bias) while becoming increasingly small and evenly distributed over all aliases of $\Omega_{n_{0}}$ (suppressing var).

A.3.2 More general conditions for benign overfitting

We have derived closed-form solutions for the $\textsc{bias}^{2}$ and var terms that determine the total generalization error of an interpolating model $f$ and in the previous section provided a concrete example of a model that achieves benign overfitting in this setting. We now discuss conditions under which a more general choice of model can exhibit benign overfitting.

We begin by showing that the variance of Eq. (88) splits naturally into an error due to a (simple) prediction component and a (spiky, complex) interpolating component. Following Refs. [4, 6], we will split the variance into components corresponding to eigenspaces of $\Sigma_{\phi}$ with large and small eigenvalues respectively. Let $S^{\leq p}$ denote the set indices for the largest $p$ eigenvalues of $\Sigma_{\phi}$ (i.e., the largest $p$ values of $\nu_{k}$ ), and $S^{>p}=[d]\backslash S^{\leq p}$ be its complement. Define $P^{\leq p}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ as the projector onto the the subspace of $\mathbb{R}^{d}$ spanned by basis vectors labelled by indices in $S^{\leq p}$ (and $P^{>p}=\mathbb{I}_{d}-P^{\leq p}$ is defined analogously). Then letting $(\boldsymbol{\nu})_{k}=\nu_{k}$ be the vector of feature weights and assuming $p\leq n$ , we may rewrite the variance of Eq. (88) as

var	$\displaystyle=\frac{\sigma^{2}}{n}\sum_{k\in\Omega_{n}}\frac{\sum_{\ell\in S(k)}\nu_{\ell}^{2}}{\left(\sum_{\ell\in S(k)}\nu_{\ell}\right)^{2}}$	(129)
	$\displaystyle=\frac{\sigma^{2}}{n}\sum_{k\in\Omega_{n}}\frac{\lVert P_{S(k)}\boldsymbol{\nu}\rVert_{2}^{2}}{\lVert P_{S(k)}\boldsymbol{\nu}\rVert_{1}^{2}}$	(130)
	$\displaystyle=\frac{\sigma^{2}}{n}\sum_{k\in\Omega_{n}}\frac{\lVert P^{\leq p}P_{S(k)}\boldsymbol{\nu}+P^{>p}P_{S(k)}\boldsymbol{\nu}\rVert_{2}^{2}}{\lVert P_{S(k)}\boldsymbol{\nu}\rVert_{1}^{2}}$	(131)
	$\displaystyle\leq\frac{\sigma^{2}}{n}\left(\sum_{k\in\Omega_{n}}\frac{\lVert P^{\leq p}P_{S(k)}\boldsymbol{\nu}\rVert_{2}^{2}}{\lVert P^{\leq p}P_{S(k)}\boldsymbol{\nu}\rVert_{1}^{2}}+\sum_{k\in\Omega_{n}}\frac{\lVert P^{>p}P_{S(k)}\boldsymbol{\nu}\rVert_{2}^{2}}{\lVert P^{>p}P_{S(k)}\boldsymbol{\nu}\rVert_{1}^{2}}\right)$	(132)
	$\displaystyle\leq\sigma^{2}\left[\frac{p}{n}+\frac{1}{n}\sum_{k\in\Omega_{n}}\frac{1}{R_{p}^{(k)}(\Sigma_{\phi})}\right],$	(133)

where we have used $P^{\leq p}P_{S(k)}\hat{e}_{j}=\delta_{j}^{k}\mathbbm{1}\{j\leq p\}$ for $p\leq n$ and we have introduced an effective rank for the alias cohort of $k$ ,

\displaystyle R_{p}^{(k)}(\Sigma_{\phi})=\frac{\left(\sum_{\ell\in S(k)\cap S^{>p}}\nu_{\ell}\right)^{2}}{\sum_{\ell\in S(k)\cap S^{>p}}\nu_{\ell}^{2}}.

(134)

Since $p=n_{0}$ is a relevant choice for our problem setup, we define $R^{(k)}:=R_{n_{0}}^{(k)}(\Sigma_{\phi})$ and focus on the bound

var

\displaystyle\leq\sigma^{2}\left[\frac{n_{0}}{n}+\frac{1}{n}\sum_{k\in\Omega_{n}}\frac{1}{R^{(k)}}\right],

(135)

The first term of the decomposition of Eq. (135) corresponds to the variance of a (non-interpolating) model with access to $n_{0}$ Fourier modes, while the second term corresponds to excess error introduced by high frequency components of $f$ . Given a sequence of experiments with increasing $n$ (while $g$ and $\sigma^{2}$ remain fixed), we would like to understand the choices of feature weights $\nu_{k}$ for which var vanishes as $n\rightarrow\infty$ . Given that $|\{k\in\Omega_{n}\}|=n$ , a sufficient condition for a sequence of feature weight distributions to be benign is that $R^{(k)}=\omega(1)$ for all $k$ while a necessary condition is that there is no $k$ for which $R^{(k)}=\mathcal{O}(1/n)$ . These conditions are not difficult to satisfy: Intuitively they require only that $\Sigma_{\phi}$ changes with increasing $n$ in such a way that the values of $\nu_{k}$ in $\Omega_{n}\backslash\Omega_{n_{0}}$ continue “flatten out” as $n$ increases. This is precisely the behavior engineered in the example of Sec. A.3.1.

We now proceed to bound the bias term of Eq. (95). Observe that $P^{\perp}:=(\mathbb{I}_{d}-X^{T}(XX^{T})^{-1}X)$ is a projector onto the subspace of $\mathbb{R}^{d}$ orthogonal to the rows of $X$ , therefore satisfying $\lVert P^{\perp}\rVert\leq 1$ and $P^{\perp}X^{T}X=0$ . Then the $\textsc{bias}^{2}$ is bounded as

$\displaystyle\textsc{bias}^{2}$	$\displaystyle=\lVert\Sigma_{\phi}^{1/2}P^{\perp}\boldsymbol{\alpha}_{0}\rVert_{2}^{2}$	(136)
	$\displaystyle=\lVert(\Sigma_{\phi}-X^{T}X)^{1/2}P^{\perp}\boldsymbol{\alpha}_{0}\rVert_{2}^{2}$	(137)
	$\displaystyle\leq\lVert\Sigma_{\phi}-X^{T}X\rVert_{\infty}\lVert\boldsymbol{\alpha}_{0}\rVert_{2}^{2}.$	(138)

The term $X^{T}X$ can be interpreted as a finite-sample estimator for $\Sigma_{\phi}$ in the sense that $X^{T}X=n^{-1}\Phi^{T}\Phi=\Sigma_{\phi}$ for $n=d$ . However, we cannot apply standard results on the convergence of sample covariance matrices (e.g. Ref. [60]) since the uniform spacing requirement for training data $x$ violates the standard assumption of i.i.d. input data. To proceed, we will make a number of simplifying assumptions about the feature weights. First, to control for the possibility that large feature weights $\nu_{k}$ concentrate within a specific $S(k)$ we will assume that feature weights corresponding to any set of alias frequencies of $\Omega_{n}$ are close to their average. Letting $d=(m+1)n$ , we define

\displaystyle\eta_{\ell}=\frac{1}{n}\sum_{k\in\Omega_{n}}\nu_{k+n\ell},

(139)

for $\ell=0,1,\dots,m$ . We will impose that $|\nu_{k+n\ell}-\eta_{\ell}|\leq t$ for all $k\in\Omega_{n}$ , $\ell=1,\dots,m$ , for some positive number $t$ . We further assume normalization, $\sum_{k\in\Omega_{d}}\nu_{k}=1=n\sum_{\ell=0}^{m}\eta_{\ell}$ . Under these assumptions we can bound the first term of (138) as

$\displaystyle\lVert\Sigma_{\phi}-X^{T}X\rVert_{\infty}$	$\displaystyle\leq\max_{j}\nu_{j}^{1/2}\left(\sum_{k\in S(j\text{ mod }n)\backslash j}\nu_{k}^{1/2}\right)$	(140)
	$\displaystyle\leq(m-1)^{1/2}\max_{j}\nu_{j}^{1/2}\left(\sum_{k\in S(j\text{ mod }n)\backslash j}\nu_{k}\right)^{1/2}$	(141)
	$\displaystyle\leq m^{1/2}\max_{j}\nu_{j}^{1/2}\left(\sum_{\ell=1}^{m}(\eta_{\ell}+t)\right)^{1/2}$	(142)
	$\displaystyle\leq\left(\frac{m}{n}+m^{2}t\right)^{1/2},$	(143)

where in line (140) we have used Gershgorin circles. Meanwhile, defining $\zeta:=\left(\sum_{k\in\Omega_{n_{0}}}\nu_{k}\right)$ , by Cauchy-Schwarz we have that

\lVert\boldsymbol{\alpha}_{0}\rVert^{2}\leq P\zeta^{-2},

(144)

where $P$ is the signal power of Eq. (124). A necessary condition for producing benign overfitting in overparameterized Fourier models is that $\zeta$ remains relatively large as $n,d\rightarrow\infty$ . If this is accomplished, then a small enough $t$ guarantees that all feature weights associated with frequencies $k\in\Omega_{d}\backslash\Omega_{n_{0}}$ will be uniformly suppressed. For instance, if $\zeta$ is lower bounded as a constant while $t=0$ then combining Eqs. (143) and (144) yields a bound of

\textsc{bias}^{2}=\mathcal{O}\left(\frac{d^{1/2}}{n}\right).

(145)

Although the analysis of Sec. A.3.1 yields a significantly tighter bound, this demonstrates that the mechanisms behind that simple demonstration of benign overfitting are somewhat generic. In particular, normalization and lower bounded support of the feature weights on $\Omega_{n_{0}}$ is almost sufficient to control the bias term of the generalization error.

Appendix B Solution for the quantum overparameterized model

We now derive the solution to the minimum-norm interpolating quantum model,

	$\displaystyle M^{opt}$	$\displaystyle=\operatorname*{arg\,min}_{\begin{subarray}{c}M=M^{\dagger}\\ f(x_{j})=y_{j}\end{subarray}}\lVert M\rVert_{F},$		(146)
	$\displaystyle f(x)$	$\displaystyle=\sum_{k\in\Omega}e^{i2\pi kx}\sum_{\ell,m\in R(k)}\gamma_{\ell}M_{m\ell}\gamma_{m}^{*}.$		(147)

We will use the following notation and definitions:

$\displaystyle a_{k}$	$\displaystyle=-\Biggl{\lfloor}\frac{k+\frac{\|\Omega\|-n}{2}}{n}\Biggl{\rfloor},$	(148)
$\displaystyle b_{k}$	$\displaystyle=\Biggr{\lfloor}\frac{-k+\frac{\|\Omega\|+n}{2}-1}{n}\Biggr{\rfloor},$	(149)
$\displaystyle S_{k}$	$\displaystyle=\{k-a(k)n,\dots,k,k+n,\dots,k+b(k)n\}$	(150)
	$\displaystyle\qquad\qquad=\{k+pn,\quad p=-a(k),\dots,b(k)\},$	(151)
$\displaystyle R(S_{k})$	$\displaystyle:=\bigcup_{j\in S_{k}}R(j).$	(152)

Here, $a_{k}$ and $b_{k}$ characterize the number positive and negative frequency aliases of $k$ appearing in $\Omega$ (i.e., $k+n\ell\in\Omega$ for all $a_{k}\leq\ell\leq b_{k}$ ) assuming that $|\Omega|$ is odd, a requirement for any quantum model. Let $L(d)$ be the space of linear operators acting on $d\times d$ matrices. Define the linear operator $P_{k}:L(d)\rightarrow L(d)$ for $k\in[n]$ as

P_{k}(X)=\sum_{\ell,m\in R(k)}\langle\ell|X|m\rangle|\ell\rangle\langle m|

(153)

for any $X\in L(d)$ . Importantly, $P_{k}$ is not necessarily Hermitian preserving. Denoting $\Gamma:=|\Gamma\rangle\langle\Gamma|$ for brevity, we may rewrite the Fourier coefficients of $f(x)$ as

$\displaystyle\langle P_{k}(M),P_{k}(\Gamma)\rangle$	$\displaystyle=\operatorname{Tr}\left(P_{k}(M)^{\dagger}P_{k}(\Gamma)\right)$	(154)
	$\displaystyle=\operatorname{Tr}\left(\sum_{\ell,m\in R(k)}M_{\ell m}^{}\|m\rangle\langle\ell\|\sum_{i,j\in R(k)}\gamma_{i}\gamma_{j}^{}\|i\rangle\langle j\|\right)$	(155)
	$\displaystyle=\sum_{\ell,m\in R(k)}\gamma_{\ell}^{*}M_{m\ell}\gamma_{m},$	(156)

where in the last line we have used hermiticity of $M$ . Applying $f(x_{j})=y_{j}$ $\forall\,j\in[n]$ and substituting into Eq. (147) we find the interpolation condition

	$\displaystyle\hat{y}_{p}$	$\displaystyle=\sum_{q=a_{k}}^{b_{k}}\langle P_{p+nq}(M),P_{p+nq}(\Gamma)\rangle$		(157)
		$\displaystyle=\langle P_{S_{p}}(M),P_{S_{p}}(\Gamma)\rangle,\qquad\forall\,p\in[n],$		(158)

where $P_{S_{p}}:=\sum_{q=a_{k}}^{b_{k}}P_{p+nq}$ . The equality follows from

\langle P_{j}(X),P_{k}(Y)\rangle=\langle P_{j}(X),P_{j}(Y)\rangle\delta_{j}^{k}

(159)

due to the fact that $R(j)$ and $R(k)$ are disjoint sets for any $j\neq k$ . Following the technique of Ref. [9] we apply the Cauchy-Schwarz inequality to find

\lVert P_{S_{p}}(M)\rVert_{F}\geq\frac{|\langle P_{S_{p}}(M),P_{S_{p}}(\Gamma)\rangle|}{\lVert P_{S_{p}}(\Gamma)\rVert_{F}},

(160)

with equality if and only if $P_{S_{p}}(M)$ is proportional to $P_{S_{p}}(\Gamma)$ . Saturating this lower bound by setting $M_{\ell m}=c\gamma_{\ell}\gamma_{m}^{*}$ and solving for the proportionality $c$ constant using Eq. (160), we find

\displaystyle\hat{y}_{p}=\langle cP_{S_{p}}(\Gamma),P_{S_{p}}(\Gamma)\rangle=\sum_{\ell,m\in R(S_{p})}c^{*}|\gamma_{\ell}|^{2}|\gamma_{m}|^{2}.

(161)

This indicates an additional requirement for interpolation that $\gamma_{\ell},\gamma_{m}>0$ for some pair $(\ell,m)\in R(k)$ whenever $\tilde{y}_{k}\neq 0$ and so for simplicity we will require that $\gamma_{\ell}>0$ for all $\ell=1,\dots,d$ . Within each set of indices $R(S_{k})$ , the elements of the optimal observable are defined piecewise with respect to that partition $R$ :

(\ell,m)\in R(S_{k})\Rightarrow M_{\ell m}=\hat{y}_{k}^{*}\frac{\gamma_{\ell}\gamma_{m}^{*}}{\sum_{i,j\in R(S_{k})}|\gamma_{i}|^{2}|\gamma_{j}|^{2}}.

(162)

Minimization of $\lVert M\rVert_{F}$ subject to the interpolation constraint is equivalent to minimization of $\lVert P_{p+nq}(M)\rVert_{F}$ for all $q\in[a_{k},b_{k}],\,k\in\Omega$ , and so solving the constrained optimization over all distinct subspaces in $\bigcup_{k=0}^{n-1}R(S_{k})=\{0,1\}^{n}\times\{0,1\}^{n}$ we recover the optimal observable

M_{opt}=\operatorname*{arg\,min}_{f(x_{j})=y_{j}\,\forall\,j}\lVert M\rVert_{F}=\sum_{k\in\Omega_{n}}\hat{y}_{k}^{*}\sum_{\ell,m\in R(S_{k})}\frac{\gamma_{\ell}\gamma_{m}^{*}}{\sum_{i,j\in R(S_{k})}|\gamma_{i}|^{2}|\gamma_{j}|^{2}}|\ell\rangle\langle m|.

(163)

We now verify that this matrix is Hermitian and therefore a valid observable. We will use the following:

	$\displaystyle\hat{y}_{k}$	$\displaystyle=\hat{y}_{-k}^{*},$		(164)
	$\displaystyle R(-k)$	$\displaystyle=\{(m,\ell)\text{ for all }(\ell,m)\in R(k)\}:=R(k)^{T}.$		(165)

Eq. (164) follows from our assumption that $y_{j}\in\mathbb{R}\,\forall\,j\in[n]$ . And so

$\displaystyle M_{\ell m}$	$\displaystyle=\hat{y}_{k}^{}\frac{\gamma_{\ell}\gamma_{m}^{}}{\sum_{i,j\in R(S_{k})}\|\gamma_{i}\|^{2}\|\gamma_{j}\|^{2}}\quad\forall\,(\ell,m)\in R(S_{k})$	(166)
	$\displaystyle=\hat{y}_{-k}\frac{\gamma_{\ell}\gamma_{m}^{*}}{\sum_{j,i\in R(S_{-k})^{T}}\|\gamma_{i}\|^{2}\|\gamma_{j}\|^{2}}\quad\forall\,(\ell,m)\in R(S_{-k})^{T}$	(167)
	$\displaystyle=\left(\hat{y}_{-k}^{}\frac{\gamma_{m}\gamma_{\ell}^{}}{\sum_{i,j\in R(S_{-k})}\|\gamma_{i}\|^{2}\|\gamma_{j}\|^{2}}\right)^{*}\quad\forall\,(m,\ell)\in R(S_{-k})$	(168)
	$\displaystyle=M_{m\ell}^{*}.$	(169)

In line (167) we have used Eq. (164), while in line (169) we have observed that Eq. (162) holds only with respect to any fixed partition; $M_{\ell m}$ and $M_{m\ell}$ must be computed with respect to distinct partitions. The optimal model may now be rewritten in terms of base frequencies as

\displaystyle f^{opt}(x)

\displaystyle=\sum_{k\in\Omega_{n}}\hat{y}_{k}\sum_{\ell=a_{k}}^{b_{k}}e^{i2\pi(k+n\ell)x}\frac{\sum_{i,j\in R(k+n\ell)}|\gamma_{i}|^{2}|\gamma_{j}|^{2}}{\sum_{r=a_{k}}^{b_{k}}\sum_{c,d\in R(k+nr)}|\gamma_{c}|^{2}|\gamma_{d}|^{2}}.

(170)

Recall that the optimal classical model derived in Eq. (65) is given by

f(x)=\sum_{k\in\Omega_{n}}\hat{y}_{k}\,\sum_{\ell=a_{k}}^{b_{k}}e^{i2\pi(k+\ell n)x}\frac{\nu_{k+\ell n}}{{\sum_{r=a_{k}}^{b_{k}}\nu_{k+nr}}}.

(171)

Then despite Eq. (147) not having a clear decomposition into scalar feature weights and trainable weights, we can identify the feature weights of the optimized quantum model as

\nu_{k}^{opt}:=\sum_{i,j\in R(k)}|\gamma_{i}|^{2}|\gamma_{j}|^{2},

(172)

which recovers the same form of the optimal classical model of Eq. (65). This means that the behavior of the feature weights of the (optimal) general quantum model are strongly controlled by the degeneracy sets $R(k)$ . The generalization error of the quantum model is also described by Eq. (19) of the main text under the identification of Eq. (172), and therefore exhibits a tradeoff that is predominantly controlled by the degeneracies $R(k)$ of the data-encoding Hamiltonian.

We can now substitute $\gamma_{\ell}=1/\sqrt{d}$ to recover the optimal observable for the simplified model derived in Sec 3.1 of the main text using other means, namely

$\displaystyle M_{m\ell}^{opt}$	$\displaystyle=\hat{y}_{k}\frac{d}{\sum_{r=a_{k}}^{b_{k}}\|R(k+nr)\|},$	(173)
$\displaystyle f(x)$	$\displaystyle=\sum_{k\in\Omega_{n}}\hat{y}_{k}\sum_{q=a_{k}}^{b_{k}}e^{i2\pi(k+nq)x}\frac{\|R(k+nq)\|}{\sum_{r=a_{k}}^{b_{k}}\|R(k+nr)\|},$	(174)
$\displaystyle\nu_{k}$	$\displaystyle=\frac{\|R(k)\|}{d^{2}}.$	(175)

B.1 Computing feature weights of typical quantum models

In Sec 3.1 we introduced a simple model with a uniform amplitude state $|\Gamma\rangle$ as input and demonstrated that the feature weights of this simple quantum model are completely determined by the sets $R(k)$ induced by the encoding strategy. We now wish to extend the intuition that the behavior of the optimized general quantum model is strongly influenced by the distribution of the degeneracies $|R(k)|$ . We do so by evaluating the optimal quantum models with respect to an “average” state preparation unitary $U$ . We can compute the average value of $|\gamma_{i}|^{2}|\gamma_{j}|^{2}$ for $U$ sampled uniformly from the Haar distribution using standard results from the literature [61]³³3Note that since $\nu_{k}^{opt}$ is invariant with respect to $\gamma_{i}\rightarrow e^{i\phi}\gamma_{i}$ a spherical measure would suffice here.:

\displaystyle\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}|\gamma_{i}|^{2}|\gamma_{j}|^{2}=\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}U_{i0}U_{i0}^{*}U_{j0}U_{j0}^{*}=\frac{\delta_{i}^{j}+1}{d(d+1)}.

(176)

Since $(i,i)\in R(0)$ we can then compute the feature weights of the optimal model according to⁴⁴4It is implied that we compute the optimal $\nu$ with respect to each distinct $U$ sampled independently and uniformly with respect to the Haar measure – without optimizing $M$ with respect to each $U$ we would find the trivial result $\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}f(x)=\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\operatorname{Tr}\left(M\rho(x)\right)=d^{-1}\operatorname{Tr}\left(M\right)$ .

\displaystyle\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\nu_{k}^{opt}=\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\sum_{i,j\in R(k)}|\gamma_{i}|^{2}|\gamma_{j}|^{2}=\frac{|R(k)|}{d(d+1)}+\delta_{0}^{k}\frac{1}{d+1}.

(177)

From Eq. (177) we see that the feature weights of a quantum model optimized with respect to random $U$ are completely determined by the degeneracies $R(k)$ . This expected value is useful but does not fully characterize the behavior of an encoding strategy. To demonstrate that this average behavior is meaningful, we would further like to verify that the feature weights corresponding to a random $U$ concentrate around the mean of Eq. (177). We characterize this by computing the variance

\text{Var}(\nu_{k})=\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\left(\nu_{k}^{2}\right)-\left(\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\nu_{k}\right)^{2},

(178)

where we have dropped the superscript on $\nu_{k}^{opt}$ for brevity. This computation requires significantly more counting arguments dependent on the structure of $R(k)$ . When $k\neq 0$ , we identify cases for which $i\neq j$ whenever $(i,j)\in R(k)$ :

$\displaystyle\nu_{k}^{2}$	$\displaystyle=\sum_{i,j\in R(k)}\sum_{\ell,m\in R(k)}\|\gamma_{i}\|^{2}\|\gamma_{j}\|^{2}\|\gamma_{\ell}\|^{2}\|\gamma_{m}\|^{2}$	(179)
	$\displaystyle=\sum_{i,j\in R(k)}\left(\|\gamma_{i}\|^{4}\|\gamma_{j}\|^{4}+\sum_{\begin{subarray}{c}\ell,m\in R(k)\\ \ell\neq j,m=i\end{subarray}}\|\gamma_{i}\|^{4}\|\gamma_{j}\|^{2}\|\gamma_{\ell}\|^{2}\right.$
	$\displaystyle\qquad\qquad\qquad\qquad\left.+\sum_{\begin{subarray}{c}m,\ell\in R(k)\\ \ell=j,m\neq i\end{subarray}}\|\gamma_{i}\|^{2}\|\gamma_{j}\|^{4}\|\gamma_{m}\|^{2}+\sum_{\begin{subarray}{c}m,\ell\in R(k)\\ \ell\neq j,m\neq i\end{subarray}}\|\gamma_{i}\|^{2}\|\gamma_{j}\|^{2}\|\gamma_{\ell}\|^{2}\|\gamma_{m}\|^{2}\right).$	(180)

The expected values for of these terms are evaluated using the observation that the vector $(|\gamma_{1}|^{2},\dots,|\gamma_{s}|^{2})$ is distributed uniformly on the $d$ -simplex leading to simple expressions for the following expected values [61]:

$\displaystyle\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\|\gamma_{p}\|^{4}\|\gamma_{q}\|^{4}$	$\displaystyle=4/D,$	(181)
$\displaystyle\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\|\gamma_{p}\|^{4}\|\gamma_{q}\|^{2}\|\gamma_{r}\|^{2}$	$\displaystyle=2/D,$	(182)
$\displaystyle\underset{U\sim\operatorname{U}\left(d\right)}{\mathbb{E}}\|\gamma_{p}\|^{2}\|\gamma_{q}\|^{2}\|\gamma_{r}\|^{2}\|\gamma_{s}\|^{2}$	$\displaystyle=1/D,$	(183)

where $p,q,r,s=1,\dots,d$ and $p\neq q\neq r\neq s$ , and $D:=(d+3)(d+2)(d+1)d$ . In computing the expected value of Eq. (180), by linearity the terms of each sum will become a constant and only the number of items in each sum will be relevant. The first sum of Eq. (180) contains $|R(k)|$ many terms and the total number of terms is $|R(k)|^{2}$ , and so we only need compute the number of elements in the two middle sums of Eq. (180) (which contain an equal number of terms due to the symmetry $R(k)=R(-k)^{T}$ ). These computations may be carried out by brute-force combinatorics and are summarized in Table 2 for a few of the models studied in this work.

Encoding strategy	$\|\{(i,j,\ell,m):m\neq i,\ell=j,(i,j),(\ell,m)\in R(k)\}\|$	$\text{Var}(\nu_{k}^{opt})$
Hamming	$\sum_{p=0}^{n_{q}-2k}\begin{pmatrix}n_{q}\\ p\end{pmatrix}\begin{pmatrix}n_{q}\\ p+\|k\|\end{pmatrix}\begin{pmatrix}n_{q}\\ p+2\|k\|\end{pmatrix}$	-
Binary	$\max(2^{n_{q}}-2\|k\|,0)$	$O\left(2^{-3n_{q}}\right)$
Golomb	0	$\mathcal{O}(d^{-4})$

Table 2: The size of the subset of

R(k)\times R(k)

with a single repeated index computed for different encoding strategies and the corresponding scaling of the variance of feature weights (all entries assume

k\neq 0

). The Hamming encoding strategy computation works as follows: Each bitstring

i

with weight

p

(of which there are

n

-choose-

p

) is paired with

n

-choose-

(p+k)

many bitstrings

j

with weight

p+k

. Then, taking

\ell=j

there will be

n

-choose-

(p+2k)

many bitstrings

m

with weight

(p+k)+k=p+2k

. Summing over all such

p

’s where

p+2k\leq n_{q}

yields the desired result, however this computation does not admit a clear closed-form expression and so we have omitted the corresponding scaling of

\text{Var}(\nu_{k}^{opt})

for the Binary encoding strategy.

For the Binary encoding strategy we compute

\displaystyle\mathbb{E}(\nu_{k}^{2})=\begin{cases}\frac{5d-7k+(d-k)^{2}}{D},&0<2k\leq d\\ \frac{3(d-k)+(d-k)^{2}}{D},&2k>d,\end{cases}

(184)

and so

\displaystyle\text{Var}(\nu_{k})=\left\{\!\begin{aligned} &\frac{(5d-7k)d(d+1)-(d-k)^{2}(4d+6)}{Dd(d+1)}\text{ if }0<2k\leq d\\ &\frac{3(d-k)d(d+1)-(d-k)^{2}(4d+6)}{Dd(d+1)}\text{ if }2k>d\end{aligned}\right\}\leq\mathcal{O}(d^{-3}).

(185)

Importantly, $\mathbb{E}[\nu_{k}]\geq\mathcal{O}(d^{-2})$ for the Binary encoding strategy. Taking $d=2^{n_{q}}$ corresponding to $n_{q}$ qubits, then while the mean decays exponentially in the number of qubits $n_{q}$ , the variance decays exponentially faster. For the Golomb encoding strategy, the calculation is comparatively straightforward, yielding for $k\neq 0$

\text{Var}(\nu_{k})=\frac{3d^{2}-d-6}{Dd(d+1)}=\mathcal{O}(d^{-4}),

(186)

with the variance again decaying significantly faster in $d$ than the mean. Figure 6 shows the average and variance of $\nu_{k}^{opt}$ for the general quantum model with $U$ sampled uniformly with respect to the Haar measure. We find tight concentration of $\nu_{k}^{opt}$ around its average in each of these cases

Why is it that the feature weights $\nu_{k}^{opt}$ of the quantum model given by Eq. (172) only appear after deriving the optimal observable? One explanation is that while the classical Fourier Features model utilizes random Fourier features such that components of $\phi(x)$ are mutually orthogonal on $[0,1]$ , the quantum model does not. Consider the operator

\Sigma:=\int_{\mathcal{X}}\operatorname{Vec}\left(\rho(x)\right)\operatorname{Vec}\left(\rho(x)\right)^{\dagger}p(x)dx=\int_{\mathcal{X}}\rho(x)\otimes\rho^{T}(x)p(x)dx,

(187)

where $p:\mathcal{X}\rightarrow\mathbb{R}$ is the probability density function describing the distribution of data in $\mathcal{X}$ and $\text{Vec}:\mathbb{C}^{d\times d}\rightarrow\mathbb{C}^{d^{2}}$ is the vectorization map that acts by stacking transposed rows of a square matrix into a column vector. $\Sigma$ is analogous to a classical covariance matrix, and here determines the correlation between components of $\rho(x)$ (with the second equality holding when $\rho(x)$ is a pure state for all $x$ ). From Eq. (8) describing the classical Fourier features it is straightforward to compute $\mathbb{E}_{x\sim[0,1]}[\phi\phi^{\dagger}]=\mathbb{I}_{d}$ , demonstrating that the classical Fourier features are indeed orthonormal. However, under the identification of the feature vector $\phi(x)\rightarrow\operatorname{Vec}\left(\rho(x)\right)$ in the quantum case, the same is not true for the quantum model:

	$\displaystyle\Sigma$	$\displaystyle=\int_{0}^{1}\left(\sum_{k\in\Omega}e^{i2\pi kx}\sum_{i,j\in R(k)}\gamma_{i}\gamma_{j}^{}\|ij\rangle\right)\left(\sum_{k^{\prime}\in\Omega}e^{i2\pi k^{\prime}x}\sum_{\ell,m\in R(k^{\prime})}\gamma_{\ell}^{}\gamma_{m}\langle\ell m\|\right)dx$		(188)
		$\displaystyle=\sum_{k}\sum_{\begin{subarray}{c}i,j\in R(k)\\ \ell,m\in R(k)\end{subarray}}\gamma_{i}\gamma_{j}^{}\gamma_{\ell}^{}\gamma_{m}\|ij\rangle\langle\ell m\|.$		(189)

Thus $\Sigma$ is not diagonal in general, as many components of $S(x)|\Gamma\rangle$ each contribute to a single frequency $k$ . Nor will it be possible in general to construct a quantum feature map via unitary operations that does act as an orthogonal Fourier features vector. As $\Sigma$ is positive semidefinite by construction, there exists a spectral decomposition $\Sigma=WDW^{\dagger}$ with diagonal $D$ and $d^{2}\times d^{2}$ unitary $W$ . However the linear operator $\Phi\in L(d)$ acting on $d$ -dimensional states according to $\operatorname{Vec}\left(\Phi(\rho)\right)=W^{\dagger}\operatorname{Vec}\left(\rho(x)\right)$ will not be unitary in general, and thus it may not be possible to prepare $\rho$ in such a way that the elements of $\operatorname{Vec}\left(\rho(x)\right)$ consist of orthonormal Fourier features.

B.2 The effect of state preparation on feature weights in the general quantum model

We now discuss how the state preparation unitary $U$ may affect the feature weights $\nu_{k}^{opt}$ in a quantum model, and what choices one can make to construct a $U$ that gives rise to a specific distribution of feature weights $\nu_{k}^{opt}$ .

As an example, we consider the general quantum model using the Hamming encoding strategy and an input state $|\Gamma\rangle$ . Computing the feature weights $\nu_{k}^{opt}$ for the Hamming encoding strategy depends on the amplitudes $\gamma_{i}$ and $\gamma_{j}$ for which the weights of the indices $i,j$ differ by $k$ . We have

\displaystyle\nu_{k}^{opt}=\sum_{i,j\in R(k)}|\gamma_{i}|^{2}|\gamma_{j}|^{2}=\sum_{\begin{subarray}{c}i,j:\\ w(j)-w(i)=k\end{subarray}}|\gamma_{i}|^{2}|\gamma_{j}|^{2}.

(190)

We will now show how the feature weight $\nu_{k}^{opt}$ may be computed with respect to a rebalanced input state which distributes each amplitude $\gamma_{i}$ among all computational basis states with index having weight $w(i)$ . We define this rebalanced state on $n$ qubits as

	$\displaystyle\|\Gamma^{\prime}\rangle$	$\displaystyle=\sum_{i\in\{0,1\}^{n}}\gamma_{i}^{\prime}\|i\rangle=\sum_{i=0}^{n}\sum_{j\in W(i)}\phi_{i}\|j\rangle,$		(191)
	$\displaystyle\phi_{i}$	$\displaystyle=\frac{1}{\sqrt{\|W(i)\|}}\left(\sum_{j\in W(i)}\|\gamma_{j}\|^{2}\right)^{1/2}.$		(192)

where $W(i)=\{j:j\in\{0,1\}^{n},w(j)=i\}$ is the set of weight- $i$ indices and $|W(i)|=\binom{n}{i}$ . Observing that the amplitudes of $|\Gamma^{\prime}\rangle$ satisfy $|\gamma_{i}^{\prime}|^{2}=\phi_{w(i)}^{2}$ , we can derive for $k\geq 0$

$\displaystyle\nu_{k}^{opt}$	$\displaystyle=\sum_{\begin{subarray}{c}i,j:\\ w(j)-w(i)=k\end{subarray}}\|\gamma_{i}\|^{2}\|\gamma_{j}\|^{2}$	(193)
	$\displaystyle=\sum_{\ell=0}^{n-k}\left(\sum_{i:w(i)=\ell}\|\gamma_{i}\|^{2}\right)\left(\sum_{j:w(j)=\ell+k}\|\gamma_{j}\|^{2}\right)$	(194)
	$\displaystyle=\sum_{\ell=0}^{n-k}\left(\|W(\ell)\|\frac{\sum_{i:w(i)=\ell}\|\gamma_{i}\|^{2}}{\|W(\ell)\|}\right)\left(\|W(\ell+k)\|\frac{\sum_{j:w(j)=\ell+k}\|\gamma_{j}\|^{2}}{\|W(\ell+k)\|}\right)$	(195)
	$\displaystyle=\sum_{\ell=0}^{n-k}\|W(\ell)\|\phi_{\ell}^{2}\|W(\ell+k)\|\phi_{\ell+k}^{2}$	(196)
	$\displaystyle=\sum_{\ell=0}^{n-k}\left(\sum_{i:w(i)=\ell}\phi_{w(i)}^{2}\right)\left(\sum_{j:w(j)=\ell+k}\phi_{w(j)}^{2}\right)$	(197)
	$\displaystyle=\sum_{\begin{subarray}{c}i,j:\\ w(j)-w(i)=k\end{subarray}}\|\gamma_{i}^{\prime}\|^{2}\|\gamma_{j}^{\prime}\|^{2}.$	(198)

Therefore, the feature weights computed from $|\Gamma\rangle=U|0\rangle$ and the rebalanced state $|\Gamma^{\prime}\rangle$ are identical. We can emphasize the significance of this observation by rewriting $|\Gamma^{\prime}\rangle$ of Eq. (191) as

\displaystyle|\Gamma^{\prime}\rangle

\displaystyle=\sum_{i=0}^{n}\phi_{i}|\Phi_{i}\rangle,

(199)

where $|\Phi_{i}\rangle=\sum_{j\in W(i)}|j\rangle$ is an (unnormalized) superposition of bitstrings with weight $i$ . The state $|\Gamma^{\prime}\rangle$ of Eq. (199) has only $n+1$ real parameters $\phi_{i}$ , and is invariant under operations that are restricted to act within the subspace spanned by components of $|\Phi\rangle$ . This invariance greatly reduces the class of unitaries $U$ that affect the distribution of feature weights $\nu_{k}^{opt}$ and enables some degree of tuning for these parameters.

B.2.1 Vanishing gradients in preparing feature weights

We briefly remark on the possibility of training feature weights $\nu_{k}^{opt}$ of the general quantum model variationally. For example, one could consider defining a desired distribution of feature weights as $\phi_{k}\in\mathbb{R}^{|\Omega|}$ and then attempting to tune parameters of $U$ in order to minimize a cost function such as⁵⁵5As $\nu_{k}$ represents a probability distribution, other distance measures such as relative entropy or earth-mover’s distance may be considered more appropriate. However using alternative cost functions will not affect the main arguments here.

L=\sum_{k\in\Omega}|\nu_{k}^{opt}-\phi_{k}|^{2}.

(200)

We will now demonstrate that for a sufficiently expressive class of state preparation unitaries $U$ , such an optimization problem will be difficult on average. For simplicity, we follow the original formulation of barren plateaus presented in Ref. [23]: Let $U$ be defined with respect to a set of parameters $\boldsymbol{\theta}\in\mathbb{R}^{L}$ as

U=\prod_{\ell=1}^{L}U(\theta_{\ell})W_{\ell}

(201)

with $U(\theta_{\ell})=\exp(i\theta_{\ell}V_{\ell})$ , $V_{\ell}$ being a $d$ -dimensional Hermitian operator, and $W_{\ell}$ being a $d$ -dimensional unitary operator. Pick a parameter $\theta_{m}$ and define $U_{+}=\prod_{\ell=m}^{L}U(\theta_{\ell})W_{\ell}$ and $U_{-}=\prod_{\ell=1}^{m}U(\theta_{\ell})W_{\ell}$ , and observe that

\partial_{\theta_{m}}|\gamma_{j}|^{2}=i\langle 0|U_{-}^{\dagger}[V_{m},U_{+}^{\dagger}|j\rangle\langle j|U_{+}]U_{-}|0\rangle.

(202)

We can compute the derivative of $\nu_{k}$ with respect to $\theta_{m}$ using the chain rule:

$\displaystyle\frac{\partial\nu_{k}}{\partial\theta_{m}}$	$\displaystyle=\sum_{j=1}^{d}\frac{\partial\nu_{k}}{\partial\|\gamma_{j}\|^{2}}\frac{\partial\|\gamma_{j}\|^{2}}{\partial\theta_{m}}$	(203)
	$\displaystyle=\sum_{j=1}^{d}\left(\sum_{a,b\in R(k)}\left(\|\gamma_{a}\|^{2}\delta_{b}^{j}+\delta_{a}^{j}\|\gamma_{b}\|^{2}\right)\right)i\langle 0\|U_{-}^{\dagger}[V_{m},U_{+}^{\dagger}\|j\rangle\langle j\|U_{+}]U_{-}\|0\rangle$	(204)
	$\displaystyle=i\sum_{a,b\in R(k)}\left[\operatorname{Tr}\left(\rho_{-}H_{a}\right)\operatorname{Tr}\left(\rho_{-}[V_{m},H_{b}]\right)+\operatorname{Tr}\left(\rho_{-}[V_{m},H_{a}]\right)\operatorname{Tr}\left(\rho_{-}H_{b}\right)\right]$	(205)

where $\rho_{-}=U_{-}|0\rangle\langle 0|U_{-}^{\dagger}$ and $H_{x}=U_{+}^{\dagger}|x\rangle\langle x|U_{+}$ , and we used the equality

\displaystyle|\gamma_{a}|^{2}=\langle 0|U_{-}^{\dagger}U_{+}^{\dagger}|a\rangle\langle a|U_{+}U_{-}|0\rangle=\operatorname{Tr}\left(\rho_{-}H_{a}\right).

(206)

We now show that each term in this sum vanishes by following Ref. [23] in letting $U_{-}$ be sampled from a distribution that forms a unitary 2-design:

$\displaystyle\mathbb{E}_{U_{-}\sim U(d)}\operatorname{Tr}\left(\rho_{-}H_{a}\right)$	$\displaystyle\operatorname{Tr}\left(\rho_{-}[V_{m},H_{b}]\right)$	(207)
	$\displaystyle=\operatorname{Tr}\left(\mathbb{E}_{U_{-}\sim U(d)}\left[\rho_{-}\otimes\rho_{-}\right]H_{a}\otimes[V_{m},H_{b}]\right)$	(208)
	$\displaystyle=\frac{1}{d(d+1)}\operatorname{Tr}\left(\left(\mathbb{I}_{d}+\sum_{i,j=1}^{d}\|ij\rangle\langle ji\|\right)H_{a}\otimes[V_{m},H_{b}]\right)$	(209)
	$\displaystyle=\frac{\operatorname{Tr}\left([V_{m},H_{b}]H_{a}\right)}{d(d+1)}$	(210)
	$\displaystyle=\frac{V_{m}U_{+}^{\dagger}\|b\rangle\langle b\|U_{+}U_{+}^{\dagger}\|a\rangle\langle a\|U_{+}-U_{+}^{\dagger}\|b\rangle\langle b\|U_{+}V_{m}U_{+}^{\dagger}\|a\rangle\langle a\|U_{+}}{d(d+1)}$	(211)
	$\displaystyle=\frac{\delta_{a}^{b}\langle a\|U_{+}V_{m}U_{+}^{\dagger}\|b\rangle-\delta_{a}^{b}\langle b\|U_{+}V_{m}U_{+}^{\dagger}\|a\rangle}{d(d+1)}$	(212)
	$\displaystyle=0,$	(213)

where in line (209) we have used a common expression for $\mathbb{E}_{U\sim U(d)}[\rho\otimes\rho]$ in terms of the projector onto the symmetric subspace (e.g. [62]). Substituting this result into Eq. (205) we find

\mathbb{E}_{U\sim U(d)}\left(\frac{\partial\nu_{k}}{\partial\theta_{m}}\right)=0

(214)

By extension, the gradient of a loss function of the form of Eq. (200) will vanish for expressive enough state-preparation unitaries $U$ , suggesting that solving for a choice of $U$ to induce a specific distribution of feature weights $\nu_{k}^{opt}$ will be infeasible in practice.

Appendix C Determining the degeneracy of quantum models

Here we will develop a theoretical framework for manipulating the degeneracy (and therefore feature weights) of quantum models. We will begin with choosing the data-encoding operator $S(z)=\exp(-iHz)$ to be separable, generated from a Hamiltonian of the form

H=\mathbf{r}\cdot\mathbf{Z}=\sum_{j=0}^{n-1}r_{j}Z^{(j)},

(215)

and $Z^{(j)}$ denotes the operator that applies a Pauli $Z$ operator to the $j^{th}$ register and acts trivially elsewhere, and $\mathbf{r}\in\mathbb{R}^{n}$ . One can show that the diagonal elements of $H$ are given by

\lambda_{j}=H_{jj}=2(\mathbf{r}\cdot\mathbf{j})-\lVert\mathbf{r}\rVert_{1},

(216)

where here and elsewhere we use interchangeably a bitstring denoted as $j=j_{0}j_{1}\dots j_{n-1}$ with decimal value $j=\sum_{k}j_{k}2^{j}\in[0,2^{n}-1]$ , and $\mathbf{j}\in\{0,1\}^{n}$ as its corresponding vector representation. The Fourier spectrum and degeneracy of the encoding strategy are then computed as

$\displaystyle\Omega$	$\displaystyle=\{\lambda_{j}-\lambda_{i}:i,j=0,\dots,2^{n}-1\}$	(217)
	$\displaystyle=\{2(\mathbf{r}\cdot(\mathbf{j}-\mathbf{i})):\mathbf{i},\mathbf{j}\in\{0,1\}^{n}\},$	(218)
$\displaystyle R(\omega)$	$\displaystyle=\{(i,j):2\mathbf{r}\cdot(\mathbf{j}-\mathbf{i})=k\}.$	(219)

Note that the subtraction in the definition of $\Omega$ does not preserve $\mathbb{Z}_{2}$ , i.e. $(\mathbf{j}-\mathbf{i})\in\{-1,0,1\}^{n}$ . From Eq. (217) we see that the largest possible size for $|\Omega|$ is $3^{n}$ , the number of unique choices for $(\mathbf{j}-\mathbf{i})$ . Any encoding strategy of Eq. (215) therefore introduces a combinatorial degeneracy, since the set $\{(\mathbf{j}-\mathbf{i}):\mathbf{i},\mathbf{j}\in\{0,1\}^{n}\}$ is the image of a surjective map on $\{0,1\}^{n}\times\{0,1\}^{n}$ , with each $(\mathbf{j}-\mathbf{i})$ occurring with multiplicity

\#(\mathbf{j}-\mathbf{i})=2^{n-\lVert(\mathbf{j}-\mathbf{i})\rVert_{1}}.

(220)

To prove this, construct the set $\{(\mathbf{j}-\mathbf{i}):\mathbf{i},\mathbf{j}\in\{0,1\}^{n}\}$ by reflecting the hypercube $\{0,1\}^{n}$ over $2^{n}$ distinct axes, and count the number of vertices shared by among reflected images – this gives the desired degeneracy. Many choices of $\mathbf{r}$ will reduce $|\Omega|$ by reducing the size of the image of the map $(\mathbf{i},\mathbf{j})\rightarrow\mathbf{r}\cdot(\mathbf{j}-\mathbf{i})$ . Conversely, the spectrum $\Omega$ will saturate $|\Omega|=3^{n}$ only under particular conditions on $\mathbf{r}$ . For example with $n=2$ , $|\Omega|=9$ is achieved only if $\mathbf{r}$ satisfies all of the following conditions

\displaystyle r_{0}\neq 0,\qquad r_{0}\neq\pm\frac{1}{2}r_{1},\qquad r_{0}\neq\pm r_{1},\qquad r_{0}\neq\pm 2r_{1},\qquad

(221)

i.e., the inverse of the dot product $\mathbf{r}\cdot\mathbf{d}$ with respect to an input $\mathbf{d}$ exists if the elements of $\mathbf{r}$ satisfy $r_{j}\neq r_{i}\text{ mod }2^{p}$ for some integer $p$ ; a more general, recursive statement of this property is provided in Ref. [44]. Requirements of this form may be proven using the frame analysis operator $T_{n}\in\mathbb{Z}^{3^{n}\times n}$ , the rows of which contain each element of $\{-1,0,1\}^{n}$ . And so to demonstrate Eq. (221) above, if we let $\mathbf{k}\in\mathbb{R}^{3^{n}}$ be a vector containing (repeated) frequencies in $\Omega$ , we observe that $T_{2}\mathbf{r}=\mathbf{k}\in\mathbb{R}^{9}$ will have unique entries only under specific conditions on $\mathbf{r}$ . $\Omega$ being the result of a frame reconstruction of $\mathbf{r}$ also provides a direct way to tune the elements of $\Omega$ : given a length $3^{n}$ vector $\mathbf{k}$ containing the (not necessarily unique) desired elements of $\Omega$ , one can then compute $H$ by solving $\mathbf{r}=T_{n}^{\dagger}\mathbf{k}$ .

C.1 Hamming encoding strategy

We instantiate the Hamiltonian of Eq. (215) for

\mathbf{r}=\frac{1}{2}(1,1,\dots,1)\in\mathbb{R}^{n}.

This introduces many degeneracies into the spectrum; using $T_{n}\mathbf{r}=\mathbf{k}$ we immediately recover the result of Ref. [30] that the unique elements of $\mathbf{k}$ are given by

\Omega=\{-n,-(n-1),\dots,0,\dots,n-1,n\},

(222)

with $|\Omega|=2n+1$ . To recover the degeneracy of each frequency we compute

\displaystyle\lambda_{i}-\lambda_{j}=\frac{1}{2}\sum_{t=1}^{n}(-1)^{i_{t}}-\frac{1}{2}\sum_{t=1}^{n}(-1)^{j_{t}}=\frac{1}{2}\sum_{t=1}^{n}(1-2i_{t})-\frac{1}{2}\sum_{t=1}^{n}(1-2j_{t})=w(j)-w(i),

(223)

where $w:\{0,1\}^{n}\rightarrow[n]$ is the the weight of a bitstring $b=b_{n-1}\dots b_{1}b_{0}\in\{0,1\}^{n}$ defined as

w(b)=\sum_{k=1}^{n}b_{k},

(224)

and when $b$ is written as an integer it should be interpreted according to its binary value. It will be useful to consider subsets of bitstrings having a constant weight. For each $k\in[n]$ (results for $k<0$ follow by symmetry), we now define the $k$ -weight set $W(k)$ as

W(k)=\{j\in\{0,1\}^{n}:w(j)=k\}.

(225)

Then the degeneracy set is exactly the set of bitstring indices differing by constant weight:

	$\displaystyle R(k)$	$\displaystyle=\{(i,j)\in\{0,1\}^{n}\times\{0,1\}^{n}:w(j)-w(i)=k\}$		(226)
		$\displaystyle=\bigcup_{p=k}^{n}W(p)\times W(p-k).$		(227)

where $\times$ denotes the Cartesian product (this motivates the name “Hamming encoding strategy”, as the frequencies and degeneracy of this model arise from index pairs having fixed Hamming weight). From counting arguments, the degeneracy of this model is then

|R(k)|=\sum_{j=1}^{n-k}\begin{pmatrix}n\\ j\end{pmatrix}\begin{pmatrix}n\\ k+j\end{pmatrix}=\begin{pmatrix}2n\\ n-k\end{pmatrix}.

(228)

C.2 Binary encoding strategy

We instantiate the Hamiltonian of Eq. (215) for

\mathbf{r}=\frac{1}{2}(2^{n-1},\dots,4,2,1)\in\mathbb{R}^{n},

such that $\mathbf{r}$ generates the decimal value of binary vector. We know from relations like those given in Eq. (221) that $|\Omega|<<3^{n}$ , and indeed using Eq. (216) we find that $\lambda_{j}=(j-(2^{n}-1))/2$ resulting in a frequency spectrum of

\Omega=\{-2^{n}+1,-2^{n}+2,\dots,0,1,\dots 2^{n}-1\},

(229)

such that $|\Omega|=2^{n+1}-1$ , and the degenerate index set of the kernel for frequency $k$ is equivalent (up to permutations) to the indices corresponding to nonzero elements of an elementary Toeplitz matrix with 1’s on the $k$ -th diagonal:

R(k)=\{(i,j):i=j-k\,\,i,j\in[2^{n}]\}.

(230)

C.3 Ternary encoding strategy

We instantiate the Hamiltonian of Eq. (215) for

\mathbf{r}=\frac{1}{2}(3^{n-1},\dots,9,3,1)\in\mathbb{R}^{n},

resulting in the spectrum

\Omega=\Bigl{\{}-\frac{1}{2}(3^{n}-1),-\frac{1}{2}(3^{n}-1)+1,\dots,0,\dots,\frac{1}{2}(3^{n}-1)\Bigr{\}}.

(231)

This spectrum (also studied in Ref. [44]) is interesting to study because it is the unique choice giving rise to a Fourier spectrum with frequencies spaced by $1$ saturating the $|\Omega|=3^{n}$ limit for spectra generated by separable Hamiltonians. Given $k=2\mathbf{r}\cdot\mathbf{d}$ , then $\mathbf{d}=\mathbf{i}-\mathbf{j}$ may be uniquely recovered from $k$ . Suppose $k$ has a ternary representation

k=k_{n}\dots k_{1}k_{0}=\sum_{j}k_{j}3^{j},

(232)

with $n=\lceil\log_{3}k\rceil$ and $k_{j}\in\{0,1,2\}$ . Then defining the operation $t:\mathbb{Z}\rightarrow\{0,1,2\}^{n}$ which converts decimals to their vector-form ternary representation according to $t(k)=(k_{0},k_{1},\dots,k_{n-1})$ , then the degeneracy of this encoding strategy follows directly from the combinatorial degeneracy given in Eq. (220):

\displaystyle|R(k)|=2^{n-\lVert T(k)\rVert_{1}},

(233)

where the shifted ternary operation is defined for $k\in[0,3^{n}-1]$

\displaystyle T(k)=t\left(k+\lVert\mathbf{r}\rVert_{1}\right)-\mathbf{1},

(234)

and $\mathbf{1}=(1,1,\dots,1)\in\mathbb{Z}^{n}$ is the array of all ones. Intuitively, this computation converts a signed ternary expression $k=\sum_{j}k_{j}3^{j}$ (with $\mathbf{k}\in\{-1,0,1\}$ ) to an unsigned ternary expression provided in Eq. (232).

C.4 Nonseparable models

Given a data-encoding Hamiltonian $H=\operatorname{diag}\left(\Lambda\right)$ with $\Lambda\in\mathbb{R}^{d}$ , a key feature of the associated spectrum is that it is preserved under permutations of $\Lambda$ . This means that a large class of models where $S(x)$ is generated by a separable $H=H_{0}\otimes H_{1}\otimes H_{n-1}$ are capable of producing the same spectrum as models where $S(x)$ is generated by an nonseparable Hamiltonian. For instance, letting $\sim$ denote equivalence up to permutation and additive shift, it holds that $I\otimes Z\sim Z\otimes Z$ , so that a data encoding unitaries

	$\displaystyle S_{1}(x)$	$\displaystyle=\exp(i2\pi r(I\otimes Z)x),$		(235)
	$\displaystyle S_{2}(x)$	$\displaystyle=\exp(i2\pi r(Z\otimes Z)x),$		(236)

have identical spectrum and degeneracy properties (differing only in their corresponding degeneracy index sets $R$ ). Understanding the classes of nonseparable data-encoding Hamiltonians that are equivalent to a separable encoding strategy up to permutation and translation may be an interesting problem for future work.

We therefore limit our discussion of nonseparable models to a particular choice of Hamiltonian which is provably inequivalent to any separable Hamiltonian, as it saturates the largest possible $|\Omega|$ (and smallest values of $|R(k)|$ ) available to any single-layer quantum model. This can be achieved by setting the diagonal of the data-encoding Hamiltonian as the elements of a Golomb ruler [45, 63]. The resulting spectrum is nondegenerate for all nonzero frequencies, though it is straightforward to prove that one cannot achieve uniform spacing in the spectrum (i.e., a Perfect Golomb ruler) for $d\geq 5$ . As a result, the corresponding spectrum of this model generally exhibits gaps between frequencies. Further exploration of the connections to concepts from radio engineering [64, 65] and classical coding theory [66] may enrich investigations into the spectral properties of quantum models.

Appendix D Demonstration of benign overfitting in the general quantum model

In this section we demonstrate an example of benign overfitting in the general quantum model by explicitly constructing a sequence of state preparation unitaries $U$ for a particular choice of data-encoding unitary $S(d)$ . We will consider a $d$ -dimensional version of the Binary encoding strategy constructed from a data-encoding Hamiltonian $H=\operatorname{diag}\left(0,1,2,\dots,d-1\right)$ that gives rise to

	$\displaystyle\Omega$	$\displaystyle=\Bigl{\{}\frac{-d}{2}+1,\dots,0,\dots,\frac{d}{2}-1\Bigr{\}},$		(237)
	$\displaystyle\|R(k)\|$	$\displaystyle=d-\|k\|,$		(238)

for $k\in\Omega$ , and we will require $d$ to be even for simplicity. Suppose the target function spectrum has size $n_{0}<n<d$ for some integer $n$ and satisfies $(n_{0}+1)\text{ mod }4=0$ . Define the constants

	$\displaystyle c_{1}$	$\displaystyle=\frac{d}{2}-\frac{n_{0}+1}{4},$		(239)
	$\displaystyle c_{2}$	$\displaystyle=\frac{d}{2}+\frac{n_{0}+1}{4}.$		(240)

We then choose a constant $a\in[0,2/(n_{0}+1)]$ and prepare $|\Gamma\rangle$ with elements given by

\displaystyle\gamma_{j}=\begin{cases}\sqrt{a},&j\in[c_{1},c_{2}),\\ \sqrt{b},&j\in[0,c_{1})\cup[c_{2},d),\end{cases}

(241)

where normalization requires that

b=\frac{1-\frac{n_{0}+1}{2}a}{d-\frac{n_{0}+1}{2}}.

(242)

For this encoding strategy, the optimal feature weights of the quantum model corresponding to positive frequencies $k\in\Omega_{+}$ are given by

\nu_{k}=\sum_{j=0}^{d-k-1}|\gamma_{j}|^{2}|\gamma_{j+k}|^{2}.

(243)

Counting arguments and algebraic simplification lead to

\nu_{k}=\begin{cases}(d-B_{0}-2k)b^{2}+2kab+(B_{0}-k)a^{2},&0\leq k<c_{2}-c_{1},\\ (d-2B_{0}-k)b^{2}+2B_{0}ab,&c_{2}-c_{1}\leq k<c_{1},\\ (k-B_{0})b^{2}+(d-B_{0}-2k)ab,&c_{1}\leq k<c_{2},\\ (d-k)b^{2},&c_{2}\leq k<d,\end{cases}

(244)

where $B_{0}=(n_{0}+1)/2$ and $\nu_{k+1}\leq\nu_{k}$ for all $k>0$ . Noting that $b=\mathcal{O}(d^{-1})$ and $a$ is constant, we find that $\nu_{k}$ has three distinct scaling regimes: $\nu_{k}=\Omega(1)$ for $k\in\Omega_{n_{0}}$ , $\nu_{k}=\Theta\left(d^{-2}\right)$ when $k\in[c_{2},d)$ , and $\nu_{k}=\mathcal{O}(d^{-1})$ otherwise. We can use this behavior to bound the generalization error of the quantum model in a manner analogous to Sec. A.3.1. Defining $m=|S(k)|-1$ , we bound the bias according to

$\displaystyle\textsc{bias}^{2}$	$\displaystyle=\sum_{k\in\Omega_{n_{0}}}\|\hat{g}_{k}\|^{2}\frac{\left(\sum_{\ell\in S(k)\backslash k}\nu_{\ell}\right)^{2}+\sum_{\ell\in S(k)\backslash k}\nu_{\ell}^{2}}{\left(\sum_{\ell\in S(k)}\nu_{\ell}\right)^{2}}$	(245)
	$\displaystyle\leq\sum_{k\in\Omega_{n_{0}}}\|\hat{g}_{k}\|^{2}\frac{m(m+1)(db^{2}+2B_{0}ab)^{2}}{(a^{2}+mb^{2})^{2}}$	(246)
	$\displaystyle\leq\left[m(m+1)b^{2}\right]\left(\frac{db+2B_{0}a}{a^{2}+mb^{2}}\right)^{2}P$	(247)
	$\displaystyle=\mathcal{O}\left(\frac{1}{n^{2}}\right),$	(248)

where we have defined $P:=\sum_{k\in\Omega_{n_{0}}}|\hat{g}_{k}|^{2}$ . In line (246) we have used the monotonicity of $\nu_{k}$ to bound the components of $S(k)\backslash k$ in the numerator according to the upper bound for the interval $k\in[c_{2}-c_{1},c_{1})$ and applied a lower bound for elements of $S(k)\backslash k$ in the denominator using the interval $k\in[c_{2},d)$ . The scaling of Eq. (248) follows from $m=\mathcal{O}(d/n)$ , which implies that $m(m+1)b^{2}=\mathcal{O}(n^{-2})$ , while the remaining terms in line (247) scale as no greater than a constant. To bound the variance, we observe

	$\displaystyle\sum_{k\in\Omega_{n_{0}}}\frac{\sum_{\ell\in S(k)}\nu_{\ell}^{2}}{\left(\sum_{\ell\in S(k)}\nu_{\ell}\right)^{2}}$		(249)
	$\displaystyle\leq\sum_{k\in\Omega_{n_{0}}}\frac{\left[a^{2}(B_{0}-k)+2kab+(d-B_{0}-2k)b^{2}\right]^{2}+\left[(d-2B_{0}-k)b^{2}+2B_{0}ab\right]^{2}}{(a^{2}+mb^{2})^{2}}$		(250)
	$\displaystyle\leq n_{0}\frac{(a^{2}B_{0}+n_{0}ab+db^{2})^{2}+m(db^{2}+2B_{0}ab)^{2}}{(a^{2}+mb^{2})^{2}}$		(251)
	$\displaystyle=\mathcal{O}(1),$		(252)

which follows identically to the arguments used to reach Eq. (248) and the observation that $2k\leq n_{0}$ for $k\in\Omega_{n_{0}}$ . Furthermore, defining $m^{\prime}=\lfloor(c_{1}-n_{0})/n\rfloor$ as the lower bound for the number of aliases of any $k\in\{\ell\in\Omega_{n}\backslash\Omega_{n_{0}}:c_{1}\leq\ell<c_{2}\}$ we have

$\displaystyle\sum_{k\in\Omega_{n}\backslash\Omega_{n_{0}}}\frac{\sum_{\ell\in S(k)}\nu_{\ell}^{2}}{\left(\sum_{\ell\in S(k)}\nu_{\ell}\right)^{2}}$	$\displaystyle\leq\frac{(m+1)(db^{2}+2B_{0}ab)^{2}}{\left(m^{\prime}\left[(d-2B_{0}-k)b^{2}+2B_{0}ab\right]+(m+1-m^{\prime})b^{2}\right)^{2}}$	(253)
	$\displaystyle\leq\frac{n-n_{0}}{(m+1)}\frac{(db+2B_{0}a)^{2}}{\left(\frac{m^{\prime}}{m+1}\left[(d-2B_{0}-c_{1})b+2B_{0}a\right]+\left(1-\frac{m^{\prime}}{m+1}\right)b\right)^{2}}$	(254)
	$\displaystyle=\mathcal{O}\left(\frac{n^{2}}{d}\right),$	(255)

where in line (254) we have split the denominator into frequencies $n_{0}/2\leq k<c_{1}$ and $c_{1}\leq k<d$ , bounding $\nu_{k}$ for the latter as $b^{2}$ . Line (255) follows from observing that $m^{\prime}=\Theta(d/n)\Rightarrow m^{\prime}/m=\Theta(1)$ , and thus the entire second term of line (254) is bounded as $\mathcal{O}(1)$ . Combining Eqs. (252) and (255) with Eq. (88) for the variance of the Fourier features model gives

\textsc{var}=\mathcal{O}\left(\frac{1}{n}+\frac{n}{d}\right).

(256)

It follows that the minimum- $\lVert\cdot\rVert_{F}$ interpolating quantum model with state preparation unitary $U$ satisfying Eq. (241) (up to permutations) and data encoded using the Binary encoding strategy will result in benign overfitting as long as the dimensionality of the model scales as $d=\omega(n)$ (i.e., the number of qubits satisfies $n_{q}=\omega(\log n)$ ), in which case Eqs. (248) and (256) characterizing the generalization error of the model both vanish in the large- $n$ limit.

A convenience of this demonstration of benign overfitting in a quantum model was that the state $|\Gamma\rangle$ of Eq. (241) incorporates knowledge of the target function (namely the band-limited spectrum $\Omega_{n_{0}}$ ). This could be considered a limitation of the approach, as it imposes an inductive bias on the resulting interpolating model $f^{opt}$ , in contrast to other examples of benign overfitting that are more agnostic to the underlying distribution [6, 59]. Future work could reveal choices of $|\Gamma\rangle$ that are more data-independent (no explicit dependence on $n_{0}$ ) but give rise to feature weights $\nu_{k}^{opt}$ with the same desirable properties as Eq. (244). As described in Appendix A.3.2, the desirable properties include a more weight on all $\nu_{k}$ with $k\in\Omega_{n_{0}}$ and a long, thin tail of feature weights for all other $k\in\Omega_{d}\backslash\Omega_{n_{0}}$ .