This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Tensor Sketch:
Fast and Scalable Polynomial Kernel Approximation

\nameNinh Pham \emailninh.pham@auckland.ac.nz
\addrSchool of Computer Science
University of Auckland
Auckland, New Zealand \AND\nameRasmus Pagh \emailpagh@di.ku.dk
\addrComputer Science Department
University of Copenhagen
Copenhagen, Denmark
Abstract

Approximation of non-linear kernels using random feature maps has become a powerful technique for scaling kernel methods to large datasets. We propose Tensor Sketch, an efficient random feature map for approximating polynomial kernels. Given nn training samples in d\mathbb{R}^{d} Tensor Sketch computes low-dimensional embeddings in D\mathbb{R}^{D} in time 𝒪(n(d+DlogD))\mathcal{O}\left(n(d+D\log{D})\right) making it well-suited for high-dimensional and large-scale settings. We provide theoretical guarantees on the approximation error, ensuring the fidelity of the resulting kernel function estimates. We also discuss extensions and highlight applications where Tensor Sketch serves as a central computational tool.

Keywords: Polynomial kernel, SVM, Tensor Product, Count Sketches, FFT

1 Introduction

Kernel machines such as Support Vector Machines (SVMs) (Schölkopf and Smola, 2002), are powerful tools for a wide range of machine learning and data mining tasks. A key strength of kernel methods lies in their ability to capture non-linear structure in data through the use of kernel functions. These functions implicitly map data from the original space to a high-dimensional feature space, where each coordinate corresponds to a feature of the input vectors. In this kernel space, many standard learning algorithms operate using only pairwise inner products, without explicitly computing the mapped coordinates. This implicit approach not only reduces computational overhead but also enables kernel methods to handle diverse data types, including both numeric and symbolic inputs, in a unified framework.

While kernel methods have achieved considerable success across a wide range of data analysis tasks (Shawe-Taylor and Cristianini, 2004), their scalability remains a significant limitation. Kernel-based learning algorithms typically suffer from high computational and memory costs, with direct methods often requiring cubic time and quadratic space in the number of training samples (Schölkopf and Smola, 2002). This issue becomes increasingly problematic in modern machine learning applications that rely on large-scale datasets. In high-dimensional domains such as text, where data is often represented using sparse bag-of-words vectors, linear models – particularly linear SVMs – have demonstrated strong empirical performance (Joachims, 1998; Lewis et al., 2004). Consequently, there has been substantial research on designing training algorithms for linear SVMs that scale linearly with the number of training examples (Joachims, 2006; Shalev-Shwartz et al., 2011; Fan et al., 2008; Bubeck, 2015; Allen-Zhu, 2017; d’Aspremont et al., 2021).

Since non-linear SVMs with kernel functions can be interpreted as linear SVMs operating in a high-dimensional feature space, Rahimi and Recht (2007) first proposed random feature maps for approximating shift-invariant kernels to combine the advantages of linear and non-linear SVM approaches. Their approach approximates kernels by a randomized feature map from data space into a relatively low-dimensional feature space. In this randomized feature space, the kernel function of any two vectors is well approximated by their inner product with high probability. In other words, the randomized feature map can be seen as a specific dimensionality reduction from the feature space, computed directly from vectors in data space. This enables the use of fast linear learning algorithms to approximate the performance of non-linear kernel methods, significantly reducing training time while maintaining competitive generalization performance.

Given any two vectors 𝐱=(x1,,xd),𝐲=(y1,,yd)d\mathbf{x}=\left(x_{1},\ldots,x_{d}\right),\mathbf{y}=\left(y_{1},\ldots,y_{d}\right)\in\mathbb{R}^{d}, we denote their inner product by 𝐱,𝐲=i=1dxiyi\left\langle\mathbf{x},\mathbf{y}\right\rangle=\sum_{i=1}^{d}x_{i}y_{i}. For an implicit feature space map ϕ:d\phi:\mathbb{R}^{d}\mapsto\mathcal{F} the inner product between vectors in the feature space \mathcal{F} can be quickly computed as ϕ(𝐱),ϕ(𝐲)=κ(𝐱,𝐲)\left\langle\phi(\mathbf{x}),\phi(\mathbf{y})\right\rangle=\kappa(\mathbf{x},\mathbf{y}) where κ()\kappa(\cdot) is an easily computable kernel function. A random feature map f:dDf:\mathbb{R}^{d}\mapsto\mathbb{R}^{D}, where DD is an integer parameter, can be used to approximate a kernel κ()\kappa(\cdot) if it satisfies

𝔼[f(𝐱),f(𝐲)]=ϕ(𝐱),ϕ(𝐲)=κ(𝐱,𝐲),\mathbb{E}\left[\left\langle f(\mathbf{x}),f(\mathbf{y})\right\rangle\right]=\left\langle\phi(\mathbf{x}),\phi(\mathbf{y})\right\rangle=\kappa(\mathbf{x},\mathbf{y}),

with good concentration around the expected value. Using such feature maps we can transform data from the original data space into a DD-dimensional randomized feature space to efficiently approximate the solutions of learning algorithms on high-dimensional feature spaces. This line of work enables kernel methods to handle massive datasets on many standard statistical learning tasks, including kernel ridge regression (Avron et al., 2017a, b), support vector machines (Rahimi and Recht, 2007; Lu et al., 2016), clustering (Wu et al., 2018; Xu and Pham, 2024) and dimensionality reduction (Lopez-Paz et al., 2014; Sriperumbudur and Sterge, 2022).

Randomized techniques for kernel approximation fall into two broad categories: data-dependent and data-independent methods. A prominent example of data-dependent approaches is Nyström method (Williams and Seeger, 2000; Yang et al., 2012; Gittens and Mahoney, 2016), which approximates the kernel matrix using a subset of training data. More recent variants leverage statistical properties such as leverage scores of the kernel matrix to select informative samples for improved approximation (Avron et al., 2017b; Li et al., 2019; Liu et al., 2020).

In contrast, data-independent random feature maps (Rahimi and Recht, 2007) approximate the entire kernel function – not just the matrix – using features drawn from fixed distributions. This makes them particularly well-suited for online learning and streaming settings, where access to the full training set may be limited. Understanding the theoretical and practical properties of such random features has been the focus of extensive research over the past decade (Sriperumbudur and Szabó, 2015; Bach, 2017; Rudi and Rosasco, 2017; Sun et al., 2018; Li et al., 2019). These efforts have extended to downstream machine learning applications (Muandet et al., 2017; Liu et al., 2022), as well as to scaling deep learning architectures and analyzing over-parameterized neural networks  (Yehudai and Shamir, 2019; Zandieh et al., 2021; Choromanski et al., 2021; Fu et al., 2023).

Our contribution. This paper investigates algorithmic aspects of data-independent random features, with a focus on polynomial kernels. Although many modern learning algorithms can be trained in time linear in the number of samples (Bubeck, 2015), the cost of computing random features often becomes a bottleneck. Specifically, many existing kernel approximation methods require time and space proportional to the product of the data dimension dd and the number of random features DD. For instance, Rahimi and Recht (2007) maintained DD random vectors 𝐰1,,𝐰Dd\mathbf{w}_{1},\ldots,\mathbf{w}_{D}\in\mathbb{R}^{d}, requiring 𝒪(dD){\mathcal{O}}\left(dD\right) time and space to compute DD features. When D=𝒪(d)D={\mathcal{O}}\left(d\right), this leads to quadratic costs, which may exceed the time spent in the actual learning or prediction phase. As a result, the cost of the random mapping itself can dominate the overall runtime of kernel-based methods.

We study a near-linear time random feature mapping for approximating the standard polynomial kernel κ(𝐱,𝐲)=(c+𝐱,𝐲)p\kappa(\mathbf{x},\mathbf{y})=\left(c+\left\langle\mathbf{x},\mathbf{y}\right\rangle\right)^{p} for an integer p1p\geq 1 and a real c0c\geq 0. The polynomial kernel is a widely used example of a non-stationary kernel, and it serves as a building block for a broader class of kernels that can be well approximated by polynomial expansions, including Gaussian kernels, general dot product kernels, arc-cosine and sigmoid kernels (Scetbon and Harchaoui, 2021). Unlike shift-invariant kernels, polynomial kernels in d\mathbb{R}^{d} do not admit representations via spherical harmonics, and thus require different techniques for constructing random feature maps (Schoenberg, 1942; Bochner, 2005).

In this paper we present Tensor Sketch, a scalable random feature map for polynomial kernels that combines the explicit feature mapping via tensor products with an efficient sketching technique to approximate these products. Given a dataset of nn points, Tensor Sketch computes DD-dimensional feature embeddings in time 𝒪(n(d+DlogD)){\mathcal{O}}\left(n(d+D\log{D})\right) and requires 𝒪(1){\mathcal{O}}\left(1\right) extra space to store the randomness of the sketch. The core technical insight is a connection between tensor products and the fast convolution structure of Count Sketch (Charikar et al., 2002; Pagh, 2013), which allows for substantial reductions in both computational and memory costs. Empirical evaluations demonstrate that Tensor Sketch achieves high approximation accuracy while outperforming prior methods by orders of magnitude in runtime on large-scale datasets. A preliminary version of this work appeared in the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Pham and Pagh, 2013).

Later developments. Since its initial publication, Tensor Sketch has seen widespread adoption across multiple domains. It has been used to accelerate machine learning algorithms (Cui et al., 2017; Dai et al., 2017) and standard statistical learning tasks (Avron et al., 2014; Wang et al., 2015; Diao et al., 2018; Draief et al., 2018), as well as to improve computational efficiency in various computer vision applications (Gao et al., 2016; Fukui et al., 2016). Notably, Tensor Sketch has been integrated into widely used libraries such as scikit-learn (Pedregosa et al., 2011), where it is available as PolynomialCountSketch111https://scikit-learn.org/stable/modules/generated/sklearn.kernel_approximation.PolynomialCountSketch.html

In this extended version, we introduce several important additions and refinements:

  • A survey of state-of-the-art techniques for approximating polynomial kernels, along with an overview of recent applications that utilize Tensor Sketch as a core component.

  • A revised theoretical analysis of the approximation error. 222Lemma 6 in the conference version contained an error; we correct the dependence on pp. In particular, our main result (Theorem 9) establishes a variance bound on the inner product of two Tensor Sketch vectors, providing formal guarantees on the reliability and accuracy of the approximation.

The organization of the paper is as follows. In Section 2, we briefly review related work and recent applications using Tensor Sketch as a core algorithmic component. Section 3 describes background and preliminaries. Tensor Sketch is presented and analyzed in Section 4.

2 Related Work

Decomposition methods. Traditional techniques for training non-linear SVMs on large-scale datasets include decomposition methods (Osuna et al., 1997; Schölkopf and Smola, 2002; Chang and Lin, 2011). These algorithms partition the training data into a working set and a fixed set, and iteratively optimize the dual objective with respect to the working set while holding the fixed set constant. In effect, they perform coordinate ascent over subsets of the dual variables, updating only a small number of coefficients per iteration until the Karush-Kuhn-Tucker (KKT) conditions are satisfied within a specified tolerance. While decomposition methods avoid the memory overhead of constructing the full kernel matrix, they rely on repeated numerical optimization steps, which can be computationally expensive for large datasets.

Data-dependent feature maps. To scale kernel methods to large datasets, various techniques have been developed to efficiently approximate the kernel matrix, notably Nyström methods (Williams and Seeger, 2000; Kumar et al., 2012; Gittens and Mahoney, 2016; Avron et al., 2017b; Li et al., 2019; Liu et al., 2020). These data-dependent approximation methods aim to reduce the time and space complexity while maintaining approximation quality. Nyström methods sample a subset of columns (or data points) from the kernel matrix, typically using a distribution informed by the data, and compute a rank-kk approximation in time 𝒪(nk2+k3){\mathcal{O}}\left(nk^{2}+k^{3}\right). Leverage-score-based methods (Li et al., 2019) proposed a leverage weighted scheme to construct refined random feature maps, achieving kernel approximation in 𝒪(nD2+D3){\mathcal{O}}\left(nD^{2}+D^{3}\right) time for DD features.

Data-independent feature maps. Instead of approximating the kernel matrix, Rahimi and Recht (2007) introduced Random Fourier Features (RFF) to approximate shift-invariant kernel functions, including Gaussian, Laplacian, and Cauchy kernels. Since this method approximates the entire kernel function, not just the kernel matrix, it is especially relevant for out-of-sample prediction and online learning scenarios. The method is based on Bochner’s theorem (Bochner, 2005), which relates any continuous shift-invariant positive-definite kernel to the Fourier transform of a probability distribution. RFF uses Monte Carlo sampling to approximate the associated integral, mapping input data to a randomized low-dimensional feature space where inner products approximate the kernel values.

This seminal work has opened a new direction for kernel approximation techniques and has been widely adopted in machine learning applications. Since its introduction, hundreds of subsequent research papers have built on RFF, advancing both its algorithmic performance and theoretical guarantees. On the algorithmic side, researchers have proposed more efficient constructions of RFF and studied their variance reduction properties. On the theoretical side, work has focused on the approximation error of the kernel matrix, the expected risk of learning algorithms, and their generalization properties when using RFF. We refer readers to the survey by Liu et al. (2022) and references therein for a comprehensive overview.

Table 1: A summary of recent works on approximating polynomial kernels (c+𝐱,𝐲)p\left(c+\left\langle\mathbf{x},\mathbf{y}\right\rangle\right)^{p} where c0,p=𝒪(1)c\geq 0,p={\mathcal{O}}\left(1\right), their construction time and extra memory to store the randomness.
Technique Reference Time Memory
Sketching  Kar and Karnick (2012) 𝒪(dD){\mathcal{O}}\left(dD\right) 𝒪(D){\mathcal{O}}\left(D\right)
 Meister et al. (2019)
Sketching  Hamid et al. (2014) 𝒪(dD+DlogD){\mathcal{O}}\left(dD+D\log{D}\right) 𝒪(D){\mathcal{O}}\left(D\right)
Monte Carlo method  Pennington et al. (2015) 𝒪(dD){\mathcal{O}}\left(dD\right) 𝒪(dD){\mathcal{O}}\left(dD\right)
 Liu et al. (2021)
Sketching Tensor Sketch 𝓞(𝑫𝐥𝐨𝐠𝑫)\boldsymbol{{\mathcal{O}}\left(D\log{D}\right)} O(1)

Polynomial kernels. Our work focuses on developing data-independent random feature maps for polynomial kernels, a class of non-stationary kernels of the form κ(x, y) = (c + ⟨x, y⟩)^p   , for an integer p1p\geq 1 and a constant c0c\geq 0. Unlike shift-invariant kernels, polynomial kernels defined over d\mathbb{R}^{d} do not admit spectral representations via spherical harmonics, making standard random feature techniques inapplicable. As a result, constructing efficient random feature maps for polynomial kernels typically requires tools from sketching and dimensionality reduction in linear algebra (Woodruff, 2014). We provide a detailed review of recent methods for approximating polynomial kernels, highlighting the advantages and limitations of each approach. Table 2 summarizes these methods along with their time and space complexities.

Kar and Karnick (2012) proposed random feature maps based on the Maclaurin series expansion of inhomogeneous polynomial kernels, i.e., κ(𝐱,𝐲)=(c+𝐱,𝐲)p\kappa(\mathbf{x},\mathbf{y})=\left(c+\left\langle\mathbf{x},\mathbf{y}\right\rangle\right)^{p}. Their method approximates the homogeneous polynomial kernels κ(𝐱,𝐲)=𝐱,𝐲p\kappa(\mathbf{x},\mathbf{y})=\left\langle\mathbf{x},\mathbf{y}\right\rangle^{p} by ( ∏_i=1^p⟨w_i, x⟩ ) ( ∏_i=1^p⟨w_i, y⟩ )   , where each 𝐰i{1,1}d\mathbf{w}_{i}\in\left\{-1,1\right\}^{d} is a Rademacher vector. To construct one random feature for inhomogeneous kernels, the method randomly selects a degree-tt term 𝐱,𝐲t\left\langle\mathbf{x},\mathbf{y}\right\rangle^{t} with probability 1/2t+11/2^{t+1}, corresponding to the tt-th order term in the Taylor expansion, and uses tt Rademacher vectors to construct the feature. The total runtime is 𝒪(dD){\mathcal{O}}\left(dD\right), where DD is the number of random features. This approach was later referred to as Tensorized Random Projection by Meister et al. (2019), who also provided improved high-probability error bounds (see Theorem 2.2).

Hamid et al. (2014) proposed CRAFTMaps which build on the Maclaurin-based method by incorporating fast Johnson-Lindenstrauss transforms (Ailon and Chazelle, 2009; Tropp, 2011). It first constructs D=𝒪(D)D^{\prime}={\mathcal{O}}\left(D\right) random features using the Maclaurin-based method (Kar and Karnick, 2012), and then applies Subsampled randomized Hadamard (SRHT) to reduce the dimensionality from DD^{\prime} to DD features in 𝒪(DlogD){\mathcal{O}}\left(D\log{D}\right) time. The overall runtime is 𝒪(dD+DlogD){\mathcal{O}}\left(dD+D\log{D}\right), where the first term accounts for feature construction and the second for the projection.

Pennington et al. (2015) proposed random feature maps for a specific class of polynomial kernels defined on the unit sphere. It employs mixtures of Gaussian distributions to generate DD spherical random Fourier features, which approximate the non positive-definite kernel κ(x, y) = ( 1 - 1c + 1c⟨x, y⟩)^p   , where c2c\geq 2 and 𝐱2=𝐲2=1\|\mathbf{x}\|_{2}=\|\mathbf{y}\|_{2}=1. Like standard random Fourier methods, this construction requires 𝒪(dD){\mathcal{O}}\left(dD\right) time and storage. Although this method introduces bias in the kernel approximation, Liu et al. (2021) later demonstrated how to construct unbiased estimators for a broader class of non positive-definite kernels, also with 𝒪(dD){\mathcal{O}}\left(dD\right) computational complexity.

Building on our work on Tensor Sketch, Ahle et al. (2020) introduced a method that improves the dependence on the polynomial degree pp. Specifically, their approach requires an embedding dimension DD that scales polynomially with pp to achieve a target error, in contrast to the exponential dependence in the original Tensor Sketch. While Tensor Sketch can be followed by a separate dimension reduction step,  Ahle et al. (2020) show that it is more efficient to interleave sketching and projection through a tree-structured composition of tensoring operations. This recursive strategy maintains computational efficiency and allows the overall approximation error to be tightly bounded.

3 Background and Preliminaries

This section introduces tensor powers, the explicit feature map for polynomial kernels, which expands input vectors into a space of exponentially growing dimension. To address the computational challenges arising from this expansion, we review key sketching techniques including AMS Sketches and Count Sketches, which serve as efficient random projection methods for approximating inner products in the high-dimensional kernel space.

3.1 Notation

For an integer dd we use [d][d] to denote the set {1,,d}\{1,\dots,d\}. Consider a vector 𝐱=(x1,,xd)d\mathbf{x}=\left(x_{1},\ldots,x_{d}\right)\in\mathbb{R}^{d}. For q>0q>0 the q\ell_{q} norm of 𝐱\mathbf{x} is defined as 𝐱q=(i=1d|xi|q)1/q\|\mathbf{x}\|_{q}=\left(\sum_{i=1}^{d}|x_{i}|^{q}\right)^{1/q}. Given 𝐲=(y1,,yd)d\mathbf{y}=\left(y_{1},\ldots,y_{d}\right)\in\mathbb{R}^{d}, we define 𝐱,𝐲=i=1dxiyi\left\langle\mathbf{x},\mathbf{y}\right\rangle=\sum_{i=1}^{d}x_{i}y_{i}. The 2nd tensor power of 𝐱\mathbf{x} (i.e. outer product 𝐱𝐱\mathbf{x}\otimes\mathbf{x}) denoted by 𝐱(2)\mathbf{x}^{(2)}, is defined as the vector with entries: x^(2) = xx= [x1x1x1x2⋯x1xdx2x1x2x2⋯x2xd⋮⋮⋱⋮xdx1xdx2⋯xdxd]  . Though it is depicted in matrix form, we will think about 𝐱(2)\mathbf{x}^{(2)} as a vector in d2\mathbb{R}^{d^{2}} (with any fixed ordering of the entries). Generally, given an integer p>1p>1 we consider a pp-th tensor power 333pp-th tensor power of 𝐱\mathbf{x}: 𝐱(p)=𝐱p times𝐱\mathbf{x}^{(p)}=\mathbf{x}\underbrace{\otimes\cdots\otimes}_{\text{$p$ times}}\mathbf{x}. 𝐱(p)\mathbf{x}^{(p)} indexed by vectors in [d]p[d]^{p}, such that: x^(p)_(i_1,…,i_p) = ∏_j=1^p x_i_j  .

3.2 Tensor Powers as Feature Maps

Schölkopf and Smola (2002, Proposition 2.1) justifies that tensor power is an explicit feature map for the homogeneous polynomial kernel.

Lemma 1

Given any pair of vectors 𝐱,𝐲d\mathbf{x},\mathbf{y}\in\mathbb{R}^{d} and an integer p1p\geq 1, we have:

𝐱(p),𝐲(p)\displaystyle\left\langle\mathbf{x}^{(p)},\mathbf{y}^{(p)}\right\rangle =𝐱,𝐲p.\displaystyle=\left\langle\mathbf{x},\mathbf{y}\right\rangle^{p}\enspace.

Since the tensor power map requires dpd^{p} dimensions, it is not feasible to explicitly compute data coordinates in kernel space for high-dimensional vectors.

3.3 AMS Sketches

Most sketching approaches require kk-wise independent hash families, defined as follows.

Definition 2

A family \mathcal{H} of functions f:[u][r]f:[u]\rightarrow[r] is kk-wise independent if for every set I[u]I\subseteq[u] of |I|=k|I|=k elements, and a random ff\in\mathcal{H} the vector of hash values (f(i))iI(f(i))_{i\in I} is uniform in [r]k[r]^{k}.

Alon et al. (1999) described and analyzed a sketching approach, referred to as the AMS Sketch, to estimate the second frequency moment of a high-dimensional vector. AMS Sketch samples random functions from a 4-wise independent family. Such samples can be generated efficiently by storing 𝒪(1){\mathcal{O}}\left(1\right) random integers, and the sampled functions can be evaluated in 𝒪(1){\mathcal{O}}\left(1\right) time (Carter and Wegman, 1979).

Definition 3

Given s:[d]{1,1}s:[d]\mapsto\left\{-1,1\right\} sampled from a 4-wise independent family, an AMS Sketch Z(𝐱)Z(\mathbf{x}) of a vector 𝐱=(x1,,xd)d\mathbf{x}=\left(x_{1},\ldots,x_{d}\right)\in\mathbb{R}^{d} is computed as Z(𝐱)=Zs(𝐱)=i=1ds(i)xiZ(\mathbf{x})=Z_{s}(\mathbf{x})=\sum_{i=1}^{d}s(i)\,x_{i}.

We use the subscript ss to describe the corresponding hash function ss of the AMS Sketch. We will skip it when the context is clear. Alon et al. (1999) analyze the bias and variance of AMS Sketch.

Lemma 4

(Alon et al. (1999, Theorem 2.2)) Consider an AMS sketch Z(𝐱)Z(\mathbf{x}) for 𝐱d\mathbf{x}\in\mathbb{R}^{d}. We have 𝔼[Z(𝐱)2]=𝐱22\mathbb{E}\left[Z(\mathbf{x})^{2}\right]=\|\mathbf{x}\|_{2}^{2} and Var[Z(𝐱)2]2𝐱24\mathrm{Var}\left[Z(\mathbf{x})^{2}\right]\leq 2\,\|\mathbf{x}\|_{2}^{4}.

3.4 Count Sketch

Charikar et al. (2002) described and analyzed a sketching approach, called Count Sketch, to estimate the frequency of items in a data stream.

Definition 5

Given h:[d][D]h:[d]\mapsto[D] sampled from a 2-wise independent family, and s:[d]{1,1}s:[d]\mapsto\left\{-1,1\right\} sampled from a 4-wise independent family, a Count Sketch 𝐂𝐱{\mathbf{Cx}} of a vector 𝐱=(x1,,xd)d\mathbf{x}=\left(x_{1},\ldots,x_{d}\right)\in\mathbb{R}^{d} is computed as 𝐂𝐱=((Cx)1,,(Cx)D)D{\mathbf{Cx}}=\left((Cx)_{1},\ldots,(Cx)_{D}\right)\in\mathbb{R}^{D} where (Cx)k=i:h(i)=ks(i)xi.(Cx)_{k}=\sum_{i:h(i)=k}{s(i)\,x_{i}}.

The Count Sketch definition above is a slight relaxation of the original definition proposed by Charikar et al. (2002) where we require the hash function ss to be sampled from a 4-wise independent family. Weinberger et al. (2009) introduced a variant of Count Sketch as a feature hashing method for large-scale multi-task learning, leveraging its ability to approximately preserve inner products. Inspired by this property, our work applies Count Sketch in a similar fashion, but instead of operating in the input space, we use it to project implicitly into the feature space of polynomial kernels, without explicitly constructing the high-dimensional feature vectors.

Count Sketch can be seen as a random projection technique because it computes linear projections of 𝐱\mathbf{x} with random vectors implicitly defined by hash functions hh and ss. The following lemma provides the bias and variance on inner product of Count Sketches.

Lemma 6

Given vectors 𝐱,𝐲d\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}, denote by 𝐂𝐱,𝐂𝐲D{\mathbf{Cx}},\mathbf{Cy}\in\mathbb{R}^{D} the respective Count Sketches of 𝐱,𝐲\mathbf{x},\mathbf{y} based on the hash functions h,sh,s. Then we have:

𝔼[𝐂𝐱,𝐂𝐲]\displaystyle\mathbb{E}\left[\left\langle{\mathbf{Cx}},\mathbf{Cy}\right\rangle\right] =𝐱,𝐲,\displaystyle=\left\langle\mathbf{x},\mathbf{y}\right\rangle,
Var[𝐂𝐱,𝐂𝐲]\displaystyle\mathrm{Var}\left[\left\langle{\mathbf{Cx}},\mathbf{Cy}\right\rangle\right] =1D(ijxi2yj2+ijxiyixjyj)2D𝐱22𝐲22.\displaystyle=\frac{1}{D}\left(\sum_{i\neq j}{x_{i}^{2}y_{j}^{2}}+\sum_{i\neq j}{x_{i}y_{i}x_{j}y_{j}}\right)\leq\frac{2}{D}\,\|\mathbf{x}\|_{2}^{2}\,\|\mathbf{y}\|_{2}^{2}\enspace.

Proof  Define the indicator variable ξij=𝕀[h(i)=h(j)]\xi_{ij}=\mathbb{I}[h(i)=h(j)]. Then we can write:

𝐂𝐱,𝐂𝐲=i,jxiyjs(i)s(j)ξij=𝐱,𝐲+ijxiyjs(i)s(j)ξij.\langle{\mathbf{Cx}},\mathbf{Cy}\rangle=\sum_{i,j}x_{i}y_{j}s(i)s(j)\xi_{ij}=\langle\mathbf{x},\mathbf{y}\rangle+\sum_{i\neq j}x_{i}y_{j}s(i)s(j)\xi_{ij}\,.

Taking expectation and using independence:

𝔼[𝐂𝐱,𝐂𝐲]=𝐱,𝐲+ijxiyj𝔼[s(i)]𝔼[s(j)]𝔼[ξij]=𝐱,𝐲,\mathbb{E}[\langle{\mathbf{Cx}},\mathbf{Cy}\rangle]=\langle\mathbf{x},\mathbf{y}\rangle+\sum_{i\neq j}x_{i}y_{j}\,\mathbb{E}[s(i)]\,\mathbb{E}[s(j)]\,\mathbb{E}[\xi_{ij}]=\langle\mathbf{x},\mathbf{y}\rangle,

since 𝔼[s(i)]=0\mathbb{E}[s(i)]=0 and 𝔼[ξij]=Pr[h(i)=h(j)]=1D\mathbb{E}[\xi_{ij}]=\textrm{\bf Pr}\left[h(i)=h(j)\right]=\frac{1}{D}. For the variance, compute 𝔼[𝐂𝐱,𝐂𝐲2]\mathbb{E}[\langle{\mathbf{Cx}},\mathbf{Cy}\rangle^{2}]:

𝐂𝐱,𝐂𝐲2\displaystyle\langle{\mathbf{Cx}},\mathbf{Cy}\rangle^{2} =(𝐱,𝐲+ijxiyjs(i)s(j)ξij)2\displaystyle=\left(\langle\mathbf{x},\mathbf{y}\rangle+\sum_{i\neq j}x_{i}y_{j}s(i)s(j)\xi_{ij}\right)^{2}
=𝐱,𝐲2+2𝐱,𝐲ijxiyjs(i)s(j)ξij+(ijxiyjs(i)s(j)ξij)2.\displaystyle=\langle\mathbf{x},\mathbf{y}\rangle^{2}+2\langle\mathbf{x},\mathbf{y}\rangle\sum_{i\neq j}x_{i}y_{j}s(i)s(j)\xi_{ij}+\left(\sum_{i\neq j}x_{i}y_{j}s(i)s(j)\xi_{ij}\right)^{2}.

The middle term has zero expectation due to independence and zero-mean of s(i)s(i). Expanding the last term, and keeping only terms whose expectation survives due to 4-wise independence of ss, we get:

𝔼[(ijxiyjs(i)s(j)ξij)2]=1D(ijxi2yj2+ijxiyixjyj).\mathbb{E}\left[\left(\sum_{i\neq j}x_{i}y_{j}s(i)s(j)\xi_{ij}\right)^{2}\right]=\frac{1}{D}\left(\sum_{i\neq j}x_{i}^{2}y_{j}^{2}+\sum_{i\neq j}x_{i}y_{i}x_{j}y_{j}\right).

Hence, we have

Var[𝐂𝐱,𝐂𝐲]=𝔼[𝐂𝐱,𝐂𝐲2]𝐱,𝐲2=1D(ijxi2yj2+ijxiyixjyj).\mathrm{Var}[\langle{\mathbf{Cx}},\mathbf{Cy}\rangle]=\mathbb{E}[\langle{\mathbf{Cx}},\mathbf{Cy}\rangle^{2}]-\langle\mathbf{x},\mathbf{y}\rangle^{2}=\frac{1}{D}\left(\sum_{i\neq j}x_{i}^{2}y_{j}^{2}+\sum_{i\neq j}x_{i}y_{i}x_{j}y_{j}\right).

Finally, using Cauchy-Schwarz and basic norm identities:

ijxi2yj2𝐱22𝐲22,ijxiyixjyj𝐱22𝐲22,\sum_{i\neq j}x_{i}^{2}y_{j}^{2}\leq\|\mathbf{x}\|_{2}^{2}\|\mathbf{y}\|_{2}^{2},\quad\sum_{i\neq j}x_{i}y_{i}x_{j}y_{j}\leq\|\mathbf{x}\|_{2}^{2}\|\mathbf{y}\|_{2}^{2}\,,

we conclude:

Var[𝐂𝐱,𝐂𝐲]2D𝐱22𝐲22.\mathrm{Var}[\langle{\mathbf{Cx}},\mathbf{Cy}\rangle]\leq\frac{2}{D}\|\mathbf{x}\|_{2}^{2}\|\mathbf{y}\|_{2}^{2}\enspace.

 

3.5 AMS Sketch on the Tensor Domain

Similar to the Maclaurin-based approach of Kar and Karnick (2012), AMS Sketches can be employed as random features to approximate polynomial kernels. Below, we focus on the construction of a single random feature; computing DD features is straightforward by independently repeating this process DD times. Indyk and McGregor (2008), followed by Braverman et al. (2010), analyzed the use of products of AMS Sketches with independently chosen hash functions, and established the following result:

Lemma 7

(Braverman et al., 2010, Lemma 4.1) Consider 𝐱d\mathbf{x}\in\mathbb{R}^{d}, an integer p>1p>1, and pp independently sampled functions s1,,sp:[d]{1,1}s_{1},\ldots,s_{p}:[d]\mapsto\{-1,1\} from a 4-wise independent family. Define Z=j=1pZsj(𝐱)Z=\prod_{j=1}^{p}Z_{s_{j}}(\mathbf{x}). Then 𝔼[Z2]=𝐱(p)22=𝐱22p\mathbb{E}\left[Z^{2}\right]=\|\mathbf{x}^{(p)}\|_{2}^{2}=\|\mathbf{x}\|_{2}^{2p} and Var[Z2](3p1)𝐱(p)24=(3p1)𝐱24p\mathrm{Var}\left[Z^{2}\right]\leq(3^{p}-1)\|\mathbf{x}^{(p)}\|_{2}^{4}=(3^{p}-1)\|\mathbf{x}\|_{2}^{4p}.

Since ZZ is defined as the product of independent AMS sketches, its expectation 𝔼[Z2]\mathbb{E}[Z^{2}] can be directly derived from Lemma 4. However, analyzing the variance Var(Z2)\mathrm{Var}(Z^{2}) is significantly more challenging and constitutes the main contribution of Braverman et al. (2010).

To proceed, consider defining a composite hash function S(xi1,,xip)=j=1psj(xij)S(x_{i_{1}},\ldots,x_{i_{p}})=\prod_{j=1}^{p}s_{j}(x_{i_{j}}), which maps a coordinate xi1xi2xipx_{i_{1}}x_{i_{2}}\ldots x_{i_{p}} of 𝐱(p)[d]p\mathbf{x}^{(p)}\in[d]^{p} to {1,1}\{-1,1\}. Under this definition, ZZ can be interpreted as an AMS sketch applied to the pp-th order tensor product 𝐱(p)\mathbf{x}^{(p)}: Z = Z_S(x^(p)) = ∏_j=1^p Z_s_j(x)   . However, the hash function S:[d]p{1,1}S:[d]^{p}\rightarrow\{-1,1\} is not drawn from a 4-wise independent family, and thus standard AMS analysis techniques are insufficient to bound the variance. Braverman et al. (2010) address this by analyzing the combinatorial structure of correlations between S(𝐮)S(\mathbf{u}) and S(𝐯)S(\mathbf{v}) for distinct index vectors 𝐮,𝐯[d]p\mathbf{u},\mathbf{v}\in[d]^{p}.

Based on this approach, it is natural to generalize Lemma 7 to handle a pair of tensorized vectors 𝐱(p)\mathbf{x}^{(p)} and 𝐲(p)\mathbf{y}^{(p)}, as follows:

Lemma 8

Consider 𝐱,𝐲d\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}, an integer p>1p>1, and pp independently sampled functions s1,,sp:[d]{1,1}s_{1},\ldots,s_{p}:[d]\mapsto\{-1,1\} from a 4-wise independent family. Define Z=j=1pZsj(𝐱)Zsj(𝐲)Z=\prod_{j=1}^{p}Z_{s_{j}}(\mathbf{x})\,Z_{s_{j}}(\mathbf{y}). Then 𝔼[Z]=𝐱,𝐲p\mathbb{E}\left[Z\right]=\left\langle\mathbf{x},\mathbf{y}\right\rangle^{p} and Var[Z](3p1)𝐱22p𝐲22p\mathrm{Var}\left[Z\right]\leq(3^{p}-1)\,\|\mathbf{x}\|_{2}^{2p}\,\|\mathbf{y}\|_{2}^{2p}.

Proof  Following the approach of Braverman et al. (2010), we compute the expectation and variance of ZZ. First, we consider the expectation, and for each jj, we note that

𝔼[Zsj(𝐱)Zsj(𝐲)]=𝔼[(i=1dxisj(i))(k=1dyksj(k))]=𝐱,𝐲,\mathbb{E}\bigl{[}Z_{s_{j}}(\mathbf{x})\,Z_{s_{j}}(\mathbf{y})\bigr{]}=\mathbb{E}\left[\left(\sum_{i=1}^{d}x_{i}\,s_{j}(i)\right)\left(\sum_{k=1}^{d}y_{k}\,s_{j}(k)\right)\right]=\langle\mathbf{x},\mathbf{y}\rangle,

where the last equality follows from the 4-wise independence and unbiasedness of the sjs_{j}. Since the functions sjs_{j} are independent for different jj, we have

𝔼[Z]=j=1p𝔼[Zsj(𝐱)Zsj(𝐲)]=𝐱,𝐲p.\mathbb{E}[Z]=\prod_{j=1}^{p}\mathbb{E}\bigl{[}Z_{s_{j}}(\mathbf{x})\,Z_{s_{j}}(\mathbf{y})\bigr{]}=\langle\mathbf{x},\mathbf{y}\rangle^{p}\,.

Next, we bound the variance Var(Z)=𝔼[Z2](𝔼[Z])2\operatorname{Var}(Z)=\mathbb{E}[Z^{2}]-(\mathbb{E}[Z])^{2}. Because the hash functions are independent across different jj, we may write

𝔼[Z2]=j=1p𝔼[(Zsj(𝐱)Zsj(𝐲))2].\mathbb{E}[Z^{2}]=\prod_{j=1}^{p}\mathbb{E}\Bigl{[}\bigl{(}Z_{s_{j}}(\mathbf{x})\,Z_{s_{j}}(\mathbf{y})\bigr{)}^{2}\Bigr{]}\,. (1)

For each jj, expanding the square gives

𝔼[(Zsj(𝐱)Zsj(𝐲))2]\displaystyle\mathbb{E}\Bigl{[}\bigl{(}Z_{s_{j}}(\mathbf{x})\,Z_{s_{j}}(\mathbf{y})\bigr{)}^{2}\Bigr{]} =𝔼[(i=1dxisj(i))2(k=1dyksj(k))2]\displaystyle=\mathbb{E}\left[\left(\sum_{i=1}^{d}x_{i}\,s_{j}(i)\right)^{2}\left(\sum_{k=1}^{d}y_{k}\,s_{j}(k)\right)^{2}\right]
=i=1dk=1di=1dk=1dxiykxiyk𝔼[sj(i)sj(k)sj(i)sj(k)].\displaystyle=\sum_{i=1}^{d}\sum_{k=1}^{d}\sum_{i^{\prime}=1}^{d}\sum_{k^{\prime}=1}^{d}x_{i}\,y_{k}\,x_{i^{\prime}}\,y_{k^{\prime}}\,\mathbb{E}\left[s_{j}(i)\,s_{j}(k)\,s_{j}(i^{\prime})\,s_{j}(k^{\prime})\right]\,.

Observing that 𝔼[sj(i)sj(k)sj(i)sj(k)]\mathbb{E}\left[s_{j}(i)s_{j}(k)s_{j}(i^{\prime})s_{j}(k^{\prime})\right] is nonzero only when the indices form pairs (including the possibility that all four are identical), we have

𝔼[sj(i)sj(k)sj(i)sj(k)]={1if i=k=i=k,1if i=ki=k,1if i=ik=k,1if i=kk=i,0otherwise.\mathbb{E}\left[s_{j}(i)s_{j}(k)s_{j}(i^{\prime})s_{j}(k^{\prime})\right]=\begin{cases}1&\text{if }i=k=i^{\prime}=k^{\prime},\\ 1&\text{if }i=k\neq i^{\prime}=k^{\prime},\\ 1&\text{if }i=i^{\prime}\neq k=k^{\prime},\\ 1&\text{if }i=k^{\prime}\neq k=i^{\prime},\\ 0&\text{otherwise}.\end{cases}

The contribution from terms with i=k=i=ki=k=i^{\prime}=k^{\prime} is i=1dxi2yi2\sum_{i=1}^{d}x_{i}^{2}y_{i}^{2}. Terms with i=ki=ki=k\neq i^{\prime}=k^{\prime} contribute

iixiyixiyi=(i=1dxiyi)2i=1dxi2yi2=𝐱,𝐲2i=1dxi2yi2.\sum_{i\neq i^{\prime}}x_{i}\,y_{i}\,x_{i^{\prime}}\,y_{i^{\prime}}=\left(\sum_{i=1}^{d}x_{i}\,y_{i}\right)^{2}-\sum_{i=1}^{d}x_{i}^{2}y_{i}^{2}=\langle\mathbf{x},\mathbf{y}\rangle^{2}-\sum_{i=1}^{d}x_{i}^{2}y_{i}^{2}\,.

The case i=kk=ii=k^{\prime}\neq k=i^{\prime} is symmetric and yields the same contribution. Finally, for i=ik=ki=i^{\prime}\neq k=k^{\prime} we obtain

ikxi2yk2=𝐱22𝐲22i=1dxi2yi2.\sum_{i\neq k}x_{i}^{2}\,y_{k}^{2}=\|\mathbf{x}\|_{2}^{2}\,\|\mathbf{y}\|_{2}^{2}-\sum_{i=1}^{d}x_{i}^{2}y_{i}^{2}\,.

Thus, summing these contributions, we have

𝔼[(Zsj(𝐱)Zsj(𝐲))2]\displaystyle\mathbb{E}\Bigl{[}\bigl{(}Z_{s_{j}}(\mathbf{x})\,Z_{s_{j}}(\mathbf{y})\bigr{)}^{2}\Bigr{]} =i=1dxi2yi2+2(𝐱,𝐲2i=1dxi2yi2)+(𝐱22𝐲22i=1dxi2yi2)\displaystyle=\sum_{i=1}^{d}x_{i}^{2}y_{i}^{2}+2\Bigl{(}\langle\mathbf{x},\mathbf{y}\rangle^{2}-\sum_{i=1}^{d}x_{i}^{2}y_{i}^{2}\Bigr{)}+\Bigl{(}\|\mathbf{x}\|_{2}^{2}\,\|\mathbf{y}\|_{2}^{2}-\sum_{i=1}^{d}x_{i}^{2}y_{i}^{2}\Bigr{)}
=2𝐱,𝐲2+𝐱22𝐲222i=1dxi2yi2.\displaystyle=2\langle\mathbf{x},\mathbf{y}\rangle^{2}+\|\mathbf{x}\|_{2}^{2}\,\|\mathbf{y}\|_{2}^{2}-2\sum_{i=1}^{d}x_{i}^{2}y_{i}^{2}\,.

Using the Cauchy–Schwarz inequality, 𝐱,𝐲2𝐱22𝐲22\langle\mathbf{x},\mathbf{y}\rangle^{2}\leq\|\mathbf{x}\|_{2}^{2}\,\|\mathbf{y}\|_{2}^{2}, and noting that i=1dxi2yi20\sum_{i=1}^{d}x_{i}^{2}y_{i}^{2}\geq 0, it follows that

𝔼[(Zsj(𝐱)Zsj(𝐲))2]3𝐱22𝐲22.\mathbb{E}\Bigl{[}\bigl{(}Z_{s_{j}}(\mathbf{x})\,Z_{s_{j}}(\mathbf{y})\bigr{)}^{2}\Bigr{]}\leq 3\,\|\mathbf{x}\|_{2}^{2}\,\|\mathbf{y}\|_{2}^{2}\,.

Substituting this bound into (1) yields

𝔼[Z2](3𝐱22𝐲22)p,\mathbb{E}[Z^{2}]\leq\Bigl{(}3\,\|\mathbf{x}\|_{2}^{2}\,\|\mathbf{y}\|_{2}^{2}\Bigr{)}^{p},

which completes the proof since

Var(Z)=𝔼[Z2]𝐱,𝐲2p3p𝐱22p𝐲22p.\operatorname{Var}(Z)=\mathbb{E}[Z^{2}]-\langle\mathbf{x},\mathbf{y}\rangle^{2p}\leq 3^{p}\,\|\mathbf{x}\|_{2}^{2p}\,\|\mathbf{y}\|_{2}^{2p}\,.

 

Lemma 8 shows that the AMS sketch can be interpreted as a variant of the Maclaurin-based approach of Kar and Karnick (2012), where the Rademacher vectors 𝐰i\mathbf{w}_{i} are replaced by hash functions ss drawn from a 4-wise independent family. This substitution enables a provable variance bound for the resulting random feature maps.

4 Tensor Sketch

We now describe how to compute the Count Sketch of a tensor product 𝐱(p)\mathbf{x}^{(p)}, which serves as a random feature map for approximating polynomial kernels κ(𝐱,𝐲)=𝐱,𝐲p\kappa(\mathbf{x},\mathbf{y})=\left\langle\mathbf{x},\mathbf{y}\right\rangle^{p}, for an integer p>0p>0. For the kernel κ(𝐱,𝐲)=(c+𝐱,𝐲)p\kappa(\mathbf{x},\mathbf{y})=\left(c+\left\langle\mathbf{x},\mathbf{y}\right\rangle\right)^{p}, we can avoid the constant c>0c>0 by adding an extra dimension of value c\sqrt{c} to all data vectors. For a fixed polynomial degree p>1p>1, our method maps an input vector 𝐱d\mathbf{x}\in\mathbb{R}^{d} to a low-dimensional feature vector in D\mathbb{R}^{D} in time 𝒪(d+DlogD)\mathcal{O}(d+D\log D). This runtime represents a significant improvement over the AMS Sketch-based approach described in Section 3.5, both in terms of computational efficiency and practical scalability.

4.1 Convolution of Count Sketches

Pagh (2013) introduced a fast algorithm for computing the Count Sketch of the outer product of two vectors. As our method builds on this technique, we briefly review it here. Rather than explicitly forming the outer product, the key idea is to first compute the Count Sketches of the input vectors and then derive the sketch of their outer product directly from these compressed representations. As we will show, this process reduces to polynomial multiplication, which can be efficiently implemented using the Fast Fourier Transform (FFT). 444 For background on FFT, see for example Kleinberg and Tardos (2005, Section 5.6). This yields an algorithm to compute the Count Sketch of an outer product in time near-linear in the size of the sketches.

More precisely, given a vector 𝐱d\mathbf{x}\in\mathbb{R}^{d}, we denote by 𝐂𝟏𝐱,𝐂𝟐𝐱D\mathbf{C_{1}x},\mathbf{C_{2}x}\in\mathbb{R}^{D} two different Count Sketches using hash functions h1,h2:[d][D]h_{1},h_{2}:[d]\mapsto[D] and s1,s2:[d]{1,1}s_{1},s_{2}:[d]\mapsto\left\{-1,1\right\}, all independently sampled from 2-wise independent families. The aim is to compute a Count Sketch of the outer product 𝐱𝐱d2\mathbf{x}\otimes\mathbf{x}\in\mathbb{R}^{d^{2}}, denoted 𝐂𝐱(2)D{\mathbf{Cx}}^{(2)}\in\mathbb{R}^{D}, from 𝐂𝟏𝐱\mathbf{C_{1}x} and 𝐂𝟐𝐱\mathbf{C_{2}x}. We define the Count Sketch 𝐂𝐱(2){\mathbf{Cx}}^{(2)} in terms of the hash functions H:[d2][D]H:[d^{2}]\mapsto[D] and S:[d2]{1,1}S:[d^{2}]\mapsto\left\{-1,1\right\} derived from the functions h1,h2,s1,s2h_{1},h_{2},s_{1},s_{2}, where:

H(i1,i2)=(h1(i1)+h2(i2)) mod D and S(i1,i2)=s1(i1)s2(i2).\displaystyle H(i_{1},i_{2})=\left(h_{1}(i_{1})+h_{2}(i_{2})\right)\text{ mod }D\text{ and }S(i_{1},i_{2})=s_{1}(i_{1})s_{2}(i_{2})\enspace. (2)

It is well known that since h1h_{1}, h2h_{2}, s1s_{1}, s2s_{2} are sampled from 2-wise independent families, HH and SS are also sampled from 2-wise independent families (Pǎtraşcu and Thorup, 2012). Naïvely computing 𝐂𝐱(2){\mathbf{Cx}}^{(2)} would require 𝒪(d2){\mathcal{O}}\left(d^{2}\right) time. However, by thinking of Count Sketches as polynomials, we are able to exploit FFT to fast compute 𝐂𝐱(2){\mathbf{Cx}}^{(2)} given hash functions HH and SS defined in Equation 2.

In particular, we represent Count Sketches 𝐂𝟏𝐱,𝐂𝟐𝐱D\mathbf{C_{1}x},\mathbf{C_{2}x}\in\mathbb{R}^{D} as polynomials of degree D1D-1 where each entry (𝐂𝟏𝐱)k(\mathbf{C_{1}x})_{k} or (𝐂𝟐𝐱)k(\mathbf{C_{2}x})_{k} is the coefficient of ωk\omega^{k} in the polynomial: P_x(ω) = ∑_i=1^ds_1(i)x_i   ω^h_1(i) and Q_x(ω) = ∑_i=1^ds_2(i)x_i   ω^h_2(i)   . In other words, the kk-th features of 𝐂𝟏𝐱\mathbf{C_{1}x} and 𝐂𝟐𝐱\mathbf{C_{2}x} corresponding to the coefficients of the term ωk\omega^{k} are i,h1(i)=ks1(i)xi\sum_{i,h_{1}(i)=k}s_{1}(i)x_{i} and i,h2(i)=ks2(i)xi\sum_{i,h_{2}(i)=k}s_{2}(i)x_{i}, respectively. We will derive the polynomial multiplication P𝐱(ω)×Q𝐱(ω)P_{\mathbf{x}}(\omega)\times Q_{\mathbf{x}}(\omega) as follows.

P𝐱(ω)×Q𝐱(ω)=(i=1ds1(i)xiωh1(i))(i=1ds2(i)xiωh2(i))\displaystyle P_{\mathbf{x}}(\omega)\times Q_{\mathbf{x}}(\omega)=\left(\sum_{i=1}^{d}{s_{1}(i)x_{i}\,\omega^{h_{1}(i)}}\right)\left(\sum_{i=1}^{d}{s_{2}(i)x_{i}\,\omega^{h_{2}(i)}}\right)
=i1,i2=1ds1(i1)s2(i2)xi1xi2ωh1(i1)+h2(i2)=i1,i2=1dS(i1,i2)xi1xi2ωh1(i1)+h2(i2).\displaystyle=\sum_{i_{1},i_{2}=1}^{d}{s_{1}(i_{1})s_{2}(i_{2})x_{i_{1}}x_{i_{2}}\,\omega^{h_{1}(i_{1})+h_{2}(i_{2})}}=\sum_{i_{1},i_{2}=1}^{d}{S(i_{1},i_{2})x_{i_{1}}x_{i_{2}}\,\omega^{h_{1}(i_{1})+h_{2}(i_{2})}}\,.

We note that the polynomial P𝐱(ω)×Q𝐱(ω)P_{\mathbf{x}}(\omega)\times Q_{\mathbf{x}}(\omega) has degree 2D22D-2 since h1,h2:[d][D]h_{1},h_{2}:[d]\mapsto[D]. We transform the polynomial P𝐱(ω)×Q𝐱(ω)P_{\mathbf{x}}(\omega)\times Q_{\mathbf{x}}(\omega) to the polynomial of degree D1D-1 by casting coefficients of the term ωk\omega^{k} as the coefficients of the term ωk mod D\omega^{k\text{ mod }D} where 0k2D20\leq k\leq 2D-2. We denote by P𝐱(2)(ω)P_{\mathbf{x}^{(2)}}(\omega) the transformation polynomial of (D1)(D-1)-degree from P𝐱(ω)×Q𝐱(ω)P_{\mathbf{x}}(\omega)\times Q_{\mathbf{x}}(\omega). It is clear that P𝐱(2)(ω)P_{\mathbf{x}^{(2)}}(\omega) is polynomial representation of the Count Sketch of the 𝐂𝐱(2)D{\mathbf{Cx}}^{(2)}\in\mathbb{R}^{D} of 𝐱(2)\mathbf{x}^{(2)} using HH and SS as

P𝐱(2)(ω)\displaystyle P_{\mathbf{x}^{(2)}}(\omega) =i1,i2=1dS(i1,i2)xi1xi2ω(h1(i1)+h2(i2)) mod D=i1,i2=1dS(i1,i2)xi1xi2ωH(i1,i2).\displaystyle=\sum_{i_{1},i_{2}=1}^{d}{S(i_{1},i_{2})x_{i_{1}}x_{i_{2}}\,\omega^{\left(h_{1}(i_{1})+h_{2}(i_{2})\right)\text{ mod }D}}=\sum_{i_{1},i_{2}=1}^{d}{S(i_{1},i_{2})x_{i_{1}}x_{i_{2}}\,\omega^{H(i_{1},i_{2})}}\,.

The (D1)(D-1)-degree polynomial P𝐱(2)(ω)P_{\mathbf{x}^{(2)}}(\omega) derived from the polynomial multiplication P𝐱(ω)×Q𝐱(ω)P_{\mathbf{x}}(\omega)\times Q_{\mathbf{x}}(\omega) can be computed in time 𝒪(DlogD){\mathcal{O}}\left(D\log{D}\right) using FFT and its inverse FFT-1: P_x^(2)(ω) = FFT^-1 ( FFT(P_x(ω)) ∘FFT(Q_x(ω)) )   , where \circ is the component-wise product operator defined by (𝐚𝐛)i=aibi({\bf a}\circ{\bf b})_{i}=a_{i}b_{i}. In other words, the Count Sketch 𝐂𝐱(2){\mathbf{Cx}}^{(2)} of 𝐱𝐱\mathbf{x}\otimes\mathbf{x} can be efficiently computed by Count Sketches 𝐂𝟏𝐱\mathbf{C_{1}x} and 𝐂𝟐𝐱\mathbf{C_{2}x} in 𝒪(d+DlogD){\mathcal{O}}\left(d+D\log{D}\right) time. The first term comes from constructing the count sketches and the latter comes from running FFT three times.

4.2 Tensor Sketch

We now extend the previous method to compute Tensor Sketch of a pp-th tensor power 𝐱(p)\mathbf{x}^{(p)}. This is achieved by convolving pp independent Count Sketches, each constructed using hash functions h1,,hp:[d][D]h_{1},\ldots,h_{p}:[d]\to[D] and s1,,sp:[d]{1,1}s_{1},\ldots,s_{p}:[d]\to\{-1,1\}. The hash functions hih_{i} are drawn from 2-wise independent families, but the sign functions sis_{i} are drawn from 4-wise independent families to ensure variance bounds in the resulting sketch. This construction yields a Count Sketch for 𝐱(p)\mathbf{x}^{(p)}, referred to as the ppth-order Tensor Sketch, defined by the following composite hash functions:

H(i1,,ip)=(j=1phj(ij)) mod D and S(i1,,ip)=j=1psj(ij).\displaystyle H(i_{1},\ldots,i_{p})=\left(\sum_{j=1}^{p}{h_{j}(i_{j})}\right)\text{ mod }D\text{ and }S(i_{1},\ldots,i_{p})=\prod_{j=1}^{p}{s_{j}(i_{j})}\,.

We leverage the efficient computation of Count Sketches over tensor domains to develop a fast algorithm for approximating the homogeneous polynomial kernel κ(𝐱,𝐲)=𝐱,𝐲p\kappa(\mathbf{x},\mathbf{y})=\langle\mathbf{x},\mathbf{y}\rangle^{p}, where pp is a positive integer. For each input vector 𝐱d\mathbf{x}\in\mathbb{R}^{d}, Tensor Sketch computes a Count Sketch of the pp-fold tensor product 𝐱(p)\mathbf{x}^{(p)}, producing a DD-dimensional random feature map in D\mathbb{R}^{D} that approximates the kernel. The full procedure is outlined in Algorithm 1, which illustrates how Tensor Sketch efficiently maps input vectors to the lower-dimensional kernel feature space.

Algorithm 1 Tensor Sketch(𝐱,p,D\mathbf{x},p,D)
0:   Vector 𝐱d\mathbf{x}\in\mathbb{R}^{d}, integer DD, p>1p>1, h1,,hp:[d][D]h_{1},\ldots,h_{p}:[d]\mapsto[D] independently sampled from a 2-wise independent family, s1,,sp:[d]{1,1}s_{1},\ldots,s_{p}:[d]\mapsto\{-1,1\} independently sampled from a 4-wise independent family.
0:   Return a feature vector f(𝐱)Df(\mathbf{x})\in\mathbb{R}^{D} such that 𝔼[f(𝐱),f(𝐲)]=κ(𝐱,𝐲)=𝐱,𝐲p\mathbb{E}\left[\left\langle f(\mathbf{x}),f(\mathbf{y})\right\rangle\right]=\kappa(\mathbf{x},\mathbf{y})=\left\langle\mathbf{x},\mathbf{y}\right\rangle^{p}
1:  For i=1,,pi=1,\ldots,p create Count Sketch 𝐂𝐢𝐱\bf{C_{i}x} using hash functions hi,sih_{i},s_{i}
2:  For i=1,,pi=1,\ldots,p let 𝐂^𝐢𝐱FFT(𝐂𝐢𝐱)\bf{\widehat{C}_{i}x}\leftarrow\text{FFT}(\bf{C_{i}x})
3:  Let 𝐂^𝐱𝐂^𝟏𝐱𝐂^𝐩𝐱\bf{\widehat{C}x}\leftarrow\bf{\widehat{C}_{1}x}\circ\ldots\circ\bf{\widehat{C}_{p}x} (component-wise multiplication)
4:  Let f(𝐱)FFT1(𝐂^𝐱)f({\bf x})\leftarrow\text{FFT}^{-1}(\bf{\widehat{C}x})
5:  return  f(𝐱)f({\mathbf{x}})

We maintain 2p2p independent hash functions: h1,,hph_{1},\ldots,h_{p}, which are 2-wise independent, and s1,,sps_{1},\ldots,s_{p}, which are 4-wise independent. For each input vector 𝐱\mathbf{x}, we construct pp independent Count Sketches of size DD using these hash functions (Line 1 of Algorithm 1). The final sketch of the pp-fold tensor product 𝐱(p)\mathbf{x}^{(p)} is then computed via fast polynomial multiplication using the Fast Fourier Transform (FFT). This procedure yields a random feature map ff that serves as an unbiased estimator of the homogeneous polynomial kernel for any pair of input vectors.

We now analyze the computational and space complexity of Tensor Sketch. Since the degree pp is fixed and typically small, only 𝒪(1)\mathcal{O}(1) space is required to store the 2p2p hash functions (Carter and Wegman, 1979; Thorup and Zhang, 2012). For each vector, computing the sketch of its pp-fold tensor product takes 𝒪(d+DlogD)\mathcal{O}(d+D\log D) time, due to the use of FFT. Hence, given nn data points, the total runtime of Tensor Sketch is 𝒪(n(d+DlogD))\mathcal{O}(n(d+D\log D)).

To improve the accuracy of kernel approximation, DD is often chosen to be 𝒪(d)\mathcal{O}(d), resulting in an overall time complexity of 𝒪(ndlogd)\mathcal{O}(nd\log d). This is significantly faster than earlier approaches such as Kar and Karnick (2012), Hamid et al. (2014), and Pennington et al. (2015), which typically require 𝒪(nd2)\mathcal{O}(nd^{2}) time. Furthermore, Tensor Sketch only requires 𝒪(1)\mathcal{O}(1) additional memory to store the hash functions, while previous methods demand 𝒪(d)\mathcal{O}(d) space (Kar and Karnick, 2012; Hamid et al., 2014; Meister et al., 2019) or even 𝒪(d2)\mathcal{O}(d^{2}) (Pennington et al., 2015; Liu et al., 2021).

4.3 Error Analysis

In this section, we analyze the estimation accuracy of the polynomial kernel κ(𝐱,𝐲)=𝐱,𝐲p\kappa(\mathbf{x},\mathbf{y})=\langle\mathbf{x},\mathbf{y}\rangle^{p} for a positive integer pp, focusing on the concentration behavior of the estimator produced by Tensor Sketch. We derive bounds on the variance of the estimator as a function of the number of random features DD, and show that the estimate concentrates tightly around its expected value. The following theorem establishes both the unbiasedness and a variance bound for the approximation provided by Tensor Sketch.

Theorem 9

Given two vectors 𝐱,𝐲d\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}, we denote by 𝐂𝐱(p),𝐂𝐲(p)D{\mathbf{Cx}}^{(p)},\mathbf{Cy}^{(p)}\in\mathbb{R}^{D} the Count Sketches of 𝐱(p),𝐲(p)dp\mathbf{x}^{(p)},\mathbf{y}^{(p)}\in\mathbb{R}^{d^{p}} using hash functions h1,,hp:[d][D]h_{1},\ldots,h_{p}:[d]\mapsto[D] and s1,,sp:[d]{1,1}s_{1},\ldots,s_{p}:[d]\mapsto\left\{-1,1\right\} chosen independently from 2-wise and 4-wise independent families, respectively. Then we have

𝔼[𝐂𝐱(p),𝐂𝐲(p)]\displaystyle\mathbb{E}\left[\left\langle{\mathbf{Cx}}^{(p)},\mathbf{Cy}^{(p)}\right\rangle\right] =𝐱(p),𝐲(p)=𝐱,𝐲p,\displaystyle=\left\langle\mathbf{x}^{(p)},\mathbf{y}^{(p)}\right\rangle=\left\langle\mathbf{x},\mathbf{y}\right\rangle^{p},
Var[𝐂𝐱(p),𝐂𝐲(p)]\displaystyle\mathrm{Var}\left[\left\langle{\mathbf{Cx}}^{(p)},\mathbf{Cy}^{(p)}\right\rangle\right] 3p1D𝐱22p𝐲22p.\displaystyle\leq\frac{3^{p}-1}{D}\|\mathbf{x}\|_{2}^{2p}\|\mathbf{y}\|_{2}^{2p}\enspace.

Proof  We note that the Tensor Sketches 𝐂𝐱(p),𝐂𝐲(p){\mathbf{Cx}}^{(p)},\mathbf{Cy}^{(p)} are the Count Sketches of the tensor product 𝐗=𝐱(p),𝐘=𝐲(p)\mathbf{X}=\mathbf{x}^{(p)},\mathbf{Y}=\mathbf{y}^{(p)} using the two new hash functions H:[d]p[D]H:[d]^{p}\mapsto[D] and S:[d]p{1,1}S:[d]^{p}\mapsto\left\{-1,1\right\} such that: H(i_1, …, i_p) = ( ∑_j = 1^ph_j(i_j)) mod D and S(i_1, …, i_p) = ∏_j = 1^ps_j(i_j)   .

For notational simplicity we define by 𝐮,𝐯[d]p\mathbf{u},\mathbf{v}\in[d]^{p} as the index of vectors 𝐗,𝐘\mathbf{X},\mathbf{Y} of dpd^{p} dimensions. Also note that H(𝐮)H(\mathbf{u}) is still 2-wise independent (Pǎtraşcu and Thorup, 2012) but S(𝐮)S(\mathbf{u}) is not 4-wise independent anymore (Indyk and McGregor, 2008; Braverman et al., 2010). That leads to the incorrect result on the variance bound on Lemma 6 of the conference version (Pham and Pagh, 2013) as the dependence on pp is missing.

Define the indicator variable ξ𝐮𝐯=𝕀[H(𝐮)=H(𝐯)]\xi_{\mathbf{u}\mathbf{v}}=\mathbb{I}[H(\mathbf{u})=H(\mathbf{v})] for any 𝐮,𝐯[d]p\mathbf{u},\mathbf{v}\in[d]^{p}, we have

𝐂𝐱(p),𝐂𝐲(p)\displaystyle\left\langle{\mathbf{Cx}}^{(p)},\mathbf{Cy}^{(p)}\right\rangle =𝐮,𝐯[d]p𝐗𝐮𝐘𝐯S(𝐮)S(𝐯)ξ𝐮𝐯\displaystyle=\sum_{\mathbf{u},\mathbf{v}\in[d]^{p}}{\mathbf{X}_{\mathbf{u}}\mathbf{Y}_{\mathbf{v}}S(\mathbf{u})S(\mathbf{v})\xi_{\mathbf{u}\mathbf{v}}} =𝐗,𝐘+𝐮𝐯[d]p𝐗𝐮𝐘𝐯S(𝐮)S(𝐯)ξ𝐮𝐯.\displaystyle=\left\langle\mathbf{X},\mathbf{Y}\right\rangle+\sum_{\mathbf{u}\neq\mathbf{v}\in[d]^{p}}{\mathbf{X}_{\mathbf{u}}\mathbf{Y}_{\mathbf{v}}S(\mathbf{u})S(\mathbf{v})\xi_{\mathbf{u}\mathbf{v}}}\,.

Recall that S:[d]p{1,1}S:[d]^{p}\rightarrow\left\{-1,1\right\} is 2-wise independent, applying the independence property of this hash function, we can verify that 𝔼[𝐂𝐱(p),𝐂𝐲(p)]=𝐗,𝐘=𝐱,𝐲p\mathbb{E}\left[\left\langle{\mathbf{Cx}}^{(p)},\mathbf{Cy}^{(p)}\right\rangle\right]=\left\langle\mathbf{X},\mathbf{Y}\right\rangle=\left\langle\mathbf{x},\mathbf{y}\right\rangle^{p}.

For the variance, we compute 𝔼[[𝐂𝐱(p),𝐂𝐲(p)2]]\mathbb{E}\left[\left[\left\langle{\mathbf{Cx}}^{(p)},\mathbf{Cy}^{(p)}\right\rangle^{2}\right]\right] by expanding 𝐂𝐱(p),𝐂𝐲(p)2\left\langle{\mathbf{Cx}}^{(p)},\mathbf{Cy}^{(p)}\right\rangle^{2}:

𝐂𝐱(p),𝐂𝐲(p)2=(𝐗,𝐘+𝐮𝐯𝐗𝐮𝐘𝐯S(𝐮)S(𝐯)ξ𝐮𝐯)2\displaystyle\left\langle{\mathbf{Cx}}^{(p)},\mathbf{Cy}^{(p)}\right\rangle^{2}=\left(\left\langle\mathbf{X},\mathbf{Y}\right\rangle+\sum_{\mathbf{u}\neq\mathbf{v}}{\mathbf{X}_{\mathbf{u}}\mathbf{Y}_{\mathbf{v}}S(\mathbf{u})S(\mathbf{v})\xi_{\mathbf{u}\mathbf{v}}}\right)^{2}
=𝐗,𝐘2+2𝐗,𝐘𝐮𝐯𝐗𝐮𝐘𝐯S(𝐮)S(𝐯)ξ𝐮𝐯+(𝐮𝐯𝐗𝐮𝐘𝐯S(𝐮)S(𝐯)ξ𝐮𝐯)2,\displaystyle=\left\langle\mathbf{X},\mathbf{Y}\right\rangle^{2}+2\left\langle\mathbf{X},\mathbf{Y}\right\rangle\sum_{\mathbf{u}\neq\mathbf{v}}{\mathbf{X}_{\mathbf{u}}\mathbf{Y}_{\mathbf{v}}S(\mathbf{u})S(\mathbf{v})\xi_{\mathbf{u}\mathbf{v}}}+\left(\sum_{\mathbf{u}\neq\mathbf{v}}{\mathbf{X}_{\mathbf{u}}\mathbf{Y}_{\mathbf{v}}S(\mathbf{u})S(\mathbf{v})\xi_{\mathbf{u}\mathbf{v}}}\right)^{2}\,,

where the expectation for the second term is 0. Applying the independence between SS and HH together with Lemma 8, we can bound the expectation of the last term and prove the variance claim as follows.

𝔼[(𝐮𝐯𝐗𝐮𝐘𝐯S(𝐮)S(𝐯)ξ𝐮𝐯)2]\displaystyle\mathbb{E}\left[\left(\sum_{\mathbf{u}\neq\mathbf{v}}{\mathbf{X}_{\mathbf{u}}\mathbf{Y}_{\mathbf{v}}S(\mathbf{u})S(\mathbf{v})\xi_{\mathbf{u}\mathbf{v}}}\right)^{2}\right]
=𝔼[𝐮1𝐯1𝐮2𝐯2𝐗𝐮1𝐘𝐯1𝐗𝐮2𝐘𝐯2S(𝐮1)S(𝐯1)S(𝐮2)S(𝐯2)ξ𝐮1𝐯1ξ𝐮2𝐯2]\displaystyle=\mathbb{E}\left[\sum_{\begin{subarray}{c}\mathbf{u}_{1}\neq\mathbf{v}_{1}\\ \mathbf{u}_{2}\neq\mathbf{v}_{2}\end{subarray}}\mathbf{X}_{\mathbf{u}_{1}}\mathbf{Y}_{\mathbf{v}_{1}}\mathbf{X}_{\mathbf{u}_{2}}\mathbf{Y}_{\mathbf{v}_{2}}S(\mathbf{u}_{1})S(\mathbf{v}_{1})S(\mathbf{u}_{2})S(\mathbf{v}_{2})\xi_{\mathbf{u}_{1}\mathbf{v}_{1}}\xi_{\mathbf{u}_{2}\mathbf{v}_{2}}\right]
=𝐮1𝐯1𝐮2𝐯2𝔼[𝐗𝐮1𝐘𝐯1𝐗𝐮2𝐘𝐯2S(𝐮1)S(𝐯1)S(𝐮2)S(𝐯2)]𝔼[ξ𝐮1𝐯1ξ𝐮2𝐯2]\displaystyle=\sum_{\begin{subarray}{c}\mathbf{u}_{1}\neq\mathbf{v}_{1}\\ \mathbf{u}_{2}\neq\mathbf{v}_{2}\end{subarray}}\mathbb{E}\left[\mathbf{X}_{\mathbf{u}_{1}}\mathbf{Y}_{\mathbf{v}_{1}}\mathbf{X}_{\mathbf{u}_{2}}\mathbf{Y}_{\mathbf{v}_{2}}S(\mathbf{u}_{1})S(\mathbf{v}_{1})S(\mathbf{u}_{2})S(\mathbf{v}_{2})\right]\,\cdot\,\mathbb{E}\left[\xi_{\mathbf{u}_{1}\mathbf{v}_{1}}\xi_{\mathbf{u}_{2}\mathbf{v}_{2}}\right]
1D𝐮1𝐯1𝐮2𝐯2𝔼[𝐗𝐮1𝐘𝐯1𝐗𝐮2𝐘𝐯2S(𝐮1)S(𝐯1)S(𝐮2)S(𝐯2)]\displaystyle\leq\frac{1}{D}\sum_{\begin{subarray}{c}\mathbf{u}_{1}\neq\mathbf{v}_{1}\\ \mathbf{u}_{2}\neq\mathbf{v}_{2}\end{subarray}}\mathbb{E}\left[\mathbf{X}_{\mathbf{u}_{1}}\mathbf{Y}_{\mathbf{v}_{1}}\mathbf{X}_{\mathbf{u}_{2}}\mathbf{Y}_{\mathbf{v}_{2}}S(\mathbf{u}_{1})S(\mathbf{v}_{1})S(\mathbf{u}_{2})S(\mathbf{v}_{2})\right]
1D𝐮1𝐯1𝐮2𝐯2𝔼[|𝐗𝐮1||𝐘𝐯1||𝐗𝐮2||𝐘𝐯2|S(𝐮1)S(𝐯1)S(𝐮2)S(𝐯2)]\displaystyle\leq\frac{1}{D}\sum_{\begin{subarray}{c}\mathbf{u}_{1}\neq\mathbf{v}_{1}\\ \mathbf{u}_{2}\neq\mathbf{v}_{2}\end{subarray}}\mathbb{E}\left[|\mathbf{X}_{\mathbf{u}_{1}}||\mathbf{Y}_{\mathbf{v}_{1}}||\mathbf{X}_{\mathbf{u}_{2}}||\mathbf{Y}_{\mathbf{v}_{2}}|S(\mathbf{u}_{1})S(\mathbf{v}_{1})S(\mathbf{u}_{2})S(\mathbf{v}_{2})\right]
=1D𝔼[(𝐮𝐯[d]p|𝐗𝐮||𝐘𝐯|S(𝐮)S(𝐯))2]3p1D𝐗22𝐘22=3p1D𝐱22p𝐲22p.\displaystyle=\frac{1}{D}\,\mathbb{E}\left[\left(\sum_{\mathbf{u}\neq\mathbf{v}\in[d]^{p}}|\mathbf{X}_{\mathbf{u}}|\,|\mathbf{Y}_{\mathbf{v}}|S(\mathbf{u})S(\mathbf{v})\right)^{2}\right]\leq\frac{3^{p}-1}{D}\|\mathbf{X}\|_{2}^{2}\|\mathbf{Y}\|_{2}^{2}=\frac{3^{p}-1}{D}\|\mathbf{x}\|_{2}^{2p}\|\mathbf{y}\|_{2}^{2p}.

The last line holds using the variance bound from Lemma 8 over |𝐗|,|𝐘||\mathbf{X}|,|\mathbf{Y}|, coordinate-wise absolute vectors of 𝐗,𝐘\mathbf{X},\mathbf{Y}. Note that when p=1p=1, this general bound matches the bound of Lemma 6.  

Empirically, it has been shown that normalizing a kernel may improve the performance of SVMs. One way to do so is to normalize the data such as 𝐱2=1\|\mathbf{x}\|_{2}=1 so that the exact kernel is properly normalized, i.e. κ(𝐱,𝐱)=𝐱,𝐱p=1\kappa(\mathbf{x},\mathbf{x})=\left\langle\mathbf{x},\mathbf{x}\right\rangle^{p}=1. Theorem 9 shows that Tensor Sketches can preserve the normalization of kernels given sufficiently large DD random features.

5 Recent applications of Tensor Sketches

As polynomial kernels can be used as feature representations in several computational models and applications, Tensor Sketches has emerged as a powerful tool for dimensionality reduction to scale up many computational tasks with high-dimensional datasets. We briefly describe some recent applications that leverage Tensor Sketches as the core algorithmic component for scalability and efficiency.

Dimensionality Reduction.

Tensor Sketch has become a key tool for compressing and processing high-dimensional data, particularly in applications requiring compact representations of polynomial or multilinear feature expansions. Studies such as Wang et al. (2015); Shi and Anandkumar (2020) show its use in reducing storage and computational costs in tensor-based data representations, while preserving task-relevant information. Diao et al. (2018) and Malik and Becker (2018) explore sketching methods in low-rank approximation and matrix/tensor completion settings, where Tensor Sketch provides fast randomized projections that retain structure in the data. Additionally, works like Cichocki et al. (2016) and Han et al. (2020) incorporate Tensor Sketch into larger frameworks for scalable tensor factorization and deep model compression. These methods highlight Tensor Sketch’s flexibility in balancing approximation quality with substantial reductions in dimensionality and runtime.

In high-dimensional approximation tasks, Tensor Sketch offers near-linear time algorithms for polynomial kernel approximation and randomized linear algebra operations. It enables subspace embeddings and low-distortion projections with provable guarantees, as demonstrated in Charikar and Siminelakis (2017); Ahle et al. (2020). Avron et al. (2014); Song et al. (2019); Martinsson and Tropp (2020) further investigate sketching’s role in randomized numerical linear algebra, where Tensor Sketch preserves inner products and norms in reduced dimensions. These advances underscore its power in scalable, approximate computation for large-scale scientific and machine learning applications.

Machine Learning and Neural Network Acceleration.

Tensor Sketch has also seen impactful applications in accelerating machine learning and neural network pipelines, particularly in scenarios demanding high-dimensional interactions such as fine-grained classification and multimodal fusion. By approximating polynomial feature maps efficiently, Tensor Sketch enables compact bilinear pooling and interaction modeling without incurring prohibitive computational costs (Fukui et al., 2016; Gao et al., 2016; Cui et al., 2018). This technique has been particularly effective in deep learning settings, where high-dimensional bilinear features are essential for tasks like few-shot learning and multimodal representation (Schwartz et al., 2017; Sun et al., 2020). Tensor Sketch is also widely used to reduce the parameter footprint in attention mechanisms and fusion layers, leading to more scalable architectures for both visual and language understanding (Dai et al., 2017; Li et al., 2020). In addition, recent work has extended its use to kernel-based learning over structured representations, such as those arising in human action recognition and molecule generation (Rahmani and Bennamoun, 2017; Tripp et al., 2024). With further improvements in scalability and robustness (Fettal et al., 2023; Kleyko et al., 2025), Tensor Sketch continues to serve as a foundational tool for compressing deep architectures, enabling neural models to scale more efficiently in resource-constrained environments.

Graph, Molecular and Biological Data.

In the realm of graph and network analysis, Tensor Sketch has emerged as an effective tool for compressing high-dimensional representations of graph-structured data. By enabling efficient approximation of polynomial kernels, it supports scalable algorithms for clustering and representation learning in large networks and ordered-neighborhoods graphs (Draief et al., 2018; Fettal et al., 2023). These methods demonstrate the sketch’s utility in preserving structural and semantic information, making it highly suitable for tasks such as node classification, community detection, and network summarization under tight memory and computational constraints.

Tensor Sketch has proven valuable in bioinformatics and molecular machine learning, where high-dimensional and structured data representations are prevalent. Its ability to approximate polynomial kernels efficiently enables fast similarity estimation and scalable computation over molecular fingerprints and genomic data (Joudaki et al., 2020). In molecular property prediction, Tripp et al. (2024) leverage Tensor Sketch to project molecular descriptors into compressed feature spaces that preserve interaction patterns critical for prediction tasks. These approaches illustrate how sketching techniques can maintain predictive accuracy while enabling tractable computation on large molecular databases, facilitating scalable discovery pipelines in drug design and genomics.

6 Conclusion

In this paper, we introduce a fast and scalable sketching technique for approximating polynomial kernels, enabling efficient training of kernel-based learning algorithms. By leveraging the connection between tensor products and the fast convolution structure of Count Sketches, our method computes random feature maps in time 𝒪(n(d+DlogD))\mathcal{O}(n(d+D\log D)) for nn training samples in d\mathbb{R}^{d} and DD random features. We also provided a theoretical analysis bounding the variance of the inner product between two Tensor Sketches, thereby establishing formal guarantees on the accuracy and reliability of the approximation.


Acknowledgments

We thank Graham Cormode for pointing out the error in the proof of Lemma 6 in the conference version.

References

  • Ahle et al. (2020) Thomas D Ahle, Michael Kapralov, Jakob BT Knudsen, Rasmus Pagh, Ameya Velingker, David P Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 141–160, 2020.
  • Ailon and Chazelle (2009) Nir Ailon and Bernard Chazelle. The fast johnson–lindenstrauss transform and approximate nearest neighbors. SIAM Journal on Computing, 39(1):302–322, 2009.
  • Allen-Zhu (2017) Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. Journal of Machine Learning Research (JMLR), 18:221:1–221:51, 2017.
  • Alon et al. (1999) Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137–147, 1999.
  • Avron et al. (2014) Haim Avron, Huy L. Nguyen, and David P. Woodruff. Subspace embeddings for the polynomial kernel. In Advances in Neural Information Processing Systems (NIPS), pages 2258–2266, 2014.
  • Avron et al. (2017a) Haim Avron, Kenneth L. Clarkson, and David P. Woodruff. Faster kernel ridge regression using sketching and preconditioning. SIAM J. Matrix Analysis Applications, 38(4):1116–1138, 2017a.
  • Avron et al. (2017b) Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In International Conference on Machine Learning (ICML), pages 253–262, 2017b.
  • Bach (2017) Francis R. Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research (JMLR), 18:21:1–21:38, 2017.
  • Bochner (2005) Salomon Bochner. Harmonic Analysis and the Theory of Probability. Courier Corporation, 2005.
  • Braverman et al. (2010) Vladimir Braverman, Kai-Min Chung, Zhenming Liu, Michael Mitzenmacher, and Rafail Ostrovsky. AMS without 4-wise independence on product domains. In International Symposium on Theoretical Aspects of Computer Science (STACS), pages 119–130, 2010.
  • Bubeck (2015) Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015.
  • Carter and Wegman (1979) Larry Carter and Mark N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143–154, 1979.
  • Chang and Lin (2011) Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.
  • Charikar and Siminelakis (2017) Moses Charikar and Panos Siminelakis. Hashing-based-estimators for kernel density in high dimensions. In Annual Symposium on Foundations of Computer Science (FOCS), pages 1036–1049, 2017.
  • Charikar et al. (2002) Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In International colloquium on automata, languages, and programming (ICALP), pages 693–703, 2002.
  • Choromanski et al. (2021) Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamás Sarlós, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J. Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations (ICLR), 2021.
  • Cichocki et al. (2016) Andrzej Cichocki, Namgil Lee, Ivan Oseledets, Anh-Huy Phan, Qibin Zhao, and Danilo P Mandic. Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions. Foundations and Trends in Machine Learning, 9(4-5):249–429, 2016.
  • Cui et al. (2017) Yin Cui, Feng Zhou, Jiang Wang, Xiao Liu, Yuanqing Lin, and Serge Belongie. Kernel pooling for convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3049–3058, 2017.
  • Cui et al. (2018) Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge Belongie. Learning to evaluate image captioning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5804–5812, 2018.
  • Dai et al. (2017) Xiyang Dai, Joe Yue-Hei Ng, and Larry S. Davis. FASON: first and second order information fusion network for texture recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6100–6108, 2017.
  • d’Aspremont et al. (2021) Alexandre d’Aspremont, Damien Scieur, and Adrien B. Taylor. Acceleration methods. Foundations and Trends in Optimization, 5(1-2):1–245, 2021.
  • Diao et al. (2018) Huaian Diao, Zhao Song, Wen Sun, and David P. Woodruff. Sketching for kronecker product regression and p-splines. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 84, pages 1299–1308, 2018.
  • Draief et al. (2018) Moez Draief, Konstantin Kutzkov, Kevin Scaman, and Milan Vojnovic. KONG: kernels for ordered-neighborhood graphs. In Advances in Neural Information Processing Systems (NeurIPS), pages 4055–4064, 2018.
  • Fan et al. (2008) Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research (JMRL), 9:1871–1874, 2008.
  • Fettal et al. (2023) Chafik Fettal, Linda Labiod, and Mohamed Nadif. Scalable attributed-graph subspace clustering. In AAAI Conference on Artificial Intelligence (AAAI), volume 37, pages 5256–5264, 2023.
  • Fu et al. (2023) Hengyu Fu, Tianyu Guo, Yu Bai, and Song Mei. What can a single attention layer learn? a study through the random features lens. In Advances in Neural Information Processing Systems (NeurIPS), pages 11912–11951, 2023.
  • Fukui et al. (2016) Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 457–468, 2016.
  • Gao et al. (2016) Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear pooling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 317–326, 2016.
  • Gittens and Mahoney (2016) Alex Gittens and Michael W. Mahoney. Revisiting the nystrom method for improved large-scale machine learning. Journal of Machine Learning Research (JMLR), 17:117:1–117:65, 2016.
  • Hamid et al. (2014) Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis DeCoste. Compact random feature maps. In International Conference of Machine Learning (ICML), volume 32, pages 19–27, 2014.
  • Han et al. (2020) Insu Han, Haim Avron, and Jinwoo Shin. Polynomial tensor sketch for element-wise function of low-rank matrix. In International Conference on Machine Learning (ICML), pages 3984–3993. PMLR, 2020.
  • Indyk and McGregor (2008) Piotr Indyk and Andrew McGregor. Declaring independence via the sketching of sketches. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 737–745, 2008.
  • Joachims (1998) Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning (ECML), volume 1398, pages 137–142, 1998.
  • Joachims (2006) Thorsten Joachims. Training linear SVMs in linear time. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 217–226, 2006.
  • Joudaki et al. (2020) Alireza Joudaki, Gunnar Rätsch, and Andre Kahles. Fast alignment-free similarity estimation by tensor sketching. bioRxiv, 2020.
  • Kar and Karnick (2012) Purushottam Kar and Harish Karnick. Random feature maps for dot product kernels. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 583–591, 2012.
  • Kleinberg and Tardos (2005) Jon Kleinberg and Eva Tardos. Algorithm Design. Addison-Wesley Longman Publishing Co., Inc., 2005.
  • Kleyko et al. (2025) Denis Kleyko, Christopher J Kymn, Anthony Thomas, Bruno A Olshausen, Friedrich T Sommer, and E Paxon Frady. Principled neuromorphic reservoir computing. Nature Communications, 16(1):640, 2025.
  • Kumar et al. (2012) Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the Nyström method. Journal of Machine Learning Research (JMLR), 13:981–1006, 2012.
  • Lewis et al. (2004) David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research (JMLR), 5:361–397, 2004.
  • Li et al. (2020) Jiwei Li, Xiang Xie, Qing Pan, Yong Cao, Zhaoyang Zhao, and Guoqiang Shi. Sgm-net: Skeleton-guided multimodal network for action recognition. Pattern Recognition, 107:107490, 2020.
  • Li et al. (2019) Zhu Li, Jean-Francois Ton, Dino Oglic, and Dino Sejdinovic. Towards a unified analysis of random fourier features. In International Conference on Machine Learning (ICML), pages 3905–3914, 2019.
  • Liu et al. (2020) Fanghui Liu, Xiaolin Huang, Yudong Chen, Jie Yang, and Johan A. K. Suykens. Random fourier features via fast surrogate leverage weighted sampling. In AAAI Conference on Artificial Intelligence (AAAI), pages 4844–4851, 2020.
  • Liu et al. (2021) Fanghui Liu, Xiaolin Huang, Yingyi Chen, and Johan A. K. Suykens. Fast learning in reproducing kernel krein spaces via signed measures. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 388–396, 2021.
  • Liu et al. (2022) Fanghui Liu, Xiaolin Huang, Yudong Chen, and Johan A. K. Suykens. Random features for kernel approximation: A survey on algorithms, theory, and beyond. Transactions on Pattern Analysis and Machine Intelligence (PAMI), 44(10):7128–7148, 2022.
  • Lopez-Paz et al. (2014) David Lopez-Paz, Suvrit Sra, Alexander J. Smola, Zoubin Ghahramani, and Bernhard Schölkopf. Randomized nonlinear component analysis. In International Conference on Machine Learning (ICML), pages 1359–1367, 2014.
  • Lu et al. (2016) Jing Lu, Steven C. H. Hoi, Jialei Wang, Peilin Zhao, and Zhiyong Liu. Large scale online kernel learning. Journal of Machine Learning Research (JMLR), 17:47:1–47:43, 2016.
  • Malik and Becker (2018) Osman Asif Malik and Stephen Becker. Low-rank tucker decomposition of large tensors using tensorsketch. In Advances in neural information processing systems (NeurIPS), volume 31, 2018.
  • Martinsson and Tropp (2020) Per-Gunnar Martinsson and Joel A Tropp. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 29:403–572, 2020.
  • Meister et al. (2019) Michela Meister, Tamás Sarlós, and David P. Woodruff. Tight dimensionality reduction for sketching low degree polynomial kernels. In Advances in Neural Information Processing Systems (NeurIPS), pages 9470–9481, 2019.
  • Muandet et al. (2017) Krikamol Muandet, Kenji Fukumizu, Bharath K Sriperumbudur, and Bernhard Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1–141, 2017.
  • Osuna et al. (1997) Edgar Osuna, Robert Freund, and Federico Girosi. An improved training algorithm for support vector machines. In Neural networks for signal processing VII. Proceedings of the 1997 IEEE signal processing society workshop, pages 276–285, 1997.
  • Pagh (2013) Rasmus Pagh. Compressed matrix multiplication. ACM Transactions on Computation Theory (TOCT), 5(3):9:1–9:17, 2013.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research (JMLR), 12:2825–2830, 2011.
  • Pennington et al. (2015) Jeffrey Pennington, Felix X. Yu, and Sanjiv Kumar. Spherical random features for polynomial kernels. In Advances in Neural Information Processing Systems (NIPS), pages 1846–1854, 2015.
  • Pham and Pagh (2013) Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature maps. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 239–247. ACM, 2013.
  • Pǎtraşcu and Thorup (2012) Mihai Pǎtraşcu and Mikkel Thorup. The power of simple tabulation hashing. Journal of the ACM, 59(3):1–50, 2012.
  • Rahimi and Recht (2007) Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems (NIPS), pages 1177–1184, 2007.
  • Rahmani and Bennamoun (2017) Hossein Rahmani and Mohammed Bennamoun. Learning action recognition model from depth and skeleton videos. In IEEE International Conference on Computer Vision (ICCV), pages 798–807, 2017.
  • Rudi and Rosasco (2017) Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in neural information processing systems (NIPS), pages 3218–3228, 2017.
  • Scetbon and Harchaoui (2021) Meyer Scetbon and Zaïd Harchaoui. A spectral analysis of dot-product kernels. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3394–3402, 2021.
  • Schoenberg (1942) I. J. Schoenberg. Positive definite functions on spheres. Duke Mathematical Journal, 9(1):96 – 108, 1942. doi: 10.1215/S0012-7094-42-00908-6.
  • Schölkopf and Smola (2002) Bernhard Schölkopf and Alexander Johannes Smola. Learning with Kernels: support vector machines, regularization, optimization, and beyond. MIT Press, 2002.
  • Schwartz et al. (2017) Israel Schwartz, Alexander Schwing, and Tamir Hazan. High-order attention models for visual question answering. In Advances in Neural Information Processing Systems (NIPS), pages 4906–4915, 2017.
  • Shalev-Shwartz et al. (2011) Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1):3–30, 2011.
  • Shawe-Taylor and Cristianini (2004) John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
  • Shi and Anandkumar (2020) Yang Shi and Animashree Anandkumar. Higher-order count sketch: Dimensionality reduction that retains efficient tensor operations. In Data Compression Conference (DCC), pages 394–394, 2020.
  • Song et al. (2019) Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1152–1169, 2019.
  • Sriperumbudur and Sterge (2022) Bharath K. Sriperumbudur and Nicholas Sterge. Approximate kernel PCA: Computational versus statistical trade-off. The Annals of Statistics, 50(5):2713 – 2736, 2022.
  • Sriperumbudur and Szabó (2015) Bharath K. Sriperumbudur and Zoltán Szabó. Optimal rates for random fourier features. In Advances in neural information processing systems (NIPS), pages 1144–1152, 2015.
  • Sun et al. (2020) Xiang Sun, Haoming Xv, Jian Dong, Hao Zhou, and Sheng Tan. Few-shot learning for domain-specific fine-grained image classification. IEEE Transactions on Image Processing, 29:10272–10285, 2020.
  • Sun et al. (2018) Yitong Sun, Anna C. Gilbert, and Ambuj Tewari. But how does it work in theory? linear SVM with random features. In Advances in Neural Information Processing Systems (NeurIPS), pages 3383–3392, 2018.
  • Thorup and Zhang (2012) Mikkel Thorup and Yin Zhang. Tabulation-based 5-independent hashing with applications to linear probing and second moment estimation. SIAM Journal on Computing, 41(2):293–331, 2012.
  • Tripp et al. (2024) Alexander Tripp, Sergio Bacallado, Suhas Singh, and Fabio Anselmi. Tanimoto random features for scalable molecular machine learning. In Advances in Neural Information Processing Systems (NeurIPS), 2024.
  • Tropp (2011) Joel A. Tropp. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(1-2):115–126, 2011.
  • Wang et al. (2015) Yining Wang, Hsiao-Yu Fish Tung, Alexander J. Smola, and Anima Anandkumar. Fast and guaranteed tensor decomposition via sketching. In Advances in Neural Information Processing Systems (NIPS), pages 991–999, 2015.
  • Weinberger et al. (2009) Kilian Q. Weinberger, Anirban Dasgupta, John Langford, Alexander J. Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In International Conference of Machine Learning (ICML), pages 1113–1120, 2009.
  • Williams and Seeger (2000) Christopher K. I. Williams and Matthias W. Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 682–688, 2000.
  • Woodruff (2014) David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1–157, 2014.
  • Wu et al. (2018) Lingfei Wu, Pin-Yu Chen, Ian En-Hsu Yen, Fangli Xu, Yinglong Xia, and Charu C. Aggarwal. Scalable spectral clustering using random binning features. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 2506–2515, 2018.
  • Xu and Pham (2024) Haochuan Xu and Ninh Pham. Scalable DBSCAN with random projections. In Advances in Neural Information Processing Systems (NeurIPS), 2024.
  • Yang et al. (2012) Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nyström method vs random fourier features: A theoretical and empirical comparison. Advances in neural information processing systems (NIPS), 25, 2012.
  • Yehudai and Shamir (2019) Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understanding neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 6594–6604, 2019.
  • Zandieh et al. (2021) Amir Zandieh, Insu Han, Haim Avron, Neta Shoham, Chaewon Kim, and Jinwoo Shin. Scaling neural tangent kernels via sketching and random features. In Advances in Neural Information Processing Systems (NeurIPS), pages 1062–1073, 2021.