Node-Variant Graph Filters in Graph Neural Networks

Fernando Gama, Brendon G. Anderson, Somayeh Sojoudi

Abstract

Graph neural networks (GNNs) have been successfully employed in a myriad of applications involving graph signals. Theoretical findings establish that GNNs use nonlinear activation functions to create low-eigenvalue frequency content that can be processed in a stable manner by subsequent graph convolutional filters. However, the exact shape of the frequency content created by nonlinear functions is not known and cannot be learned. In this work, we use node-variant graph filters (NVGFs) –which are linear filters capable of creating frequencies– as a means of investigating the role that frequency creation plays in GNNs. We show that, by replacing nonlinear activation functions by NVGFs, frequency creation mechanisms can be designed or learned. By doing so, the role of frequency creation is separated from the nonlinear nature of traditional GNNs. Simulations on graph signal processing problems are carried out to pinpoint the role of frequency creation.

^†^†This work is partially supported by NSF and ONR. F. Gama is with the Dept. of Electrical and Computer Engineering, Rice University, Houston, TX. Email: fgama@rice.edu. B. G. Anderson and S. Sojoudi are with the Dept. of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA.

1 Introduction

Graph neural networks (GNNs) [1, 2] are learning architectures that have been successfully applied to a wide array of graph signal processing (GSP) problems ranging from recommendation systems [3, 4] and authorship attribution [5, 6], to physical network problems including wireless communications [7], control [8], and sensor networks [9]. In the context of GSP, frequency analysis has been successful at providing theoretical insight into the observed success of GNNs [10, 11, 12] and graph scattering transforms [13, 14].

Of particular interest is the seminal work by Mallat [15, 16], concerning discrete-time signals and images, that argues that the improved performance of convolutional neural networks (CNNs) over linear convolutional filters is due to the activation functions. Concretely, nonlinear activation functions allow for high-frequency content to be spread into lower frequencies, where it can be processed in a stable manner—a feat that cannot be achieved by convolutional filters alone, which are unstable when processing high frequencies.Leveraging the notion of graph Fourier transform (GFT) [17], these results have been extended to GNNs [10], establishing that the use of functions capable of creating low-eigenvalue frequency content allows them to be robust to changes in the graph topology, facilitating scalability and transferability [18].

While nonlinear activation functions play a key role in the creation of low-eigenvalue frequency content, it is not possible to know the exact shape in which this frequency content is actually generated. Node-variant graph filters (NVGFs) [19], which essentially assign a different filter tap to each node in the graph, are also able to generate frequency content. Different from nonlinear activation functions, this frequency content can be exactly computed, given the filter taps. Thus, by learning or carefully designing these filter taps, it is possible to know exactly how the frequency content is being generated.

The NVGF is a linear filter, which means that replacing the nonlinear activation functions with NVGFs actually renders the whole architecture a linear one. Therefore, by comparing the performance of this architecture to that of a traditional GNN, it is possible to isolate the role of frequency creation in the overall performance of the architecture, from that of the nonlinear nature of mappings. The contributions of this paper can be summarized as follows:

(C1)

We introduce NVGFs as a means of replacing nonlinear activation functions, motivated by their ability to create frequencies. We obtain closed-form expressions for the frequency response of NVGFs as a function of the filter taps.
(C2)

We prove that NVGFs are Lipschitz continuous with respect to changes in the underlying graph topology.
(C3)

We put forth a framework for designing NVGFs.
(C4)

We propose a GNN architecture where the nonlinear activation function is replaced by a NVGF. The filter taps of the NVGF can be either learned from data (Learn NVGF) or obtained by design (Design NVGF). The aim of the resulting architecture is to decouple the role of frequency creation from the nonlinear nature of the GNN.
(C5)

We investigate the problem of authorship attribution to demonstrate, both quantitatively and qualitatively, the role of frequency creation in the performance of a GNN, and its relationship to the nonlinear nature of traditional architectures.

In essence, we show that nonlinear activation functions are not strictly required for creating frequencies, as originally thought in [15, 10], but that linear NVGF activation functions are sufficient. Furthermore, we demonstrate that this frequency content can be learned with respect to the specific problem under study. All proofs, as well as further simulations, can be found in the appendices. Code can be found online at http://github.com/fgfgama/nvgf.

Related work. GNNs constitute a very active area of research [20, 21]. From a GSP perspective, spectral filtering is used in [22], it is then replaced by computationally simpler Chebyshev polynomials [23], and subsequently by general graph convolutional filters [6]. GNNs were also adopted in non-GSP problems [24, 25, 26]. The proposed replacement of nonlinear activation functions by NVGFs creates a linear architecture that uses both convolutional and non-convolutional graph filters.

NVGFs are first introduced in [19] to extend time-variant filters into the realm of graph signals. In that work, NVGFs are used to optimally approximate linear operators in a distributed manner. In this paper, we focus on the frequency response of NVGFs and on their capacity to create frequency content.

Leveraging the notion of GFT, [10] shows that a GNN is Lipschitz continuous to small changes in the underlying graph structure. Likewise, frequency analysis has been used to understand graph scattering transforms, where the filters used in the GNN are chosen to be wavelets (and not learned) [13, 14]. In this paper, we focus on the role of frequency creation that is put forth in [15, 10], and study NVGFs as linear mechanisms for achieving this.

2 The Node-Variant Graph Filter

Let $\mathcal{G}=(\mathcal{V},\mathcal{E})$ be an undirected, possibly weighted, graph with a set of $N$ nodes $\mathcal{V}=\{v_{1},\ldots,v_{N}\}$ and a set of edges $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ . Define a graph signal as the function $\mathsf{x}:\mathcal{V}\to\mathbb{R}$ that associates a scalar value to each node. For a fixed ordering of the nodes, the graph signal can be conveniently described by means of a vector $\mathbf{x}\in\mathbb{R}^{N}$ such that the $i^{\text{th}}$ entry $[\mathbf{x}]_{i}$ is the value $x_{i}$ of the signal on node $v_{i}$ , i.e., $[\mathbf{x}]_{i}=x_{i}=\mathsf{x}(v_{i})$ .

Describing a graph signal as the vector $\mathbf{x}\in\mathbb{R}^{N}$ is mathematically convenient but carries no information about the underlying graph topology that supports it. This information can be recovered by defining a graph matrix description (GMD) $\mathbf{S}\in\mathbb{R}^{N\times N}$ which is a matrix that respects the sparsity pattern of the graph, i.e., $[\mathbf{S}]_{ij}=0$ for all distinct indices $i$ and $j$ such that $(v_{j},v_{i})\notin\mathcal{E}$ . Examples of GMDs widely used include the adjacency matrix, the Laplacian matrix, and their corresponding normalizations [27, 28, 29].

The GMD $\mathbf{S}$ can thus be leveraged to process the graph signal $\mathbf{x}$ in such a way that the underlying graph structure is exploited. The most elementary example is the linear map $\mathsf{S}:\mathbb{R}^{N}\to\mathbb{R}^{N}$ between graph signals given by $\mathbf{y}=\mathsf{S}(\mathbf{x})=\mathbf{S}\mathbf{x}$ . This linear map is a linear combination of the information located in the one-hop neighborhood of each node:

y_{i}=[\mathbf{y}]_{i}=\sum_{j=1}^{N}[\mathbf{S}]_{ij}[\mathbf{x}]_{j}=\sum_{j:v_{j}\in\mathcal{N}_{i}\cup\{v_{i}\}}s_{ij}x_{j}

(1)

where $\mathcal{N}_{i}=\{v_{j}:(v_{j},v_{i})\in\mathcal{E}\}$ is the set of nodes that share an edge with node $v_{i}$ . The last equality in the above equation is due to the sparsity pattern of the GMD $\mathbf{S}$ .

More generally, a graph filter $\mathsf{H}(\cdot;\mathbf{S}):\mathbb{R}^{N}\to\mathbb{R}^{N}$ is a mapping between graph signals that leverages the structure encoded in $\mathbf{S}$ [19]. In particular, linear shift-invariant graph filters (LSIGFs) are those that can be built as a polynomial in $\mathbf{S}$ :

\mathsf{H}^{\text{lsi}}(\mathbf{x};\mathbf{S})=\sum_{k=0}^{K}h_{k}\mathbf{S}^{k}\mathbf{x}=\boldsymbol{H}^{\text{lsi}}(\mathbf{S})\mathbf{x}

(2)

where $\boldsymbol{H}^{\text{lsi}}(\mathbf{S})=\sum_{k=0}^{K}h_{k}\mathbf{S}^{k}$ with $h_{k}\in\mathbb{R}$ . Note that $\mathsf{H}^{\text{lsi}}(\mathbf{x};\mathbf{S})$ is written in the form of $\boldsymbol{H}^{\text{lsi}}(\mathbf{S})\mathbf{x}$ to emphasize that the function is linear in the input $\mathbf{x}$ , i.e., $\mathbf{x}$ is multiplied by a matrix $\boldsymbol{H}^{\text{lsi}}(\mathbf{S})$ that is parametrized by $\mathbf{S}$ . LSIGFs inherit their name from the fact that they satisfy the property that $\boldsymbol{H}^{\text{lsi}}(\mathbf{S})\mathbf{S}\mathbf{x}=\mathbf{S}\boldsymbol{H}^{\text{lsi}}(\mathbf{S})\mathbf{x}$ . The set of polynomial coefficients $\{h_{k}\}_{k=0}^{K}$ are called filter taps, and can be collected in a vector $\mathbf{h}\in\mathbb{R}^{K+1}$ defined as $[\mathbf{h}]_{k+1}=h_{k}$ for all $k\in\{0,\ldots,K\}$ . Note that the term $\mathbf{S}^{k}\mathbf{x}$ is a convenient mathematical formulation, but in practice $\mathbf{S}^{k}\mathbf{x}$ is computed by exchanging information $k$ times with one-hop neighbors, i.e., there are no matrix powers involved. In general, GSP regards $\mathbf{S}$ as given by the structure of the problem, and regards $\mathbf{x}$ as the actionable data [27, 28, 29].

This paper focuses on NVGFs [19], which are linear filters that assign a different filter tap to each node, for each application of $\mathbf{S}$ . This can be compactly written as follows:

\mathsf{H}^{\text{nv}}(\mathbf{x};\mathbf{S})=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{S}^{k}\mathbf{x}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\mathbf{x}

(3)

where $\mathbf{h}^{(k)}\in\mathbb{R}^{N}$ and $\mathsf{diag}(\cdot)$ is the diagonal operator that takes a vector and creates a diagonal matrix with the elements of the vector in the diagonal. Since the NVGF is linear in the input, it holds that $\mathsf{H}^{\text{nv}}(\mathbf{x};\mathbf{S})=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\mathbf{x}$ , where $\boldsymbol{H}^{\text{nv}}(\mathbf{S})=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{S}^{k}$ . The $i^{\text{th}}$ entry of the vector, $[\mathbf{h}^{(k)}]_{i}=h_{ik}$ , is the filter tap that node $v_{i}$ uses to weigh the information incoming after $k$ exchanges with its neighbors. The set of all filter taps can be conveniently collected in a matrix $\mathbf{H}\in\mathbb{R}^{N\times(K+1)}$ where the $(k+1)^{\text{th}}$ column is $\mathbf{h}^{(k)}\in\mathbb{R}^{N}$ and the $i^{\text{th}}$ row, denoted by $\mathbf{h}_{i}\in\mathbb{R}^{K+1}$ , contains the $K+1$ filter taps used by node $v_{i}$ , i.e., $[\mathbf{h}_{i}]_{k+1}=h_{ik}$ .

The LSIGF in (2) and the NVGF in (3) are both linear and local processing operators. They depend linearly on the input graph signal $\mathbf{x}$ as indicated by the matrix multiplication notation in (2) and in (3). They are local in the sense that, to compute the output, each node requires information relayed directly by their immediate neighbors. The LSIGF is characterized by the collection of $K+1$ filter taps. The NVGF, on the other hand, is characterized by $N(K+1)$ filter taps. It is noted that while the NVGF requires additional memory to store more parameters, this can be distributed throughout the graph. It is also observed that both the LSIGF and the NVGF have the same computational complexity.

3 Frequency Analysis

The GMD $\mathbf{S}\in\mathbb{R}^{N\times N}$ can be used to define a spectral representation of the graph signal $\mathbf{x}\in\mathbb{R}^{N}$ [17]. Since the graph is undirected, assume that $\mathbf{S}$ is symmetric so that it can be diagonalized by an orthonormal basis of eigenvectors $\{\mathbf{v}_{i}\}_{i=1}^{N}$ , where $\mathbf{v}_{i}\in\mathbb{R}^{N}$ and $\mathbf{S}\mathbf{v}_{i}=\lambda_{i}\mathbf{v}_{i}$ , with $\lambda_{i}\in\mathbb{R}$ being the corresponding eigenvalue. Then, it holds that $\mathbf{S}=\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda})\mathbf{V}^{\mathsf{T}}$ , where the $i^{\text{th}}$ column of $\mathbf{V}$ is $\mathbf{v}_{i}$ and where $\boldsymbol{\lambda}\in\mathbb{R}^{N}$ is given by $[\boldsymbol{\lambda}]_{i}=\lambda_{i}$ for $i=1,\ldots,N$ . We assume throughout this paper that the eigenvalues are distinct, which is typically the case for random connected graphs.

The spectral representation of a graph signal $\mathbf{x}$ with respect to its underlying graph support described by $\mathbf{S}$ is given by

\mathbf{\tilde{x}}=\mathbf{V}^{\mathsf{T}}\mathbf{x}

(4)

where $\mathbf{\tilde{x}}\in\mathbb{R}^{N}$ , see [17]. The spectral representation $\mathbf{\tilde{x}}$ of the graph signal $\mathbf{x}$ contains the coordinates of representing $\mathbf{x}$ on the eigenbasis $\{\mathbf{v}_{i}\}_{i=1}^{N}$ of the support matrix $\mathbf{S}$ . The resulting vector $\mathbf{\tilde{x}}$ is known as the frequency response of the signal $\mathbf{x}$ . The $i^{\text{th}}$ entry $[\mathbf{\tilde{x}}]_{i}=\mathbf{v}_{i}^{\mathsf{T}}\mathbf{x}=\tilde{x}_{i}\in\mathbb{R}$ of the frequency response $\mathbf{\tilde{x}}$ measures how much the $i^{\text{th}}$ eigenvector $\mathbf{v}_{i}$ contributes to the signal $\mathbf{x}$ . The operation in (4) is called the GFT, and thus the frequency response $\mathbf{\tilde{x}}$ is often referred to as the GFT of the signal $\mathbf{x}$ .

The GFT offers an alternative representation of the graph signal $\mathbf{x}$ that takes into account the graph structure in $\mathbf{S}$ . The effect of a filter can be characterized in the spectral domain by computing the GFT of the output. For instance, when considering a LSIGF, the spectral representation of the output $\mathbf{y}=\boldsymbol{H}^{\text{lsi}}(\mathbf{S})\mathbf{x}$ is

\mathbf{\tilde{y}}=\mathbf{V}^{\mathsf{T}}\sum_{k=0}^{K}h_{k}\mathbf{S}^{k}\mathbf{x}=\sum_{k=0}^{K}h_{k}\mathsf{diag}(\boldsymbol{\lambda}^{k})\mathbf{\tilde{x}}=\mathsf{diag}(\mathbf{\tilde{h}})\mathbf{\tilde{x}}

(5)

where the eigendecomposition of $\mathbf{S}$ , the GFT of $\mathbf{x}$ , and the fact that $\mathbf{S}^{k}=\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{k})\mathbf{V}^{\mathsf{T}}$ were all used. Note that $\boldsymbol{\lambda}^{k}$ is a shorthand notation that means $[\boldsymbol{\lambda}^{k}]_{i}=\lambda_{i}^{k}$ . The vector $\mathbf{\tilde{h}}\in\mathbb{R}^{N}$ is known as the frequency response of the filter and its $i^{\text{th}}$ entry is given by

[\mathbf{\tilde{h}}]_{i}=\mathsf{\tilde{h}}(\lambda_{i})=\sum_{k=0}^{K}h_{k}\lambda_{i}^{k}

(6)

where $\mathsf{\tilde{h}}:\mathbb{R}\to\mathbb{R}$ is a polynomial defined by the set of filter taps $\{h_{k}\}_{k=0}^{K}$ . The function $\mathsf{\tilde{h}}(\cdot)$ depends only on the filter coefficients and not on the specific graph on which it is applied, and thus is valid for all graphs. The effect of filtering on a specific graph comes from instantiating $\mathsf{\tilde{h}}(\cdot)$ on the specific eigenvalues of that graph. The function $\mathsf{\tilde{h}}(\cdot)$ is denoted as the frequency response as well, and it will be clear from context whether we refer to the function $\mathsf{\tilde{h}}(\cdot)$ or to the vector $\mathbf{\tilde{h}}$ given in (6) that stems from evaluating $\mathsf{\tilde{h}}(\cdot)$ at each of the $N$ eigenvalues $\lambda_{i}$ . Note that since $\mathsf{\tilde{h}}(\cdot)$ is an analytic function, it can be applied to the square matrix $\mathbf{S}$ so that $\mathsf{\tilde{h}}(\mathbf{S})=\boldsymbol{H}^{\text{lsi}}(\mathbf{S})$ .

In the case of the LSIGF, it is observed from (5) that the $i^{\text{th}}$ entry of the frequency response of the output $[\mathbf{\tilde{y}}]_{i}=\tilde{y}_{i}$ is given by the elementwise multiplication

\tilde{y}_{i}=\mathsf{\tilde{h}}(\lambda_{i})\tilde{x}_{i}.

(7)

This implies that the frequency response of the output $\tilde{y}_{i}$ is the elementwise multiplication of the frequency response of the filter $\mathsf{\tilde{h}}(\lambda_{i})$ and the frequency response of the input $\tilde{x}_{i}$ . This makes (7) the analogue of the convolution theorem for graph signals. Therefore, oftentimes the LSIGF in (2) is called graph convolution. It is observed from (7) that LSIGFs are capable of learning any type of frequency response (low-pass, high-pass, etc.), but that they are not able to create frequencies, i.e., if $\tilde{x}_{i}=0$ , then $\tilde{y}_{i}=0$ .

Unlike discrete-time signals, the frequency response of the LSIGF is not computed in the same manner as the frequency response of graph signals. More specifically, the frequency response of the filter $\mathbf{\tilde{h}}$ can be directly computed from the filter taps $\mathbf{h}$ by means of a Vandermonde matrix $\mathbf{\Lambda}\in\mathbb{R}^{N\times(K+1)}$ given by $[\mathbf{\Lambda}]_{ik}=\lambda_{i}^{k-1}$ as follows:

\mathbf{\tilde{h}}=\mathbf{\Lambda}\mathbf{h}.

(8)

This implies that when processing graph signals, the graph filter cannot be uniquely represented by a graph signal [27, Th. 5], and thus the concept of impulse response is no longer valid. Furthermore, the graph convolution is not symmetric, i.e., the filter and the graph signal do not commute. Finally, it is interesting to note that, when $\mathbf{S}$ is the adjacency matrix of a directed cycle and $K=N-1$ , then $\mathbf{\Lambda}=\mathbf{V}^{\mathsf{H}}$ and the GFT of the signal and of the filter taps (in time, known as the impulse response) becomes equivalent, as expected.

When considering NVGFs, as in (3), the convolution theorem (7) no longer holds. Instead, the frequency response of the output is given in the following proposition.

Proposition 1 (Frequency response of NVGF).

Let $\mathbf{y}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\mathbf{x}=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{S}^{k}\mathbf{x}$ be the output of an arbitrary NVGF characterized by some filter tap matrix $\mathbf{H}\in\mathbb{R}^{N\times(K+1)}$ . Then, the frequency response $\mathbf{\tilde{y}}=\mathbf{V}^{\mathsf{T}}\mathbf{y}$ of the output is given by

\mathbf{\tilde{y}}=\mathbf{V}^{\mathsf{T}}\big{(}\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})\big{)}\mathbf{\tilde{x}}

(9)

where $\circ$ denotes elementwise product of matrices.

It is immediately observed that NVGFs are capable of generating new frequency content, even though they are linear.

Corollary 2.

If the matrix $\mathbf{V}^{\mathsf{T}}\big{(}\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})\big{)}$ is not diagonal, then the output exhibits frequency content that is not present in the input.

As an example, consider the case where the input is given by a single frequency signal, i.e., $\mathbf{x}=\mathbf{v}_{t}$ for some $t\in\{1,\ldots,N\}$ so that $\mathbf{\tilde{x}}=\mathbf{e}_{t}$ where $[\mathbf{e}_{t}]_{i}=1$ if $i=t$ and $[\mathbf{e}_{t}]_{i}=0$ otherwise. This is a signal that has a single frequency component. Yet, the $i^{\text{th}}$ entry of the frequency response of the NVGF output is $\tilde{y}_{i}=\sum_{j=1}^{N}\mathsf{\tilde{h}}_{j}(\lambda_{t})v_{ji}v_{jt}$ , where $v_{ij}=[\mathbf{V}]_{ij}$ and $\mathsf{\tilde{h}}_{j}$ is the frequency response obtained from using the $K+1$ coefficients at node $v_{j}$ . Unless $\mathsf{\tilde{h}}_{j}(\lambda_{t})$ is a constant for all $j$ , i.e., $\mathsf{\tilde{h}}_{j}(\lambda_{t})=\tilde{h}_{t}$ , the $i^{\text{th}}$ entry of the GFT of the output is nonzero even when the $i^{\text{th}}$ entry of the GFT of the input is zero. This also shows that by carefully designing the filter taps for each node, the frequencies that are allowed to be created can be chosen, analyzed, and understood.

To finalize the frequency analysis of NVGFs, we establish their Lipschitz continuity to changes in the underlying graph support, as decribed by the matrix $\mathbf{S}$ . In what follows, we denote the spectral norm of a matrix $\mathbf{A}$ by $\|\mathbf{A}\|_{2}$ .

Theorem 3 (Lipschitz continuity of the NVGF with respect to $\mathbf{S}$ ).

Let $\mathcal{G}$ and $\mathcal{\hat{G}}$ be two graphs with $N$ nodes, described by GMDs $\mathbf{S}\in\mathbb{R}^{N\times N}$ and $\mathbf{\hat{S}}\in\mathbb{R}^{N\times N}$ , respectively. Let $\mathbf{H}\in\mathbb{R}^{N\times(K+1)}$ be the coefficients of any NVGF. Given a constant $\varepsilon>0$ , if $\|\mathbf{\hat{S}}-\mathbf{S}\|_{2}\leq\varepsilon$ , it holds that

\Big{\|}\big{(}\boldsymbol{H}^{\text{nv}}(\mathbf{\hat{S}})-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\big{)}\mathbf{x}\Big{\|}_{2}\leq\varepsilon C\sqrt{N}(1+8N)\|\mathbf{x}\|_{2}+\mathsf{O}(\varepsilon^{2})

(10)

where $\boldsymbol{H}^{\text{nv}}(\mathbf{S})$ and $\boldsymbol{H}^{\text{nv}}(\mathbf{\hat{S}})$ are the NVGF on $\mathbf{S}$ and $\mathbf{\hat{S}}$ , respectively, and where $C$ is the Lipschitz constant of the frequency responses at each node, i.e., $|\mathsf{\tilde{h}}_{t}(\lambda_{j})-\mathsf{\tilde{h}}_{t}(\lambda_{i})|\leq C|\lambda_{j}-\lambda_{i}|$ for all $i,j,t\in\{1,\ldots,N\}$ .

Theorem 3 establishes the Lipschitz continuity of the NVGF filter with respect to the support matrix $\mathbf{S}$ (Lipschitz continuity with respect to the input $\mathbf{x}$ is immediately given for bounded filter taps) as long as the graphs $\mathbf{S}$ and $\mathbf{\hat{S}}$ are similar, i.e., $\varepsilon\ll 1$ . The bound is proportional to this difference, $\varepsilon$ , and to the shape of the frequency responses at each node through the Lipschitz constant $C$ . It also depends on the number of nodes $N$ , but it is fixed for given graphs with the same number of nodes. In short, Theorem 3 gives mild guarantees on the expected performance of the NVGF across a wide range of graphs $\mathbf{\hat{S}}$ that are close to the graph $\mathbf{S}$ .

4 Approximating Activation Functions

One of the main roles of activation functions in neural networks is to create low-frequency content that can be processed in a stable manner [15]. However, the way the nonlinearities create this frequency content is unknown and cannot be shaped. One alternative for tailoring the frequency creation process to the specific problem under study is to learn the NVGF filter taps (Sec. 5). However, doing so, implies that the number of learnable parameters depend on $N$ which may lead to overfitting. In what follows we propose one method of designing, instead of learning, the NVGFs.

Problem statement. Assume that each data point $\mathbf{x}$ is a random graph signal with mean $\mathbb{E}[\mathbf{x}]=\boldsymbol{\mu}_{x}$ , correlation matrix $\mathbf{R}_{x}=\mathbb{E}[\mathbf{x}\mathbf{x}^{\mathsf{T}}]$ , and covariance matrix $\mathbf{C}_{x}=\mathbb{E}[(\mathbf{x}-\boldsymbol{\mu}_{x})(\mathbf{x}-\boldsymbol{\mu}_{x})^{\mathsf{T}}]$ . The objective is to estimate a pointwise nonlinear function $\mathsf{\rho}:\mathbb{R}\to\mathbb{R}$ such that $[\mathsf{\rho}(\mathbf{x})]_{i}=\mathsf{\rho}([\mathbf{x}]_{i})$ , using a NVGF-based estimator as $\mathbf{\hat{y}}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\mathbf{x}+\mathbf{c}$ for $\boldsymbol{H}^{\text{nv}}(\mathbf{S})$ as in (3) and $\mathbf{c}\in\mathbb{R}^{N}$ . Given the random variable $\mathbf{y}=\mathsf{\rho}(\mathbf{x})$ , the aim is to find the filter taps $\mathbf{H}\in\mathbb{R}^{N\times(K+1)}$ that minimize the mean squared error (MSE)

\mathbf{H}^{\star}=\operatorname*{arg\,min}_{\mathbf{H}\in\mathbb{R}^{N\times(K+1)}}\mathbb{E}\big{[}\|\mathbf{\hat{y}}-\mathbf{y}\|_{2}^{2}\big{]}.

(11)

Our particular focus is set on obtaining unbiased estimators.

Lemma 4 (Unbiased estimator).

Let $\boldsymbol{\mu}_{\rho}=\mathbb{E}[\mathsf{\rho}(\mathbf{x})]$ . The NVGF-based estimator is unbiased if and only if $\mathbf{c}=\boldsymbol{\mu}_{\rho}-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\boldsymbol{\mu}_{x}$ .

From Lemma 4, the unbiased estimator is now

\mathbf{\hat{y}}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\big{(}\mathbf{x}-\boldsymbol{\mu}_{x}\big{)}+\boldsymbol{\mu}_{\rho}.

(12)

Therefore, the objective becomes finding the filter tap matrix $\mathbf{H}\in\mathbb{R}^{N\times(K+1)}$ for some fixed value of $K$ that satisfies

\mathbf{H}^{\star}=\operatorname*{arg\,min}_{\mathbf{H}\in\mathbb{R}^{N\times(K+1)}}\mathbb{E}\Big{[}\big{\|}\boldsymbol{H}^{\text{nv}}(\mathbf{S})(\mathbf{x}-\boldsymbol{\mu}_{x})-\big{(}\mathsf{\rho}(\mathbf{x})-\boldsymbol{\mu}_{\rho}\big{)}\big{\|}_{2}^{2}\Big{]}.

(13)

Note that, due to the orthogonal nature of the GFT, minimizing (13) is equivalent to minimizing the difference of the corresponding frequency responses.

The optimal filter taps for each node, i.e., the rows $\mathbf{h}_{i}^{\star}\in\mathbb{R}^{K+1}$ of $\mathbf{H}^{\star}$ that solve (13), can be obtained by solving a linear system of equations as follows.

Proposition 5 (Optimal NVGF).

Let $\mathbf{u}_{i}$ denote the $i^{\text{th}}$ row of $\mathbf{V}$ , $\mathbf{R}_{i}=\mathbf{\Lambda}^{\mathsf{T}}\mathsf{diag}(\mathbf{u}_{i})\mathbf{V}^{\mathsf{T}}\mathbf{C}_{x}\mathbf{V}\mathsf{diag}(\mathbf{u}_{i})\mathbf{\Lambda}$ be the covariance matrix of the frequency response at node $v_{i}$ for the input $\mathbf{x}$ , and $\mathbf{p}_{i}=\mathbf{\Lambda}^{\mathsf{T}}\mathsf{diag}(\mathbf{u}_{i})\mathbf{V}^{\mathsf{T}}\mathbb{E}[(\mathsf{\rho}(x_{i})-\mu_{\rho i})(\mathbf{x}-\boldsymbol{\mu}_{x})]$ denote the correlation between the filtered signal and the target nonlinearity. Then, a set of filter taps $\{\mathbf{h}_{1}^{\star},\ldots,\mathbf{h}_{N}^{\star}\}$ is optimal for (13) if and only if they solve the system of linear equations

\mathbf{R}_{i}\mathbf{h}_{i}^{\star}=\mathbf{p}_{i},\quad i\in\{1,\ldots,N\}.

(14)

From Proposition 5, it is immediate that only knowledge of the first and second moments of $\mathbf{x}$ , and of the correlation between the input $\mathbf{x}$ and the output $\mathbf{y}=\mathsf{\rho}(\mathbf{x})$ , is required to solve for the optimal NVGF. These moments can be estimated from training data. Also observed is that the optimal filter taps for each node can be computed separately at each node.

We now consider the specific case of random graph signals with zero-mean, independent, identically distributed (i.i.d.) entries, to illustrate the form of the optimal NVGF.

Corollary 6 (Zero-mean, i.i.d., ReLU nonlinearity).

Assume that the elements of $\mathbf{x}$ are i.i.d., and that $\boldsymbol{\mu}_{x}=\mathbf{0}$ and $\mathbf{R}_{x}=\mathbf{C}_{x}=\sigma_{x}^{2}\mathbf{I}$ for some $\sigma_{x}^{2}>0$ . Consider an unbiased NVGF-based estimator of the form

\hat{y}_{i}=(\xi^{2}/\sigma_{x}^{2})x_{i}+\mu_{\rho}

(15)

with $\xi^{2}=\mathbb{E}[\rho(x)x]$ and $\mu_{\rho}=\mathbb{E}[\rho(x)]$ , where $x$ is distributed as the elements of $\mathbf{x}$ . Then, the estimator (15) is optimal for (13). If, additionally, $\rho(\cdot)=\mathsf{ReLU}(\cdot)$ and the distribution of $x$ is symmetric around $0$ , then $\mu_{\rho}=0$ and $\xi^{2}=\sigma_{x}^{2}/2$ , so that (15) reduces to $\hat{y}_{i}=x_{i}/2$ .

Corollary 6 shows that if the elements of the graph signal are zero-mean and i.i.d., then the NVGF boils down to a LSI graph filter. Hence, this optimal unbiased NVGF is not capable of generating frequencies. This is sensible since the elements of the graph signal $\mathbf{x}$ bare no relationship to the underlying graph support, and thus the NVGF does not leverage that support to create the appropriate frequencies. Furthermore, if the distribution of each element of $\mathbf{x}$ is symmetric around zero, then an optimal approximation of the $\mathsf{ReLU}$ nonlinearity amounts to a LSI graph filter that outputs half of the input, as one would expect.

Corollary 7 (Stationary graph processes).

Let $\mathbf{x}$ be a stationary graph process [30, 31], so that $\boldsymbol{\mu}_{x}=\mathbf{0}$ and $\mathbf{R}_{x}=\mathbf{C}_{x}=\mathbf{V}\mathsf{diag}(\mathbf{q})\mathbf{V}^{\mathsf{T}}$ with power spectral density $\mathbf{q}\in\mathbb{R}^{N}$ . Then, NVGF filter taps $\mathbf{h}_{i}^{\star}$ at node $v_{i}$ solving the linear equations $\mathsf{diag}(\mathbf{u}_{i}\circ\mathbf{q})\mathbf{\Lambda}\mathbf{h}_{i}^{\star}=\mathbf{V}^{\mathsf{T}}\mathbb{E}[\rho(x_{i})\mathbf{x}]$ are optimal.

It is observed from Corollary 7 that the effect of the contribution of each frequency component to node $v_{i}$ gets modulated by the power spectral density $\mathbf{q}$ . Also, the filter in this case is different for each node, and thus optimal unbiased NVGFs for processing stationary graph processes actually create frequencies.

5 Graph Neural Network Architectures Using Node-Variant Graph Filters

Refer to caption — (a) Performance Comparison

A graph convolutional neural network (GCNN) $\mathsf{\Phi}$ has a layered architecture [23, 6], where each layer applies a LSIGF as in (2), followed by a pointwise nonlinear activation function $\rho:\mathbb{R}\to\mathbb{R}$ applied to each node $[\rho(\mathbf{x})]_{i}=\rho([\mathbf{x}]_{i})$ and

\mathbf{x}_{0}=\mathbf{x},\ \mathbf{x}_{\ell}=\rho\big{(}\mathsf{H}_{\ell}^{\text{lsi}}(\mathbf{x}_{\ell-1};\mathbf{S},\mathbf{h}_{\ell})\big{)},\ \mathbf{x}_{L}=\mathsf{\Phi}(\mathbf{x};\mathbf{S},\mathcal{H}).

(16)

The LSIGF $\mathsf{H}_{\ell}^{\text{lsi}}(\cdot;\mathbf{S},\mathbf{h}_{\ell})$ is characterized by the specific filter taps $\mathbf{h}_{\ell}\in\mathbb{R}^{K_{\ell}+1}$ . Note that the output of the GCNN is collected as the output of the last layer $\mathbf{x}_{L}=\mathsf{\Phi}(\mathbf{x};\mathbf{S},\mathcal{H})$ . This notation emphasizes that the input is $\mathbf{x}$ , while the matrix $\mathbf{S}$ is given by the problem, and the filter taps $\mathcal{H}=\{\mathbf{h}_{\ell}\}_{\ell=1}^{L}$ are learned from data.

Nonlinear activation functions are used in GCNNs to enable them to learn nonlinear relationships between input and output. Additionally, theoretical results have found that they play a key role in creating frequency content that can be better processed by subsequent graph convolutional filters. As previously discussed, NVGFs are also capable of creating new frequency content, albeit in a linear manner. Therefore, by replacing the nonlinear activation functions by NVGFs, it is possible to decouple the contribution made by frequency creation from that made by the architecture’s nonlinearity.

The first architecture proposed here is to use a designed NVGF in lieu of the activation function. That is, instead of using the nonlinear activation function $\rho$ , an optimal NVGF filter designed as in Proposition 5 is used. This requires estimating the first and second moments of the NVGF input data. The architecture, herein termed “Design NVGF”, is given by

\mathbf{x}_{\ell}=\mathsf{H}_{\ell}^{\text{nv}}\big{(}\mathsf{H}_{\ell}^{\text{lsi}}(\mathbf{x}_{\ell};\mathbf{S},\mathbf{h}_{\ell});\mathbf{S},\mathbf{H}_{\ell}^{\star}\big{)}.

(17)

Note that, in this case, the filter taps $\mathbf{H}_{\ell}^{\star}\in\mathbb{R}^{N\times(K_{\ell}+1)}$ of the NVGF are obtained by Proposition 5, while the filter taps $\mathcal{H}=\{\mathbf{h}_{\ell}\}_{\ell=1}^{L}$ of the LSIGF are learned from data.

Alternatively, the filter taps of the NVGF replacing the nonlinear activation function can also be learned from data, together with the filter taps of the LSIGF:

\mathbf{x}_{\ell}=\mathsf{H}_{\ell}^{\text{nv}}\big{(}\mathsf{H}_{\ell}^{\text{lsi}}(\mathbf{x}_{\ell};\mathbf{S},\mathbf{h}_{\ell});\mathbf{S},\mathbf{H}_{\ell}\big{)}

(18)

where the filter taps to be learned are $\mathcal{H}=\{(\mathbf{h}_{\ell},\mathbf{H}_{\ell})\}_{\ell=1}^{L}$ . This approach avoids the need to estimate first and second moments. Additionally, it allows the NVGF to learn how to create frequency content tailored to the application at hand, instead of just approximating the chosen nonlinear activation function. We note that while the increased number of parameters may lead to overfitting, this can be tackled by dropout. This architecture is termed “Learn NVGF”.

6 Numerical Experiments

The objectives of the experiments are twofold. First, they aim to show how the architecture obtained by replacing the nonlinear activation function with a NVGF performs when compared to other popular GNNs. The second objective is to isolate the impact that frequency creation has on the overall performance. Due to space constraints, we focus on the problem of authorship attribution [5]. Other applications can be found in the supplementary material, where the observations are summarized at the end of this section for reference.

Problem statement. In the problem of authorship attribution, the goal is to determine whether a given text has been written by a certain author, relying on other texts that the author is known to have written. To this end, word adjacency networks (WANs) are leveraged. WANs are graphs that are built by considering function words (i.e., words without semantic meaning) as nodes, and considering their co-occurrences as edges [5]. As it happens, the way each author uses function words gives rise to a stylistic signature that can be used to attribute authorship. In what follows, we address this problem by leveraging previously constructed WANs as graphs, and the frequency count (histogram) of function words as the corresponding graph signals.

Dataset. For illustrative purposes, in what follows, works by Jane Austen are considered. Attribution of other $\text{19}^{\text{th}}$ century authors can be found in the supplementary material. A WAN consisting of $189$ nodes (function words) and $9812$ edges is built from texts belonging to a given corpus considered to be the training set. These texts are partitioned into segments of approximately $1000$ words each, and the frequency count of those $189$ function words in each of the texts is obtained. These represent the graph signals $\mathbf{x}$ that are considered to be part of the training set. Each of these is assigned a label $1$ to indicate that they have been authored by Jane Austen. An equal number of segments from other contemporary authors are randomly selected, and then their frequency count is computed and added to the training set with the label $0$ to indicate that they have not been written by Jane Austen. The total number of labeled samples in the training set is $1464$ , of which $118$ are left aside for validation. The test set is built analogously by considering other text segments that were not part of the training set (and thus, not used to build the WAN either), totaling $78$ graph signals (half corresponding to texts authored by Jane Austen, and half corresponding to texts by other contemporary authors).

Architectures and training. For the first experiment, we compare the Learn NVGF architecture (18) with arguably three of the most popular non-GSP GNNs, namely, GCN [24], SGC [25], and GAT [26]. Note that the Learn NVGF is an entirely linear architecture, but one capable of creating frequencies due to the nature of the NVGF. The filter taps of both the LSIGF and the NVGF are learned from data. The other three architectures are nonlinear since they include a ReLU activation function after the first filtering layer. All architectures include a learnable linear readout layer. Dropout with a probability $0.5$ is included after the first layer. All architectures are trained for $25$ epochs with a batch size of $20$ , using the ADAM algorithm [32] with the forgetting factors $0.9$ and $0.999$ , respectively, as well as a learning rate $\eta$ . The value of the learning rate $\eta$ , the number of hidden units $F$ , and the number of filter taps $K$ are chosen by exhaustive search over triplets $(\eta,F,K)$ in the set $\{0.001,0.005,0.01\}\times\{16,32,64\}\times\{2,3,4\}$ .

Experiment 1: Performance comparison. The objective of this first experiment is to illustrate that the performance of the Learn NVGF is comparable to the performance of popular (non-GSP) GNNs. The best results for each architecture are shown in Figure 1a, where the classification error was averaged over $10$ random splits of texts that are assigned to the training and test sets. One third of the standard deviation is also shown. It is observed that the Learn NVGF architecture has a comparable performance. It is emphasized that the objective of this paper is not to achieve state-of-the-art performance, but to offer insight on the role of nonlinear activation functions in frequency creation and how this translates to performance, as discussed next. Among the three popular architectures, GAT ( $\eta=0.01$ , $F=64$ ) exhibits the best performance, which is $102\%$ better than SGC ( $\eta=0.005$ , $F=64$ , $K=2$ ) and $30\%$ better than GCN ( $\eta=0.01$ , $F=64$ ). This is expected since the graph filters involved in GAT are non-convolutional, and thus have more expressivity. Yet, when compared to the linear Learn NVGF ( $\eta=0.001$ , $F=32$ , $K=3$ ), it is observed that the latter exhibits $20\%$ better performance than the GAT.

Experiment 2: The role of frequency creation. For the second experiment, we compare the Learn NVGF of the previous experiment with (i) a simple LSIGF, (ii) a Design NVGF as in (17), and (iii) a GCNN as in (16). The same values of $(\eta=0.001,F=32,K=3)$ are used for all architectures as a means of fixing all other variabiliity except for the nonlinearity/frequency creation. The results are shown in Figure 1b. Note that the GCNN in (16) can be interpreted as a stand-in for ChebNets [23], arguably the most popular GSP-based GNN architecture.

Discussion. First, note that the LSIGF, Design NVGF, and Learn NVGF architectures are linear, whereas the GCNN is not. The LSIGF, however, is not capable of creating frequencies, while the other three are, albeit through different mechanisms. Second, Figure 1b shows the percentage difference in performance with respect to the base architecture (the GCNN). Essentially, the difference in performance between the LSIGF and the Learn NVGF can be pinned down to the frequency creation, because both architectures are linear, while the difference between the Learn NVGF and the GCNN can be tied to the nonlinear nature of the GCNN. It is thus observed that the LSIGF performs considerably worse than the Learn NVGF and the GCNN, which perform the same. It is observed that the Design NVGF performs halfway between the GCNN and the LSIGF. The Design NVGF depends on the ability to accurately estimate the first and second moments from the data, and this has an impact on its performance. In any case, it is noted that this experiment suggests that the main driver of improved performance is the frequency creation and not necessarily the nonlinear nature of the GCNN.

From a qualitative standpoint, the average frequency response of the signals in the test set is shown in Figure 1c. Since the high-eigenvalue content is significant, it is expected that the ability to better process this content will impact the overall performance. This explains the relatively poor performance of the LSIGF. In Figure 2, we show the frequency response of the output for each of the three architectures (LSIGF, Learn NVGF, and GCNN) when the input has a single high frequency, i.e., $\mathbf{x}=\mathbf{v}_{N}$ so that $\mathbf{\tilde{x}}=\mathbf{e}_{N}$ . Figure 2a shows that the output frequency response of the LSIGF exhibits a single frequency, the same as the input. Figure 2b shows that the output frequency response of the GCNN has content in all frequencies, but most notably a low-frequency peak appears. Figure 2c shows that the output frequency response of the Learn NVGF contains all frequencies, in a much more spread manner than the GCNN.

General observations. In the supplementary material, a similar analysis is carried out for $21$ other authors. Additionally, it is noted that Jane Austen is representative of the largest group (consisting of $11$ authors) where the Learn NVGF and the GCNN have similar performance and are better than the LSIGF. For $7$ other authors, the Learn NVGF actually performs better than the GCNN. Finally, for the remaining $3$ authors, there is no significant difference between the performance of the LSIGF, the GCNN, and the Learn NVGF, which implies that the high-frequency eigenvalue content is less significant for these authors. Additionally, the problem of movie recommendation is considered [33]. First, it is observed that the Learn NVGF exhibits better performance than the methods in [3, 4] and the nearest neighbor algorithm. Second, it is noted that the input signals do not have significant high-eigenvalue content and thus the LSIGF, the GCNN, and the Learn NVGF exhibit similar performance. For a detailed analysis, please see the supplementary material. ¹¹1It is noted that the popular benchmark of semi-supervised learning over citation networks does not fit the empirical risk minimization framework nor the graph signal processing framework, and thus, it does not admit a frequency analysis, precluding their use in this work.

Problems beyond processing graph signals. Two popular tasks that utilize graph-based data are semi-supervised learning and graph classification. The former involves a framework where each sample represents a node in the graph, i.e., $\mathbf{x}\in\mathbb{R}^{F}$ for $F$ features instead of $\mathbf{x}\in\mathbb{R}^{N}$ for $N$ nodes. This implies that the notion of graph frequency response used in (4) does not translate to the semi-supervised setting. Therefore, extending these results to this problem requires careful determination of the notion of frequency. For the graph classification problem, each sample in the dataset represents a signal together with a graph, and the graph associated to each sample is usually different. In this case, while the notion of frequency is properly defined for each sample, an overarching frequency response for all the samples is not. Possible extensions of this framework to the graph classification problem would entail redefining a common notion of frequency response for the dataset, for example, using graphons.

7 Conclusion

The objective of this work was to study the role of frequency creation in GSP problems. To do so, nonlinear activation functions (which theoretical findings suggest give rise to frequency creation) are replaced by NVGFs, which are also capable of creating frequencies, but in a linear manner. In this way, frequency creation was decoupled from the nonlinear nature of activation functions. Numerical experiments show that the main driver of improved performance is frequency creation and not necessarily the nonlinear nature of GCNNs. As future work, we are interested in extending this frequency analysis to non-GSP related problems such as semi-supervised node classification or graph classification problems, which require a careful definition of a notion of frequency. We discussed the caveats of extending this framework to include semi-supervised learning and graph classification problems, relating to the need of defining an appropriate notion of graph frequency. This opens up an interesting area of future research. We also note that it would be possible to use shift-variant filters to leverage this framework when using CNNs.

References

[1] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: Going beyond Euclidean data,” IEEE Signal Process. Mag., vol. 34, no. 4, pp. 18–42, July 2017.
[2] F. Gama, E. Isufi, G. Leus, and A. Ribeiro, “Graphs, convolutions, and neural networks: From graph filters to graph neural networks,” IEEE Signal Process. Mag., vol. 37, no. 6, pp. 128–138, Nov. 2020.
[3] F. Monti, M. M. Bronstein, and X. Bresson, “Geometric matrix completion with recurrent multi-graph neural networks,” in 31st Conf. Neural Inform. Process. Syst. Long Beach, CA: Neural Inform. Process. Syst. Foundation, 4-9 Dec. 2017, pp. 3700–3710.
[4] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “CayleyNets: Graph convolutional neural networks with complex rational spectral filters,” IEEE Trans. Signal Process., vol. 67, no. 1, pp. 97–109, 5 Nov. 2018.
[5] S. Segarra, M. Eisen, and A. Ribeiro, “Authorship attribution through function word adjacency networks,” IEEE Trans. Signal Process., vol. 63, no. 20, pp. 5464–5478, 30 June 2015.
[6] F. Gama, A. G. Marques, G. Leus, and A. Ribeiro, “Convolutional neural network architectures for signals supported on graphs,” IEEE Trans. Signal Process., vol. 67, no. 4, pp. 1034–1049, 17 Dec. 2018.
[7] M. Eisen and A. Ribeiro, “Optimal wireless resource allocation with random edge graph neural networks,” IEEE Trans. Signal Process., vol. 68, pp. 2977–2991, 20 Apr. 2020.
[8] F. Gama and S. Sojoudi, “Distributed linear-quadratic control with graph neural networks,” Signal Process., 11 Feb. 2022, early access. [Online]. Available: http://doi.org/10.1016/j.sigpro.2022.108506
[9] D. Owerko, F. Gama, and A. Ribeiro, “Predicting power outages using graph neural networks,” in 2018 IEEE Global Conf. Signal and Inform. Process. Anaheim, CA: IEEE, 26-29 Nov. 2018, pp. 743–747.
[10] F. Gama, J. Bruna, and A. Ribeiro, “Stability properties of graph neural networks,” IEEE Trans. Signal Process., vol. 68, pp. 5680–5695, 25 Sep. 2020.
[11] H. Kenlay, D. Thanou, and X. Dong, “On the stability of graph convolutional neural networks under edge rewiring,” in 46th IEEE Int. Conf. Acoust., Speech and Signal Process. Toronto, ON: IEEE, 6-11 June 2021.
[12] R. Levie, W. Huang, L. Bucci, M. Bronstein, and G. Kutyniok, “Transferability of spectral graph convolutional neural networks,” arXiv:1907.12972v2 [cs.LG], 5 March 2020. [Online]. Available: http://arxiv.org/abs/1907.12972
[13] F. Gao, G. Wolf, and M. Hirn, “Geometric scattering for graph data analysis,” in 36th Int. Conf. Mach. Learning, vol. 97. Long Beach, CA: Proc. Mach. Learning Res., 9-15 June 2019, pp. 2122–2131.
[14] D. Zou and G. Lerman, “Graph convolutional neural networks via scattering,” Appl. Comput. Harmonic Anal., vol. 49, no. 3, pp. 1046–1074, Nov. 2020.
[15] S. Mallat, “Group invariant scattering,” Commun. Pure, Appl. Math., vol. 65, no. 10, pp. 1331–1398, Oct. 2012.
[16] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1872–1886, Aug. 2013.
[17] A. Sandyhaila and J. M. F. Moura, “Discrete signal processing on graphs: Frequency analysis,” IEEE Trans. Signal Process., vol. 62, no. 12, pp. 3042–3054, 30 Apr. 2014.
[18] L. Ruiz, F. Gama, and A. Ribeiro, “Graph neural networks: Architectures, stability and transferability,” Proc. IEEE, vol. 109, no. 5, pp. 660–682, May 2021.
[19] S. Segarra, A. G. Marques, and A. Ribeiro, “Optimal graph-filter design and applications to distributed linear networks operators,” IEEE Trans. Signal Process., vol. 65, no. 15, pp. 4117–4131, 11 May 2017.
[20] S. Rey, V. Tenorio, S. Rozada, L. Martino, and A. G. Marques, “Deep encoder-decoder neural network architectures for graph output signals,” in 53rd Asilomar Conf. Signals, Systems and Comput. Pacific Grove, CA: IEEE, 3-6 Nov. 2019, pp. 225–229.
[21] S. Rey, V. Tenorio, S. Rozada, L. Martino, and A. G. Marques, “Overparametrized deep encoder-decoder schemes for inputs and outputs defined over graphs,” in 28th Eur. Signal Process. Conf. Amsterdam, the Netherlands: Eur. Assoc. Signal Process., 18-21 Jan. 2021, pp. 855–859.
[22] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and deep locally connected networks on graphs,” in 2nd Int. Conf. Learning Representations, Banff, AB, 14-16 Apr. 2014, pp. 1–14.
[23] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in 30th Conf. Neural Inform. Process. Syst. Barcelona, Spain: Neural Inform. Process. Syst. Foundation, 5-10 Dec. 2016, pp. 3844–3858.
[24] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in 5th Int. Conf. Learning Representations, Toulon, France, 24-26 Apr. 2017, pp. 1–14.
[25] F. Wu, T. Zhang, A. H. de Souza Jr., C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” in 36th Int. Conf. Mach. Learning, vol. 97. Long Beach, CA: Proc. Mach. Learning Res., 9-15 June 2019, pp. 6861–6871.
[26] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in 6th Int. Conf. Learning Representations, Vancouver, BC, 30 Apr.-3 May 2018, pp. 1–12.
[27] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing on graphs,” IEEE Trans. Signal Process., vol. 61, no. 7, pp. 1644–1656, 11 Jan. 2013.
[28] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Process. Mag., vol. 30, no. 3, pp. 83–98, May 2013.
[29] A. Ortega, P. Frossard, J. Kovačević, J. M. F. Moura, and P. Vandergheynst, “Graph signal processing: Overview, challenges and applications,” Proc. IEEE, vol. 106, no. 5, pp. 808–828, May 2018.
[30] N. Perraudin and P. Vendergheynst, “Stationary signal processing on graphs,” IEEE Trans. Signal Process., vol. 65, no. 13, pp. 3462–3477, 3 Apr. 2017.
[31] A. G. Marques, S. Segarra, G. Leus, and A. Ribeiro, “Stationary graph processes and spectral estimation,” IEEE Trans. Signal Process., vol. 65, no. 22, pp. 5911–5926, 11 Aug. 2017.
[32] D. P. Kingma and J. L. Ba, “ADAM: A method for stochastic optimization,” in 3rd Int. Conf. Learning Representations, San Diego, CA, 7-9 May 2015, pp. 1–15.
[33] F. M. Harper and J. A. Konstan, “The MovieLens datasets: History and context,” ACM Trans. Interactive Intell. Syst., vol. 5, no. 4, pp. 19:(1–19), Jan. 2016.

Appendix A Frequency Response of Node-Variant Graph Filters

Proof of Proposition 1.

The output graph signal $\mathbf{y}\in\mathbb{R}^{N}$ of a node-variant graph filter is given by (3), which is reproduced here for ease of exposition:

\mathbf{y}=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{S}^{k}\mathbf{x}.

(19)

Recall that $\mathbf{h}^{(k)}\in\mathbb{R}^{N}$ is the $(k+1)^{\text{th}}$ column of the matrix $\mathbf{H}\in\mathbb{R}^{N\times(K+1)}$ containing the $N(K+1)$ filter taps, that $\mathbf{S}\in\mathbb{R}^{N\times N}$ is the support matrix describing the graph, and that $\mathbf{x}\in\mathbb{R}^{N}$ is the input graph signal. As given by (4), the spectral representation $\mathbf{\tilde{x}}\in\mathbb{R}^{N}$ of the input graph signal are the coordinates of representing $\mathbf{x}$ on the eigenbasis $\{\mathbf{v}_{i}\}_{i=1}^{N}$ of the support matrix $\mathbf{S}$ , i.e., $\mathbf{\tilde{x}}=\mathbf{V}^{\mathsf{T}}\mathbf{x}$ . Similarly, $\mathbf{\tilde{y}}=\mathbf{V}^{\mathsf{T}}\mathbf{y}$ . Therefore, by (19) together with the fact that $\mathbf{S}=\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda})\mathbf{V}^{\mathsf{T}}$ , it holds that

	$\displaystyle\mathbf{\tilde{y}}$	$\displaystyle=\mathbf{V}^{\mathsf{T}}\mathbf{y}=\mathbf{V}^{\mathsf{T}}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{S}^{k}\mathbf{x}$		(20)
		$\displaystyle=\mathbf{V}^{\mathsf{T}}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{k})\mathbf{\tilde{x}}$		(20)

where $\boldsymbol{\lambda}^{k}$ is the $N$ -vector with $i^{\text{th}}$ element given by $[\boldsymbol{\lambda}^{k}]_{i}=\lambda_{i}^{k}$ . Denoting $h_{ik}=[\mathbf{h}^{(k)}]_{i}=[\mathbf{H}]_{i(k+1)}$ , note that $\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{k})\in\mathbb{R}^{N\times N}$ , so that

	$\displaystyle\sum_{k=0}^{K}[\mathsf{diag}($	$\displaystyle\mathbf{h}^{(k)})\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{k})]_{ij}=\sum_{k=0}^{K}h_{ik}\lambda_{j}^{k}v_{ij}$		(21)
		$\displaystyle=v_{ij}\sum_{k=0}^{K}h_{ik}\lambda_{j}^{k}=v_{ij}\mathsf{\tilde{h}}_{i}(\lambda_{j})$		(21)

where $\mathsf{\tilde{h}}_{i}(\lambda)=\sum_{k=0}^{K}h_{ik}\lambda^{k}$ is the frequency response of the filter taps at node $v_{i}$ ; see (6). Next, observe that the filter taps for each node are collected in the rows of $\mathbf{H}$ . Therefore, in analogy to (8), it holds that $\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}}\in\mathbb{R}^{N\times N}$ , with

[\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}}]_{ij}=\sum_{k=0}^{K}h_{ik}\lambda_{j}^{k}=\mathsf{\tilde{h}}_{i}(\lambda_{j}).

(22)

Substituting (22) into (21) gives that

\sum_{k=0}^{K}[\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{k})]_{ij}=[\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})]_{ij}

(23)

so that (20) becomes

\mathbf{\tilde{y}}=\mathbf{V}^{\mathsf{T}}\big{(}\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})\big{)}\mathbf{\tilde{x}}.

(24)

This completes the proof. ∎

Proof of Corollary 2.

The $j^{\text{th}}$ element of the graph Fourier transform of the output, $\mathbf{\tilde{y}}$ , is

\tilde{y}_{j}=\sum_{i=1}^{N}\tilde{x}_{i}\mathbf{v}_{j}^{\mathsf{T}}\mathsf{diag}(\boldsymbol{\tilde{h}}(\lambda_{i}))\mathbf{v}_{i}

(25)

where $\boldsymbol{\tilde{h}}(\lambda_{i})=[\mathsf{\tilde{h}}_{1}(\lambda_{i}),\mathsf{\tilde{h}}_{2}(\lambda_{i}),\dots,\mathsf{\tilde{h}}_{N}(\lambda_{i})]\in\mathbb{R}^{N}$ collects the frequency response of all nodes at eigenvalue $\lambda_{i}$ . Note that, if $\tilde{y}_{j}$ depends on $\tilde{x}_{i}$ for some $i\neq j$ , then frequencies are created. When $\mathbf{V}^{\mathsf{T}}\big{(}\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})\big{)}$ is not diagonal, there exists $i,j\in\{1,\ldots,N\}$ such that $i\neq j$ and $(\mathbf{V}^{\mathsf{T}}\big{(}\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})\big{)})_{ij}=\mathbf{v}_{j}^{\mathsf{T}}\mathsf{diag}(\boldsymbol{\tilde{h}}(\lambda_{i}))\mathbf{v}_{i}\neq 0$ . In this case, (25) indeed shows that frequencies are created.

Now suppose that no frequency creation occurs. Then $\mathbf{V}^{\mathsf{T}}\big{(}\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})\big{)}$ must be diagonal. Therefore, there exists $\alpha_{1},\ldots,\alpha_{N}\in\mathbb{R}$ such that $\mathbf{v}_{j}^{\mathsf{T}}\mathsf{diag}(\boldsymbol{\tilde{h}}(\lambda_{i}))\mathbf{v}_{i}=\alpha_{i}\delta_{ij}$ for all $i,j$ . Recalling that the set $\{\mathbf{v}_{i}\}_{i=1}^{N}$ of eigenvectors of $\mathbf{S}$ is an orthonormal basis, we have that $\mathbf{v}_{j}^{\mathsf{T}}\mathbf{v}_{i}=\delta_{ij}$ , so it must be that $\mathbf{v}_{i}^{\mathsf{T}}(\mathsf{diag}(\boldsymbol{\tilde{h}}(\lambda_{i}))-\alpha_{i}\mathbf{I})\mathbf{v}_{j}=0$ for all $i,j$ . This implies for all $i,j$ that $v_{ji}=0$ or $\mathsf{\tilde{h}}_{j}(\lambda_{i})=\alpha_{i}$ . In the case that $v_{ij}\neq 0$ for all $i,j$ , it is clear that $\mathsf{diag}(\boldsymbol{\tilde{h}}(\lambda_{i}))=\alpha_{i}\mathbf{I}$ , meaning that the frequency response at all nodes is the same. This restriction implies that the NVGF filter is a LSI graph filter. This concludes the proof. ∎

Appendix B Lipschitz Continuity of Node-Variant Graph Filters

Proof of Theorem 3.

Leveraging the fact that the filter taps of both $\boldsymbol{H}^{\text{nv}}(\mathbf{\hat{S}})$ and $\boldsymbol{H}^{\text{nv}}(\mathbf{S})$ are the same, start by writing the difference between the filter outputs as

\big{(}\boldsymbol{H}^{\text{nv}}(\mathbf{\hat{S}})-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\big{)}\mathbf{x}=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\big{(}\mathbf{\hat{S}}^{k}-\mathbf{S}^{k}\big{)}\mathbf{x}.

(26)

Let $\mathbf{E}=\mathbf{\hat{S}}-\mathbf{S}$ and note that $\mathbf{E}$ is symmetric and satisfies $\|\mathbf{E}\|=\|\mathbf{\hat{S}}-\mathbf{S}\|\leq\varepsilon$ by assumption. Recall that $(\mathbf{S}+\mathbf{E})^{k}=\mathbf{S}^{k}+\sum_{r=0}^{k-1}\mathbf{S}^{r}\mathbf{E}\mathbf{S}^{k-r-1}+\mathbf{C}$ with $\mathbf{C}$ such that $\|\mathbf{C}\|\leq\sum_{r=2}^{k}\binom{k}{r}\|\mathbf{E}\|^{r}\|\mathbf{S}\|^{k-r}$ . Leveraging this fact in (26), it yields

	$\displaystyle\big{(}\boldsymbol{H}^{\text{nv}}$	$\displaystyle(\mathbf{\hat{S}})-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\big{)}\mathbf{x}$		(27)
		$\displaystyle=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\mathbf{S}^{r}\mathbf{E}\mathbf{S}^{k-r-1}\mathbf{x}+\mathbf{D}\mathbf{x}$		(27)

with $\mathbf{D}$ satisfying $\|\mathbf{D}\|=\mathsf{O}(\|\mathbf{E}\|^{2})$ , since the filter taps $\mathbf{H}$ define analytic frequency responses $\mathsf{h}_{i}(\lambda)$ with bounded derivatives for all $i\in\{1,\ldots,N\}$ . The input graph signal $\mathbf{x}$ can be rewritten as $\mathbf{x}=\sum_{i=1}^{N}\tilde{x}_{i}\mathbf{v}_{i}$ using the GFT for the support matrix $\mathbf{S}$ . Then, (27) becomes

		$\displaystyle\big{(}\boldsymbol{H}^{\text{nv}}(\mathbf{\hat{S}})-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\big{)}\mathbf{x}$		(28)
		$\displaystyle=\sum_{i=1}^{N}\tilde{x}_{i}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\mathbf{S}^{r}\mathbf{E}\mathbf{S}^{k-r-1}\mathbf{v}_{i}+\sum_{i=1}^{N}\tilde{x}_{i}\mathbf{D}\mathbf{v}_{i}.$

Using the fact that $\mathbf{v}_{i}$ is an eigenvector of $\mathbf{S}$ the first term in (28) can be conveniently rewritten as

	$\displaystyle\sum_{i=1}^{N}\tilde{x}_{i}$	$\displaystyle\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\mathbf{S}^{r}\mathbf{E}\mathbf{S}^{k-r-1}\mathbf{v}_{i}$		(29)
		$\displaystyle=\sum_{i=1}^{N}\tilde{x}_{i}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{S}^{r}\mathbf{E}\mathbf{v}_{i}.$		(29)

Denoting by $\mathbf{E}=\mathbf{U}\mathsf{diag}(\mathbf{m})\mathbf{U}^{\mathsf{T}}$ the eigendecomposition of $\mathbf{E}$ with $\mathbf{E}\mathbf{u}_{i}=m_{i}\mathbf{u}_{i}$ the corresponding eigenvectors $\mathbf{u}_{i}$ and eigenvalues $m_{i}$ , [10, Lemma 1] states that $\mathbf{E}\mathbf{v}_{i}=m_{i}\mathbf{v}_{i}+\mathbf{E}_{U}\mathbf{v}_{i}$ with $\|\mathbf{E}_{U}\|\leq 8\varepsilon$ for an appropriate matrix $\mathbf{E}_{U}$ that depends on $\mathbf{U}$ , $\mathbf{S}$ , and $\mathbf{E}$ . Using this in (29),

$\displaystyle\sum_{i=1}^{N}$	$\displaystyle\tilde{x}_{i}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{S}^{r}\mathbf{E}\mathbf{v}_{i}$
	$\displaystyle=\sum_{i=1}^{N}\tilde{x}_{i}m_{i}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{S}^{r}\mathbf{v}_{i}$	(30)
	$\displaystyle\quad+\sum_{i=1}^{N}\tilde{x}_{i}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{S}^{r}\mathbf{E}_{U}\mathbf{v}_{i}$	(31)

is obtained. For the term (30), note that $\mathbf{S}^{r}\mathbf{v}_{i}=\lambda_{i}^{r}\mathbf{v}_{i}$ and that $\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\lambda_{i}^{r}=k\lambda_{i}^{k-1}$ , so that the term (30) is equivalent to

	$\displaystyle\sum_{i=1}^{N}\tilde{x}_{i}m_{i}$	$\displaystyle\sum_{k=0}^{K}k\mathsf{diag}(\mathbf{h}^{(k)})\lambda_{i}^{k-1}\mathbf{v}_{i}$		(32)
		$\displaystyle=\sum_{i=1}^{N}\tilde{x}_{i}m_{i}\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\mathbf{v}_{i}$		(32)

where $\boldsymbol{\tilde{h}}^{\prime}(\lambda)\in\mathbb{R}^{N}$ is a vector where the $i^{\text{th}}$ entry is the derivative of the frequency response of node $v_{i}$ , i.e., $[\boldsymbol{\tilde{h}}^{\prime}(\lambda)]_{i}=\mathsf{\tilde{h}}^{\prime}_{i}(\lambda)=\frac{d}{d\lambda}\mathsf{\tilde{h}}_{i}(\lambda)$ . To rewrite (31), consider the following lemma, which is conveniently proved after completing the current proof.

Lemma 8.

For all $i\in\{1,\dots,N\}$ , define $\mathbf{G}_{i}\in\mathbb{R}^{N\times N}$ by

[\mathbf{G}_{i}]_{tj}=\begin{cases}\mathsf{\tilde{h}}_{t}^{\prime}(\lambda_{i})&\text{ if }j=i\\ \frac{\mathsf{\tilde{h}}_{t}(\lambda_{i})-\mathsf{\tilde{h}}_{t}(\lambda_{j})}{\lambda_{i}-\lambda_{j}}&\text{ if }j\neq i\end{cases}

(33)

where $\mathsf{\tilde{h}}_{t}$ is the frequency response at node $v_{t}$ and $\mathsf{\tilde{h}}^{\prime}_{t}$ is its derivative. Then

	$\displaystyle\sum_{i=1}^{N}\tilde{x}_{i}$	$\displaystyle\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{S}^{r}\mathbf{E}_{U}\mathbf{v}_{i}$		(34)
		$\displaystyle=\sum_{i=1}^{N}\tilde{x}_{i}\big{(}\mathbf{V}\circ\mathbf{G}_{i}\big{)}\mathbf{V}^{\mathsf{T}}\mathbf{E}_{U}\mathbf{v}_{i}.$		(34)

With Lemma 8 in place, use (32) in (30) and (34) in (31), and this in turn back into (28) to obtain

$\displaystyle\big{(}\boldsymbol{H}^{\text{nv}}$	$\displaystyle(\mathbf{\hat{S}})-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\big{)}\mathbf{x}$	(35)
	$\displaystyle=\sum_{i=1}^{N}\tilde{x}_{i}m_{i}\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\mathbf{v}_{i}$
	$\displaystyle\quad+\sum_{i=1}^{N}\tilde{x}_{i}\big{(}\mathbf{V}\circ\mathbf{G}_{i}\big{)}\mathbf{V}^{\mathsf{T}}\mathbf{E}_{U}\mathbf{v}_{i}$
	$\displaystyle\quad+\sum_{i=1}^{N}\tilde{x}_{i}\mathbf{D}\mathbf{v}_{i}.$

Next, compute the norm of (35) by applying the triangle inequality to compute the norms of each of the three summands. For the norm of the first term in (35), the triangle inequality gives

	$\displaystyle\Big{\\|}\sum_{i=1}^{N}$	$\displaystyle\tilde{x}_{i}m_{i}\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\mathbf{v}_{i}\Big{\\|}$		(36)
		$\displaystyle\leq\sum_{i=1}^{N}\|\tilde{x}_{i}\|\|m_{i}\|\\|\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\\|\\|\mathbf{v}_{i}\\|$		(36)

where the submultiplicative property of the operator norm was used to bound $\|\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\mathbf{v}_{i}\|\leq\|\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\|\|\mathbf{v}_{i}\|$ . Leveraging that $\|\mathbf{v}_{i}\|=1$ , that $\|\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\|=\max_{j}|\mathsf{\tilde{h}}^{\prime}_{j}(\lambda_{i})|\leq C$ by Lipschitz continuity, and that $|m_{i}|\leq\varepsilon$ by the hypothesis that $\|\mathbf{S}-\mathbf{\hat{S}}\|\leq\varepsilon$ , the norm can be further bounded as

$\displaystyle\Big{\\|}\sum_{i=1}^{N}\tilde{x}_{i}m_{i}$	$\displaystyle\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\mathbf{v}_{i}\Big{\\|}$	(37)
	$\displaystyle\leq\varepsilon C\sum_{i=1}^{N}\|\tilde{x}_{i}\|=\varepsilon C\\|\mathbf{\tilde{x}}\\|_{1}$
	$\displaystyle\leq\varepsilon C\sqrt{N}\\|\mathbf{\tilde{x}}\\|=\varepsilon C\sqrt{N}\\|\mathbf{x}\\|.$

Note that the inequality between the $1$ -norm and the $2$ -norm was used, as well as the fact that the GFT is a Parseval operator. For the norm of the second term in (35), the triangle inequality together with the submultiplicativity of the operator norm are used to get

$\displaystyle\Big{\\|}$	$\displaystyle\sum_{i=1}^{N}\tilde{x}_{i}\big{(}\mathbf{V}\circ\mathbf{G}_{i}\big{)}\mathbf{V}^{\mathsf{T}}\mathbf{E}_{U}\mathbf{v}_{i}\Big{\\|}$	(38)
	$\displaystyle\leq\sum_{i=1}^{N}\|\tilde{x}_{i}\|\\|\mathbf{V}\circ\mathbf{G}_{i}\\|\\|\mathbf{V}^{\mathsf{T}}\\|\\|\mathbf{E}_{U}\\|\\|\mathbf{v}_{i}\\|$
	$\displaystyle\leq 8\varepsilon CN\sqrt{N}\\|\mathbf{x}\\|.$

For the last inequality to hold, denote by $\|\cdot\|_{\max}$ the entrywise maximum norm of a matrix, and note that $\|\mathbf{V}\circ\mathbf{G}_{i}\|\leq N\|\mathbf{V}\circ\mathbf{G}_{i}\|_{\max}=N\max_{t,j\in\{1,\ldots,N\}}(|[\mathbf{V}]_{tj}|\ |[\mathbf{G}_{i}]_{tj}|)\leq NC$ for all $i\in\{1,\ldots,N\}$ , by the Lipschitz continuity of the frequency responses and the fact that $\mathbf{V}$ is an orthogonal matrix, so $|[\mathbf{V}]_{tj}|\leq 1$ for all $t$ and $j$ . Also recall that $\|\mathbf{E}_{U}\|\leq 8\varepsilon$ . Finally, note that the inequality between the $1$ -norm and $2$ -norm of vectors together with the Parseval nature of the GFT were used. For the third term in (35), it simply holds that

	$\displaystyle\Big{\\|}$	$\displaystyle\sum_{i=1}^{N}\tilde{x}_{i}\mathbf{D}\mathbf{v}_{i}\Big{\\|}=\Big{\\|}\mathbf{D}\sum_{t=1}^{N}\tilde{x}_{i}\mathbf{v}_{i}\Big{\\|}=\\|\mathbf{D}\mathbf{x}\\|$		(39)
		$\displaystyle\leq\\|\mathbf{D}\\|\\|\mathbf{x}\\|\leq\mathsf{O}(\varepsilon^{2})\\|\mathbf{x}\\|.$		(39)

Substituting (37), (38), and (39) into (35) gives (10), which concludes the proof. ∎

Proof of Lemma 8.

We have for all $i\in\{1,\ldots,N\}$ that

		$\displaystyle\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{r})$		(40)
		$\displaystyle=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathsf{diag}(\boldsymbol{\lambda}^{r}).$		(40)

The last summation $\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathsf{diag}(\boldsymbol{\lambda}^{r})$ is a diagonal matrix, where the $j^{\text{th}}$ element of the diagonal can be written as

	$\displaystyle\Big{[}$	$\displaystyle\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\boldsymbol{\lambda}^{r}\Big{]}_{j}=\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\lambda_{j}^{r}$		(41)
		$\displaystyle=[\boldsymbol{\gamma}_{i}]_{j}\coloneqq$		(41)

so that

		$\displaystyle\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathsf{diag}(\boldsymbol{\lambda}^{r})$		(42)
		$\displaystyle=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\mathsf{diag}(\boldsymbol{\gamma}_{i}).$		(42)

Note that the $i$ subindex in $\boldsymbol{\gamma}_{i}$ indicates that this is parametrized by $\lambda_{i}$ , while each entry of this vector, i.e., the $j^{\text{th}}$ entry, actually depends on $\lambda_{j}$ . Now, remark that since we have a full matrix $\mathbf{V}$ in between two diagonal matrices, the matrix product does not commute.

To continue simplifying the expressions, start by considering a vector $\mathbf{a}\in\mathbb{R}^{N}$ , a matrix $\mathbf{B}\in\mathbb{R}^{N\times N}$ and another vector $\mathbf{c}\in\mathbb{R}^{N}$ . Observe that

	$\displaystyle\big{[}\mathsf{diag}(\mathbf{a})\mathbf{B}\mathsf{diag}(\mathbf{c})\big{]}_{ij}$	$\displaystyle=a_{i}c_{j}b_{ij}$
	$\displaystyle\mathsf{diag}(\mathbf{a})\mathbf{B}\mathsf{diag}(\mathbf{c})$	$\displaystyle=\mathbf{B}\circ(\mathbf{a}\mathbf{c}^{\mathsf{T}}).$

Therefore, it can be written

\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\mathsf{diag}(\boldsymbol{\gamma}_{i})=\mathbf{V}\circ\Big{(}\sum_{k=0}^{K}\mathbf{h}^{(k)}\boldsymbol{\gamma}_{i}^{\mathsf{T}}\Big{)}=\mathbf{V}\circ\mathbf{G}_{i}

(43)

where $\mathbf{G}_{i}\in\mathbb{R}^{N\times N}$ is the matrix defined in (33);

	$\displaystyle\sum_{k=0}^{K}h_{tk}[\boldsymbol{\gamma}_{i}]_{j}$	$\displaystyle=$		(44)
		$\displaystyle==[\mathbf{G}_{i}]_{tj}.$		(44)

Using this result in (40) and (42) gives that

\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{r})=\mathbf{V}\circ\mathbf{G}_{i}.

(45)

This equality, together with the fact that $\mathbf{S}^{r}=\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{r})\mathbf{V}^{\mathsf{T}}$ , gives the desired result (34). ∎

Appendix C Optimal Unbiased Node-Variant Graph Filters

Proof of Lemma 4.

The NVGF-based estimator is given by:

\mathbf{\hat{y}}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\mathbf{x}+\mathbf{c}.

(46)

This estimator is unbiased if and only if $\mathbb{E}[\mathbf{\hat{y}}]=\mathbb{E}[\mathbf{y}]=\mathbb{E}[\mathsf{\rho}(\mathbf{x})]=\boldsymbol{\mu}_{\rho}$ . By (46) and linearity of expectation, this holds precisely when

\mathbb{E}[\mathbf{\hat{y}}]=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\mathbb{E}[\mathbf{x}]+\mathbf{c}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\boldsymbol{\mu}_{x}+\mathbf{c}=\boldsymbol{\mu}_{\rho}.

(47)

This is equivalent to the condition on $\mathbf{c}$ that

\mathbf{c}=\boldsymbol{\mu}_{\rho}-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\boldsymbol{\mu}_{x},

(48)

which completes the proof. ∎

Proof of Proposition 5.

Given an unbiased estimator $\mathbf{\hat{y}}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})(\mathbf{x}-\boldsymbol{\mu}_{x})+\boldsymbol{\mu}_{\rho}$ , an optimal matrix $\mathbf{H}^{\star}\in\mathbb{R}^{N\times(K+1)}$ of filter taps are the ones that minimize (13). First, note that the estimator output at node $v_{i}$ , i.e., the $i^{\text{th}}$ entry of $\mathbf{\hat{y}}$ , is given by

\hat{y}_{i}=[\mathbf{\hat{y}}]_{i}=\mathbf{u}_{i}^{\mathsf{T}}\mathsf{diag}(\mathbf{\Lambda}\mathbf{h}_{i})\mathbf{V}^{\mathsf{T}}(\mathbf{x}-\boldsymbol{\mu}_{x})+\mu_{\rho i}

(49)

where $\mu_{\rho i}=[\boldsymbol{\mu}_{\rho}]_{i}$ , $\mathbf{u}_{i}\in\mathbb{R}^{N}$ is the $i^{\text{th}}$ row of $\mathbf{V}$ , and $\mathbf{h}_{i}\in\mathbb{R}^{K+1}$ is the $i^{\text{th}}$ row of the matrix $\mathbf{H}$ ; see [19]. Note that if $\mathbf{a},\mathbf{b},\mathbf{c}$ are three $N$ -dimensional real vectors, then $\mathbf{a}^{\mathsf{T}}\mathsf{diag}(\mathbf{b})\mathbf{c}=\sum_{i=1}^{N}a_{i}b_{i}c_{i}$ , which means that they commute, i.e., $\mathbf{a}^{\mathsf{T}}\mathsf{diag}(\mathbf{b})\mathbf{c}=\mathbf{c}^{\mathsf{T}}\mathsf{diag}(\mathbf{a})\mathbf{b}$ . Using this fact in (49) yields

\hat{y}_{i}=(\mathbf{x}-\boldsymbol{\mu}_{x})^{\mathsf{T}}\mathbf{V}\mathsf{diag}(\mathbf{u}_{i})\mathbf{\Lambda}\mathbf{h}_{i}+\mu_{\rho i}=\boldsymbol{a}_{i}(\mathbf{x})^{\mathsf{T}}\mathbf{h}_{i}+\mu_{\rho i}

(50)

with

\boldsymbol{a}_{i}(\mathbf{x})=\mathbf{\Lambda}^{\mathsf{T}}\mathsf{diag}(\mathbf{u}_{i})\mathbf{V}^{\mathsf{T}}(\mathbf{x}-\boldsymbol{\mu}_{x}).

The objective function in (13) can be rewritten as

\mathbb{E}\big{[}\|\mathbf{\hat{y}}-\mathbf{y}\|_{2}^{2}\big{]}=\mathbb{E}\Big{[}\sum_{i=1}^{N}(\hat{y}_{i}-y_{i})^{2}\Big{]}=\sum_{i=1}^{N}\mathbb{E}\big{[}(\hat{y}_{i}-y_{i})^{2}\big{]}.

(51)

Since each $\hat{y}_{i}$ depends only on $\mathbf{h}_{i}$ , as indicated in (50), minimizing (51) over $\mathbf{H}$ is equivalent to minimizing each of the summands in (51) over $\mathbf{h}_{i}$ . Therefore, (13) becomes equivalent to the following system of $N$ optimization problems over each of the rows of $\mathbf{H}$ :

\min_{\mathbf{h}_{i}\in\mathbb{R}^{K+1}}\mathbb{E}\big{[}(\hat{y}_{i}-y_{i})^{2}\big{]},\quad i\in\{1,\ldots,N\}.

(52)

Substituting (50) into the $i^{\text{th}}$ objective function of (52) gives

		$\displaystyle\mathbb{E}\big{[}(\hat{y}_{i}-y_{i})^{2}\big{]}=\mathbb{E}\big{[}\big{(}\boldsymbol{a}_{i}(\mathbf{x})^{\mathsf{T}}\mathbf{h}_{i}+\mu_{\rho i}-\rho(x_{i})\big{)}^{2}\big{]}$		(53)
		$\displaystyle=\mathbf{h}_{i}^{\mathsf{T}}\mathbb{E}\big{[}\boldsymbol{a}_{i}(\mathbf{x})\boldsymbol{a}_{i}(\mathbf{x})^{\mathsf{T}}\big{]}\mathbf{h}_{i}$
		$\displaystyle\quad-2\mathbb{E}\big{[}(\rho(x_{i})-\mu_{\rho i})\boldsymbol{a}_{i}(x)^{\mathsf{T}}\big{]}\mathbf{h}_{i}$
		$\displaystyle\quad+\mathbb{E}\big{[}(\rho(x_{i})-\mu_{\rho i})^{2}\big{]}.$

Now, with

	$\displaystyle\mathbf{R}_{i}$	$\displaystyle=\mathbb{E}\big{[}\boldsymbol{a}_{i}(\mathbf{x})\boldsymbol{a}_{i}(\mathbf{x})^{\mathsf{T}}\big{]}$
		$\displaystyle=\mathbf{\Lambda}^{\mathsf{T}}\mathsf{diag}(\mathbf{u}_{i})\mathbf{V}^{\mathsf{T}}\mathbf{C}_{x}\mathbf{V}\mathsf{diag}(\mathbf{u}_{i})\mathbf{\Lambda},$
	$\displaystyle\mathbf{p}_{i}$	$\displaystyle=\mathbb{E}\big{[}(\rho(x_{i})-\mu_{\rho i})\boldsymbol{a}_{i}(x)\big{]}$
		$\displaystyle=\mathbf{\Lambda}^{\mathsf{T}}\mathsf{diag}(\mathbf{u}_{i})\mathbf{V}^{\mathsf{T}}\mathbb{E}\big{[}(\rho(x_{i})-\mu_{\rho i})(\mathbf{x}-\boldsymbol{\mu}_{x})\big{]}$

the $i^{\text{th}}$ objective (53) becomes

\mathbb{E}\big{[}(\hat{y}_{i}-y_{i})^{2}\big{]}=\mathbf{h}_{i}^{\mathsf{T}}\mathbf{R}_{i}\mathbf{h}_{i}-2\mathbf{p}_{i}^{\mathsf{T}}\mathbf{h}_{i}+\mathbb{E}\big{[}(\rho(x_{i})-\mu_{\rho i})^{2}\big{]}.

(54)

Since $\mathbf{R}_{i}$ is a positive semidefinite matrix, (54) is a convex quadratic function in $\mathbf{h}_{i}$ . Therefore, setting the gradient of this function to zero, it can be concluded that $\mathbf{h}_{i}^{\star}$ is a global minimizer of (54) if and only if $2\mathbf{R}_{i}\mathbf{h}_{i}^{\star}-2\mathbf{p}_{i}=\mathbf{0}$ , or, equivalently,

\mathbf{R}_{i}\mathbf{h}_{i}^{\star}=\mathbf{p}_{i}.

(55)

This completes the proof. ∎

Appendix D Authorship Attribution

Problem objective. Consider a set $\boldsymbol{\mathit{\Theta}}_{a}$ of texts that are known to be written by author $a$ . Given a new text $\theta\notin\boldsymbol{\mathit{\Theta}}_{a}$ the objective is to determine whether $\theta$ was written by $a$ or not.

Approach. The approach is to leverage word adjacency networks (WAN) built from the set of known texts $\boldsymbol{\mathit{\Theta}}_{a}$ to build a graph support $\mathcal{G}$ , and then use the word frequency count of the function words in $\theta$ as the graph signal $\mathbf{x}$ . Then, $\mathbf{x}$ is processed through a GNN $\mathsf{\Phi}$ (in any of the variants discussed in Section 5) to obtain a predictor of the text being written by author $a$ .

Graph construction. The WAN can be modeled by a graph $\mathcal{G}$ where the set of nodes $\mathcal{V}$ consists of function words (i.e., words that do not carry semantic meaning but express grammatical relationships among other words within a sentence, e.g., “the”, “and”, “a”, “of”, “to”, “for”, “but”). The existence of an edge connecting words and the corresponding edge weight are determined as follows. Consider each text $\theta\in\boldsymbol{\mathit{\Theta}}_{a}$ and split it into a total of $S$ sentences $\{\mathsf{\omega}_{\theta}^{s}\}_{s=1}^{S}$ where each sentence $\mathsf{\omega}_{\theta}^{s}:\mathbb{N}\to\mathcal{V}\cup\{\emptyset\}$ gives the function word present in each position within a sentence, or $\emptyset$ if the word is not a function word. Then, given a discount factor $\alpha\in(0,1)$ and a window length $D$ , the edge weight $w_{ij}$ between nodes $v_{i}$ and $v_{j}$ is computed as

	$\displaystyle w_{ij}=\sum_{t:\theta_{t}\in\boldsymbol{\mathit{\Theta}}_{a}}$	$\displaystyle\sum_{s,e}\mathsf{1}\{\mathsf{\omega}_{\theta_{t}}^{s}(e)=v_{i}\}$		(56)
		$\displaystyle\sum_{d=1}^{D}\alpha^{d-1}\mathsf{1}\{\mathsf{\omega}_{\theta_{t}}^{s}(e+d)=v_{j}\}$		(56)

where $\mathsf{1}\{\mathcal{A}\}$ is the indicator function that takes value $1$ when condition $\mathcal{A}$ is met and 0 otherwise, see [5]. Equation (56) essentially computes each weight by going text by text $\theta_{t}\in\boldsymbol{\mathit{\Theta}}_{a}$ and sentence by sentence $s\in\{1,\ldots,S\}$ , looking position by position $e$ for the corresponding word $\mathsf{\omega}_{\theta_{t}}^{s}(e)$ to match the function word $v_{i}$ . Once the word $v_{i}$ is matched, the following $D$ words in the window length are looked at and, if the $(e+d)^{\text{th}}$ word matches $v_{j}$ , then the discounted weight $a^{d-1}$ is added. In this way, not only co-occurrence of words, but also their proximity counts in establishing the WAN. Note that the edge weight function (56) is asymmetric, which results in a directed graph.

Graph signal processing description. The support matrix is chosen to be

\mathbf{S}=\frac{1}{2}(\mathbf{D}^{-1}\mathbf{W}+\mathbf{W}^{\mathsf{T}}\mathbf{D}^{-1})

(57)

where the $\mathbf{W}$ is the adjacency matrix with entry $(i,j)$ equal to $w_{ij}$ and $\mathbf{D}=\mathsf{diag}(\mathbf{W}\mathbf{1})$ is the degree matrix. The operation $\mathbf{D}^{-1}\mathbf{W}$ normalizes the matrix by rows, and the support matrix $\mathbf{S}$ comes from symmetrizing the matrix by adopting one half of the weight on each direction. The graph signal $\mathbf{x}$ contains a normalized word frequency count for each word

[\mathbf{x}]_{i}=\frac{\sum_{s,e}\mathsf{1}\{\mathsf{\omega}_{\theta}^{s}(e)=v_{i}\}}{\sum_{j:v_{j}\in\mathcal{V}\sum_{s,e}}\mathsf{1}\{\mathsf{\omega}_{\theta}^{s}(e)=v_{j}\}}.

(58)

Note that the graph signals are normalized by the $1$ -norm and can therefore be interpreted as the probability of finding the function word $v_{i}$ in text $\theta$ .

Dataset and code. The dataset is presented in [5], and is publicly available at http://github.com/alelab-upenn/graph-neural-networks/tree/master/datasets/authorshipData. The dataset consists of $21$ authors from the $\text{19}^{\text{th}}$ century. The corpus of each author is split in texts of about $1000$ words. The WANs and the word frequency count for each text are already present in the dataset. The texts are split, at random, in $95\%$ for training and $5\%$ for testing. The weights of the WANs of the $95\%$ texts selected for training are averaged to obtain an average WAN from which the support matrix $\mathbf{S}$ is obtained by following (57). The resulting support matrix is further normalized to have unit spectral norm. Note that no text from the test set is used in building the WAN. From the texts in the training set, $8\%$ are further separated to build the validation set. Denote by $N_{a}^{\text{train}}$ , $N_{a}^{\text{valid}}$ , and $N_{a}^{\text{test}}$ the number of texts in the training, validation, and test set, respectively. The word frequency counts for each text are normalized as in (58) and used as graph signals. A label of $1$ is attached to these signals to indicate that they have been written by author $a$ in a supervised learning context. To complete the datasets, an equal number of texts are obtained at random from other contemporary authors, and their frequency word counts are normalized and incorporated into the corresponding sets and assigned a label of $0$ . In this way, the resulting training, validation, and test set have $2N_{a}^{\text{train}}$ , $2N_{a}^{\text{valid}}$ , and $2N_{a}^{\text{test}}$ samples, respectively (half of them labeled with $1$ and the other half with $0$ ). The code to run the simulations will be provided as a .zip file.

Architectures for comparison. The Learn NVGF architecture in (18) is compared against three of the most popular GNN architectures in the literature, namely, the GCN [24], the SGC [25], and the GAT [26]. The Learn NVGF adopts the support matrix in (57) and consists of a LSI graph filter (a graph convolution) that outputs $F_{\text{NVGF}}$ features (hidden units) and has $K_{\text{NVGF}}+1$ filter taps, followed by a NVGF that takes those $F_{\text{NVGF}}$ input features and applies a NVGF with $K_{\text{NVGF}}+1$ filter taps, and also outputs $F_{\text{NVGF}}$ features. The GCN considers $\mathbf{S}$ in (57) to be the adjacency matrix of the graph and thus adopts a support matrix given by $\mathbf{S}_{\text{GCN}}=\mathbf{\tilde{D}}^{-1/2}(\mathbf{I}+\mathbf{S})\mathbf{\tilde{D}}^{-1/2}$ where $\mathbf{\tilde{D}}=\mathsf{diag}((\mathbf{I}+\mathbf{S})\mathbf{1})$ as indicated in [24]. The graph convolutional layer consists of $F_{\text{GCN}}$ LSI graph filters, each one of the form $h^{f}\mathbf{S}_{\text{GCN}}$ for $f\in\{1,\ldots,F_{\text{GCN}}\}$ . This is followed by a ReLU activation function. The SGC considers the same support matrix $\mathbf{S}_{\text{GCN}}$ as the GCN and learns $F_{\text{SGC}}$ output features, where each filter in the bank is of the form $h^{f}\mathbf{S}_{\text{GCN}}^{K_{\text{SGC}}}$ for a predetermined hyperparameter $K_{\text{SGC}}$ . This is followed by a ReLU activation function. Finally, for the GAT, the support matrix is $\mathbf{S}$ (although this is relevant only in terms of the nonzero elements, which are the same in $\mathbf{S}$ and $\mathbf{S}_{\text{GCN}}$ —except for the diagonal elements—since the weights of each edge are learned through the attention mechanism), and the output features are $F_{\text{GAT}}$ learned through the attention mechanism exactly as described in [26]. All four architectures are followed by a readout layer consisting of a learnable linear transform that maps the $NF$ output features into a vector of size $2$ that is interpreted to be the logits for the two classes (either the text is written by the author or not).

Architectures for analysis. To analyze the impact of the nonlinearity, four architectures are considered. First, as a baseline, a GCNN consisting of a graph convolutional layer that outputs $F$ features, with $K+1$ filter taps, followed by a ReLU activation function, as indicated in (16). Second, the Learn NVGF that replaces the ReLU activation function by a NVGF with filter taps that can be learned from data as in (18). Third, the Design NVGF where the nonlinear activation function of the GCNN is replaced by a NVGF, but one whose filter taps are designed to mimic the ReLU (the LSI graph filters in the Design NVGF are the same ones learned by the GCNN). Fourth, a LSI graph filter with $K+1$ taps that outputs $F$ features. Note that the GCNN is a nonlinear architecture, while the other three are linear. Also, note that the first three architectures are capable of creating frequency content while the fourth one, the LSIGF, is not. For fair comparison, the values of $K$ and $F$ are the same for all architectures.

Training. The loss function during training is a cross-entropy loss between the logits obtained from the output of each architecture, and the labels in the training set. All the architectures are trained by using an ADAM optimizer [32] with forgetting factors $0.9$ and $0.999$ , and with a learning rate $\eta$ . The training is carried out for $25$ epochs with batches of size $20$ . Dropout with probability $0.5$ is included before the readout layer, during training, to avoid overfitting. At testing time, the weights are correspondingly rescaled. Validation is run every $5$ training steps. The learned filters that result in the best performance on the validation set are kept and used during the testing phase. For each experiment, $10$ realizations of the random train/test split are carried out (also randomizing the selection of the texts by other authors that complete the training, validation, and test sets). The average evaluation performances (measured as classification error—ratio of texts wrongly attributed in the test set) is reported, together with the estimated standard deviation.

Hyperparameter selection. The number of hidden units $F_{\text{NVGF}}$ , $F_{\text{GCN}}$ , $F_{\text{SGC}}$ , and $F_{\text{GAT}}$ , the polynomial order $K_{\text{NVGF}}$ and $K_{\text{SGC}}$ , and the learning rate $\eta_{\text{NVGF}}$ , $\eta_{\text{GCN}}$ , $\eta_{\text{SGC}}$ , and $\eta_{\text{GAT}}$ are selected, independently for each architecture, from the set $\{16,32,64\}$ for the number of hidden units, $\{2,3,4\}$ for the polynomial order, and $\{0.001,0.005,0.01\}$ for the learning rate. In other words, all possible combinations of these three parameters are run for each architecture, and the ones that show the best performance on the test set are kept.

Table 1: Hyperparameters for each architecture that lead to the best performance

	Learn NVGF			GCN [24]		SGC [25]			GAT [26]
Authors	$\eta_{\text{NVGF}}$	$F_{\text{NVGF}}$	$K_{\text{NVGF}}$	$\eta_{\text{GCN}}$	$F_{\text{GCN}}$	$\eta_{\text{SGC}}$	$F_{\text{SGC}}$	$K_{\text{SGC}}$	$\eta_{\text{GAT}}$	$F_{\text{GAT}}$
Abbott	0.001	32	2	0.005	64	0.01	64	2	0.01	64
Alcott	0.01	16	4	0.01	32	0.01	32	2	0.01	32
Alger	0.005	16	4	0.01	32	0.001	64	2	0.005	64
Allen	0.005	64	2	0.01	64	0.005	32	2	0.005	64
Austen	0.001	32	3	0.01	64	0.005	64	2	0.01	64
Brontë	0.001	32	3	0.01	64	0.005	16	2	0.01	16
Cooper	0.005	16	3	0.005	64	0.005	64	2	0.01	64
Dickens	0.001	32	4	0.005	32	0.005	64	2	0.01	64
Doyle	0.005	16	3	0.005	16	0.005	64	2	0.005	32
Garland	0.005	32	3	0.01	64	0.005	64	2	0.01	32
Hawthorne	0.01	16	3	0.005	64	0.01	64	2	0.01	16
Irving	0.005	16	3	0.005	64	0.01	64	2	0.005	64
James	0.001	32	2	0.005	64	0.01	64	2	0.01	64
Jewett	0.001	16	2	0.01	64	0.01	64	2	0.01	32
Melville	0.005	16	4	0.01	64	0.005	64	2	0.005	64
Page	0.005	16	2	0.005	64	0.005	64	2	0.005	64
Poe	0005	32	4	0.005	64	0.005	32	2	0.01	32
Stevenson	0.001	16	4	0.005	64	0.005	32	2	0.01	64
Thoreau	0.005	64	3	0.01	64	0.01	16	2	0.01	32
Twain	0.005	16	2	0.01	16	0.001	64	2	0.01	32
Wharton	0.005	16	4	0.01	64	0.01	64	2	0.001	64

Experiment 1: Performance comparison. In the first experiment, the performance is measured by classification error (ratio of texts in the test set that were wrongly attributed), and the comparison between the Learn NVGF, the GCN [24], the SGC [25], and the GAT [26] is carried out, for all $21$ authors. The hyperparameters used for each architecture and each author are present in Table 1 (recall that the hyperparameters that exhibit the best performance—the lowest classification error on the test set—are the ones used).

Results are shown in Figure 3. The general trend observed is for the Learn NVGF to exhibit better performance than the GAT, which in turn is better than the GCN, and all of them are better than SGC. The differences are usually significant between the four architectures, with a marked improvement by the Learn NVGF. It is observed, however, that for Doyle, Hawthorne, and Stevenson, the performance of the Learn NVGF and the GAT is comparable.

Experiment 2: Impact of nonlinearities. In the second experiment, the objective is to decouple the contribution made by frequency creation from that made by the nonlinear nature of the architecture. To do this, the GCNN architecture is taken as a baseline (a nonlinear, frequency-creating architecture), and the relative change in performance of the three other architectures is measured (Learn NVGF and Design NVGF, both linear and frequency-creating architectures, and LSIGF, which is linear but cannot create frequencies). The learning rate $\eta$ , the number of hidden units $F$ , and order of the filters $K$ are the same for all four architectures, and are those found in Table 1 under the column of Learn NVGF. The results showing relative change in the mean classification error with respect to the GCNN are shown in Figure 4.

The authors have been classified into three groups according to their relative behavior. Group 1 consists of those authors where the Learn NVGF has a comparable performance with respect to the GCNN (i.e., $5\%$ or less relative variation in performance), and both have a considerably better performance than the LSIGF. These results suggest that, for this group of authors (the most numerous one, consisting of $11$ authors, or $53\%$ of the total), the improvement in performance is mostly due to the capability of the architectures to create frequency, and not necessarily due to the nonlinear nature of the GCNN.

Group 2 consists of those authors where the Learn NVGF exhibits a better performance than the GCNN, which in turn exhibits a similar performance to the LSIGF (except for Jewett and Thoreau, where the GCNN still exhibits considerably better performance than the LSIGF). This group of 7 authors ( $33\%$ of the total) suggests that in some cases, frequency creation is the key contributor to improved performance, and that the inclusion of a nonlinear mapping could possibly have a negative impact. In essence, we observe that a linear architecture capable of creating frequencies outperforms the rest, and that a frequency-creating nonlinear architecture performs similarly to a linear architecture that cannot create frequencies. This suggests that the relationship between input and output is approximately linear with frequencies being created, but that attempting to model this frequency creation with a nonlinear architecture is not a good approach.

Finally, Group 3 consists of those architectures for which all architectures exhibit a similar performance. This group consists of $3$ authors ( $14\%$ of the total). This may be explained by the fact that the high-frequency content for these authors does not carry useful information, and thus the role of frequency creation is less relevant.

With respect to the Design NVGF, a somewhat erratic behavior is observed. In near half of the cases ( $11$ authors), the performance of the Design NVGF closely resembles the performance of the GCNN, as expected. In a few other cases (Alcott, James, Cooper, and Allen), the Design NVGF results are better, and in the rest they are considerably worse (Brontë, Dickens, Hawthorne, Page, Stevenson, and Melville). The Design NVGF architecture is designed to mimic the GCNN, but its accurate design depends on good estimates of the first and second moments of the data. Thus, one possible explanation is that there is not enough data to get good estimates of these values. Another possible explanation is that higher-order moments have a larger impact in these cases, and the NVGF, being linear, is not able to accurately capture them.

Overall, this second experiment shows the importance of frequency-creation in improving performance, especially when high-frequency content is significant. Among the two ways of creating frequency (linear and nonlinear), we see that in most cases, they essentially perform the same. But there are cases in which creating frequency content in a linear manner is better (Group 2). In any case, this experiment shows the importance of frequency creation and calls for further research on what other contributions the nonlinear nature of the activation function has on performance.

Appendix E Movie Recommendation

Problem objective. Consider a set $\mathcal{V}=\{v_{1},\ldots,v_{N}\}$ of $N$ items, and let $\mathsf{\tilde{x}}_{t}:\mathcal{V}\to\mathbb{R}\cup\{\emptyset\}$ be the ratings assigned by user $t$ to these items, i.e., $\mathsf{\tilde{x}}_{t}(v_{i})$ is the rating assigned by user $t$ to item $i$ , this function yields $\emptyset$ if the item has not been rated. The objective is to estimate what rating $\mathsf{\tilde{x}}_{t}(v_{r})$ a user would give to some target item $v_{r}\in\mathcal{V}$ not yet rated [3, 4].

Approach. The idea is to create a graph $\mathcal{G}$ of rating similarities and take the graph signal $\mathbf{x}$ to be the ratings given by a user to some of the items. Then $\mathbf{x}$ is processed through a GNN $\mathsf{\Phi}$ (in any of the variants discussed in Section 5) to obtain the estimated rating the user would give to the target item $v_{r}$ , by looking at the output value of the GNN on node $v_{r}$ . In short, this amounts to an interpolation problem.

Graph construction. Let $\mathcal{\tilde{T}}=\{\mathbf{\tilde{x}}_{t}\}_{t}$ be a set where $\mathbf{\tilde{x}}_{t}\in\mathbb{R}^{N}$ collects the ratings given by user $t$ to some of the items, such that $[\mathbf{\tilde{x}}_{t}]_{i}=\mathsf{\tilde{x}}_{t}(v_{i})\in\mathbb{R}$ if item $v_{i}$ has been rated and $[\mathbf{\tilde{x}}_{t}]_{i}=0$ if $\mathsf{\tilde{x}}_{t}(v_{i})=\emptyset$ . Denote by $\mathcal{T}_{i}=\{\mathbf{x}_{t}\in\mathcal{\tilde{T}}:[\mathbf{x}_{t}]_{i}>0\}$ the set of users that have rated item $v_{t}$ , and by $\mathcal{\tilde{T}}_{ij}=\mathcal{\tilde{T}}_{i}\cap\mathcal{\tilde{T}}_{j}$ the set of users that have rated both items $v_{i}$ and $v_{j}$ . Define the mean intersection score as $\mu_{ij}=|\mathcal{\tilde{T}}_{ij}|^{-1}\sum_{\mathbf{\tilde{x}}_{t}\in\mathcal{\tilde{T}}_{ij}}[\mathbf{\tilde{x}}_{t}]_{i}$ . Note that this is the rating average for item $v_{i}$ , computed among those users that have rated both $v_{i}$ and $v_{j}$ . Then, the rating similarity between items $v_{i}$ and $v_{j}$ is computed by means of the Pearson correlation as

w_{ij}=\frac{1}{|\mathcal{\tilde{T}}_{ij}|}\sum_{\mathbf{\tilde{x}}_{t}\in\mathcal{\tilde{T}}_{ij}}\big{(}[\mathbf{\tilde{x}}_{t}]_{i}-\mu_{ij}\big{)}\big{(}[\mathbf{\tilde{x}}_{t}]_{j}-\mu_{ij}\big{)}.

(59)

These weights can be used to build a complete graph $\mathcal{\tilde{G}}=(\mathcal{V},\mathcal{\tilde{E}})$ where $\mathcal{V}$ is the set of items and $\mathcal{\tilde{E}}=\mathcal{V}\times\mathcal{V}$ is the complete set of edges with $w_{ij}$ being the corresponding weights. In what follows, the $10$ -nearest neighbor graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ of $\mathcal{\tilde{G}}$ is built with $\mathcal{E}\subseteq\mathcal{\tilde{E}}$ , and the matrix $\mathbf{W}$ is used to denote the weighted adjacency matrix of $\mathcal{G}$ , such that $[\mathbf{W}]_{ij}=w_{ij}$ if $(i,j)\in\mathcal{E}$ , and $[\mathbf{W}]_{ij}=0$ otherwise.

Graph signal processing description. The support matrix is chosen to be

\mathbf{S}=\big{(}\mathsf{diag}(\mathbf{W})\big{)}^{-1/2}\mathbf{W}\big{(}\mathsf{diag}(\mathbf{W})\big{)}^{-1/2}-\mathbf{I}.

(60)

This problem can be cast as a supervised interpolation problem. Given the set $\mathcal{\tilde{T}}$ and the target item $v_{r}$ , consider the set $\mathcal{\tilde{T}}_{r}$ of all users that have rated the item $v_{r}$ . Extract the specific rating $\mathsf{\tilde{x}}_{t}(v_{r})=y_{t}$ as a label, and set a $0$ in the $r^{\text{th}}$ entry of $\mathbf{\tilde{x}}_{t}$ . The resulting vector is a graph signal $\mathbf{x}_{t}$ which always has a $0$ in the $r^{\text{th}}$ entry. The resulting set $\mathcal{T}_{r}=\{(\mathbf{x}_{t},y_{t}):\mathbf{\tilde{x}}_{t}\in\mathcal{\tilde{T}}_{r}\}$ contains all the users that have rated the item $v_{r}$ with the corresopnding rating extracted as a label $y_{t}$ and the $r^{\text{th}}$ entry $[\mathbf{x}_{t}]_{r}$ of the graph signal $\mathbf{x}_{t}$ set to zero, i.e., $[\mathbf{x}_{t}]_{r}=0$ for all $t$ such that $\mathbf{\tilde{x}}_{t}\in\mathcal{\tilde{T}}_{r}$ .

Dataset and code. The dataset is the MovieLens-100k dataset [33] publicly available at http://files.grouplens.org/datasets/movielens/ml-100k.zip. This dataset consists of one hundred thousand ratings given by $943$ users to $1682$ movies, and where each user has rated at least $20$ of them. The ratings are integers ranging from $1$ to $5$ meaning that $\mathsf{\tilde{x}}_{t}:\mathcal{V}\to\{1,\ldots,5\}\cup\emptyset$ for every user $t$ . In particular, the subset of $250$ movies that have received the largest number of ratings is used to build a graph with $N=250$ nodes. The resulting dataset has $54746$ ratings given by $943$ users to some of these $250$ movies. The five movies with the largest number of ratings are considered as target movies, namely “Star Wars” with $|\mathcal{\tilde{T}}_{\text{Star Wars}}|=583$ pairs $(\mathbf{x}_{t},y_{t})$ , “Contact” with $509$ pairs, “Fargo” with $508$ , “Return of the Jedi” with $507$ , and “Liar, Liar” with $485$ . Each of these datasets is split randomly into $81\%$ for training, $9\%$ for validation, and $10\%$ for testing. The code to run the simulations will be provided as a .zip file.

Architectures for comparison. The Learn NVGF architecture in (18) is compared against two graph signal processing based methods for movie recommendation, namely the RMGCNN in [3] and the RMCayley in [4], which are implemented exactly as in the corresponding papers, with the same values for the hyperparameters. For the Learn NVGF, the values are $F$ hidden units and filters of order $K$ . The values used are $(64,3)$ for Star Wars and Contact, $(16,4)$ for Fargo, $(32,4)$ for Return of the Jedi, and $(16,3)$ for Liar, Liar. The support matrix is given by (60). All architectures are followed by a learnable local linear transformation (the same for all nodes, i.e., a LSI graph filter with $K=0$ ) that takes the value of the $F$ hidden units and outputs a single scalar that represents the estimated rating for that movie. A comparison with the nearest neighbor method (i.e., averaging the ratings of the nearest nodes) is also included.

Architectures for analysis. The architectures for analyzing the role of frequency creation are the same four architectures that in the authorship attribution problem. Namely, the LSI graph filter as a linear architecture unable to create frequencies, the Design NVGF and the Learn NVGF are linear frequency-creating architectures, and the GCNN with a ReLU nonlinearity is a nonlinear frequency-creating architecture. The values of $F$ and $K$ in all cases are the same.

Training. The loss function during training is the “Smooth L1” loss between the output scalar at the target node and the labels in the training set. All architectures are trained by using an ADAM optimizer [32] with forgetting factors $0.9$ and $0.999$ , and a learning rate $\eta$ . The training is carried out for $40$ epochs with batches of size $5$ . Validation is run every $5$ training steps. The learned filters that result in the best performance on the validation set are kept and used during the testing phase. For each experiment, $10$ realizations of the random dataset split are carried out. The average evaluation performances (measured by RMSE) is reported, together with the estimated standard deviation.

Experiment 1: Performance comparison. In the first experiment, the performance is measured by the RMSE, and the comparison between the Learn NVGF, the RMGCNN [3], the RMCayley [4], and the Nearest Neighbor approach is carried out for the $5$ aforementioned movies with the most number of ratings. Results are shown in Figure 5. It is generally observed that the Learn NVGF performs better than the alternatives, although the performance is comparable to the RMGCNN and the RMCayley in the case of the movie Contact. The nearest neighbor approach yields worse performance.

Experiment 2: Impact of nonlinearities. In this second experiment, the objective is to decouple the contribution made by frequency creation from that made by the nonlinear nature of the architecture. To do this, the GCNN architecture is taken as a baseline (a nonlinear, frequency-creating architecture), and the relative change in performance of the three other architectures is measured. The results shown in Figure 6 show that the relative change is quite small (approximately $1.5\%$ change in the highest case, the movies Fargo and Liar, Liar), which implies that all four architectures have relatively similar performance. This can be easily explained by computing the average frequency response of the signals in the test set of the movie Star Wars. The result is shown in Figure 7. It is observed that it is a signal with low-eigenvalue frequency content. Therefore, as expected, there is not much to gain for using architectures that create frequencies. In Figures 8, 9, and 10, the frequency responses of the output of each architecture to an input that is equal to the largest eigenvector, i.e., $\mathbf{x}=\mathbf{v}_{N}$ , are shown. As expected, the LSIGF (Figure 8) does not create frequency content, while the other two architectures, do (Figures 9 and 10). However, since this high-eigenvalue frequency content is not really significant, the frequency creation capabilities do not markedly improve the performance.

	$\displaystyle\Big{\\|}\sum_{i=1}^{N}$	$\displaystyle\tilde{x}_{i}m_{i}\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\mathbf{v}_{i}\Big{\\|}$		(36)
		$\displaystyle\leq\sum_{i=1}^{N}\|\tilde{x}_{i}\|\|m_{i}\|\\|\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\\|\\|\mathbf{v}_{i}\\|$		(36)

$\displaystyle\Big{\\|}\sum_{i=1}^{N}\tilde{x}_{i}m_{i}$	$\displaystyle\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\mathbf{v}_{i}\Big{\\|}$	(37)
	$\displaystyle\leq\varepsilon C\sum_{i=1}^{N}\|\tilde{x}_{i}\|=\varepsilon C\\|\mathbf{\tilde{x}}\\|_{1}$
	$\displaystyle\leq\varepsilon C\sqrt{N}\\|\mathbf{\tilde{x}}\\|=\varepsilon C\sqrt{N}\\|\mathbf{x}\\|.$

$\displaystyle\Big{\\|}$	$\displaystyle\sum_{i=1}^{N}\tilde{x}_{i}\big{(}\mathbf{V}\circ\mathbf{G}_{i}\big{)}\mathbf{V}^{\mathsf{T}}\mathbf{E}_{U}\mathbf{v}_{i}\Big{\\|}$	(38)
	$\displaystyle\leq\sum_{i=1}^{N}\|\tilde{x}_{i}\|\\|\mathbf{V}\circ\mathbf{G}_{i}\\|\\|\mathbf{V}^{\mathsf{T}}\\|\\|\mathbf{E}_{U}\\|\\|\mathbf{v}_{i}\\|$
	$\displaystyle\leq 8\varepsilon CN\sqrt{N}\\|\mathbf{x}\\|.$

	$\displaystyle\Big{\\|}$	$\displaystyle\sum_{i=1}^{N}\tilde{x}_{i}\mathbf{D}\mathbf{v}_{i}\Big{\\|}=\Big{\\|}\mathbf{D}\sum_{t=1}^{N}\tilde{x}_{i}\mathbf{v}_{i}\Big{\\|}=\\|\mathbf{D}\mathbf{x}\\|$		(39)
		$\displaystyle\leq\\|\mathbf{D}\\|\\|\mathbf{x}\\|\leq\mathsf{O}(\varepsilon^{2})\\|\mathbf{x}\\|.$		(39)

Node-Variant Graph Filters in Graph Neural Networks

Abstract

1 Introduction

2 The Node-Variant Graph Filter

3 Frequency Analysis

Proposition 1 (Frequency response of NVGF).

Corollary 2.

Theorem 3 (Lipschitz continuity of the NVGF with respect to 𝐒\mathbf{S}).

4 Approximating Activation Functions

Lemma 4 (Unbiased estimator).

Proposition 5 (Optimal NVGF).

Corollary 6 (Zero-mean, i.i.d., ReLU nonlinearity).

Corollary 7 (Stationary graph processes).

5 Graph Neural Network Architectures Using Node-Variant Graph Filters

6 Numerical Experiments

7 Conclusion

References

Appendix A Frequency Response of Node-Variant Graph Filters

Proof of Proposition 1.

Proof of Corollary 2.

Appendix B Lipschitz Continuity of Node-Variant Graph Filters

Proof of Theorem 3.

Lemma 8.

Proof of Lemma 8.

Appendix C Optimal Unbiased Node-Variant Graph Filters

Proof of Lemma 4.

Proof of Proposition 5.

Appendix D Authorship Attribution

Appendix E Movie Recommendation

Theorem 3 (Lipschitz continuity of the NVGF with respect to $\mathbf{S}$ ).