This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Node-Variant Graph Filters in Graph Neural Networks

Fernando Gama, Brendon G. Anderson, Somayeh Sojoudi
Abstract

Graph neural networks (GNNs) have been successfully employed in a myriad of applications involving graph signals. Theoretical findings establish that GNNs use nonlinear activation functions to create low-eigenvalue frequency content that can be processed in a stable manner by subsequent graph convolutional filters. However, the exact shape of the frequency content created by nonlinear functions is not known and cannot be learned. In this work, we use node-variant graph filters (NVGFs) –which are linear filters capable of creating frequencies– as a means of investigating the role that frequency creation plays in GNNs. We show that, by replacing nonlinear activation functions by NVGFs, frequency creation mechanisms can be designed or learned. By doing so, the role of frequency creation is separated from the nonlinear nature of traditional GNNs. Simulations on graph signal processing problems are carried out to pinpoint the role of frequency creation.

This work is partially supported by NSF and ONR. F. Gama is with the Dept. of Electrical and Computer Engineering, Rice University, Houston, TX. Email: fgama@rice.edu. B. G. Anderson and S. Sojoudi are with the Dept. of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA.

1 Introduction

Graph neural networks (GNNs) [1, 2] are learning architectures that have been successfully applied to a wide array of graph signal processing (GSP) problems ranging from recommendation systems [3, 4] and authorship attribution [5, 6], to physical network problems including wireless communications [7], control [8], and sensor networks [9]. In the context of GSP, frequency analysis has been successful at providing theoretical insight into the observed success of GNNs [10, 11, 12] and graph scattering transforms [13, 14].

Of particular interest is the seminal work by Mallat [15, 16], concerning discrete-time signals and images, that argues that the improved performance of convolutional neural networks (CNNs) over linear convolutional filters is due to the activation functions. Concretely, nonlinear activation functions allow for high-frequency content to be spread into lower frequencies, where it can be processed in a stable manner—a feat that cannot be achieved by convolutional filters alone, which are unstable when processing high frequencies.Leveraging the notion of graph Fourier transform (GFT) [17], these results have been extended to GNNs [10], establishing that the use of functions capable of creating low-eigenvalue frequency content allows them to be robust to changes in the graph topology, facilitating scalability and transferability [18].

While nonlinear activation functions play a key role in the creation of low-eigenvalue frequency content, it is not possible to know the exact shape in which this frequency content is actually generated. Node-variant graph filters (NVGFs) [19], which essentially assign a different filter tap to each node in the graph, are also able to generate frequency content. Different from nonlinear activation functions, this frequency content can be exactly computed, given the filter taps. Thus, by learning or carefully designing these filter taps, it is possible to know exactly how the frequency content is being generated.

The NVGF is a linear filter, which means that replacing the nonlinear activation functions with NVGFs actually renders the whole architecture a linear one. Therefore, by comparing the performance of this architecture to that of a traditional GNN, it is possible to isolate the role of frequency creation in the overall performance of the architecture, from that of the nonlinear nature of mappings. The contributions of this paper can be summarized as follows:

  • (C1)

    We introduce NVGFs as a means of replacing nonlinear activation functions, motivated by their ability to create frequencies. We obtain closed-form expressions for the frequency response of NVGFs as a function of the filter taps.

  • (C2)

    We prove that NVGFs are Lipschitz continuous with respect to changes in the underlying graph topology.

  • (C3)

    We put forth a framework for designing NVGFs.

  • (C4)

    We propose a GNN architecture where the nonlinear activation function is replaced by a NVGF. The filter taps of the NVGF can be either learned from data (Learn NVGF) or obtained by design (Design NVGF). The aim of the resulting architecture is to decouple the role of frequency creation from the nonlinear nature of the GNN.

  • (C5)

    We investigate the problem of authorship attribution to demonstrate, both quantitatively and qualitatively, the role of frequency creation in the performance of a GNN, and its relationship to the nonlinear nature of traditional architectures.

In essence, we show that nonlinear activation functions are not strictly required for creating frequencies, as originally thought in [15, 10], but that linear NVGF activation functions are sufficient. Furthermore, we demonstrate that this frequency content can be learned with respect to the specific problem under study. All proofs, as well as further simulations, can be found in the appendices. Code can be found online at http://github.com/fgfgama/nvgf.


Related work. GNNs constitute a very active area of research [20, 21]. From a GSP perspective, spectral filtering is used in [22], it is then replaced by computationally simpler Chebyshev polynomials [23], and subsequently by general graph convolutional filters [6]. GNNs were also adopted in non-GSP problems [24, 25, 26]. The proposed replacement of nonlinear activation functions by NVGFs creates a linear architecture that uses both convolutional and non-convolutional graph filters.

NVGFs are first introduced in [19] to extend time-variant filters into the realm of graph signals. In that work, NVGFs are used to optimally approximate linear operators in a distributed manner. In this paper, we focus on the frequency response of NVGFs and on their capacity to create frequency content.

Leveraging the notion of GFT, [10] shows that a GNN is Lipschitz continuous to small changes in the underlying graph structure. Likewise, frequency analysis has been used to understand graph scattering transforms, where the filters used in the GNN are chosen to be wavelets (and not learned) [13, 14]. In this paper, we focus on the role of frequency creation that is put forth in [15, 10], and study NVGFs as linear mechanisms for achieving this.

2 The Node-Variant Graph Filter

Let 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) be an undirected, possibly weighted, graph with a set of NN nodes 𝒱={v1,,vN}\mathcal{V}=\{v_{1},\ldots,v_{N}\} and a set of edges 𝒱×𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}. Define a graph signal as the function 𝗑:𝒱\mathsf{x}:\mathcal{V}\to\mathbb{R} that associates a scalar value to each node. For a fixed ordering of the nodes, the graph signal can be conveniently described by means of a vector 𝐱N\mathbf{x}\in\mathbb{R}^{N} such that the ithi^{\text{th}} entry [𝐱]i[\mathbf{x}]_{i} is the value xix_{i} of the signal on node viv_{i}, i.e., [𝐱]i=xi=𝗑(vi)[\mathbf{x}]_{i}=x_{i}=\mathsf{x}(v_{i}).

Describing a graph signal as the vector 𝐱N\mathbf{x}\in\mathbb{R}^{N} is mathematically convenient but carries no information about the underlying graph topology that supports it. This information can be recovered by defining a graph matrix description (GMD) 𝐒N×N\mathbf{S}\in\mathbb{R}^{N\times N} which is a matrix that respects the sparsity pattern of the graph, i.e., [𝐒]ij=0[\mathbf{S}]_{ij}=0 for all distinct indices ii and jj such that (vj,vi)(v_{j},v_{i})\notin\mathcal{E}. Examples of GMDs widely used include the adjacency matrix, the Laplacian matrix, and their corresponding normalizations [27, 28, 29].

The GMD 𝐒\mathbf{S} can thus be leveraged to process the graph signal 𝐱\mathbf{x} in such a way that the underlying graph structure is exploited. The most elementary example is the linear map 𝖲:NN\mathsf{S}:\mathbb{R}^{N}\to\mathbb{R}^{N} between graph signals given by 𝐲=𝖲(𝐱)=𝐒𝐱\mathbf{y}=\mathsf{S}(\mathbf{x})=\mathbf{S}\mathbf{x}. This linear map is a linear combination of the information located in the one-hop neighborhood of each node:

yi=[𝐲]i=j=1N[𝐒]ij[𝐱]j=j:vj𝒩i{vi}sijxjy_{i}=[\mathbf{y}]_{i}=\sum_{j=1}^{N}[\mathbf{S}]_{ij}[\mathbf{x}]_{j}=\sum_{j:v_{j}\in\mathcal{N}_{i}\cup\{v_{i}\}}s_{ij}x_{j} (1)

where 𝒩i={vj:(vj,vi)}\mathcal{N}_{i}=\{v_{j}:(v_{j},v_{i})\in\mathcal{E}\} is the set of nodes that share an edge with node viv_{i}. The last equality in the above equation is due to the sparsity pattern of the GMD 𝐒\mathbf{S}.

More generally, a graph filter 𝖧(;𝐒):NN\mathsf{H}(\cdot;\mathbf{S}):\mathbb{R}^{N}\to\mathbb{R}^{N} is a mapping between graph signals that leverages the structure encoded in 𝐒\mathbf{S} [19]. In particular, linear shift-invariant graph filters (LSIGFs) are those that can be built as a polynomial in 𝐒\mathbf{S}:

𝖧lsi(𝐱;𝐒)=k=0Khk𝐒k𝐱=𝑯lsi(𝐒)𝐱\mathsf{H}^{\text{lsi}}(\mathbf{x};\mathbf{S})=\sum_{k=0}^{K}h_{k}\mathbf{S}^{k}\mathbf{x}=\boldsymbol{H}^{\text{lsi}}(\mathbf{S})\mathbf{x} (2)

where 𝑯lsi(𝐒)=k=0Khk𝐒k\boldsymbol{H}^{\text{lsi}}(\mathbf{S})=\sum_{k=0}^{K}h_{k}\mathbf{S}^{k} with hkh_{k}\in\mathbb{R}. Note that 𝖧lsi(𝐱;𝐒)\mathsf{H}^{\text{lsi}}(\mathbf{x};\mathbf{S}) is written in the form of 𝑯lsi(𝐒)𝐱\boldsymbol{H}^{\text{lsi}}(\mathbf{S})\mathbf{x} to emphasize that the function is linear in the input 𝐱\mathbf{x}, i.e., 𝐱\mathbf{x} is multiplied by a matrix 𝑯lsi(𝐒)\boldsymbol{H}^{\text{lsi}}(\mathbf{S}) that is parametrized by 𝐒\mathbf{S}. LSIGFs inherit their name from the fact that they satisfy the property that 𝑯lsi(𝐒)𝐒𝐱=𝐒𝑯lsi(𝐒)𝐱\boldsymbol{H}^{\text{lsi}}(\mathbf{S})\mathbf{S}\mathbf{x}=\mathbf{S}\boldsymbol{H}^{\text{lsi}}(\mathbf{S})\mathbf{x}. The set of polynomial coefficients {hk}k=0K\{h_{k}\}_{k=0}^{K} are called filter taps, and can be collected in a vector 𝐡K+1\mathbf{h}\in\mathbb{R}^{K+1} defined as [𝐡]k+1=hk[\mathbf{h}]_{k+1}=h_{k} for all k{0,,K}k\in\{0,\ldots,K\}. Note that the term 𝐒k𝐱\mathbf{S}^{k}\mathbf{x} is a convenient mathematical formulation, but in practice 𝐒k𝐱\mathbf{S}^{k}\mathbf{x} is computed by exchanging information kk times with one-hop neighbors, i.e., there are no matrix powers involved. In general, GSP regards 𝐒\mathbf{S} as given by the structure of the problem, and regards 𝐱\mathbf{x} as the actionable data [27, 28, 29].

This paper focuses on NVGFs [19], which are linear filters that assign a different filter tap to each node, for each application of 𝐒\mathbf{S}. This can be compactly written as follows:

𝖧nv(𝐱;𝐒)=k=0K𝖽𝗂𝖺𝗀(𝐡(k))𝐒k𝐱=𝑯nv(𝐒)𝐱\mathsf{H}^{\text{nv}}(\mathbf{x};\mathbf{S})=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{S}^{k}\mathbf{x}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\mathbf{x} (3)

where 𝐡(k)N\mathbf{h}^{(k)}\in\mathbb{R}^{N} and 𝖽𝗂𝖺𝗀()\mathsf{diag}(\cdot) is the diagonal operator that takes a vector and creates a diagonal matrix with the elements of the vector in the diagonal. Since the NVGF is linear in the input, it holds that 𝖧nv(𝐱;𝐒)=𝑯nv(𝐒)𝐱\mathsf{H}^{\text{nv}}(\mathbf{x};\mathbf{S})=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\mathbf{x}, where 𝑯nv(𝐒)=k=0K𝖽𝗂𝖺𝗀(𝐡(k))𝐒k\boldsymbol{H}^{\text{nv}}(\mathbf{S})=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{S}^{k}. The ithi^{\text{th}} entry of the vector, [𝐡(k)]i=hik[\mathbf{h}^{(k)}]_{i}=h_{ik}, is the filter tap that node viv_{i} uses to weigh the information incoming after kk exchanges with its neighbors. The set of all filter taps can be conveniently collected in a matrix 𝐇N×(K+1)\mathbf{H}\in\mathbb{R}^{N\times(K+1)} where the (k+1)th(k+1)^{\text{th}} column is 𝐡(k)N\mathbf{h}^{(k)}\in\mathbb{R}^{N} and the ithi^{\text{th}} row, denoted by 𝐡iK+1\mathbf{h}_{i}\in\mathbb{R}^{K+1}, contains the K+1K+1 filter taps used by node viv_{i}, i.e., [𝐡i]k+1=hik[\mathbf{h}_{i}]_{k+1}=h_{ik}.

The LSIGF in (2) and the NVGF in (3) are both linear and local processing operators. They depend linearly on the input graph signal 𝐱\mathbf{x} as indicated by the matrix multiplication notation in (2) and in (3). They are local in the sense that, to compute the output, each node requires information relayed directly by their immediate neighbors. The LSIGF is characterized by the collection of K+1K+1 filter taps. The NVGF, on the other hand, is characterized by N(K+1)N(K+1) filter taps. It is noted that while the NVGF requires additional memory to store more parameters, this can be distributed throughout the graph. It is also observed that both the LSIGF and the NVGF have the same computational complexity.

3 Frequency Analysis

The GMD 𝐒N×N\mathbf{S}\in\mathbb{R}^{N\times N} can be used to define a spectral representation of the graph signal 𝐱N\mathbf{x}\in\mathbb{R}^{N} [17]. Since the graph is undirected, assume that 𝐒\mathbf{S} is symmetric so that it can be diagonalized by an orthonormal basis of eigenvectors {𝐯i}i=1N\{\mathbf{v}_{i}\}_{i=1}^{N}, where 𝐯iN\mathbf{v}_{i}\in\mathbb{R}^{N} and 𝐒𝐯i=λi𝐯i\mathbf{S}\mathbf{v}_{i}=\lambda_{i}\mathbf{v}_{i}, with λi\lambda_{i}\in\mathbb{R} being the corresponding eigenvalue. Then, it holds that 𝐒=𝐕𝖽𝗂𝖺𝗀(𝝀)𝐕𝖳\mathbf{S}=\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda})\mathbf{V}^{\mathsf{T}}, where the ithi^{\text{th}} column of 𝐕\mathbf{V} is 𝐯i\mathbf{v}_{i} and where 𝝀N\boldsymbol{\lambda}\in\mathbb{R}^{N} is given by [𝝀]i=λi[\boldsymbol{\lambda}]_{i}=\lambda_{i} for i=1,,Ni=1,\ldots,N. We assume throughout this paper that the eigenvalues are distinct, which is typically the case for random connected graphs.

The spectral representation of a graph signal 𝐱\mathbf{x} with respect to its underlying graph support described by 𝐒\mathbf{S} is given by

𝐱~=𝐕𝖳𝐱\mathbf{\tilde{x}}=\mathbf{V}^{\mathsf{T}}\mathbf{x} (4)

where 𝐱~N\mathbf{\tilde{x}}\in\mathbb{R}^{N}, see [17]. The spectral representation 𝐱~\mathbf{\tilde{x}} of the graph signal 𝐱\mathbf{x} contains the coordinates of representing 𝐱\mathbf{x} on the eigenbasis {𝐯i}i=1N\{\mathbf{v}_{i}\}_{i=1}^{N} of the support matrix 𝐒\mathbf{S}. The resulting vector 𝐱~\mathbf{\tilde{x}} is known as the frequency response of the signal 𝐱\mathbf{x}. The ithi^{\text{th}} entry [𝐱~]i=𝐯i𝖳𝐱=x~i[\mathbf{\tilde{x}}]_{i}=\mathbf{v}_{i}^{\mathsf{T}}\mathbf{x}=\tilde{x}_{i}\in\mathbb{R} of the frequency response 𝐱~\mathbf{\tilde{x}} measures how much the ithi^{\text{th}} eigenvector 𝐯i\mathbf{v}_{i} contributes to the signal 𝐱\mathbf{x}. The operation in (4) is called the GFT, and thus the frequency response 𝐱~\mathbf{\tilde{x}} is often referred to as the GFT of the signal 𝐱\mathbf{x}.

The GFT offers an alternative representation of the graph signal 𝐱\mathbf{x} that takes into account the graph structure in 𝐒\mathbf{S}. The effect of a filter can be characterized in the spectral domain by computing the GFT of the output. For instance, when considering a LSIGF, the spectral representation of the output 𝐲=𝑯lsi(𝐒)𝐱\mathbf{y}=\boldsymbol{H}^{\text{lsi}}(\mathbf{S})\mathbf{x} is

𝐲~=𝐕𝖳k=0Khk𝐒k𝐱=k=0Khk𝖽𝗂𝖺𝗀(𝝀k)𝐱~=𝖽𝗂𝖺𝗀(𝐡~)𝐱~\mathbf{\tilde{y}}=\mathbf{V}^{\mathsf{T}}\sum_{k=0}^{K}h_{k}\mathbf{S}^{k}\mathbf{x}=\sum_{k=0}^{K}h_{k}\mathsf{diag}(\boldsymbol{\lambda}^{k})\mathbf{\tilde{x}}=\mathsf{diag}(\mathbf{\tilde{h}})\mathbf{\tilde{x}} (5)

where the eigendecomposition of 𝐒\mathbf{S}, the GFT of 𝐱\mathbf{x}, and the fact that 𝐒k=𝐕𝖽𝗂𝖺𝗀(𝝀k)𝐕𝖳\mathbf{S}^{k}=\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{k})\mathbf{V}^{\mathsf{T}} were all used. Note that 𝝀k\boldsymbol{\lambda}^{k} is a shorthand notation that means [𝝀k]i=λik[\boldsymbol{\lambda}^{k}]_{i}=\lambda_{i}^{k}. The vector 𝐡~N\mathbf{\tilde{h}}\in\mathbb{R}^{N} is known as the frequency response of the filter and its ithi^{\text{th}} entry is given by

[𝐡~]i=𝗁~(λi)=k=0Khkλik[\mathbf{\tilde{h}}]_{i}=\mathsf{\tilde{h}}(\lambda_{i})=\sum_{k=0}^{K}h_{k}\lambda_{i}^{k} (6)

where 𝗁~:\mathsf{\tilde{h}}:\mathbb{R}\to\mathbb{R} is a polynomial defined by the set of filter taps {hk}k=0K\{h_{k}\}_{k=0}^{K}. The function 𝗁~()\mathsf{\tilde{h}}(\cdot) depends only on the filter coefficients and not on the specific graph on which it is applied, and thus is valid for all graphs. The effect of filtering on a specific graph comes from instantiating 𝗁~()\mathsf{\tilde{h}}(\cdot) on the specific eigenvalues of that graph. The function 𝗁~()\mathsf{\tilde{h}}(\cdot) is denoted as the frequency response as well, and it will be clear from context whether we refer to the function 𝗁~()\mathsf{\tilde{h}}(\cdot) or to the vector 𝐡~\mathbf{\tilde{h}} given in (6) that stems from evaluating 𝗁~()\mathsf{\tilde{h}}(\cdot) at each of the NN eigenvalues λi\lambda_{i}. Note that since 𝗁~()\mathsf{\tilde{h}}(\cdot) is an analytic function, it can be applied to the square matrix 𝐒\mathbf{S} so that 𝗁~(𝐒)=𝑯lsi(𝐒)\mathsf{\tilde{h}}(\mathbf{S})=\boldsymbol{H}^{\text{lsi}}(\mathbf{S}).

In the case of the LSIGF, it is observed from (5) that the ithi^{\text{th}} entry of the frequency response of the output [𝐲~]i=y~i[\mathbf{\tilde{y}}]_{i}=\tilde{y}_{i} is given by the elementwise multiplication

y~i=𝗁~(λi)x~i.\tilde{y}_{i}=\mathsf{\tilde{h}}(\lambda_{i})\tilde{x}_{i}. (7)

This implies that the frequency response of the output y~i\tilde{y}_{i} is the elementwise multiplication of the frequency response of the filter 𝗁~(λi)\mathsf{\tilde{h}}(\lambda_{i}) and the frequency response of the input x~i\tilde{x}_{i}. This makes (7) the analogue of the convolution theorem for graph signals. Therefore, oftentimes the LSIGF in (2) is called graph convolution. It is observed from (7) that LSIGFs are capable of learning any type of frequency response (low-pass, high-pass, etc.), but that they are not able to create frequencies, i.e., if x~i=0\tilde{x}_{i}=0, then y~i=0\tilde{y}_{i}=0.

Unlike discrete-time signals, the frequency response of the LSIGF is not computed in the same manner as the frequency response of graph signals. More specifically, the frequency response of the filter 𝐡~\mathbf{\tilde{h}} can be directly computed from the filter taps 𝐡\mathbf{h} by means of a Vandermonde matrix 𝚲N×(K+1)\mathbf{\Lambda}\in\mathbb{R}^{N\times(K+1)} given by [𝚲]ik=λik1[\mathbf{\Lambda}]_{ik}=\lambda_{i}^{k-1} as follows:

𝐡~=𝚲𝐡.\mathbf{\tilde{h}}=\mathbf{\Lambda}\mathbf{h}. (8)

This implies that when processing graph signals, the graph filter cannot be uniquely represented by a graph signal [27, Th. 5], and thus the concept of impulse response is no longer valid. Furthermore, the graph convolution is not symmetric, i.e., the filter and the graph signal do not commute. Finally, it is interesting to note that, when 𝐒\mathbf{S} is the adjacency matrix of a directed cycle and K=N1K=N-1, then 𝚲=𝐕𝖧\mathbf{\Lambda}=\mathbf{V}^{\mathsf{H}} and the GFT of the signal and of the filter taps (in time, known as the impulse response) becomes equivalent, as expected.

When considering NVGFs, as in (3), the convolution theorem (7) no longer holds. Instead, the frequency response of the output is given in the following proposition.

Proposition 1 (Frequency response of NVGF).

Let 𝐲=𝐇nv(𝐒)𝐱=k=0K𝖽𝗂𝖺𝗀(𝐡(k))𝐒k𝐱\mathbf{y}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\mathbf{x}=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{S}^{k}\mathbf{x} be the output of an arbitrary NVGF characterized by some filter tap matrix 𝐇N×(K+1)\mathbf{H}\in\mathbb{R}^{N\times(K+1)}. Then, the frequency response 𝐲~=𝐕𝖳𝐲\mathbf{\tilde{y}}=\mathbf{V}^{\mathsf{T}}\mathbf{y} of the output is given by

𝐲~=𝐕𝖳(𝐕(𝐇𝚲𝖳))𝐱~\mathbf{\tilde{y}}=\mathbf{V}^{\mathsf{T}}\big{(}\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})\big{)}\mathbf{\tilde{x}} (9)

where \circ denotes elementwise product of matrices.

It is immediately observed that NVGFs are capable of generating new frequency content, even though they are linear.

Corollary 2.

If the matrix 𝐕𝖳(𝐕(𝐇𝚲𝖳))\mathbf{V}^{\mathsf{T}}\big{(}\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})\big{)} is not diagonal, then the output exhibits frequency content that is not present in the input.

As an example, consider the case where the input is given by a single frequency signal, i.e., 𝐱=𝐯t\mathbf{x}=\mathbf{v}_{t} for some t{1,,N}t\in\{1,\ldots,N\} so that 𝐱~=𝐞t\mathbf{\tilde{x}}=\mathbf{e}_{t} where [𝐞t]i=1[\mathbf{e}_{t}]_{i}=1 if i=ti=t and [𝐞t]i=0[\mathbf{e}_{t}]_{i}=0 otherwise. This is a signal that has a single frequency component. Yet, the ithi^{\text{th}} entry of the frequency response of the NVGF output is y~i=j=1N𝗁~j(λt)vjivjt\tilde{y}_{i}=\sum_{j=1}^{N}\mathsf{\tilde{h}}_{j}(\lambda_{t})v_{ji}v_{jt}, where vij=[𝐕]ijv_{ij}=[\mathbf{V}]_{ij} and 𝗁~j\mathsf{\tilde{h}}_{j} is the frequency response obtained from using the K+1K+1 coefficients at node vjv_{j}. Unless 𝗁~j(λt)\mathsf{\tilde{h}}_{j}(\lambda_{t}) is a constant for all jj, i.e., 𝗁~j(λt)=h~t\mathsf{\tilde{h}}_{j}(\lambda_{t})=\tilde{h}_{t}, the ithi^{\text{th}} entry of the GFT of the output is nonzero even when the ithi^{\text{th}} entry of the GFT of the input is zero. This also shows that by carefully designing the filter taps for each node, the frequencies that are allowed to be created can be chosen, analyzed, and understood.

To finalize the frequency analysis of NVGFs, we establish their Lipschitz continuity to changes in the underlying graph support, as decribed by the matrix 𝐒\mathbf{S}. In what follows, we denote the spectral norm of a matrix 𝐀\mathbf{A} by 𝐀2\|\mathbf{A}\|_{2}.

Theorem 3 (Lipschitz continuity of the NVGF with respect to 𝐒\mathbf{S}).

Let 𝒢\mathcal{G} and 𝒢^\mathcal{\hat{G}} be two graphs with NN nodes, described by GMDs 𝐒N×N\mathbf{S}\in\mathbb{R}^{N\times N} and 𝐒^N×N\mathbf{\hat{S}}\in\mathbb{R}^{N\times N}, respectively. Let 𝐇N×(K+1)\mathbf{H}\in\mathbb{R}^{N\times(K+1)} be the coefficients of any NVGF. Given a constant ε>0\varepsilon>0, if 𝐒^𝐒2ε\|\mathbf{\hat{S}}-\mathbf{S}\|_{2}\leq\varepsilon, it holds that

(𝑯nv(𝐒^)𝑯nv(𝐒))𝐱2εCN(1+8N)𝐱2+𝖮(ε2)\Big{\|}\big{(}\boldsymbol{H}^{\text{nv}}(\mathbf{\hat{S}})-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\big{)}\mathbf{x}\Big{\|}_{2}\leq\varepsilon C\sqrt{N}(1+8N)\|\mathbf{x}\|_{2}+\mathsf{O}(\varepsilon^{2}) (10)

where 𝐇nv(𝐒)\boldsymbol{H}^{\text{nv}}(\mathbf{S}) and 𝐇nv(𝐒^)\boldsymbol{H}^{\text{nv}}(\mathbf{\hat{S}}) are the NVGF on 𝐒\mathbf{S} and 𝐒^\mathbf{\hat{S}}, respectively, and where CC is the Lipschitz constant of the frequency responses at each node, i.e., |𝗁~t(λj)𝗁~t(λi)|C|λjλi||\mathsf{\tilde{h}}_{t}(\lambda_{j})-\mathsf{\tilde{h}}_{t}(\lambda_{i})|\leq C|\lambda_{j}-\lambda_{i}| for all i,j,t{1,,N}i,j,t\in\{1,\ldots,N\}.

Theorem 3 establishes the Lipschitz continuity of the NVGF filter with respect to the support matrix 𝐒\mathbf{S} (Lipschitz continuity with respect to the input 𝐱\mathbf{x} is immediately given for bounded filter taps) as long as the graphs 𝐒\mathbf{S} and 𝐒^\mathbf{\hat{S}} are similar, i.e., ε1\varepsilon\ll 1. The bound is proportional to this difference, ε\varepsilon, and to the shape of the frequency responses at each node through the Lipschitz constant CC. It also depends on the number of nodes NN, but it is fixed for given graphs with the same number of nodes. In short, Theorem 3 gives mild guarantees on the expected performance of the NVGF across a wide range of graphs 𝐒^\mathbf{\hat{S}} that are close to the graph 𝐒\mathbf{S}.

4 Approximating Activation Functions

One of the main roles of activation functions in neural networks is to create low-frequency content that can be processed in a stable manner [15]. However, the way the nonlinearities create this frequency content is unknown and cannot be shaped. One alternative for tailoring the frequency creation process to the specific problem under study is to learn the NVGF filter taps (Sec. 5). However, doing so, implies that the number of learnable parameters depend on NN which may lead to overfitting. In what follows we propose one method of designing, instead of learning, the NVGFs.

Problem statement. Assume that each data point 𝐱\mathbf{x} is a random graph signal with mean 𝔼[𝐱]=𝝁x\mathbb{E}[\mathbf{x}]=\boldsymbol{\mu}_{x}, correlation matrix 𝐑x=𝔼[𝐱𝐱𝖳]\mathbf{R}_{x}=\mathbb{E}[\mathbf{x}\mathbf{x}^{\mathsf{T}}], and covariance matrix 𝐂x=𝔼[(𝐱𝝁x)(𝐱𝝁x)𝖳]\mathbf{C}_{x}=\mathbb{E}[(\mathbf{x}-\boldsymbol{\mu}_{x})(\mathbf{x}-\boldsymbol{\mu}_{x})^{\mathsf{T}}]. The objective is to estimate a pointwise nonlinear function ρ:\mathsf{\rho}:\mathbb{R}\to\mathbb{R} such that [ρ(𝐱)]i=ρ([𝐱]i)[\mathsf{\rho}(\mathbf{x})]_{i}=\mathsf{\rho}([\mathbf{x}]_{i}), using a NVGF-based estimator as 𝐲^=𝑯nv(𝐒)𝐱+𝐜\mathbf{\hat{y}}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\mathbf{x}+\mathbf{c} for 𝑯nv(𝐒)\boldsymbol{H}^{\text{nv}}(\mathbf{S}) as in (3) and 𝐜N\mathbf{c}\in\mathbb{R}^{N}. Given the random variable 𝐲=ρ(𝐱)\mathbf{y}=\mathsf{\rho}(\mathbf{x}), the aim is to find the filter taps 𝐇N×(K+1)\mathbf{H}\in\mathbb{R}^{N\times(K+1)} that minimize the mean squared error (MSE)

𝐇=argmin𝐇N×(K+1)𝔼[𝐲^𝐲22].\mathbf{H}^{\star}=\operatorname*{arg\,min}_{\mathbf{H}\in\mathbb{R}^{N\times(K+1)}}\mathbb{E}\big{[}\|\mathbf{\hat{y}}-\mathbf{y}\|_{2}^{2}\big{]}. (11)

Our particular focus is set on obtaining unbiased estimators.

Lemma 4 (Unbiased estimator).

Let 𝛍ρ=𝔼[ρ(𝐱)]\boldsymbol{\mu}_{\rho}=\mathbb{E}[\mathsf{\rho}(\mathbf{x})]. The NVGF-based estimator is unbiased if and only if 𝐜=𝛍ρ𝐇nv(𝐒)𝛍x\mathbf{c}=\boldsymbol{\mu}_{\rho}-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\boldsymbol{\mu}_{x}.

From Lemma 4, the unbiased estimator is now

𝐲^=𝑯nv(𝐒)(𝐱𝝁x)+𝝁ρ.\mathbf{\hat{y}}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\big{(}\mathbf{x}-\boldsymbol{\mu}_{x}\big{)}+\boldsymbol{\mu}_{\rho}. (12)

Therefore, the objective becomes finding the filter tap matrix 𝐇N×(K+1)\mathbf{H}\in\mathbb{R}^{N\times(K+1)} for some fixed value of KK that satisfies

𝐇=argmin𝐇N×(K+1)𝔼[𝑯nv(𝐒)(𝐱𝝁x)(ρ(𝐱)𝝁ρ)22].\mathbf{H}^{\star}=\operatorname*{arg\,min}_{\mathbf{H}\in\mathbb{R}^{N\times(K+1)}}\mathbb{E}\Big{[}\big{\|}\boldsymbol{H}^{\text{nv}}(\mathbf{S})(\mathbf{x}-\boldsymbol{\mu}_{x})-\big{(}\mathsf{\rho}(\mathbf{x})-\boldsymbol{\mu}_{\rho}\big{)}\big{\|}_{2}^{2}\Big{]}. (13)

Note that, due to the orthogonal nature of the GFT, minimizing (13) is equivalent to minimizing the difference of the corresponding frequency responses.

The optimal filter taps for each node, i.e., the rows 𝐡iK+1\mathbf{h}_{i}^{\star}\in\mathbb{R}^{K+1} of 𝐇\mathbf{H}^{\star} that solve (13), can be obtained by solving a linear system of equations as follows.

Proposition 5 (Optimal NVGF).

Let 𝐮i\mathbf{u}_{i} denote the ithi^{\text{th}} row of 𝐕\mathbf{V}, 𝐑i=𝚲𝖳𝖽𝗂𝖺𝗀(𝐮i)𝐕𝖳𝐂x𝐕𝖽𝗂𝖺𝗀(𝐮i)𝚲\mathbf{R}_{i}=\mathbf{\Lambda}^{\mathsf{T}}\mathsf{diag}(\mathbf{u}_{i})\mathbf{V}^{\mathsf{T}}\mathbf{C}_{x}\mathbf{V}\mathsf{diag}(\mathbf{u}_{i})\mathbf{\Lambda} be the covariance matrix of the frequency response at node viv_{i} for the input 𝐱\mathbf{x}, and 𝐩i=𝚲𝖳𝖽𝗂𝖺𝗀(𝐮i)𝐕𝖳𝔼[(ρ(xi)μρi)(𝐱𝛍x)]\mathbf{p}_{i}=\mathbf{\Lambda}^{\mathsf{T}}\mathsf{diag}(\mathbf{u}_{i})\mathbf{V}^{\mathsf{T}}\mathbb{E}[(\mathsf{\rho}(x_{i})-\mu_{\rho i})(\mathbf{x}-\boldsymbol{\mu}_{x})] denote the correlation between the filtered signal and the target nonlinearity. Then, a set of filter taps {𝐡1,,𝐡N}\{\mathbf{h}_{1}^{\star},\ldots,\mathbf{h}_{N}^{\star}\} is optimal for (13) if and only if they solve the system of linear equations

𝐑i𝐡i=𝐩i,i{1,,N}.\mathbf{R}_{i}\mathbf{h}_{i}^{\star}=\mathbf{p}_{i},\quad i\in\{1,\ldots,N\}. (14)

From Proposition 5, it is immediate that only knowledge of the first and second moments of 𝐱\mathbf{x}, and of the correlation between the input 𝐱\mathbf{x} and the output 𝐲=ρ(𝐱)\mathbf{y}=\mathsf{\rho}(\mathbf{x}), is required to solve for the optimal NVGF. These moments can be estimated from training data. Also observed is that the optimal filter taps for each node can be computed separately at each node.

We now consider the specific case of random graph signals with zero-mean, independent, identically distributed (i.i.d.) entries, to illustrate the form of the optimal NVGF.

Corollary 6 (Zero-mean, i.i.d., ReLU nonlinearity).

Assume that the elements of 𝐱\mathbf{x} are i.i.d., and that 𝛍x=𝟎\boldsymbol{\mu}_{x}=\mathbf{0} and 𝐑x=𝐂x=σx2𝐈\mathbf{R}_{x}=\mathbf{C}_{x}=\sigma_{x}^{2}\mathbf{I} for some σx2>0\sigma_{x}^{2}>0. Consider an unbiased NVGF-based estimator of the form

y^i=(ξ2/σx2)xi+μρ\hat{y}_{i}=(\xi^{2}/\sigma_{x}^{2})x_{i}+\mu_{\rho} (15)

with ξ2=𝔼[ρ(x)x]\xi^{2}=\mathbb{E}[\rho(x)x] and μρ=𝔼[ρ(x)]\mu_{\rho}=\mathbb{E}[\rho(x)], where xx is distributed as the elements of 𝐱\mathbf{x}. Then, the estimator (15) is optimal for (13). If, additionally, ρ()=𝖱𝖾𝖫𝖴()\rho(\cdot)=\mathsf{ReLU}(\cdot) and the distribution of xx is symmetric around 0, then μρ=0\mu_{\rho}=0 and ξ2=σx2/2\xi^{2}=\sigma_{x}^{2}/2, so that (15) reduces to y^i=xi/2\hat{y}_{i}=x_{i}/2.

Corollary 6 shows that if the elements of the graph signal are zero-mean and i.i.d., then the NVGF boils down to a LSI graph filter. Hence, this optimal unbiased NVGF is not capable of generating frequencies. This is sensible since the elements of the graph signal 𝐱\mathbf{x} bare no relationship to the underlying graph support, and thus the NVGF does not leverage that support to create the appropriate frequencies. Furthermore, if the distribution of each element of 𝐱\mathbf{x} is symmetric around zero, then an optimal approximation of the 𝖱𝖾𝖫𝖴\mathsf{ReLU} nonlinearity amounts to a LSI graph filter that outputs half of the input, as one would expect.

Corollary 7 (Stationary graph processes).

Let 𝐱\mathbf{x} be a stationary graph process [30, 31], so that 𝛍x=𝟎\boldsymbol{\mu}_{x}=\mathbf{0} and 𝐑x=𝐂x=𝐕𝖽𝗂𝖺𝗀(𝐪)𝐕𝖳\mathbf{R}_{x}=\mathbf{C}_{x}=\mathbf{V}\mathsf{diag}(\mathbf{q})\mathbf{V}^{\mathsf{T}} with power spectral density 𝐪N\mathbf{q}\in\mathbb{R}^{N}. Then, NVGF filter taps 𝐡i\mathbf{h}_{i}^{\star} at node viv_{i} solving the linear equations 𝖽𝗂𝖺𝗀(𝐮i𝐪)𝚲𝐡i=𝐕𝖳𝔼[ρ(xi)𝐱]\mathsf{diag}(\mathbf{u}_{i}\circ\mathbf{q})\mathbf{\Lambda}\mathbf{h}_{i}^{\star}=\mathbf{V}^{\mathsf{T}}\mathbb{E}[\rho(x_{i})\mathbf{x}] are optimal.

It is observed from Corollary 7 that the effect of the contribution of each frequency component to node viv_{i} gets modulated by the power spectral density 𝐪\mathbf{q}. Also, the filter in this case is different for each node, and thus optimal unbiased NVGFs for processing stationary graph processes actually create frequencies.

5 Graph Neural Network Architectures Using Node-Variant Graph Filters

Refer to caption
(a) Performance Comparison
Refer to caption
(b) Role of frequency creation
Refer to caption
(c) Input frequencies
Figure 1: Authorship attribution problem: Jane Austen. (LABEL:sub@fig:austen:comparison) Comparison between the Learn NVGF and three popular architectures (GCN [24], SGC [25] and GAT [26]). (LABEL:sub@fig:austen:change) Change with respect to a GCNN (16) when considering no activation function (LSIGF), a designed NVGF (17) and a learned NVGF (18). (LABEL:sub@fig:austen:input) A sample of the input frequencies to the architectures, where a large high-frequency component is observed.

A graph convolutional neural network (GCNN) Φ\mathsf{\Phi} has a layered architecture [23, 6], where each layer applies a LSIGF as in (2), followed by a pointwise nonlinear activation function ρ:\rho:\mathbb{R}\to\mathbb{R} applied to each node [ρ(𝐱)]i=ρ([𝐱]i)[\rho(\mathbf{x})]_{i}=\rho([\mathbf{x}]_{i}) and

𝐱0=𝐱,𝐱=ρ(𝖧lsi(𝐱1;𝐒,𝐡)),𝐱L=Φ(𝐱;𝐒,).\mathbf{x}_{0}=\mathbf{x},\ \mathbf{x}_{\ell}=\rho\big{(}\mathsf{H}_{\ell}^{\text{lsi}}(\mathbf{x}_{\ell-1};\mathbf{S},\mathbf{h}_{\ell})\big{)},\ \mathbf{x}_{L}=\mathsf{\Phi}(\mathbf{x};\mathbf{S},\mathcal{H}). (16)

The LSIGF 𝖧lsi(;𝐒,𝐡)\mathsf{H}_{\ell}^{\text{lsi}}(\cdot;\mathbf{S},\mathbf{h}_{\ell}) is characterized by the specific filter taps 𝐡K+1\mathbf{h}_{\ell}\in\mathbb{R}^{K_{\ell}+1}. Note that the output of the GCNN is collected as the output of the last layer 𝐱L=Φ(𝐱;𝐒,)\mathbf{x}_{L}=\mathsf{\Phi}(\mathbf{x};\mathbf{S},\mathcal{H}). This notation emphasizes that the input is 𝐱\mathbf{x}, while the matrix 𝐒\mathbf{S} is given by the problem, and the filter taps ={𝐡}=1L\mathcal{H}=\{\mathbf{h}_{\ell}\}_{\ell=1}^{L} are learned from data.

Nonlinear activation functions are used in GCNNs to enable them to learn nonlinear relationships between input and output. Additionally, theoretical results have found that they play a key role in creating frequency content that can be better processed by subsequent graph convolutional filters. As previously discussed, NVGFs are also capable of creating new frequency content, albeit in a linear manner. Therefore, by replacing the nonlinear activation functions by NVGFs, it is possible to decouple the contribution made by frequency creation from that made by the architecture’s nonlinearity.

The first architecture proposed here is to use a designed NVGF in lieu of the activation function. That is, instead of using the nonlinear activation function ρ\rho, an optimal NVGF filter designed as in Proposition 5 is used. This requires estimating the first and second moments of the NVGF input data. The architecture, herein termed “Design NVGF”, is given by

𝐱=𝖧nv(𝖧lsi(𝐱;𝐒,𝐡);𝐒,𝐇).\mathbf{x}_{\ell}=\mathsf{H}_{\ell}^{\text{nv}}\big{(}\mathsf{H}_{\ell}^{\text{lsi}}(\mathbf{x}_{\ell};\mathbf{S},\mathbf{h}_{\ell});\mathbf{S},\mathbf{H}_{\ell}^{\star}\big{)}. (17)

Note that, in this case, the filter taps 𝐇N×(K+1)\mathbf{H}_{\ell}^{\star}\in\mathbb{R}^{N\times(K_{\ell}+1)} of the NVGF are obtained by Proposition 5, while the filter taps ={𝐡}=1L\mathcal{H}=\{\mathbf{h}_{\ell}\}_{\ell=1}^{L} of the LSIGF are learned from data.

Alternatively, the filter taps of the NVGF replacing the nonlinear activation function can also be learned from data, together with the filter taps of the LSIGF:

𝐱=𝖧nv(𝖧lsi(𝐱;𝐒,𝐡);𝐒,𝐇)\mathbf{x}_{\ell}=\mathsf{H}_{\ell}^{\text{nv}}\big{(}\mathsf{H}_{\ell}^{\text{lsi}}(\mathbf{x}_{\ell};\mathbf{S},\mathbf{h}_{\ell});\mathbf{S},\mathbf{H}_{\ell}\big{)} (18)

where the filter taps to be learned are ={(𝐡,𝐇)}=1L\mathcal{H}=\{(\mathbf{h}_{\ell},\mathbf{H}_{\ell})\}_{\ell=1}^{L}. This approach avoids the need to estimate first and second moments. Additionally, it allows the NVGF to learn how to create frequency content tailored to the application at hand, instead of just approximating the chosen nonlinear activation function. We note that while the increased number of parameters may lead to overfitting, this can be tackled by dropout. This architecture is termed “Learn NVGF”.

6 Numerical Experiments

The objectives of the experiments are twofold. First, they aim to show how the architecture obtained by replacing the nonlinear activation function with a NVGF performs when compared to other popular GNNs. The second objective is to isolate the impact that frequency creation has on the overall performance. Due to space constraints, we focus on the problem of authorship attribution [5]. Other applications can be found in the supplementary material, where the observations are summarized at the end of this section for reference.

Problem statement. In the problem of authorship attribution, the goal is to determine whether a given text has been written by a certain author, relying on other texts that the author is known to have written. To this end, word adjacency networks (WANs) are leveraged. WANs are graphs that are built by considering function words (i.e., words without semantic meaning) as nodes, and considering their co-occurrences as edges [5]. As it happens, the way each author uses function words gives rise to a stylistic signature that can be used to attribute authorship. In what follows, we address this problem by leveraging previously constructed WANs as graphs, and the frequency count (histogram) of function words as the corresponding graph signals.

Dataset. For illustrative purposes, in what follows, works by Jane Austen are considered. Attribution of other 19th\text{19}^{\text{th}} century authors can be found in the supplementary material. A WAN consisting of 189189 nodes (function words) and 98129812 edges is built from texts belonging to a given corpus considered to be the training set. These texts are partitioned into segments of approximately 10001000 words each, and the frequency count of those 189189 function words in each of the texts is obtained. These represent the graph signals 𝐱\mathbf{x} that are considered to be part of the training set. Each of these is assigned a label 11 to indicate that they have been authored by Jane Austen. An equal number of segments from other contemporary authors are randomly selected, and then their frequency count is computed and added to the training set with the label 0 to indicate that they have not been written by Jane Austen. The total number of labeled samples in the training set is 14641464, of which 118118 are left aside for validation. The test set is built analogously by considering other text segments that were not part of the training set (and thus, not used to build the WAN either), totaling 7878 graph signals (half corresponding to texts authored by Jane Austen, and half corresponding to texts by other contemporary authors).

Architectures and training. For the first experiment, we compare the Learn NVGF architecture (18) with arguably three of the most popular non-GSP GNNs, namely, GCN [24], SGC [25], and GAT [26]. Note that the Learn NVGF is an entirely linear architecture, but one capable of creating frequencies due to the nature of the NVGF. The filter taps of both the LSIGF and the NVGF are learned from data. The other three architectures are nonlinear since they include a ReLU activation function after the first filtering layer. All architectures include a learnable linear readout layer. Dropout with a probability 0.50.5 is included after the first layer. All architectures are trained for 2525 epochs with a batch size of 2020, using the ADAM algorithm [32] with the forgetting factors 0.90.9 and 0.9990.999, respectively, as well as a learning rate η\eta. The value of the learning rate η\eta, the number of hidden units FF, and the number of filter taps KK are chosen by exhaustive search over triplets (η,F,K)(\eta,F,K) in the set {0.001,0.005,0.01}×{16,32,64}×{2,3,4}\{0.001,0.005,0.01\}\times\{16,32,64\}\times\{2,3,4\}.

Refer to caption
(a) LSIGF
Refer to caption
(b) GCNN
Refer to caption
(c) Learn NVGF
Figure 2: Frequency responses to a single high-frequency input 𝐱=𝐯N\mathbf{x}=\mathbf{v}_{N}. The frequency responses include the outputs of all F=64F=64 filters.

Experiment 1: Performance comparison. The objective of this first experiment is to illustrate that the performance of the Learn NVGF is comparable to the performance of popular (non-GSP) GNNs. The best results for each architecture are shown in Figure 1a, where the classification error was averaged over 1010 random splits of texts that are assigned to the training and test sets. One third of the standard deviation is also shown. It is observed that the Learn NVGF architecture has a comparable performance. It is emphasized that the objective of this paper is not to achieve state-of-the-art performance, but to offer insight on the role of nonlinear activation functions in frequency creation and how this translates to performance, as discussed next. Among the three popular architectures, GAT (η=0.01\eta=0.01, F=64F=64) exhibits the best performance, which is 102%102\% better than SGC (η=0.005\eta=0.005, F=64F=64, K=2K=2) and 30%30\% better than GCN (η=0.01\eta=0.01, F=64F=64). This is expected since the graph filters involved in GAT are non-convolutional, and thus have more expressivity. Yet, when compared to the linear Learn NVGF (η=0.001\eta=0.001, F=32F=32, K=3K=3), it is observed that the latter exhibits 20%20\% better performance than the GAT.

Experiment 2: The role of frequency creation. For the second experiment, we compare the Learn NVGF of the previous experiment with (i) a simple LSIGF, (ii) a Design NVGF as in (17), and (iii) a GCNN as in (16). The same values of (η=0.001,F=32,K=3)(\eta=0.001,F=32,K=3) are used for all architectures as a means of fixing all other variabiliity except for the nonlinearity/frequency creation. The results are shown in Figure 1b. Note that the GCNN in (16) can be interpreted as a stand-in for ChebNets [23], arguably the most popular GSP-based GNN architecture.

Discussion. First, note that the LSIGF, Design NVGF, and Learn NVGF architectures are linear, whereas the GCNN is not. The LSIGF, however, is not capable of creating frequencies, while the other three are, albeit through different mechanisms. Second, Figure 1b shows the percentage difference in performance with respect to the base architecture (the GCNN). Essentially, the difference in performance between the LSIGF and the Learn NVGF can be pinned down to the frequency creation, because both architectures are linear, while the difference between the Learn NVGF and the GCNN can be tied to the nonlinear nature of the GCNN. It is thus observed that the LSIGF performs considerably worse than the Learn NVGF and the GCNN, which perform the same. It is observed that the Design NVGF performs halfway between the GCNN and the LSIGF. The Design NVGF depends on the ability to accurately estimate the first and second moments from the data, and this has an impact on its performance. In any case, it is noted that this experiment suggests that the main driver of improved performance is the frequency creation and not necessarily the nonlinear nature of the GCNN.

From a qualitative standpoint, the average frequency response of the signals in the test set is shown in Figure 1c. Since the high-eigenvalue content is significant, it is expected that the ability to better process this content will impact the overall performance. This explains the relatively poor performance of the LSIGF. In Figure 2, we show the frequency response of the output for each of the three architectures (LSIGF, Learn NVGF, and GCNN) when the input has a single high frequency, i.e., 𝐱=𝐯N\mathbf{x}=\mathbf{v}_{N} so that 𝐱~=𝐞N\mathbf{\tilde{x}}=\mathbf{e}_{N}. Figure 2a shows that the output frequency response of the LSIGF exhibits a single frequency, the same as the input. Figure 2b shows that the output frequency response of the GCNN has content in all frequencies, but most notably a low-frequency peak appears. Figure 2c shows that the output frequency response of the Learn NVGF contains all frequencies, in a much more spread manner than the GCNN.

General observations. In the supplementary material, a similar analysis is carried out for 2121 other authors. Additionally, it is noted that Jane Austen is representative of the largest group (consisting of 1111 authors) where the Learn NVGF and the GCNN have similar performance and are better than the LSIGF. For 77 other authors, the Learn NVGF actually performs better than the GCNN. Finally, for the remaining 33 authors, there is no significant difference between the performance of the LSIGF, the GCNN, and the Learn NVGF, which implies that the high-frequency eigenvalue content is less significant for these authors. Additionally, the problem of movie recommendation is considered [33]. First, it is observed that the Learn NVGF exhibits better performance than the methods in [3, 4] and the nearest neighbor algorithm. Second, it is noted that the input signals do not have significant high-eigenvalue content and thus the LSIGF, the GCNN, and the Learn NVGF exhibit similar performance. For a detailed analysis, please see the supplementary material. 111It is noted that the popular benchmark of semi-supervised learning over citation networks does not fit the empirical risk minimization framework nor the graph signal processing framework, and thus, it does not admit a frequency analysis, precluding their use in this work.

Problems beyond processing graph signals. Two popular tasks that utilize graph-based data are semi-supervised learning and graph classification. The former involves a framework where each sample represents a node in the graph, i.e., 𝐱F\mathbf{x}\in\mathbb{R}^{F} for FF features instead of 𝐱N\mathbf{x}\in\mathbb{R}^{N} for NN nodes. This implies that the notion of graph frequency response used in (4) does not translate to the semi-supervised setting. Therefore, extending these results to this problem requires careful determination of the notion of frequency. For the graph classification problem, each sample in the dataset represents a signal together with a graph, and the graph associated to each sample is usually different. In this case, while the notion of frequency is properly defined for each sample, an overarching frequency response for all the samples is not. Possible extensions of this framework to the graph classification problem would entail redefining a common notion of frequency response for the dataset, for example, using graphons.

7 Conclusion

The objective of this work was to study the role of frequency creation in GSP problems. To do so, nonlinear activation functions (which theoretical findings suggest give rise to frequency creation) are replaced by NVGFs, which are also capable of creating frequencies, but in a linear manner. In this way, frequency creation was decoupled from the nonlinear nature of activation functions. Numerical experiments show that the main driver of improved performance is frequency creation and not necessarily the nonlinear nature of GCNNs. As future work, we are interested in extending this frequency analysis to non-GSP related problems such as semi-supervised node classification or graph classification problems, which require a careful definition of a notion of frequency. We discussed the caveats of extending this framework to include semi-supervised learning and graph classification problems, relating to the need of defining an appropriate notion of graph frequency. This opens up an interesting area of future research. We also note that it would be possible to use shift-variant filters to leverage this framework when using CNNs.

References

  • [1] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: Going beyond Euclidean data,” IEEE Signal Process. Mag., vol. 34, no. 4, pp. 18–42, July 2017.
  • [2] F. Gama, E. Isufi, G. Leus, and A. Ribeiro, “Graphs, convolutions, and neural networks: From graph filters to graph neural networks,” IEEE Signal Process. Mag., vol. 37, no. 6, pp. 128–138, Nov. 2020.
  • [3] F. Monti, M. M. Bronstein, and X. Bresson, “Geometric matrix completion with recurrent multi-graph neural networks,” in 31st Conf. Neural Inform. Process. Syst. Long Beach, CA: Neural Inform. Process. Syst. Foundation, 4-9 Dec. 2017, pp. 3700–3710.
  • [4] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “CayleyNets: Graph convolutional neural networks with complex rational spectral filters,” IEEE Trans. Signal Process., vol. 67, no. 1, pp. 97–109, 5 Nov. 2018.
  • [5] S. Segarra, M. Eisen, and A. Ribeiro, “Authorship attribution through function word adjacency networks,” IEEE Trans. Signal Process., vol. 63, no. 20, pp. 5464–5478, 30 June 2015.
  • [6] F. Gama, A. G. Marques, G. Leus, and A. Ribeiro, “Convolutional neural network architectures for signals supported on graphs,” IEEE Trans. Signal Process., vol. 67, no. 4, pp. 1034–1049, 17 Dec. 2018.
  • [7] M. Eisen and A. Ribeiro, “Optimal wireless resource allocation with random edge graph neural networks,” IEEE Trans. Signal Process., vol. 68, pp. 2977–2991, 20 Apr. 2020.
  • [8] F. Gama and S. Sojoudi, “Distributed linear-quadratic control with graph neural networks,” Signal Process., 11 Feb. 2022, early access. [Online]. Available: http://doi.org/10.1016/j.sigpro.2022.108506
  • [9] D. Owerko, F. Gama, and A. Ribeiro, “Predicting power outages using graph neural networks,” in 2018 IEEE Global Conf. Signal and Inform. Process. Anaheim, CA: IEEE, 26-29 Nov. 2018, pp. 743–747.
  • [10] F. Gama, J. Bruna, and A. Ribeiro, “Stability properties of graph neural networks,” IEEE Trans. Signal Process., vol. 68, pp. 5680–5695, 25 Sep. 2020.
  • [11] H. Kenlay, D. Thanou, and X. Dong, “On the stability of graph convolutional neural networks under edge rewiring,” in 46th IEEE Int. Conf. Acoust., Speech and Signal Process. Toronto, ON: IEEE, 6-11 June 2021.
  • [12] R. Levie, W. Huang, L. Bucci, M. Bronstein, and G. Kutyniok, “Transferability of spectral graph convolutional neural networks,” arXiv:1907.12972v2 [cs.LG], 5 March 2020. [Online]. Available: http://arxiv.org/abs/1907.12972
  • [13] F. Gao, G. Wolf, and M. Hirn, “Geometric scattering for graph data analysis,” in 36th Int. Conf. Mach. Learning, vol. 97. Long Beach, CA: Proc. Mach. Learning Res., 9-15 June 2019, pp. 2122–2131.
  • [14] D. Zou and G. Lerman, “Graph convolutional neural networks via scattering,” Appl. Comput. Harmonic Anal., vol. 49, no. 3, pp. 1046–1074, Nov. 2020.
  • [15] S. Mallat, “Group invariant scattering,” Commun. Pure, Appl. Math., vol. 65, no. 10, pp. 1331–1398, Oct. 2012.
  • [16] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1872–1886, Aug. 2013.
  • [17] A. Sandyhaila and J. M. F. Moura, “Discrete signal processing on graphs: Frequency analysis,” IEEE Trans. Signal Process., vol. 62, no. 12, pp. 3042–3054, 30 Apr. 2014.
  • [18] L. Ruiz, F. Gama, and A. Ribeiro, “Graph neural networks: Architectures, stability and transferability,” Proc. IEEE, vol. 109, no. 5, pp. 660–682, May 2021.
  • [19] S. Segarra, A. G. Marques, and A. Ribeiro, “Optimal graph-filter design and applications to distributed linear networks operators,” IEEE Trans. Signal Process., vol. 65, no. 15, pp. 4117–4131, 11 May 2017.
  • [20] S. Rey, V. Tenorio, S. Rozada, L. Martino, and A. G. Marques, “Deep encoder-decoder neural network architectures for graph output signals,” in 53rd Asilomar Conf. Signals, Systems and Comput. Pacific Grove, CA: IEEE, 3-6 Nov. 2019, pp. 225–229.
  • [21] S. Rey, V. Tenorio, S. Rozada, L. Martino, and A. G. Marques, “Overparametrized deep encoder-decoder schemes for inputs and outputs defined over graphs,” in 28th Eur. Signal Process. Conf. Amsterdam, the Netherlands: Eur. Assoc. Signal Process., 18-21 Jan. 2021, pp. 855–859.
  • [22] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and deep locally connected networks on graphs,” in 2nd Int. Conf. Learning Representations, Banff, AB, 14-16 Apr. 2014, pp. 1–14.
  • [23] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in 30th Conf. Neural Inform. Process. Syst. Barcelona, Spain: Neural Inform. Process. Syst. Foundation, 5-10 Dec. 2016, pp. 3844–3858.
  • [24] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in 5th Int. Conf. Learning Representations, Toulon, France, 24-26 Apr. 2017, pp. 1–14.
  • [25] F. Wu, T. Zhang, A. H. de Souza Jr., C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” in 36th Int. Conf. Mach. Learning, vol. 97. Long Beach, CA: Proc. Mach. Learning Res., 9-15 June 2019, pp. 6861–6871.
  • [26] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in 6th Int. Conf. Learning Representations, Vancouver, BC, 30 Apr.-3 May 2018, pp. 1–12.
  • [27] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing on graphs,” IEEE Trans. Signal Process., vol. 61, no. 7, pp. 1644–1656, 11 Jan. 2013.
  • [28] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Process. Mag., vol. 30, no. 3, pp. 83–98, May 2013.
  • [29] A. Ortega, P. Frossard, J. Kovačević, J. M. F. Moura, and P. Vandergheynst, “Graph signal processing: Overview, challenges and applications,” Proc. IEEE, vol. 106, no. 5, pp. 808–828, May 2018.
  • [30] N. Perraudin and P. Vendergheynst, “Stationary signal processing on graphs,” IEEE Trans. Signal Process., vol. 65, no. 13, pp. 3462–3477, 3 Apr. 2017.
  • [31] A. G. Marques, S. Segarra, G. Leus, and A. Ribeiro, “Stationary graph processes and spectral estimation,” IEEE Trans. Signal Process., vol. 65, no. 22, pp. 5911–5926, 11 Aug. 2017.
  • [32] D. P. Kingma and J. L. Ba, “ADAM: A method for stochastic optimization,” in 3rd Int. Conf. Learning Representations, San Diego, CA, 7-9 May 2015, pp. 1–15.
  • [33] F. M. Harper and J. A. Konstan, “The MovieLens datasets: History and context,” ACM Trans. Interactive Intell. Syst., vol. 5, no. 4, pp. 19:(1–19), Jan. 2016.

Appendix A Frequency Response of Node-Variant Graph Filters

Proof of Proposition 1.

The output graph signal 𝐲N\mathbf{y}\in\mathbb{R}^{N} of a node-variant graph filter is given by (3), which is reproduced here for ease of exposition:

𝐲=k=0K𝖽𝗂𝖺𝗀(𝐡(k))𝐒k𝐱.\mathbf{y}=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{S}^{k}\mathbf{x}. (19)

Recall that 𝐡(k)N\mathbf{h}^{(k)}\in\mathbb{R}^{N} is the (k+1)th(k+1)^{\text{th}} column of the matrix 𝐇N×(K+1)\mathbf{H}\in\mathbb{R}^{N\times(K+1)} containing the N(K+1)N(K+1) filter taps, that 𝐒N×N\mathbf{S}\in\mathbb{R}^{N\times N} is the support matrix describing the graph, and that 𝐱N\mathbf{x}\in\mathbb{R}^{N} is the input graph signal. As given by (4), the spectral representation 𝐱~N\mathbf{\tilde{x}}\in\mathbb{R}^{N} of the input graph signal are the coordinates of representing 𝐱\mathbf{x} on the eigenbasis {𝐯i}i=1N\{\mathbf{v}_{i}\}_{i=1}^{N} of the support matrix 𝐒\mathbf{S}, i.e., 𝐱~=𝐕𝖳𝐱\mathbf{\tilde{x}}=\mathbf{V}^{\mathsf{T}}\mathbf{x}. Similarly, 𝐲~=𝐕𝖳𝐲\mathbf{\tilde{y}}=\mathbf{V}^{\mathsf{T}}\mathbf{y}. Therefore, by (19) together with the fact that 𝐒=𝐕𝖽𝗂𝖺𝗀(𝝀)𝐕𝖳\mathbf{S}=\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda})\mathbf{V}^{\mathsf{T}}, it holds that

𝐲~\displaystyle\mathbf{\tilde{y}} =𝐕𝖳𝐲=𝐕𝖳k=0K𝖽𝗂𝖺𝗀(𝐡(k))𝐒k𝐱\displaystyle=\mathbf{V}^{\mathsf{T}}\mathbf{y}=\mathbf{V}^{\mathsf{T}}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{S}^{k}\mathbf{x} (20)
=𝐕𝖳k=0K𝖽𝗂𝖺𝗀(𝐡(k))𝐕𝖽𝗂𝖺𝗀(𝝀k)𝐱~\displaystyle=\mathbf{V}^{\mathsf{T}}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{k})\mathbf{\tilde{x}}

where 𝝀k\boldsymbol{\lambda}^{k} is the NN-vector with ithi^{\text{th}} element given by [𝝀k]i=λik[\boldsymbol{\lambda}^{k}]_{i}=\lambda_{i}^{k}. Denoting hik=[𝐡(k)]i=[𝐇]i(k+1)h_{ik}=[\mathbf{h}^{(k)}]_{i}=[\mathbf{H}]_{i(k+1)}, note that 𝖽𝗂𝖺𝗀(𝐡(k))𝐕𝖽𝗂𝖺𝗀(𝝀k)N×N\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{k})\in\mathbb{R}^{N\times N}, so that

k=0K[𝖽𝗂𝖺𝗀(\displaystyle\sum_{k=0}^{K}[\mathsf{diag}( 𝐡(k))𝐕𝖽𝗂𝖺𝗀(𝝀k)]ij=k=0Khikλjkvij\displaystyle\mathbf{h}^{(k)})\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{k})]_{ij}=\sum_{k=0}^{K}h_{ik}\lambda_{j}^{k}v_{ij} (21)
=vijk=0Khikλjk=vij𝗁~i(λj)\displaystyle=v_{ij}\sum_{k=0}^{K}h_{ik}\lambda_{j}^{k}=v_{ij}\mathsf{\tilde{h}}_{i}(\lambda_{j})

where 𝗁~i(λ)=k=0Khikλk\mathsf{\tilde{h}}_{i}(\lambda)=\sum_{k=0}^{K}h_{ik}\lambda^{k} is the frequency response of the filter taps at node viv_{i}; see (6). Next, observe that the filter taps for each node are collected in the rows of 𝐇\mathbf{H}. Therefore, in analogy to (8), it holds that 𝐇𝚲𝖳N×N\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}}\in\mathbb{R}^{N\times N}, with

[𝐇𝚲𝖳]ij=k=0Khikλjk=𝗁~i(λj).[\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}}]_{ij}=\sum_{k=0}^{K}h_{ik}\lambda_{j}^{k}=\mathsf{\tilde{h}}_{i}(\lambda_{j}). (22)

Substituting (22) into (21) gives that

k=0K[𝖽𝗂𝖺𝗀(𝐡(k))𝐕𝖽𝗂𝖺𝗀(𝝀k)]ij=[𝐕(𝐇𝚲𝖳)]ij\sum_{k=0}^{K}[\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{k})]_{ij}=[\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})]_{ij} (23)

so that (20) becomes

𝐲~=𝐕𝖳(𝐕(𝐇𝚲𝖳))𝐱~.\mathbf{\tilde{y}}=\mathbf{V}^{\mathsf{T}}\big{(}\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})\big{)}\mathbf{\tilde{x}}. (24)

This completes the proof. ∎

Proof of Corollary 2.

The jthj^{\text{th}} element of the graph Fourier transform of the output, 𝐲~\mathbf{\tilde{y}}, is

y~j=i=1Nx~i𝐯j𝖳𝖽𝗂𝖺𝗀(𝒉~(λi))𝐯i\tilde{y}_{j}=\sum_{i=1}^{N}\tilde{x}_{i}\mathbf{v}_{j}^{\mathsf{T}}\mathsf{diag}(\boldsymbol{\tilde{h}}(\lambda_{i}))\mathbf{v}_{i} (25)

where 𝒉~(λi)=[𝗁~1(λi),𝗁~2(λi),,𝗁~N(λi)]N\boldsymbol{\tilde{h}}(\lambda_{i})=[\mathsf{\tilde{h}}_{1}(\lambda_{i}),\mathsf{\tilde{h}}_{2}(\lambda_{i}),\dots,\mathsf{\tilde{h}}_{N}(\lambda_{i})]\in\mathbb{R}^{N} collects the frequency response of all nodes at eigenvalue λi\lambda_{i}. Note that, if y~j\tilde{y}_{j} depends on x~i\tilde{x}_{i} for some iji\neq j, then frequencies are created. When 𝐕𝖳(𝐕(𝐇𝚲𝖳))\mathbf{V}^{\mathsf{T}}\big{(}\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})\big{)} is not diagonal, there exists i,j{1,,N}i,j\in\{1,\ldots,N\} such that iji\neq j and (𝐕𝖳(𝐕(𝐇𝚲𝖳)))ij=𝐯j𝖳𝖽𝗂𝖺𝗀(𝒉~(λi))𝐯i0(\mathbf{V}^{\mathsf{T}}\big{(}\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})\big{)})_{ij}=\mathbf{v}_{j}^{\mathsf{T}}\mathsf{diag}(\boldsymbol{\tilde{h}}(\lambda_{i}))\mathbf{v}_{i}\neq 0. In this case, (25) indeed shows that frequencies are created.

Now suppose that no frequency creation occurs. Then 𝐕𝖳(𝐕(𝐇𝚲𝖳))\mathbf{V}^{\mathsf{T}}\big{(}\mathbf{V}\circ(\mathbf{H}\mathbf{\Lambda}^{\mathsf{T}})\big{)} must be diagonal. Therefore, there exists α1,,αN\alpha_{1},\ldots,\alpha_{N}\in\mathbb{R} such that 𝐯j𝖳𝖽𝗂𝖺𝗀(𝒉~(λi))𝐯i=αiδij\mathbf{v}_{j}^{\mathsf{T}}\mathsf{diag}(\boldsymbol{\tilde{h}}(\lambda_{i}))\mathbf{v}_{i}=\alpha_{i}\delta_{ij} for all i,ji,j. Recalling that the set {𝐯i}i=1N\{\mathbf{v}_{i}\}_{i=1}^{N} of eigenvectors of 𝐒\mathbf{S} is an orthonormal basis, we have that 𝐯j𝖳𝐯i=δij\mathbf{v}_{j}^{\mathsf{T}}\mathbf{v}_{i}=\delta_{ij}, so it must be that 𝐯i𝖳(𝖽𝗂𝖺𝗀(𝒉~(λi))αi𝐈)𝐯j=0\mathbf{v}_{i}^{\mathsf{T}}(\mathsf{diag}(\boldsymbol{\tilde{h}}(\lambda_{i}))-\alpha_{i}\mathbf{I})\mathbf{v}_{j}=0 for all i,ji,j. This implies for all i,ji,j that vji=0v_{ji}=0 or 𝗁~j(λi)=αi\mathsf{\tilde{h}}_{j}(\lambda_{i})=\alpha_{i}. In the case that vij0v_{ij}\neq 0 for all i,ji,j, it is clear that 𝖽𝗂𝖺𝗀(𝒉~(λi))=αi𝐈\mathsf{diag}(\boldsymbol{\tilde{h}}(\lambda_{i}))=\alpha_{i}\mathbf{I}, meaning that the frequency response at all nodes is the same. This restriction implies that the NVGF filter is a LSI graph filter. This concludes the proof. ∎

Appendix B Lipschitz Continuity of Node-Variant Graph Filters

Proof of Theorem 3.

Leveraging the fact that the filter taps of both 𝑯nv(𝐒^)\boldsymbol{H}^{\text{nv}}(\mathbf{\hat{S}}) and 𝑯nv(𝐒)\boldsymbol{H}^{\text{nv}}(\mathbf{S}) are the same, start by writing the difference between the filter outputs as

(𝑯nv(𝐒^)𝑯nv(𝐒))𝐱=k=0K𝖽𝗂𝖺𝗀(𝐡(k))(𝐒^k𝐒k)𝐱.\big{(}\boldsymbol{H}^{\text{nv}}(\mathbf{\hat{S}})-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\big{)}\mathbf{x}=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\big{(}\mathbf{\hat{S}}^{k}-\mathbf{S}^{k}\big{)}\mathbf{x}. (26)

Let 𝐄=𝐒^𝐒\mathbf{E}=\mathbf{\hat{S}}-\mathbf{S} and note that 𝐄\mathbf{E} is symmetric and satisfies 𝐄=𝐒^𝐒ε\|\mathbf{E}\|=\|\mathbf{\hat{S}}-\mathbf{S}\|\leq\varepsilon by assumption. Recall that (𝐒+𝐄)k=𝐒k+r=0k1𝐒r𝐄𝐒kr1+𝐂(\mathbf{S}+\mathbf{E})^{k}=\mathbf{S}^{k}+\sum_{r=0}^{k-1}\mathbf{S}^{r}\mathbf{E}\mathbf{S}^{k-r-1}+\mathbf{C} with 𝐂\mathbf{C} such that 𝐂r=2k(kr)𝐄r𝐒kr\|\mathbf{C}\|\leq\sum_{r=2}^{k}\binom{k}{r}\|\mathbf{E}\|^{r}\|\mathbf{S}\|^{k-r}. Leveraging this fact in (26), it yields

(𝑯nv\displaystyle\big{(}\boldsymbol{H}^{\text{nv}} (𝐒^)𝑯nv(𝐒))𝐱\displaystyle(\mathbf{\hat{S}})-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\big{)}\mathbf{x} (27)
=k=0K𝖽𝗂𝖺𝗀(𝐡(k))r=0k1𝐒r𝐄𝐒kr1𝐱+𝐃𝐱\displaystyle=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\mathbf{S}^{r}\mathbf{E}\mathbf{S}^{k-r-1}\mathbf{x}+\mathbf{D}\mathbf{x}

with 𝐃\mathbf{D} satisfying 𝐃=𝖮(𝐄2)\|\mathbf{D}\|=\mathsf{O}(\|\mathbf{E}\|^{2}), since the filter taps 𝐇\mathbf{H} define analytic frequency responses 𝗁i(λ)\mathsf{h}_{i}(\lambda) with bounded derivatives for all i{1,,N}i\in\{1,\ldots,N\}. The input graph signal 𝐱\mathbf{x} can be rewritten as 𝐱=i=1Nx~i𝐯i\mathbf{x}=\sum_{i=1}^{N}\tilde{x}_{i}\mathbf{v}_{i} using the GFT for the support matrix 𝐒\mathbf{S}. Then, (27) becomes

(𝑯nv(𝐒^)𝑯nv(𝐒))𝐱\displaystyle\big{(}\boldsymbol{H}^{\text{nv}}(\mathbf{\hat{S}})-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\big{)}\mathbf{x} (28)
=i=1Nx~ik=0K𝖽𝗂𝖺𝗀(𝐡(k))r=0k1𝐒r𝐄𝐒kr1𝐯i+i=1Nx~i𝐃𝐯i.\displaystyle=\sum_{i=1}^{N}\tilde{x}_{i}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\mathbf{S}^{r}\mathbf{E}\mathbf{S}^{k-r-1}\mathbf{v}_{i}+\sum_{i=1}^{N}\tilde{x}_{i}\mathbf{D}\mathbf{v}_{i}.

Using the fact that 𝐯i\mathbf{v}_{i} is an eigenvector of 𝐒\mathbf{S} the first term in (28) can be conveniently rewritten as

i=1Nx~i\displaystyle\sum_{i=1}^{N}\tilde{x}_{i} k=0K𝖽𝗂𝖺𝗀(𝐡(k))r=0k1𝐒r𝐄𝐒kr1𝐯i\displaystyle\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\mathbf{S}^{r}\mathbf{E}\mathbf{S}^{k-r-1}\mathbf{v}_{i} (29)
=i=1Nx~ik=0K𝖽𝗂𝖺𝗀(𝐡(k))r=0k1λikr1𝐒r𝐄𝐯i.\displaystyle=\sum_{i=1}^{N}\tilde{x}_{i}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{S}^{r}\mathbf{E}\mathbf{v}_{i}.

Denoting by 𝐄=𝐔𝖽𝗂𝖺𝗀(𝐦)𝐔𝖳\mathbf{E}=\mathbf{U}\mathsf{diag}(\mathbf{m})\mathbf{U}^{\mathsf{T}} the eigendecomposition of 𝐄\mathbf{E} with 𝐄𝐮i=mi𝐮i\mathbf{E}\mathbf{u}_{i}=m_{i}\mathbf{u}_{i} the corresponding eigenvectors 𝐮i\mathbf{u}_{i} and eigenvalues mim_{i}, [10, Lemma 1] states that 𝐄𝐯i=mi𝐯i+𝐄U𝐯i\mathbf{E}\mathbf{v}_{i}=m_{i}\mathbf{v}_{i}+\mathbf{E}_{U}\mathbf{v}_{i} with 𝐄U8ε\|\mathbf{E}_{U}\|\leq 8\varepsilon for an appropriate matrix 𝐄U\mathbf{E}_{U} that depends on 𝐔\mathbf{U}, 𝐒\mathbf{S}, and 𝐄\mathbf{E}. Using this in (29),

i=1N\displaystyle\sum_{i=1}^{N} x~ik=0K𝖽𝗂𝖺𝗀(𝐡(k))r=0k1λikr1𝐒r𝐄𝐯i\displaystyle\tilde{x}_{i}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{S}^{r}\mathbf{E}\mathbf{v}_{i}
=i=1Nx~imik=0K𝖽𝗂𝖺𝗀(𝐡(k))r=0k1λikr1𝐒r𝐯i\displaystyle=\sum_{i=1}^{N}\tilde{x}_{i}m_{i}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{S}^{r}\mathbf{v}_{i} (30)
+i=1Nx~ik=0K𝖽𝗂𝖺𝗀(𝐡(k))r=0k1λikr1𝐒r𝐄U𝐯i\displaystyle\quad+\sum_{i=1}^{N}\tilde{x}_{i}\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{S}^{r}\mathbf{E}_{U}\mathbf{v}_{i} (31)

is obtained. For the term (30), note that 𝐒r𝐯i=λir𝐯i\mathbf{S}^{r}\mathbf{v}_{i}=\lambda_{i}^{r}\mathbf{v}_{i} and that r=0k1λikr1λir=kλik1\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\lambda_{i}^{r}=k\lambda_{i}^{k-1}, so that the term (30) is equivalent to

i=1Nx~imi\displaystyle\sum_{i=1}^{N}\tilde{x}_{i}m_{i} k=0Kk𝖽𝗂𝖺𝗀(𝐡(k))λik1𝐯i\displaystyle\sum_{k=0}^{K}k\mathsf{diag}(\mathbf{h}^{(k)})\lambda_{i}^{k-1}\mathbf{v}_{i} (32)
=i=1Nx~imi𝖽𝗂𝖺𝗀(𝒉~(λi))𝐯i\displaystyle=\sum_{i=1}^{N}\tilde{x}_{i}m_{i}\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\mathbf{v}_{i}

where 𝒉~(λ)N\boldsymbol{\tilde{h}}^{\prime}(\lambda)\in\mathbb{R}^{N} is a vector where the ithi^{\text{th}} entry is the derivative of the frequency response of node viv_{i}, i.e., [𝒉~(λ)]i=𝗁~i(λ)=ddλ𝗁~i(λ)[\boldsymbol{\tilde{h}}^{\prime}(\lambda)]_{i}=\mathsf{\tilde{h}}^{\prime}_{i}(\lambda)=\frac{d}{d\lambda}\mathsf{\tilde{h}}_{i}(\lambda). To rewrite (31), consider the following lemma, which is conveniently proved after completing the current proof.

Lemma 8.

For all i{1,,N}i\in\{1,\dots,N\}, define 𝐆iN×N\mathbf{G}_{i}\in\mathbb{R}^{N\times N} by

[𝐆i]tj={𝗁~t(λi) if j=i𝗁~t(λi)𝗁~t(λj)λiλj if ji[\mathbf{G}_{i}]_{tj}=\begin{cases}\mathsf{\tilde{h}}_{t}^{\prime}(\lambda_{i})&\text{ if }j=i\\ \frac{\mathsf{\tilde{h}}_{t}(\lambda_{i})-\mathsf{\tilde{h}}_{t}(\lambda_{j})}{\lambda_{i}-\lambda_{j}}&\text{ if }j\neq i\end{cases} (33)

where 𝗁~t\mathsf{\tilde{h}}_{t} is the frequency response at node vtv_{t} and 𝗁~t\mathsf{\tilde{h}}^{\prime}_{t} is its derivative. Then

i=1Nx~i\displaystyle\sum_{i=1}^{N}\tilde{x}_{i} k=0K𝖽𝗂𝖺𝗀(𝐡(k))r=0k1λikr1𝐒r𝐄U𝐯i\displaystyle\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{S}^{r}\mathbf{E}_{U}\mathbf{v}_{i} (34)
=i=1Nx~i(𝐕𝐆i)𝐕𝖳𝐄U𝐯i.\displaystyle=\sum_{i=1}^{N}\tilde{x}_{i}\big{(}\mathbf{V}\circ\mathbf{G}_{i}\big{)}\mathbf{V}^{\mathsf{T}}\mathbf{E}_{U}\mathbf{v}_{i}.

With Lemma 8 in place, use (32) in (30) and (34) in (31), and this in turn back into (28) to obtain

(𝑯nv\displaystyle\big{(}\boldsymbol{H}^{\text{nv}} (𝐒^)𝑯nv(𝐒))𝐱\displaystyle(\mathbf{\hat{S}})-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\big{)}\mathbf{x} (35)
=i=1Nx~imi𝖽𝗂𝖺𝗀(𝒉~(λi))𝐯i\displaystyle=\sum_{i=1}^{N}\tilde{x}_{i}m_{i}\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\mathbf{v}_{i}
+i=1Nx~i(𝐕𝐆i)𝐕𝖳𝐄U𝐯i\displaystyle\quad+\sum_{i=1}^{N}\tilde{x}_{i}\big{(}\mathbf{V}\circ\mathbf{G}_{i}\big{)}\mathbf{V}^{\mathsf{T}}\mathbf{E}_{U}\mathbf{v}_{i}
+i=1Nx~i𝐃𝐯i.\displaystyle\quad+\sum_{i=1}^{N}\tilde{x}_{i}\mathbf{D}\mathbf{v}_{i}.

Next, compute the norm of (35) by applying the triangle inequality to compute the norms of each of the three summands. For the norm of the first term in (35), the triangle inequality gives

i=1N\displaystyle\Big{\|}\sum_{i=1}^{N} x~imi𝖽𝗂𝖺𝗀(𝒉~(λi))𝐯i\displaystyle\tilde{x}_{i}m_{i}\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\mathbf{v}_{i}\Big{\|} (36)
i=1N|x~i||mi|𝖽𝗂𝖺𝗀(𝒉~(λi))𝐯i\displaystyle\leq\sum_{i=1}^{N}|\tilde{x}_{i}||m_{i}|\|\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\|\|\mathbf{v}_{i}\|

where the submultiplicative property of the operator norm was used to bound 𝖽𝗂𝖺𝗀(𝒉~(λi))𝐯i𝖽𝗂𝖺𝗀(𝒉~(λi))𝐯i\|\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\mathbf{v}_{i}\|\leq\|\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\|\|\mathbf{v}_{i}\|. Leveraging that 𝐯i=1\|\mathbf{v}_{i}\|=1, that 𝖽𝗂𝖺𝗀(𝒉~(λi))=maxj|𝗁~j(λi)|C\|\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\|=\max_{j}|\mathsf{\tilde{h}}^{\prime}_{j}(\lambda_{i})|\leq C by Lipschitz continuity, and that |mi|ε|m_{i}|\leq\varepsilon by the hypothesis that 𝐒𝐒^ε\|\mathbf{S}-\mathbf{\hat{S}}\|\leq\varepsilon, the norm can be further bounded as

i=1Nx~imi\displaystyle\Big{\|}\sum_{i=1}^{N}\tilde{x}_{i}m_{i} 𝖽𝗂𝖺𝗀(𝒉~(λi))𝐯i\displaystyle\mathsf{diag}(\boldsymbol{\tilde{h}}^{\prime}(\lambda_{i}))\mathbf{v}_{i}\Big{\|} (37)
εCi=1N|x~i|=εC𝐱~1\displaystyle\leq\varepsilon C\sum_{i=1}^{N}|\tilde{x}_{i}|=\varepsilon C\|\mathbf{\tilde{x}}\|_{1}
εCN𝐱~=εCN𝐱.\displaystyle\leq\varepsilon C\sqrt{N}\|\mathbf{\tilde{x}}\|=\varepsilon C\sqrt{N}\|\mathbf{x}\|.

Note that the inequality between the 11-norm and the 22-norm was used, as well as the fact that the GFT is a Parseval operator. For the norm of the second term in (35), the triangle inequality together with the submultiplicativity of the operator norm are used to get

\displaystyle\Big{\|} i=1Nx~i(𝐕𝐆i)𝐕𝖳𝐄U𝐯i\displaystyle\sum_{i=1}^{N}\tilde{x}_{i}\big{(}\mathbf{V}\circ\mathbf{G}_{i}\big{)}\mathbf{V}^{\mathsf{T}}\mathbf{E}_{U}\mathbf{v}_{i}\Big{\|} (38)
i=1N|x~i|𝐕𝐆i𝐕𝖳𝐄U𝐯i\displaystyle\leq\sum_{i=1}^{N}|\tilde{x}_{i}|\|\mathbf{V}\circ\mathbf{G}_{i}\|\|\mathbf{V}^{\mathsf{T}}\|\|\mathbf{E}_{U}\|\|\mathbf{v}_{i}\|
8εCNN𝐱.\displaystyle\leq 8\varepsilon CN\sqrt{N}\|\mathbf{x}\|.

For the last inequality to hold, denote by max\|\cdot\|_{\max} the entrywise maximum norm of a matrix, and note that 𝐕𝐆iN𝐕𝐆imax=Nmaxt,j{1,,N}(|[𝐕]tj||[𝐆i]tj|)NC\|\mathbf{V}\circ\mathbf{G}_{i}\|\leq N\|\mathbf{V}\circ\mathbf{G}_{i}\|_{\max}=N\max_{t,j\in\{1,\ldots,N\}}(|[\mathbf{V}]_{tj}|\ |[\mathbf{G}_{i}]_{tj}|)\leq NC for all i{1,,N}i\in\{1,\ldots,N\}, by the Lipschitz continuity of the frequency responses and the fact that 𝐕\mathbf{V} is an orthogonal matrix, so |[𝐕]tj|1|[\mathbf{V}]_{tj}|\leq 1 for all tt and jj. Also recall that 𝐄U8ε\|\mathbf{E}_{U}\|\leq 8\varepsilon. Finally, note that the inequality between the 11-norm and 22-norm of vectors together with the Parseval nature of the GFT were used. For the third term in (35), it simply holds that

\displaystyle\Big{\|} i=1Nx~i𝐃𝐯i=𝐃t=1Nx~i𝐯i=𝐃𝐱\displaystyle\sum_{i=1}^{N}\tilde{x}_{i}\mathbf{D}\mathbf{v}_{i}\Big{\|}=\Big{\|}\mathbf{D}\sum_{t=1}^{N}\tilde{x}_{i}\mathbf{v}_{i}\Big{\|}=\|\mathbf{D}\mathbf{x}\| (39)
𝐃𝐱𝖮(ε2)𝐱.\displaystyle\leq\|\mathbf{D}\|\|\mathbf{x}\|\leq\mathsf{O}(\varepsilon^{2})\|\mathbf{x}\|.

Substituting (37), (38), and (39) into (35) gives (10), which concludes the proof. ∎

Proof of Lemma 8.

We have for all i{1,,N}i\in\{1,\ldots,N\} that

k=0K𝖽𝗂𝖺𝗀(𝐡(k))r=0k1λikr1𝐕𝖽𝗂𝖺𝗀(𝝀r)\displaystyle\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{r}) (40)
=k=0K𝖽𝗂𝖺𝗀(𝐡(k))𝐕r=0k1λikr1𝖽𝗂𝖺𝗀(𝝀r).\displaystyle=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathsf{diag}(\boldsymbol{\lambda}^{r}).

The last summation r=0k1λikr1𝖽𝗂𝖺𝗀(𝝀r)\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathsf{diag}(\boldsymbol{\lambda}^{r}) is a diagonal matrix, where the jthj^{\text{th}} element of the diagonal can be written as

[\displaystyle\Big{[} r=0k1λikr1𝝀r]j=r=0k1λikr1λjr\displaystyle\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\boldsymbol{\lambda}^{r}\Big{]}_{j}=\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\lambda_{j}^{r} (41)
=[𝜸i]j{kλik1 if j=iλikλjkλiλj if ji\displaystyle=[\boldsymbol{\gamma}_{i}]_{j}\coloneqq

so that

k=0K𝖽𝗂𝖺𝗀(𝐡(k))𝐕r=0k1λikr1𝖽𝗂𝖺𝗀(𝝀r)\displaystyle\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathsf{diag}(\boldsymbol{\lambda}^{r}) (42)
=k=0K𝖽𝗂𝖺𝗀(𝐡(k))𝐕𝖽𝗂𝖺𝗀(𝜸i).\displaystyle=\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\mathsf{diag}(\boldsymbol{\gamma}_{i}).

Note that the ii subindex in 𝜸i\boldsymbol{\gamma}_{i} indicates that this is parametrized by λi\lambda_{i}, while each entry of this vector, i.e., the jthj^{\text{th}} entry, actually depends on λj\lambda_{j}. Now, remark that since we have a full matrix 𝐕\mathbf{V} in between two diagonal matrices, the matrix product does not commute.

To continue simplifying the expressions, start by considering a vector 𝐚N\mathbf{a}\in\mathbb{R}^{N}, a matrix 𝐁N×N\mathbf{B}\in\mathbb{R}^{N\times N} and another vector 𝐜N\mathbf{c}\in\mathbb{R}^{N}. Observe that

[𝖽𝗂𝖺𝗀(𝐚)𝐁𝖽𝗂𝖺𝗀(𝐜)]ij\displaystyle\big{[}\mathsf{diag}(\mathbf{a})\mathbf{B}\mathsf{diag}(\mathbf{c})\big{]}_{ij} =aicjbij\displaystyle=a_{i}c_{j}b_{ij}
𝖽𝗂𝖺𝗀(𝐚)𝐁𝖽𝗂𝖺𝗀(𝐜)\displaystyle\mathsf{diag}(\mathbf{a})\mathbf{B}\mathsf{diag}(\mathbf{c}) =𝐁(𝐚𝐜𝖳).\displaystyle=\mathbf{B}\circ(\mathbf{a}\mathbf{c}^{\mathsf{T}}).

Therefore, it can be written

k=0K𝖽𝗂𝖺𝗀(𝐡(k))𝐕𝖽𝗂𝖺𝗀(𝜸i)=𝐕(k=0K𝐡(k)𝜸i𝖳)=𝐕𝐆i\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\mathbf{V}\mathsf{diag}(\boldsymbol{\gamma}_{i})=\mathbf{V}\circ\Big{(}\sum_{k=0}^{K}\mathbf{h}^{(k)}\boldsymbol{\gamma}_{i}^{\mathsf{T}}\Big{)}=\mathbf{V}\circ\mathbf{G}_{i} (43)

where 𝐆iN×N\mathbf{G}_{i}\in\mathbb{R}^{N\times N} is the matrix defined in (33);

k=0Khtk[𝜸i]j\displaystyle\sum_{k=0}^{K}h_{tk}[\boldsymbol{\gamma}_{i}]_{j} ={k=0Khtkkλik1 if j=ik=0Khtkλikλjkλiλj if ji\displaystyle= (44)
={𝗁~t(λi) if j=i𝗁~t(λi)𝗁~t(λj)λiλj if ji=[𝐆i]tj.\displaystyle==[\mathbf{G}_{i}]_{tj}.

Using this result in (40) and (42) gives that

k=0K𝖽𝗂𝖺𝗀(𝐡(k))r=0k1λikr1𝐕𝖽𝗂𝖺𝗀(𝝀r)=𝐕𝐆i.\sum_{k=0}^{K}\mathsf{diag}(\mathbf{h}^{(k)})\sum_{r=0}^{k-1}\lambda_{i}^{k-r-1}\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{r})=\mathbf{V}\circ\mathbf{G}_{i}. (45)

This equality, together with the fact that 𝐒r=𝐕𝖽𝗂𝖺𝗀(𝝀r)𝐕𝖳\mathbf{S}^{r}=\mathbf{V}\mathsf{diag}(\boldsymbol{\lambda}^{r})\mathbf{V}^{\mathsf{T}}, gives the desired result (34). ∎

Appendix C Optimal Unbiased Node-Variant Graph Filters

Proof of Lemma 4.

The NVGF-based estimator is given by:

𝐲^=𝑯nv(𝐒)𝐱+𝐜.\mathbf{\hat{y}}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\mathbf{x}+\mathbf{c}. (46)

This estimator is unbiased if and only if 𝔼[𝐲^]=𝔼[𝐲]=𝔼[ρ(𝐱)]=𝝁ρ\mathbb{E}[\mathbf{\hat{y}}]=\mathbb{E}[\mathbf{y}]=\mathbb{E}[\mathsf{\rho}(\mathbf{x})]=\boldsymbol{\mu}_{\rho}. By (46) and linearity of expectation, this holds precisely when

𝔼[𝐲^]=𝑯nv(𝐒)𝔼[𝐱]+𝐜=𝑯nv(𝐒)𝝁x+𝐜=𝝁ρ.\mathbb{E}[\mathbf{\hat{y}}]=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\mathbb{E}[\mathbf{x}]+\mathbf{c}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})\boldsymbol{\mu}_{x}+\mathbf{c}=\boldsymbol{\mu}_{\rho}. (47)

This is equivalent to the condition on 𝐜\mathbf{c} that

𝐜=𝝁ρ𝑯nv(𝐒)𝝁x,\mathbf{c}=\boldsymbol{\mu}_{\rho}-\boldsymbol{H}^{\text{nv}}(\mathbf{S})\boldsymbol{\mu}_{x}, (48)

which completes the proof. ∎

Proof of Proposition 5.

Given an unbiased estimator 𝐲^=𝑯nv(𝐒)(𝐱𝝁x)+𝝁ρ\mathbf{\hat{y}}=\boldsymbol{H}^{\text{nv}}(\mathbf{S})(\mathbf{x}-\boldsymbol{\mu}_{x})+\boldsymbol{\mu}_{\rho}, an optimal matrix 𝐇N×(K+1)\mathbf{H}^{\star}\in\mathbb{R}^{N\times(K+1)} of filter taps are the ones that minimize (13). First, note that the estimator output at node viv_{i}, i.e., the ithi^{\text{th}} entry of 𝐲^\mathbf{\hat{y}}, is given by

y^i=[𝐲^]i=𝐮i𝖳𝖽𝗂𝖺𝗀(𝚲𝐡i)𝐕𝖳(𝐱𝝁x)+μρi\hat{y}_{i}=[\mathbf{\hat{y}}]_{i}=\mathbf{u}_{i}^{\mathsf{T}}\mathsf{diag}(\mathbf{\Lambda}\mathbf{h}_{i})\mathbf{V}^{\mathsf{T}}(\mathbf{x}-\boldsymbol{\mu}_{x})+\mu_{\rho i} (49)

where μρi=[𝝁ρ]i\mu_{\rho i}=[\boldsymbol{\mu}_{\rho}]_{i}, 𝐮iN\mathbf{u}_{i}\in\mathbb{R}^{N} is the ithi^{\text{th}} row of 𝐕\mathbf{V}, and 𝐡iK+1\mathbf{h}_{i}\in\mathbb{R}^{K+1} is the ithi^{\text{th}} row of the matrix 𝐇\mathbf{H}; see [19]. Note that if 𝐚,𝐛,𝐜\mathbf{a},\mathbf{b},\mathbf{c} are three NN-dimensional real vectors, then 𝐚𝖳𝖽𝗂𝖺𝗀(𝐛)𝐜=i=1Naibici\mathbf{a}^{\mathsf{T}}\mathsf{diag}(\mathbf{b})\mathbf{c}=\sum_{i=1}^{N}a_{i}b_{i}c_{i}, which means that they commute, i.e., 𝐚𝖳𝖽𝗂𝖺𝗀(𝐛)𝐜=𝐜𝖳𝖽𝗂𝖺𝗀(𝐚)𝐛\mathbf{a}^{\mathsf{T}}\mathsf{diag}(\mathbf{b})\mathbf{c}=\mathbf{c}^{\mathsf{T}}\mathsf{diag}(\mathbf{a})\mathbf{b}. Using this fact in (49) yields

y^i=(𝐱𝝁x)𝖳𝐕𝖽𝗂𝖺𝗀(𝐮i)𝚲𝐡i+μρi=𝒂i(𝐱)𝖳𝐡i+μρi\hat{y}_{i}=(\mathbf{x}-\boldsymbol{\mu}_{x})^{\mathsf{T}}\mathbf{V}\mathsf{diag}(\mathbf{u}_{i})\mathbf{\Lambda}\mathbf{h}_{i}+\mu_{\rho i}=\boldsymbol{a}_{i}(\mathbf{x})^{\mathsf{T}}\mathbf{h}_{i}+\mu_{\rho i} (50)

with

𝒂i(𝐱)=𝚲𝖳𝖽𝗂𝖺𝗀(𝐮i)𝐕𝖳(𝐱𝝁x).\boldsymbol{a}_{i}(\mathbf{x})=\mathbf{\Lambda}^{\mathsf{T}}\mathsf{diag}(\mathbf{u}_{i})\mathbf{V}^{\mathsf{T}}(\mathbf{x}-\boldsymbol{\mu}_{x}).

The objective function in (13) can be rewritten as

𝔼[𝐲^𝐲22]=𝔼[i=1N(y^iyi)2]=i=1N𝔼[(y^iyi)2].\mathbb{E}\big{[}\|\mathbf{\hat{y}}-\mathbf{y}\|_{2}^{2}\big{]}=\mathbb{E}\Big{[}\sum_{i=1}^{N}(\hat{y}_{i}-y_{i})^{2}\Big{]}=\sum_{i=1}^{N}\mathbb{E}\big{[}(\hat{y}_{i}-y_{i})^{2}\big{]}. (51)

Since each y^i\hat{y}_{i} depends only on 𝐡i\mathbf{h}_{i}, as indicated in (50), minimizing (51) over 𝐇\mathbf{H} is equivalent to minimizing each of the summands in (51) over 𝐡i\mathbf{h}_{i}. Therefore, (13) becomes equivalent to the following system of NN optimization problems over each of the rows of 𝐇\mathbf{H}:

min𝐡iK+1𝔼[(y^iyi)2],i{1,,N}.\min_{\mathbf{h}_{i}\in\mathbb{R}^{K+1}}\mathbb{E}\big{[}(\hat{y}_{i}-y_{i})^{2}\big{]},\quad i\in\{1,\ldots,N\}. (52)

Substituting (50) into the ithi^{\text{th}} objective function of (52) gives

𝔼[(y^iyi)2]=𝔼[(𝒂i(𝐱)𝖳𝐡i+μρiρ(xi))2]\displaystyle\mathbb{E}\big{[}(\hat{y}_{i}-y_{i})^{2}\big{]}=\mathbb{E}\big{[}\big{(}\boldsymbol{a}_{i}(\mathbf{x})^{\mathsf{T}}\mathbf{h}_{i}+\mu_{\rho i}-\rho(x_{i})\big{)}^{2}\big{]} (53)
=𝐡i𝖳𝔼[𝒂i(𝐱)𝒂i(𝐱)𝖳]𝐡i\displaystyle=\mathbf{h}_{i}^{\mathsf{T}}\mathbb{E}\big{[}\boldsymbol{a}_{i}(\mathbf{x})\boldsymbol{a}_{i}(\mathbf{x})^{\mathsf{T}}\big{]}\mathbf{h}_{i}
2𝔼[(ρ(xi)μρi)𝒂i(x)𝖳]𝐡i\displaystyle\quad-2\mathbb{E}\big{[}(\rho(x_{i})-\mu_{\rho i})\boldsymbol{a}_{i}(x)^{\mathsf{T}}\big{]}\mathbf{h}_{i}
+𝔼[(ρ(xi)μρi)2].\displaystyle\quad+\mathbb{E}\big{[}(\rho(x_{i})-\mu_{\rho i})^{2}\big{]}.

Now, with

𝐑i\displaystyle\mathbf{R}_{i} =𝔼[𝒂i(𝐱)𝒂i(𝐱)𝖳]\displaystyle=\mathbb{E}\big{[}\boldsymbol{a}_{i}(\mathbf{x})\boldsymbol{a}_{i}(\mathbf{x})^{\mathsf{T}}\big{]}
=𝚲𝖳𝖽𝗂𝖺𝗀(𝐮i)𝐕𝖳𝐂x𝐕𝖽𝗂𝖺𝗀(𝐮i)𝚲,\displaystyle=\mathbf{\Lambda}^{\mathsf{T}}\mathsf{diag}(\mathbf{u}_{i})\mathbf{V}^{\mathsf{T}}\mathbf{C}_{x}\mathbf{V}\mathsf{diag}(\mathbf{u}_{i})\mathbf{\Lambda},
𝐩i\displaystyle\mathbf{p}_{i} =𝔼[(ρ(xi)μρi)𝒂i(x)]\displaystyle=\mathbb{E}\big{[}(\rho(x_{i})-\mu_{\rho i})\boldsymbol{a}_{i}(x)\big{]}
=𝚲𝖳𝖽𝗂𝖺𝗀(𝐮i)𝐕𝖳𝔼[(ρ(xi)μρi)(𝐱𝝁x)]\displaystyle=\mathbf{\Lambda}^{\mathsf{T}}\mathsf{diag}(\mathbf{u}_{i})\mathbf{V}^{\mathsf{T}}\mathbb{E}\big{[}(\rho(x_{i})-\mu_{\rho i})(\mathbf{x}-\boldsymbol{\mu}_{x})\big{]}

the ithi^{\text{th}} objective (53) becomes

𝔼[(y^iyi)2]=𝐡i𝖳𝐑i𝐡i2𝐩i𝖳𝐡i+𝔼[(ρ(xi)μρi)2].\mathbb{E}\big{[}(\hat{y}_{i}-y_{i})^{2}\big{]}=\mathbf{h}_{i}^{\mathsf{T}}\mathbf{R}_{i}\mathbf{h}_{i}-2\mathbf{p}_{i}^{\mathsf{T}}\mathbf{h}_{i}+\mathbb{E}\big{[}(\rho(x_{i})-\mu_{\rho i})^{2}\big{]}. (54)

Since 𝐑i\mathbf{R}_{i} is a positive semidefinite matrix, (54) is a convex quadratic function in 𝐡i\mathbf{h}_{i}. Therefore, setting the gradient of this function to zero, it can be concluded that 𝐡i\mathbf{h}_{i}^{\star} is a global minimizer of (54) if and only if 2𝐑i𝐡i2𝐩i=𝟎2\mathbf{R}_{i}\mathbf{h}_{i}^{\star}-2\mathbf{p}_{i}=\mathbf{0}, or, equivalently,

𝐑i𝐡i=𝐩i.\mathbf{R}_{i}\mathbf{h}_{i}^{\star}=\mathbf{p}_{i}. (55)

This completes the proof. ∎

Appendix D Authorship Attribution

Problem objective. Consider a set 𝜣a\boldsymbol{\mathit{\Theta}}_{a} of texts that are known to be written by author aa. Given a new text θ𝜣a\theta\notin\boldsymbol{\mathit{\Theta}}_{a} the objective is to determine whether θ\theta was written by aa or not.

Approach. The approach is to leverage word adjacency networks (WAN) built from the set of known texts 𝜣a\boldsymbol{\mathit{\Theta}}_{a} to build a graph support 𝒢\mathcal{G}, and then use the word frequency count of the function words in θ\theta as the graph signal 𝐱\mathbf{x}. Then, 𝐱\mathbf{x} is processed through a GNN Φ\mathsf{\Phi} (in any of the variants discussed in Section 5) to obtain a predictor of the text being written by author aa.

Refer to caption
Figure 3: Comparison between architectures for all authors. It is observed that, in most cases, the Learn NVGF exhibits significantly better performance than the GAT [26], which in turn is better than the GCN [24], and all of them are better than the SGC [25]. The error bars reflect one third of the estimated standard deviation.

Graph construction. The WAN can be modeled by a graph 𝒢\mathcal{G} where the set of nodes 𝒱\mathcal{V} consists of function words (i.e., words that do not carry semantic meaning but express grammatical relationships among other words within a sentence, e.g., “the”, “and”, “a”, “of”, “to”, “for”, “but”). The existence of an edge connecting words and the corresponding edge weight are determined as follows. Consider each text θ𝜣a\theta\in\boldsymbol{\mathit{\Theta}}_{a} and split it into a total of SS sentences {ωθs}s=1S\{\mathsf{\omega}_{\theta}^{s}\}_{s=1}^{S} where each sentence ωθs:𝒱{}\mathsf{\omega}_{\theta}^{s}:\mathbb{N}\to\mathcal{V}\cup\{\emptyset\} gives the function word present in each position within a sentence, or \emptyset if the word is not a function word. Then, given a discount factor α(0,1)\alpha\in(0,1) and a window length DD, the edge weight wijw_{ij} between nodes viv_{i} and vjv_{j} is computed as

wij=t:θt𝜣a\displaystyle w_{ij}=\sum_{t:\theta_{t}\in\boldsymbol{\mathit{\Theta}}_{a}} s,e𝟣{ωθts(e)=vi}\displaystyle\sum_{s,e}\mathsf{1}\{\mathsf{\omega}_{\theta_{t}}^{s}(e)=v_{i}\} (56)
d=1Dαd1𝟣{ωθts(e+d)=vj}\displaystyle\sum_{d=1}^{D}\alpha^{d-1}\mathsf{1}\{\mathsf{\omega}_{\theta_{t}}^{s}(e+d)=v_{j}\}

where 𝟣{𝒜}\mathsf{1}\{\mathcal{A}\} is the indicator function that takes value 11 when condition 𝒜\mathcal{A} is met and 0 otherwise, see [5]. Equation (56) essentially computes each weight by going text by text θt𝜣a\theta_{t}\in\boldsymbol{\mathit{\Theta}}_{a} and sentence by sentence s{1,,S}s\in\{1,\ldots,S\}, looking position by position ee for the corresponding word ωθts(e)\mathsf{\omega}_{\theta_{t}}^{s}(e) to match the function word viv_{i}. Once the word viv_{i} is matched, the following DD words in the window length are looked at and, if the (e+d)th(e+d)^{\text{th}} word matches vjv_{j}, then the discounted weight ad1a^{d-1} is added. In this way, not only co-occurrence of words, but also their proximity counts in establishing the WAN. Note that the edge weight function (56) is asymmetric, which results in a directed graph.

Graph signal processing description. The support matrix is chosen to be

𝐒=12(𝐃1𝐖+𝐖𝖳𝐃1)\mathbf{S}=\frac{1}{2}(\mathbf{D}^{-1}\mathbf{W}+\mathbf{W}^{\mathsf{T}}\mathbf{D}^{-1}) (57)

where the 𝐖\mathbf{W} is the adjacency matrix with entry (i,j)(i,j) equal to wijw_{ij} and 𝐃=𝖽𝗂𝖺𝗀(𝐖𝟏)\mathbf{D}=\mathsf{diag}(\mathbf{W}\mathbf{1}) is the degree matrix. The operation 𝐃1𝐖\mathbf{D}^{-1}\mathbf{W} normalizes the matrix by rows, and the support matrix 𝐒\mathbf{S} comes from symmetrizing the matrix by adopting one half of the weight on each direction. The graph signal 𝐱\mathbf{x} contains a normalized word frequency count for each word

[𝐱]i=s,e𝟣{ωθs(e)=vi}j:vj𝒱s,e𝟣{ωθs(e)=vj}.[\mathbf{x}]_{i}=\frac{\sum_{s,e}\mathsf{1}\{\mathsf{\omega}_{\theta}^{s}(e)=v_{i}\}}{\sum_{j:v_{j}\in\mathcal{V}\sum_{s,e}}\mathsf{1}\{\mathsf{\omega}_{\theta}^{s}(e)=v_{j}\}}. (58)

Note that the graph signals are normalized by the 11-norm and can therefore be interpreted as the probability of finding the function word viv_{i} in text θ\theta.

Dataset and code. The dataset is presented in [5], and is publicly available at http://github.com/alelab-upenn/graph-neural-networks/tree/master/datasets/authorshipData. The dataset consists of 2121 authors from the 19th\text{19}^{\text{th}} century. The corpus of each author is split in texts of about 10001000 words. The WANs and the word frequency count for each text are already present in the dataset. The texts are split, at random, in 95%95\% for training and 5%5\% for testing. The weights of the WANs of the 95%95\% texts selected for training are averaged to obtain an average WAN from which the support matrix 𝐒\mathbf{S} is obtained by following (57). The resulting support matrix is further normalized to have unit spectral norm. Note that no text from the test set is used in building the WAN. From the texts in the training set, 8%8\% are further separated to build the validation set. Denote by NatrainN_{a}^{\text{train}}, NavalidN_{a}^{\text{valid}}, and NatestN_{a}^{\text{test}} the number of texts in the training, validation, and test set, respectively. The word frequency counts for each text are normalized as in (58) and used as graph signals. A label of 11 is attached to these signals to indicate that they have been written by author aa in a supervised learning context. To complete the datasets, an equal number of texts are obtained at random from other contemporary authors, and their frequency word counts are normalized and incorporated into the corresponding sets and assigned a label of 0. In this way, the resulting training, validation, and test set have 2Natrain2N_{a}^{\text{train}}, 2Navalid2N_{a}^{\text{valid}}, and 2Natest2N_{a}^{\text{test}} samples, respectively (half of them labeled with 11 and the other half with 0). The code to run the simulations will be provided as a .zip file.

Refer to caption
Figure 4: Relative change in performance with respect to the GCNN baseline, divided into 33 groups of similar behavior. Group 1 (53%53\% of the authors) includes those where the Learn NVGF has a comparable performance to the GCNN and both of them are better than the LSIGF, showing that frequency creation plays a vital role in improving performance. Group 2 (33%33\% of the authors) shows that oftentimes, the Learn NVGF can improve significantly over the GCNN, suggesting that the nonlinear nature of the mapping may have a negative impact. Group 3 (14%14\% of the authors) consists of those cases when the Learn NVGF, the GCNN, and the LSIGF all exhibit comparable performance.

Architectures for comparison. The Learn NVGF architecture in (18) is compared against three of the most popular GNN architectures in the literature, namely, the GCN [24], the SGC [25], and the GAT [26]. The Learn NVGF adopts the support matrix in (57) and consists of a LSI graph filter (a graph convolution) that outputs FNVGFF_{\text{NVGF}} features (hidden units) and has KNVGF+1K_{\text{NVGF}}+1 filter taps, followed by a NVGF that takes those FNVGFF_{\text{NVGF}} input features and applies a NVGF with KNVGF+1K_{\text{NVGF}}+1 filter taps, and also outputs FNVGFF_{\text{NVGF}} features. The GCN considers 𝐒\mathbf{S} in (57) to be the adjacency matrix of the graph and thus adopts a support matrix given by 𝐒GCN=𝐃~1/2(𝐈+𝐒)𝐃~1/2\mathbf{S}_{\text{GCN}}=\mathbf{\tilde{D}}^{-1/2}(\mathbf{I}+\mathbf{S})\mathbf{\tilde{D}}^{-1/2} where 𝐃~=𝖽𝗂𝖺𝗀((𝐈+𝐒)𝟏)\mathbf{\tilde{D}}=\mathsf{diag}((\mathbf{I}+\mathbf{S})\mathbf{1}) as indicated in [24]. The graph convolutional layer consists of FGCNF_{\text{GCN}} LSI graph filters, each one of the form hf𝐒GCNh^{f}\mathbf{S}_{\text{GCN}} for f{1,,FGCN}f\in\{1,\ldots,F_{\text{GCN}}\}. This is followed by a ReLU activation function. The SGC considers the same support matrix 𝐒GCN\mathbf{S}_{\text{GCN}} as the GCN and learns FSGCF_{\text{SGC}} output features, where each filter in the bank is of the form hf𝐒GCNKSGCh^{f}\mathbf{S}_{\text{GCN}}^{K_{\text{SGC}}} for a predetermined hyperparameter KSGCK_{\text{SGC}}. This is followed by a ReLU activation function. Finally, for the GAT, the support matrix is 𝐒\mathbf{S} (although this is relevant only in terms of the nonzero elements, which are the same in 𝐒\mathbf{S} and 𝐒GCN\mathbf{S}_{\text{GCN}}—except for the diagonal elements—since the weights of each edge are learned through the attention mechanism), and the output features are FGATF_{\text{GAT}} learned through the attention mechanism exactly as described in [26]. All four architectures are followed by a readout layer consisting of a learnable linear transform that maps the NFNF output features into a vector of size 22 that is interpreted to be the logits for the two classes (either the text is written by the author or not).

Architectures for analysis. To analyze the impact of the nonlinearity, four architectures are considered. First, as a baseline, a GCNN consisting of a graph convolutional layer that outputs FF features, with K+1K+1 filter taps, followed by a ReLU activation function, as indicated in (16). Second, the Learn NVGF that replaces the ReLU activation function by a NVGF with filter taps that can be learned from data as in (18). Third, the Design NVGF where the nonlinear activation function of the GCNN is replaced by a NVGF, but one whose filter taps are designed to mimic the ReLU (the LSI graph filters in the Design NVGF are the same ones learned by the GCNN). Fourth, a LSI graph filter with K+1K+1 taps that outputs FF features. Note that the GCNN is a nonlinear architecture, while the other three are linear. Also, note that the first three architectures are capable of creating frequency content while the fourth one, the LSIGF, is not. For fair comparison, the values of KK and FF are the same for all architectures.

Training. The loss function during training is a cross-entropy loss between the logits obtained from the output of each architecture, and the labels in the training set. All the architectures are trained by using an ADAM optimizer [32] with forgetting factors 0.90.9 and 0.9990.999, and with a learning rate η\eta. The training is carried out for 2525 epochs with batches of size 2020. Dropout with probability 0.50.5 is included before the readout layer, during training, to avoid overfitting. At testing time, the weights are correspondingly rescaled. Validation is run every 55 training steps. The learned filters that result in the best performance on the validation set are kept and used during the testing phase. For each experiment, 1010 realizations of the random train/test split are carried out (also randomizing the selection of the texts by other authors that complete the training, validation, and test sets). The average evaluation performances (measured as classification error—ratio of texts wrongly attributed in the test set) is reported, together with the estimated standard deviation.

Hyperparameter selection. The number of hidden units FNVGFF_{\text{NVGF}}, FGCNF_{\text{GCN}}, FSGCF_{\text{SGC}}, and FGATF_{\text{GAT}}, the polynomial order KNVGFK_{\text{NVGF}} and KSGCK_{\text{SGC}}, and the learning rate ηNVGF\eta_{\text{NVGF}}, ηGCN\eta_{\text{GCN}}, ηSGC\eta_{\text{SGC}}, and ηGAT\eta_{\text{GAT}} are selected, independently for each architecture, from the set {16,32,64}\{16,32,64\} for the number of hidden units, {2,3,4}\{2,3,4\} for the polynomial order, and {0.001,0.005,0.01}\{0.001,0.005,0.01\} for the learning rate. In other words, all possible combinations of these three parameters are run for each architecture, and the ones that show the best performance on the test set are kept.

Table 1: Hyperparameters for each architecture that lead to the best performance
Learn NVGF GCN [24] SGC [25] GAT [26]
Authors ηNVGF\eta_{\text{NVGF}} FNVGFF_{\text{NVGF}} KNVGFK_{\text{NVGF}} ηGCN\eta_{\text{GCN}} FGCNF_{\text{GCN}} ηSGC\eta_{\text{SGC}} FSGCF_{\text{SGC}} KSGCK_{\text{SGC}} ηGAT\eta_{\text{GAT}} FGATF_{\text{GAT}}
Abbott 0.001 32 2 0.005 64 0.01 64 2 0.01 64
Alcott 0.01 16 4 0.01 32 0.01 32 2 0.01 32
Alger 0.005 16 4 0.01 32 0.001 64 2 0.005 64
Allen 0.005 64 2 0.01 64 0.005 32 2 0.005 64
Austen 0.001 32 3 0.01 64 0.005 64 2 0.01 64
Brontë 0.001 32 3 0.01 64 0.005 16 2 0.01 16
Cooper 0.005 16 3 0.005 64 0.005 64 2 0.01 64
Dickens 0.001 32 4 0.005 32 0.005 64 2 0.01 64
Doyle 0.005 16 3 0.005 16 0.005 64 2 0.005 32
Garland 0.005 32 3 0.01 64 0.005 64 2 0.01 32
Hawthorne 0.01 16 3 0.005 64 0.01 64 2 0.01 16
Irving 0.005 16 3 0.005 64 0.01 64 2 0.005 64
James 0.001 32 2 0.005 64 0.01 64 2 0.01 64
Jewett 0.001 16 2 0.01 64 0.01 64 2 0.01 32
Melville 0.005 16 4 0.01 64 0.005 64 2 0.005 64
Page 0.005 16 2 0.005 64 0.005 64 2 0.005 64
Poe 0005 32 4 0.005 64 0.005 32 2 0.01 32
Stevenson 0.001 16 4 0.005 64 0.005 32 2 0.01 64
Thoreau 0.005 64 3 0.01 64 0.01 16 2 0.01 32
Twain 0.005 16 2 0.01 16 0.001 64 2 0.01 32
Wharton 0.005 16 4 0.01 64 0.01 64 2 0.001 64

Experiment 1: Performance comparison. In the first experiment, the performance is measured by classification error (ratio of texts in the test set that were wrongly attributed), and the comparison between the Learn NVGF, the GCN [24], the SGC [25], and the GAT [26] is carried out, for all 2121 authors. The hyperparameters used for each architecture and each author are present in Table 1 (recall that the hyperparameters that exhibit the best performance—the lowest classification error on the test set—are the ones used).

Results are shown in Figure 3. The general trend observed is for the Learn NVGF to exhibit better performance than the GAT, which in turn is better than the GCN, and all of them are better than SGC. The differences are usually significant between the four architectures, with a marked improvement by the Learn NVGF. It is observed, however, that for Doyle, Hawthorne, and Stevenson, the performance of the Learn NVGF and the GAT is comparable.

Experiment 2: Impact of nonlinearities. In the second experiment, the objective is to decouple the contribution made by frequency creation from that made by the nonlinear nature of the architecture. To do this, the GCNN architecture is taken as a baseline (a nonlinear, frequency-creating architecture), and the relative change in performance of the three other architectures is measured (Learn NVGF and Design NVGF, both linear and frequency-creating architectures, and LSIGF, which is linear but cannot create frequencies). The learning rate η\eta, the number of hidden units FF, and order of the filters KK are the same for all four architectures, and are those found in Table 1 under the column of Learn NVGF. The results showing relative change in the mean classification error with respect to the GCNN are shown in Figure 4.

The authors have been classified into three groups according to their relative behavior. Group 1 consists of those authors where the Learn NVGF has a comparable performance with respect to the GCNN (i.e., 5%5\% or less relative variation in performance), and both have a considerably better performance than the LSIGF. These results suggest that, for this group of authors (the most numerous one, consisting of 1111 authors, or 53%53\% of the total), the improvement in performance is mostly due to the capability of the architectures to create frequency, and not necessarily due to the nonlinear nature of the GCNN.

Group 2 consists of those authors where the Learn NVGF exhibits a better performance than the GCNN, which in turn exhibits a similar performance to the LSIGF (except for Jewett and Thoreau, where the GCNN still exhibits considerably better performance than the LSIGF). This group of 7 authors (33%33\% of the total) suggests that in some cases, frequency creation is the key contributor to improved performance, and that the inclusion of a nonlinear mapping could possibly have a negative impact. In essence, we observe that a linear architecture capable of creating frequencies outperforms the rest, and that a frequency-creating nonlinear architecture performs similarly to a linear architecture that cannot create frequencies. This suggests that the relationship between input and output is approximately linear with frequencies being created, but that attempting to model this frequency creation with a nonlinear architecture is not a good approach.

Finally, Group 3 consists of those architectures for which all architectures exhibit a similar performance. This group consists of 33 authors (14%14\% of the total). This may be explained by the fact that the high-frequency content for these authors does not carry useful information, and thus the role of frequency creation is less relevant.

With respect to the Design NVGF, a somewhat erratic behavior is observed. In near half of the cases (1111 authors), the performance of the Design NVGF closely resembles the performance of the GCNN, as expected. In a few other cases (Alcott, James, Cooper, and Allen), the Design NVGF results are better, and in the rest they are considerably worse (Brontë, Dickens, Hawthorne, Page, Stevenson, and Melville). The Design NVGF architecture is designed to mimic the GCNN, but its accurate design depends on good estimates of the first and second moments of the data. Thus, one possible explanation is that there is not enough data to get good estimates of these values. Another possible explanation is that higher-order moments have a larger impact in these cases, and the NVGF, being linear, is not able to accurately capture them.

Overall, this second experiment shows the importance of frequency-creation in improving performance, especially when high-frequency content is significant. Among the two ways of creating frequency (linear and nonlinear), we see that in most cases, they essentially perform the same. But there are cases in which creating frequency content in a linear manner is better (Group 2). In any case, this experiment shows the importance of frequency creation and calls for further research on what other contributions the nonlinear nature of the activation function has on performance.

Appendix E Movie Recommendation

Refer to caption
Figure 5: Performance comparison
Refer to caption
Figure 6: Relative change

Problem objective. Consider a set 𝒱={v1,,vN}\mathcal{V}=\{v_{1},\ldots,v_{N}\} of NN items, and let 𝗑~t:𝒱{}\mathsf{\tilde{x}}_{t}:\mathcal{V}\to\mathbb{R}\cup\{\emptyset\} be the ratings assigned by user tt to these items, i.e., 𝗑~t(vi)\mathsf{\tilde{x}}_{t}(v_{i}) is the rating assigned by user tt to item ii, this function yields \emptyset if the item has not been rated. The objective is to estimate what rating 𝗑~t(vr)\mathsf{\tilde{x}}_{t}(v_{r}) a user would give to some target item vr𝒱v_{r}\in\mathcal{V} not yet rated [3, 4].

Approach. The idea is to create a graph 𝒢\mathcal{G} of rating similarities and take the graph signal 𝐱\mathbf{x} to be the ratings given by a user to some of the items. Then 𝐱\mathbf{x} is processed through a GNN Φ\mathsf{\Phi} (in any of the variants discussed in Section 5) to obtain the estimated rating the user would give to the target item vrv_{r}, by looking at the output value of the GNN on node vrv_{r}. In short, this amounts to an interpolation problem.

Graph construction. Let 𝒯~={𝐱~t}t\mathcal{\tilde{T}}=\{\mathbf{\tilde{x}}_{t}\}_{t} be a set where 𝐱~tN\mathbf{\tilde{x}}_{t}\in\mathbb{R}^{N} collects the ratings given by user tt to some of the items, such that [𝐱~t]i=𝗑~t(vi)[\mathbf{\tilde{x}}_{t}]_{i}=\mathsf{\tilde{x}}_{t}(v_{i})\in\mathbb{R} if item viv_{i} has been rated and [𝐱~t]i=0[\mathbf{\tilde{x}}_{t}]_{i}=0 if 𝗑~t(vi)=\mathsf{\tilde{x}}_{t}(v_{i})=\emptyset. Denote by 𝒯i={𝐱t𝒯~:[𝐱t]i>0}\mathcal{T}_{i}=\{\mathbf{x}_{t}\in\mathcal{\tilde{T}}:[\mathbf{x}_{t}]_{i}>0\} the set of users that have rated item vtv_{t}, and by 𝒯~ij=𝒯~i𝒯~j\mathcal{\tilde{T}}_{ij}=\mathcal{\tilde{T}}_{i}\cap\mathcal{\tilde{T}}_{j} the set of users that have rated both items viv_{i} and vjv_{j}. Define the mean intersection score as μij=|𝒯~ij|1𝐱~t𝒯~ij[𝐱~t]i\mu_{ij}=|\mathcal{\tilde{T}}_{ij}|^{-1}\sum_{\mathbf{\tilde{x}}_{t}\in\mathcal{\tilde{T}}_{ij}}[\mathbf{\tilde{x}}_{t}]_{i}. Note that this is the rating average for item viv_{i}, computed among those users that have rated both viv_{i} and vjv_{j}. Then, the rating similarity between items viv_{i} and vjv_{j} is computed by means of the Pearson correlation as

wij=1|𝒯~ij|𝐱~t𝒯~ij([𝐱~t]iμij)([𝐱~t]jμij).w_{ij}=\frac{1}{|\mathcal{\tilde{T}}_{ij}|}\sum_{\mathbf{\tilde{x}}_{t}\in\mathcal{\tilde{T}}_{ij}}\big{(}[\mathbf{\tilde{x}}_{t}]_{i}-\mu_{ij}\big{)}\big{(}[\mathbf{\tilde{x}}_{t}]_{j}-\mu_{ij}\big{)}. (59)

These weights can be used to build a complete graph 𝒢~=(𝒱,~)\mathcal{\tilde{G}}=(\mathcal{V},\mathcal{\tilde{E}}) where 𝒱\mathcal{V} is the set of items and ~=𝒱×𝒱\mathcal{\tilde{E}}=\mathcal{V}\times\mathcal{V} is the complete set of edges with wijw_{ij} being the corresponding weights. In what follows, the 1010-nearest neighbor graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) of 𝒢~\mathcal{\tilde{G}} is built with ~\mathcal{E}\subseteq\mathcal{\tilde{E}}, and the matrix 𝐖\mathbf{W} is used to denote the weighted adjacency matrix of 𝒢\mathcal{G}, such that [𝐖]ij=wij[\mathbf{W}]_{ij}=w_{ij} if (i,j)(i,j)\in\mathcal{E}, and [𝐖]ij=0[\mathbf{W}]_{ij}=0 otherwise.

Graph signal processing description. The support matrix is chosen to be

𝐒=(𝖽𝗂𝖺𝗀(𝐖))1/2𝐖(𝖽𝗂𝖺𝗀(𝐖))1/2𝐈.\mathbf{S}=\big{(}\mathsf{diag}(\mathbf{W})\big{)}^{-1/2}\mathbf{W}\big{(}\mathsf{diag}(\mathbf{W})\big{)}^{-1/2}-\mathbf{I}. (60)

This problem can be cast as a supervised interpolation problem. Given the set 𝒯~\mathcal{\tilde{T}} and the target item vrv_{r}, consider the set 𝒯~r\mathcal{\tilde{T}}_{r} of all users that have rated the item vrv_{r}. Extract the specific rating 𝗑~t(vr)=yt\mathsf{\tilde{x}}_{t}(v_{r})=y_{t} as a label, and set a 0 in the rthr^{\text{th}} entry of 𝐱~t\mathbf{\tilde{x}}_{t}. The resulting vector is a graph signal 𝐱t\mathbf{x}_{t} which always has a 0 in the rthr^{\text{th}} entry. The resulting set 𝒯r={(𝐱t,yt):𝐱~t𝒯~r}\mathcal{T}_{r}=\{(\mathbf{x}_{t},y_{t}):\mathbf{\tilde{x}}_{t}\in\mathcal{\tilde{T}}_{r}\} contains all the users that have rated the item vrv_{r} with the corresopnding rating extracted as a label yty_{t} and the rthr^{\text{th}} entry [𝐱t]r[\mathbf{x}_{t}]_{r} of the graph signal 𝐱t\mathbf{x}_{t} set to zero, i.e., [𝐱t]r=0[\mathbf{x}_{t}]_{r}=0 for all tt such that 𝐱~t𝒯~r\mathbf{\tilde{x}}_{t}\in\mathcal{\tilde{T}}_{r}.

Refer to caption
Figure 7: Input frequencies
Refer to caption
Figure 8: Output frequency of LSIGF

Dataset and code. The dataset is the MovieLens-100k dataset [33] publicly available at http://files.grouplens.org/datasets/movielens/ml-100k.zip. This dataset consists of one hundred thousand ratings given by 943943 users to 16821682 movies, and where each user has rated at least 2020 of them. The ratings are integers ranging from 11 to 55 meaning that 𝗑~t:𝒱{1,,5}\mathsf{\tilde{x}}_{t}:\mathcal{V}\to\{1,\ldots,5\}\cup\emptyset for every user tt. In particular, the subset of 250250 movies that have received the largest number of ratings is used to build a graph with N=250N=250 nodes. The resulting dataset has 5474654746 ratings given by 943943 users to some of these 250250 movies. The five movies with the largest number of ratings are considered as target movies, namely “Star Wars” with |𝒯~Star Wars|=583|\mathcal{\tilde{T}}_{\text{Star Wars}}|=583 pairs (𝐱t,yt)(\mathbf{x}_{t},y_{t}), “Contact” with 509509 pairs, “Fargo” with 508508, “Return of the Jedi” with 507507, and “Liar, Liar” with 485485. Each of these datasets is split randomly into 81%81\% for training, 9%9\% for validation, and 10%10\% for testing. The code to run the simulations will be provided as a .zip file.

Architectures for comparison. The Learn NVGF architecture in (18) is compared against two graph signal processing based methods for movie recommendation, namely the RMGCNN in [3] and the RMCayley in [4], which are implemented exactly as in the corresponding papers, with the same values for the hyperparameters. For the Learn NVGF, the values are FF hidden units and filters of order KK. The values used are (64,3)(64,3) for Star Wars and Contact, (16,4)(16,4) for Fargo, (32,4)(32,4) for Return of the Jedi, and (16,3)(16,3) for Liar, Liar. The support matrix is given by (60). All architectures are followed by a learnable local linear transformation (the same for all nodes, i.e., a LSI graph filter with K=0K=0) that takes the value of the FF hidden units and outputs a single scalar that represents the estimated rating for that movie. A comparison with the nearest neighbor method (i.e., averaging the ratings of the nearest nodes) is also included.

Architectures for analysis. The architectures for analyzing the role of frequency creation are the same four architectures that in the authorship attribution problem. Namely, the LSI graph filter as a linear architecture unable to create frequencies, the Design NVGF and the Learn NVGF are linear frequency-creating architectures, and the GCNN with a ReLU nonlinearity is a nonlinear frequency-creating architecture. The values of FF and KK in all cases are the same.

Training. The loss function during training is the “Smooth L1” loss between the output scalar at the target node and the labels in the training set. All architectures are trained by using an ADAM optimizer [32] with forgetting factors 0.90.9 and 0.9990.999, and a learning rate η\eta. The training is carried out for 4040 epochs with batches of size 55. Validation is run every 55 training steps. The learned filters that result in the best performance on the validation set are kept and used during the testing phase. For each experiment, 1010 realizations of the random dataset split are carried out. The average evaluation performances (measured by RMSE) is reported, together with the estimated standard deviation.

Experiment 1: Performance comparison. In the first experiment, the performance is measured by the RMSE, and the comparison between the Learn NVGF, the RMGCNN [3], the RMCayley [4], and the Nearest Neighbor approach is carried out for the 55 aforementioned movies with the most number of ratings. Results are shown in Figure 5. It is generally observed that the Learn NVGF performs better than the alternatives, although the performance is comparable to the RMGCNN and the RMCayley in the case of the movie Contact. The nearest neighbor approach yields worse performance.

Experiment 2: Impact of nonlinearities. In this second experiment, the objective is to decouple the contribution made by frequency creation from that made by the nonlinear nature of the architecture. To do this, the GCNN architecture is taken as a baseline (a nonlinear, frequency-creating architecture), and the relative change in performance of the three other architectures is measured. The results shown in Figure 6 show that the relative change is quite small (approximately 1.5%1.5\% change in the highest case, the movies Fargo and Liar, Liar), which implies that all four architectures have relatively similar performance. This can be easily explained by computing the average frequency response of the signals in the test set of the movie Star Wars. The result is shown in Figure 7. It is observed that it is a signal with low-eigenvalue frequency content. Therefore, as expected, there is not much to gain for using architectures that create frequencies. In Figures 8, 9, and 10, the frequency responses of the output of each architecture to an input that is equal to the largest eigenvector, i.e., 𝐱=𝐯N\mathbf{x}=\mathbf{v}_{N}, are shown. As expected, the LSIGF (Figure 8) does not create frequency content, while the other two architectures, do (Figures 9 and 10). However, since this high-eigenvalue frequency content is not really significant, the frequency creation capabilities do not markedly improve the performance.

Refer to caption
Figure 9: Output frequency of GCNN
Refer to caption
Figure 10: Output frequency of Learn NVGF