Revisiting Generalized p-Laplacian Regularized Framelet GCNs: Convergence, Energy Dynamic and Training with Non-Linear Diffusion

Dai Shi,¹¹1Western Sydney Univeristy, 20195423@student.westernsydney.edu.au, y.guo@westernsydney.edu.au Zhiqi Shao²²2University of Sydney, zsha2911@uni.sydney.edu.au, junbin.gao@sydney.edu.au
Dai Shi and Zhiqi Shao are with equal contribution., Yi Guo, Qibin Zhao³³3AIP, RIKEN, qibin.zhao@riken.jp, Junbin Gao

Abstract

This paper presents a comprehensive theoretical analysis of the graph p-Laplacian regularized framelet network (pL-UFG) to establish a solid understanding of its properties. We conduct a convergence analysis on pL-UFG, addressing the gap in the understanding of its asymptotic behaviors. Further by investigating the generalized Dirichlet energy of pL-UFG, we demonstrate that the Dirichlet energy remains non-zero throughout convergence, ensuring the avoidance of over-smoothing issues. Additionally, we elucidate the energy dynamic perspective, highlighting the synergistic relationship between the implicit layer in pL-UFG and graph framelets. This synergy enhances the model’s adaptability to both homophilic and heterophilic data. Notably, we reveal that pL-UFG can be interpreted as a generalized non-linear diffusion process, thereby bridging the gap between pL-UFG and differential equations on the graph. Importantly, these multifaceted analyses lead to unified conclusions that offer novel insights for understanding and implementing pL-UFG, as well as other graph neural network (GNN) models. Finally, based on our dynamic analysis, we propose two novel pL-UFG models with manually controlled energy dynamics. We demonstrate empirically and theoretically that our proposed models not only inherit the advantages of pL-UFG but also significantly reduce computational costs for training on large-scale graph datasets.

1 Introduction

Graph neural networks (GNNs) have emerged as a popular tool for the representation learning on the graph-structured data [35]. To enhance the learning power of GNNs, many attempts have been made by considering the propagation of GNNs via different aspects such as optimization [45, 32], statistical test [36] and gradient flow [2, 10]. In particular, treating GNNs propagation as an optimization manner allows one to assign different types of regularizers on the GNNs’ output so that the variation of the node features, usually measured by so-called Dirichlet energy, can be properly constrained [45, 6]. The underlying reason for this regularization operation is due to the recently identified computational issue of GNNs on different types of graphs, namely homophily and heterophily [39]. With the former most of the nodes are connected with those nodes with identical labels, and the latter is not [24]. Accordingly, an ideal GNN shall be able to produce a rather smoother node features for homophily graph and more distinguishable node features when the input graph is heterophilic [24, 1].

Based on the above statement, a proper design of the regularizer that is flexible to let GNN fit both two types of the graph naturally becomes the next challenge. A recent research [13] proposed new energy based regularizer, namely p-Laplacian based regularizer to the optimization of GNN and resulted in an iterative algorithm to approximate the so-called implicit layer induced from the solution of the regularization. To engage a more flexible design of p-Laplacian GNN in [13], [25] further proposed p-Laplacian based graph framelet GNN (pL-UFG) to assign the p-Laplacian based regularization act on multiscale GNNs (e.g., graph framelet). While remarkable learning accuracy has been observed empirically, the underlying properties of the models proposed in [25] are still unclear. In this paper, our primary focus is on pL-UFG (see Section 2 for the formulation). Our objective is to analyze pL-UFG from various perspectives, including convergence of its implicit layer, model’s asymptotic energy behavior, changes of model’s dynamics due to the implicit layer, and relationship with existing diffusion models. To the best of our knowledge, these aspects have not been thoroughly explored in the context of p-Laplacian based GNNs, leaving notable knowledge gaps. Accordingly, we summarize our contribution as follows:

•

We rigorously prove the convergence of pL-UFG, providing insights into the asymptotic behavior of the model. This analysis addresses a crucial gap in the understanding of GNN models regularized with p-Laplacian based energy regularizer.
•

We show that by assigning the proper values of two key model parameters (denoted as $\mu$ and $p$ ) of pL-UFG based on our theoretical analysis, the (generalized) Dirichlet energy of the node feature produced from pL-UFG will never converge to 0; thus the inclusion of the implicit layer will prevent the model (graph framelet) from potential over-smoothing issue.
•

We demonstrate how the implicit layer in pL-UFG interacts with the energy dynamics of the graph framelet. Furthermore, we prove that pL-UFG can adapt to both homophily and heterophily graphs, enhancing its versatility and applicability.
•

We establish that the propagation mechanism within pL-UFG enables a generalized non-linear graph diffusion. The conclusions based on our analysis from different perspectives are unified at the end of the paper, suggesting a promising framework for evaluating other GNNs.
•

Based on our theoretical results, we propose two generalized pL-UFG models with controlled model dynamics, namely pL-UFG low-frequency dominant model (pL-UFG-LFD) and pL-UFG high frequency dominant model (pL-UFG-HFD). we further show that with controllable model dynamics, the computational cost of pL-UFG is largely reduced, making our proposed model capable of handling large-scale graph datasets.
•

We conduct extensive experiments to validate our theoretical claims. The empirical results not only confirm pL-UFG’s capability to handle both homophily and heterophily graphs but also demonstrate that our proposed models achieve comparable or superior classification accuracy with reduced computational cost. These findings are consistent across commonly tested and large-scale graph datasets.

The remaining sections of this paper are structured as follows. Section 2 presents fundamental notations related to graphs, GNN models, graph framelets and pL-UFG. In Section 3, we conduct a theoretical analysis on pL-UFG, focusing on the aforementioned aspects. Specifically, Section 3.1 presents the convergence analysis, while Section 3.2 examines the behavior of the p-Laplacian based implicit layer through a generalized Dirichlet energy analysis. Furthermore, Section 3.3 demystifies the interaction between the implicit layer and graph framelets from an energy dynamic perspective. We provide our proposed models (pL-UFG-LFD and pL-UFG-HFD) in section 3.4. Lastly, in Section 3.5, we demonstrate that the iterative algorithm derived from the implicit layer is equivalent to a generalized non-linear diffusion process on the graph. Additionally, in Section 4 we further verify our theoretical claims by comprehensive numerical experiments. Lastly, in conclusion 5, we summarize the findings of this paper and provide suggestions for future research directions.

2 Preliminaries

In this section, we provide necessary notations and formulations utilized in this paper. We list the necessary notations with their meanings in the Table 1 below, although we will mention the meaning of them again when we first use them.

Table 1: Necessary notations

Notations	Brief Interpretation
$\mathcal{H}(\mathcal{G})$	Heterophily index of a given graph $\mathcal{G}$
$\mathbf{X}$	Initial node feature matrix
$\mathbf{F}^{(k)}$	Feature representation on $k$ -th layer of GNN model
$\mathbf{f}_{i}$	Individual row of $\mathbf{F}$
$\mathbf{F}_{i,:}$	One or more operation acts on each row of $\mathbf{F}$
$\mathbf{D}$	Graph degree matrix
$\widehat{\mathbf{A}}$	Normalized adjacency matrix
$\widetilde{\mathbf{L}}$	Normalized Laplacian matrix
$\mathbf{W}$	Graph weight matrix
$\mathcal{W}$	Framelet decomposition matrix
$\mathcal{I}$	Index set of all framelet decomposition matrices.
$\widehat{\mathbf{W}}$	Learnable weight matrix in GNN models
$\widetilde{\mathbf{W}}$ , $\mathbf{\Omega}$ , $\widehat{\mathbf{W}}$	Learnable weight matrices in defining generalized Dirichlet energy
$\mathbf{Y}$	Feature propagation result for the pL-UFG defined in [25].
$\theta$	N-dimensional vector for diagonal scaling ( $\mathrm{diag}(\theta)$ ) in framelet models.
$\mathbf{E}^{PF}(\mathbf{F})$	Generalized Dirichlet energy for node feature induced from implicit layer
$\mathbf{E}^{Fr}(\mathbf{F})$	Generalized framelet Dirichlet energy
$\mathbf{E}^{total}(\mathbf{F})$	Total generalized Dirichlet energy
$\{\lambda_{i},\mathbf{u}_{i}\}_{i=1}^{N}$	Eigen-pairs of $\widetilde{\mathbf{L}}$

We also provide essential background information on the developmental history before the formulation of certain models, serving as a concise introduction to the related works.

Graph, Graph Convolution and Graph Consistency

We denote a weighted graph as $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathbf{W})$ with nodes set $\mathcal{V}=\{v_{1},v_{2},\cdots,v_{N}\}$ of total $N$ nodes, edge set $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ and graph adjacency matrix $\mathbf{W}$ , where $\mathbf{W}=[w_{i,j}]\in\mathbb{R}^{N\times N}$ and $w_{i,j}=1$ if $(v_{i},v_{j})\in\mathcal{E}$ , else, $w_{i,j}=0$ . The nodes feature matrix is ${\bf X}\in\mathbb{R}^{N\times c}$ for $\mathcal{G}$ with each row $\mathbf{x}_{i}\in\mathbb{R}^{c}$ as the feature vector associated with node $v_{i}$ . For a matrix $\mathbf{A}$ , we denote its transpose as $\mathbf{A}^{\top}$ , and we use $[N]$ for set $\{1,2,\dots,N\}$ . Throughout this paper, we will only focus on the undirect graph and use matrix $\mathbf{A}$ and $\mathbf{W}$ interchangeably for graph adjacency matrix⁴⁴4We initially set $\mathbf{W}$ as the graph adjacency matrix while $\mathbf{W}$ is a generic edge weight matrix in align with the notations used in [14, 25]. The normalized graph Laplacian is defined as $\widetilde{\mathbf{L}}=\mathbf{I}-\mathbf{D}^{-\frac{1}{2}}(\mathbf{W}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}$ , where $\mathbf{D}=\mathrm{diag}(d_{1,1},\dots,d_{N,N})$ is a diagonal degree matrix with $d_{i,i}=\sum^{N}_{j=1}w_{i,j}$ for $i=1,\dots,N$ , and $\mathbf{I}$ is the identity matrix. From the spectral graph theory [9], we have $\widetilde{\mathbf{L}}\succeq 0$ , i.e. $\widetilde{\mathbf{L}}$ is a positive semi-definite (SPD) matrix. Let $\{\lambda_{i}\}^{N}_{i=1}$ in decreasing order be all the eigenvalues of $\widetilde{\mathbf{L}}$ , also known as graph spectra, and $\lambda_{i}\in[0,2]$ . For any given graph, we let $\rho_{\widetilde{\mathbf{L}}}$ be the largest eigenvalue of $\widetilde{\mathbf{L}}$ . Lastly, for any vector $\mathbf{x}=[x_{1},...,x_{c}]\in\mathbb{R}^{c}$ , $\|\mathbf{x}\|_{2}=(\sum^{c}_{i=1}x^{2}_{i})^{\frac{1}{2}}$ is the L₂-norm of $\mathbf{x}$ , and similarly, for any matrix $\mathbf{M}=[m_{i,j}]$ , denote by $\|\mathbf{M}\|:=\|\mathbf{M}\|_{F}=(\sum_{i,j}m^{2}_{i,j})^{\frac{1}{2}}$ the matrix Frobenius norm.

Graph convolution network (GCN) [21] produces a layer-wise (node feature) propagation rule based on the information from the normalized adjacency matrix as:

\displaystyle\mathbf{F}^{(k+1)}=\sigma\big{(}\widehat{\mathbf{A}}\mathbf{F}^{(k)}\widehat{\mathbf{W}}^{(k)}\big{)},

(1)

where $\mathbf{F}^{(k)}$ is the embedded node feature, $\widehat{\mathbf{W}}^{(k)}$ the weight matrix for channel mixing [3], and $\sigma$ any activation function such as sigmoid. The superscript ^(k) indicates the quantity associated with layer $k$ , and $\mathbf{F}^{(0)}=\mathbf{X}$ . We write $\widehat{\mathbf{A}}=\mathbf{D}^{-\frac{1}{2}}(\mathbf{W}+\mathbf{I})\mathbf{D}^{-\frac{1}{2}}$ , the normalized adjacency matrix of $\mathcal{G}$ . It is easy to see that the operation conducted in GCN before activation can be interpreted as a localized filter by the graph Fourier transform, i.e., $\mathbf{F}^{(k+1)}=\mathbf{U}(\mathbf{I}_{n}-\mathbf{\Lambda})\mathbf{U}^{\top}\mathbf{F}^{(k)}$ , where $\mathbf{U},\mathbf{\Lambda}$ are from the eigendecomposition $\widetilde{\mathbf{L}}=\mathbf{U}\mathbf{\Lambda}\mathbf{U}^{\top}$ . In fact, $\mathbf{U}\mathbf{F}$ is known as the Fourier transform of graph signals in $\mathbf{F}$ .

Over the development of GNNs, most of GNNs are designed under the homophily assumption in which connected (neighbouring) nodes are more likely to share the same label. The recent work by [44] identifies that the general topology GNN fails to obtain outstanding results on the graphs with different class labels and dissimilar features in their connected nodes, such as the so-call heterophilic graphs. The definition of homophilic and heterophilic graphs are given by:

Definition 1 (Homophily and Heterophily [14]).

The homophily or heterophily of a network is used to define the relationship between labels of connected nodes. The level of homophily of a graph can be measured by $\mathcal{H(G)}=\mathbb{E}_{v_{i}\in\mathcal{V}}[|\{v_{j}\}_{j\in\mathcal{N}_{i,y_{i}=y_{i}}}|/|\mathcal{N}_{i}|]$ , where $|\{v_{j}\}_{j\in\mathcal{N}_{i,y_{i}=y_{i}}}|$ denotes the number of neighbours of $v_{i}\in\mathcal{V}$ that share the same label as $v_{i}$ , i.e. $y_{i}=y_{j}$ . $\mathcal{H(G)}\rightarrow 1$ corresponds to strong homophily while $\mathcal{H(G)}\rightarrow 0$ indicates strong heterophily. We say that a graph is a homophilic (heterophilic) graph if it has strong homophily (heterophily).

Graph Framelet.

As the main target for this paper to explore is pL-UFG defined in [25] in which p-Laplacian based implicit layer is combined with so-called graph framelet or framelets in short. Framelets are a type of wavelet frames arising from signal processing which can be extended for analysing graph signals. The first wavelet frame with a lifting scheme for graph analysis was presented in [27]. As computational power increased, [18] proposed a framework for wavelet transformation on graphs using Chebyshev polynomials for approximations. Later, [11] developed tight framelets on graphs by approximating smooth functions with filtered Chebyshev polynomials.

Framelets have been applied to graph learning tasks with outstanding results, as demonstrated in [40]. They are capable of decomposing graph signals and re-aggregating them effectively, as shown in the study on graph noise reduction by [42] Combining framelets with singular value decomposition (SVD) has also made them applicable to directed graphs [46]. Recently, [38] suggested a simple method for building more versatile and stable framelet families, known as Quasi-Framelets. In this study, we will introduce graph framelets using the same architecture described in [38]. To begin, we define the filtering functions for Quasi-framelets.

Definition 2.

A set of $R+1$ positive functions $\mathcal{F}=\{g_{0}(\xi),g_{1}(\xi),...,g_{R}(\xi)\}$ defined on the interval $[0,\pi]$ is considered as (a set of) Quasi-Framelet scaling functions, if these functions adhere to the following identity condition:

\displaystyle g_{0}(\xi)^{2}+g_{1}(\xi)^{2}+\cdots+g_{R}(\xi)^{2}\equiv 1,\;\;\;\forall\xi\in[0,\pi].

(2)

The identity condition (2) ensures a perfect reconstruction of a signal from its spectral space to the spatial space, see [38] for a proof. Particularly we are interested in the scaling function set in which $g_{0}$ descents from 1 to 0, i.e., $g_{0}(0)=1$ and $g_{0}(\pi)=0$ and $g_{R}$ ascends from 0 to 1, i.e., $g_{R}(0)=0$ and $g_{R}(\pi)=1$ . The purpose of setting these conditions is for $g_{0}$ to regulate the highest frequency and for $g_{R}$ to control the lowest frequency, while the remaining functions govern the frequencies lying between them.

With a given set of framelet scaling functions, the so-called Quasi-Framelet signal transformation can be defined by the following transformation matrices:

$\displaystyle\mathcal{W}_{0,J}$	$\displaystyle=\mathbf{U}g_{0}(\frac{\boldsymbol{\Lambda}}{2^{m+J}})\cdots g_{0}(\frac{\boldsymbol{\Lambda}}{2^{m}})\mathbf{U}^{\top},$	(3)
$\displaystyle\mathcal{W}_{r,0}$	$\displaystyle=\mathbf{U}g_{r}(\frac{\boldsymbol{\Lambda}}{2^{m}})\mathbf{U}^{\top},\;\;\text{for }r=1,...,R,$	(4)
$\displaystyle\mathcal{W}_{r,\ell}$	$\displaystyle=\mathbf{U}g_{r}(\frac{\boldsymbol{\Lambda}}{2^{m+\ell}})g_{0}(\frac{\boldsymbol{\Lambda}}{2^{m+\ell-1}})\cdots g_{0}(\frac{\boldsymbol{\Lambda}}{2^{m}})\mathbf{U}^{\top},$	(5)
	$\displaystyle\text{for }r=1,...,R,\ell=1,...,J,$

where $\mathcal{F}$ is a given set of Quasi-Framelet functions satisfying (2) and $J\geq 0$ is a given level on a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ with normalized graph Laplacian $\widetilde{\mathbf{L}}=\mathbf{U}^{\top}\mathbf{\Lambda}\mathbf{U}$ . $\mathcal{W}_{0,J}$ is defined as the product of $J+1$ Quasi-Framelet scaling functions $g_{0}$ applied to the Laplacian spectra $\Lambda$ at different scales. defined as $g_{r}(\frac{\boldsymbol{\Lambda}}{2^{m}})$ applied to spectra $\Lambda$ , where $m$ is the coarsest scale level which is the smallest value satisfying $2^{-m}\lambda_{n}\leq\pi$ . For $1\leq r\leq R$ and $1\leq\ell\leq J$ , $\mathcal{W}_{r,\ell}$ is defined as the product of $J-\ell+1$ Quasi-Framelet scaling functions $g_{0}$ and $\ell$ Quasi-Framelet scaling functions $g_{r}$ applied to spectra $\Lambda$ .

Let $\mathcal{W}=[\mathcal{W}_{0,J}$ ; $\mathcal{W}_{1,0}$ ; …; $\mathcal{W}_{R,0}]$ be the stacked matrix. It can be proven that $\mathcal{W}^{T}\mathcal{W}=\mathbf{I}$ , see [38], which provides a signal decomposition and reconstruction process based on $\mathcal{W}$ . This is referred to as the graph Quasi-Framelet transformation.

Since the computation of the Quasi-framelet transformation matrices requires the eigendecomposition of graph Laplacian, to reduce the computational cost, Chebyshev polynomials are used to approximate the Quasi-Framelet transformation matrices. The approximated transformation matrices are defined by replacing $g_{r}(\xi)$ in (3)-(5) with Chebyshev polynomials $\mathcal{T}_{r}(\xi)$ of a fixed degree, which is typically set to 3. The Quasi-Framelet transformation matrices defined in (3) - (5) can be approximated by,

$\displaystyle\mathcal{W}_{0,J}$	$\displaystyle\approx\mathcal{T}_{0}(\frac{1}{2^{m+J}}\widetilde{\mathbf{L}})\cdots\mathcal{T}_{0}(\frac{1}{2^{m}}\widetilde{\mathbf{L}}),$	(6)
$\displaystyle\mathcal{W}_{r,0}$	$\displaystyle\approx\mathcal{T}_{r}(\frac{1}{2^{m}}\widetilde{\mathbf{L}}),\;\;\;\text{for }r=1,...,R,$	(7)
$\displaystyle\mathcal{W}_{r,\ell}$	$\displaystyle\approx\mathcal{T}_{r}(\frac{1}{2^{m+\ell}}\widetilde{\mathbf{L}})\mathcal{T}_{0}(\frac{1}{2^{m+\ell-1}}\widetilde{\mathbf{L}})\cdots\mathcal{T}_{0}(\frac{1}{2^{m}}\widetilde{\mathbf{L}}),$	(8)
	$\displaystyle\;\;\;\;\text{for }r=1,...,R,\ell=1,...,J.$

Based on the approximated Quasi-Framelet transformation defined above, two types of graph framelet convolutions have been developed recently:

The Spectral Framelet Models [40, 41, 38, 26]:

\displaystyle\mathbf{F}^{(k+1)}

\displaystyle=\sigma\left(\mathcal{W}^{\top}\mathrm{diag}(\theta)\mathcal{W}\mathbf{F}^{(k)}\right):=\sigma\left(\sum_{(r,\ell)\in\mathcal{I}}\mathcal{W}_{r,\ell}^{\top}{\rm diag}(\mathbf{\theta}_{r,\ell})\mathcal{W}_{r,\ell}\mathbf{F}^{(k)}\widehat{\mathbf{W}}^{(k)}\right),

(9)

where $\theta_{r,\ell}\in\mathbb{R}^{N}$ , $\widehat{\mathbf{W}}^{(k)}$ are learnable matrices for channel/feature mixing, and $\mathcal{I}=\{(r,j):r=1,...,R,\ell=0,1,...,J\}\cup\{(0,J)\}$ is the index set for all framelet decomposition matrices.

The Spatial Framelet Models [6]:

\displaystyle\mathbf{F}^{(k+1)}

\displaystyle=\sigma\left(\mathcal{W}^{\top}_{0,J}\widehat{\mathbf{A}}\mathcal{W}_{0,J}\mathbf{F}^{(k)}\widehat{\mathbf{W}}^{(k)}_{0,J}+\sum_{r,\ell}\mathcal{W}_{r,\ell}^{\top}\widehat{\mathbf{A}}\mathcal{W}_{r,\ell}\mathbf{F}^{(k)}\widehat{\mathbf{W}}^{(k)}_{r,\ell}\right).

(10)

The spectral framelet models conduct framelet decomposition and reconstruction on the spectral domain of the graph. Clearly $\theta_{r,\ell}\in\mathbb{R}^{N}$ can be interpreted as the frequency filters, given that the framelet system provides a perfect reconstruction on the input graph signal (i.e., $\mathcal{W}^{\top}\mathcal{W}=\mathbf{I}$ ). Instead of frequency domain filtering, the spatial framelet models implement the framelet-based propagation via spatial (graph adjacency) domain.

There is a major difference between two schemes. In the spectral framelet methods, the weight matrix $\widehat{\mathbf{W}}^{(k)}$ is shared across different (filtered) frequency domains, while in the spatial framelet methods, an individual weight matrix $\widehat{\mathbf{W}}^{(k)}_{r,\ell}$ is applied to each (filtered) spatial domain to produce the graph convolution.

Finally, it is worth to noting that applying framelet/quasi-framelet transforms on graph signals can decomposes graph signals on different frequency domains for processing, e.g., the filtering used in the spectral framelet models and the spatial aggregating used in the spatial framelet models, thus the perfect reconstruction property guarantees less information loss in the signal processing pipeline. The learning advantage of graph framelet models has been proved via both theoretical and empirical studies [19, 40, 6].

Generalized p-Laplacian Regularized Framelet GCN.

In this part, we provide several additional definitions to formulate the model (pL-UFG) that we are interested in analyzing.

Definition 3 (The $p$ -Laplace Operator [12]).

Let $\Omega\subset\mathbb{R}^{d}$ be a domain and $u$ is a function defined on $\Omega$ . The $p$ -Laplace operator $\Delta$ over functions is defined as

\Delta u:=\nabla\cdot(\|\nabla u\|^{p-2}\nabla u)

where $\nabla$ is the gradient operator and $\|\cdot\|$ is the Euclidean norm and $p$ is a scalar satisfying $1<p<+\infty$ . The $p$ -Laplace operator, is known as a quasi-linear elliptic partial differential operator.

There are a line of research on the properties of $p$ -Laplacian in regarding to its uniqueness and existence [15], geometrical property [20] and boundary conditions on so-called p-Laplacian equation [29].

The concept of $p$ -Laplace operator can be extended for discrete domains such as graph (nodes) based on the concepts of the so-called graph gradient and divergence, see below, one of the recent works [14] considers assigning an adjustable $p$ -Laplacian regularizer to the (discrete) graph regularization problem that is conventionally treated as a way of producing GNN outcomes (i.e., Laplacian smoothing) [43]. In view of the fact that the classic graph Laplacian regularizer measures the graph signal energy along edges under $L_{2}$ metric, it would be beneficial if GNN training process can be regularized under $L_{p}$ metric in order to adapt to different graph inputs. Following these pioneer works, [25] further integrated graph framelet and a generalized $p$ -Laplacian regularizer to develop the so-called generalized $p$ -Laplacian regularized framelet model. It involves a regularization problem over the energy quadratic form induced from the graph $p$ -Laplacian. To show this, we start by defining graph gradient as follows:

To introduce graph gradient and divergence, we define the following notation:

Given a graph $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathbf{W})$ , let $\mathcal{F_{V}}:=\{\mathbf{F}|\mathbf{F}:\mathcal{V}\rightarrow\mathbb{R}^{d}\}$ be the space of the vector-valued functions defined on $\mathcal{V}$ and $\mathcal{F_{E}}:=\{\mathbf{g}|\mathbf{g}:\mathcal{E}\rightarrow\mathbb{R}^{d}\}$ be the vector-valued function space on edges, respectively.

Definition 4 (Graph Gradient [43]).

For a given function $\mathbf{F}\in\mathcal{F_{V}}$ , its graph gradient is an operator $\nabla_{W}$ : $\mathcal{F_{V}}\rightarrow\mathcal{F_{E}}$ defined as for all $(v_{i},v_{j})\in\mathcal{E}$ ,

\displaystyle(\nabla_{W}\mathbf{F})([i,j]):=\sqrt{\frac{w_{i,j}}{d_{j,j}}}\mathbf{f}_{j}-\sqrt{\frac{w_{i,j}}{d_{i,i}}}\mathbf{f}_{i},

(11)

where $\mathbf{f}_{i}$ and $\mathbf{f}_{j}$ are the signal vectors on nodes $v_{i}$ and $v_{j}$ , i.e., the rows of $\mathbf{F}$ .

For simplicity, we denote $\nabla_{W}\mathbf{F}$ as $\nabla\mathbf{F}$ as the graph gradient. The definition of (discrete) graph gradient is analogized from the notion of gradient from the continuous space. Similarly, we can further define the so-called graph divergence:

Definition 5 (Graph Divergence [43]).

The graph divergence is an operator $\text{div}:\mathcal{F}_{\mathcal{E}}\rightarrow\mathcal{F}_{\mathcal{V}}$ which is defined by the following way. For a given function $\mathbf{g}\in\mathcal{F}_{\mathcal{E}}$ , the resulting $\text{div}(\mathbf{g})\in\mathcal{F}_{\mathcal{V}}$ satisfies the following condition, for any functions $\mathbf{F}\in\mathcal{F}_{\mathcal{V}}$ ,

\displaystyle\langle\nabla\mathbf{F},\mathbf{g}\rangle=\langle\mathbf{F},-\text{div}(\mathbf{g})\rangle.

(12)

It is easy to check that the graph divergence can be computed by:

\displaystyle\text{div}(\mathbf{g})(i)=\sum_{j=1}^{N}\sqrt{\frac{w_{i,j}}{d_{i,i}}}(\mathbf{g}[i,j]-\mathbf{g}[j,i]).

(13)

With the formulation of graph gradient and divergence we are ready to define the graph p-Laplacian operator and the corresponding p-Dirichlet form [43, 14] that serves as the regularizer in the model developed in [25]. The graph p-Laplacian can be defined as follows:

Definition 6 (Graph $p$ -Laplacian).

Given a graph $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathbf{W})$ and a multiple channel signal function $\mathbf{F}:\mathcal{V}\rightarrow\mathbb{R}^{d}$ , the graph $p$ -Laplacian is an operator $\Delta_{p}:\mathcal{F}_{\mathcal{V}}\rightarrow\mathcal{F}_{\mathcal{V}}$ , defined by:

\displaystyle\Delta_{p}\mathbf{F}:=-\frac{1}{2}\text{div}(\|\nabla\mathbf{F}\|^{p-2}\nabla\mathbf{F}),\quad\text{for}\,\,\,p\geq 1.

(14)

where $\|\cdot\|^{p-2}$ is element-wise power over the node gradient $\nabla\mathbf{F}$ .

The corresponding p-Dirichlet form can be denoted as:

\displaystyle\mathcal{S}_{p}(\mathbf{F})=\frac{1}{2}\sum_{(v_{i},v_{j})\in\mathcal{E}}\left\|\sqrt{\frac{w_{i,j}}{d_{j,j}}}\mathbf{f}_{j}-\sqrt{\frac{w_{i,j}}{d_{i,i}}}\mathbf{f}_{i}\right\|^{p},

(15)

where we adopt the definition of $p$ -norm as [14]. It is not difficult to verify that once we set $p=2$ , we recover the graph Dirichlet energy [43] that is widely used to measure the difference between node features along the GNN propagation process.

Remark 1 (Dirichlet Energy, Graph Homophily and Heterophily).

Graph Dirichlet energy [14, 3] has become a commonly applied measure of variation between node features via GNNs. It has been shown that once the graph is highly heterophilic where the connected nodes are not likely to share identical labels, one may prefer GNNs that exhibit nodes feature sharpening effect, thus increasing Dirichlet energy, such that the final classification output of the connected nodes from these GNNs tend to be different. Whereas, when the graph is highly homophilic, a smoothing effect (thus a decrease of Dirichlet energy) is preferred.

[25] further generalized the p-Dirichlet form in (15) as:

		$\displaystyle\mathcal{S}_{p}(\mathbf{F})=\frac{1}{2}\sum_{(v_{i},v_{j})\in\mathcal{E}}\left\\|\nabla_{W}\mathbf{F}([i,j])\right\\|^{p}$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{v_{i}\in\mathcal{V}}\left[\left(\sum_{v_{j}\sim v_{i}}\left\\|\nabla_{W}\mathbf{F}([i,j])\right\\|^{p}\right)^{\frac{1}{p}}\right]^{p}=\frac{1}{2}\sum_{v_{i}\in\mathcal{V}}\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p}^{p},$		(16)

where $v_{j}\sim v_{i}$ stands for the node $v_{j}$ that is connected to node $v_{i}$ and $\nabla_{W}\mathbf{F}(v_{i})=\left(\nabla_{W}\mathbf{F}([i,j])\right)_{v_{j}:(v_{i},v_{j})\in\mathcal{E}}$ is the node gradient vector for each node $v_{i}$ and $\|\cdot\|_{p}$ is the vector $p$ -norm. Moreover, we can further generalize the regularizer $\mathcal{S}_{p}(\mathbf{F})$ by considering any positive convex function $\phi$ as:

\displaystyle\mathcal{S}^{\phi}_{p}(\mathbf{F})=\frac{1}{2}\sum_{v_{i}\in\mathcal{V}}\phi(\|\nabla_{W}\mathbf{F}(v_{i})\|_{p}).

(17)

There are many choices of $\phi$ and $p$ . When $\phi(\xi)=\xi^{p}$ , we recover the $p$ -Laplacian regularizer. Interestingly, by setting $\phi(\xi)=\xi^{2}$ , we recover the so-called Tikhonov regularization which is frequently applied in image processing. When $\phi(\xi)=\xi$ , i.e. identity map written as $id$ , and $p=1$ , $\mathcal{S}^{id}_{1}(\mathbf{F})$ becomes the classic total variation regularization. Last but not the least, $\phi(\xi)=r^{2}\log(1+\xi^{2}/r^{2})$ gives nonlinear diffusion. We note that there are many other choices on the form of $\phi$ . In this paper we will only focus on those mentioned in [25] (i.e., the smooth ones). As a result, the flexible design of the energy regularizer in (17) provides different penalty strength in regularizing the node features propagated from GNNs.

Accordingly, the regularization problem proposed in [25] is:

\displaystyle\mathbf{F}=\operatorname*{arg\,min}_{\mathbf{F}}\mathcal{S}^{\phi}_{p}(\mathbf{F})+\mu\|\mathbf{F}-\mathcal{W}^{\top}\mathrm{diag}(\theta)\mathcal{W}\mathbf{F}^{(k)}\|^{2}_{F},

(18)

where $\mathbf{Y}:=\mathcal{W}^{\top}\mathrm{diag}(\theta)\mathcal{W}\mathbf{F}^{(k)}$ stands for the node feature generated by the spectral framelet models (9) without activation $\sigma$ . This is the implicit layer proposed in [25]. As the optimization problem defined in (18) does not have a closed-form solution when $p\neq 2$ , an iterative algorithm is developed in [25] to address this issue. The justification is summarized by the following proposition (Theorem 1 in [25]):

Proposition 1.

For a given positive convex function $\phi(\xi)$ , define

	$\displaystyle M_{i,j}=$	$\displaystyle\frac{w_{i,j}}{2}\left\\|\nabla_{W}\mathbf{F}([i,j])\right\\|^{p-2}\cdot\left[\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{i})\\|_{p})}{\\|\nabla_{W}\mathbf{F}(v_{i})\\|^{p-1}_{p}}\right.+\left.\frac{\phi^{\prime}(\\|\nabla_{W}\mathbf{F}(v_{j})\\|_{p})}{\\|\nabla_{W}\mathbf{F}(v_{j})\\|^{p-1}_{p}}\right],$
	$\displaystyle\alpha_{ii}=$	$\displaystyle 1/\left(\sum_{v_{j}\sim v_{i}}\frac{M_{i,j}}{d_{i,i}}+2\mu\right),\quad\quad\beta_{ii}=2\mu\alpha_{ii},$

and denote the matrices $\mathbf{M}=[M_{i,j}]$ , $\boldsymbol{\alpha}=\text{diag}(\alpha_{11},...,\alpha_{NN})$ and $\boldsymbol{\beta}=\text{diag}(\beta_{11},...,\beta_{NN})$ . Then problem (18) can be solved by the following message passing process

\displaystyle\mathbf{F}^{(k+1)}=\boldsymbol{\alpha}^{(k)}\mathbf{D}^{-1/2}\mathbf{M}^{(k)}\mathbf{D}^{-1/2}\mathbf{F}^{(k)}+\boldsymbol{\beta}^{(k)}\mathbf{Y},

(19)

with an initial value, e.g., $\mathbf{F}^{(0)}=\mathbf{0}$ or $\mathbf{Y}$ . Note, $k$ denotes the discrete time index (iteration).

In this paper, we call model (18) together with its iteration algorithm (19) the pL-UFG model.

Due to the extensive analysis conducted on the graph framelet’s properties, our subsequent analysis will primarily concentrate on the iterative scheme presented in (19). However, we will also unveil the interaction between this implicit layer and the framelet in the following sections.

3 Theoretical Analysis of the pL-UFG

In this section, we show detailed analysis on the convergence (Section 3.1) and energy behavior (Section 3.2) of the iterative algorithm in solving the implicit layer presented in (19). In addition, we will also present some results regarding to the interaction between the implicit layer and graph framelet in Section 3.3 via the energy dynamic aspect based on the conclusion from Section 3.2. Lastly in Section 3.5, we will verify that the iterative algorithm induced from the p-Laplacian implicit layer admits a generalized non-linear diffusion process, thereby connecting the discrete iterative algorithm to the differential equations on graph.

First, we consider the form of matrix $\mathbf{M}$ in (19). Write

\displaystyle\zeta_{i,j}^{\phi}(\mathbf{F})=\frac{1}{2}\left[\frac{\phi^{\prime}(\|\nabla_{W}\mathbf{F}^{(k+1)}(v_{i})\|_{p})}{\|\nabla_{W}\mathbf{F}^{(k+1)}(v_{i})\|^{p-1}_{p}}+\frac{\phi^{\prime}(\|\nabla_{W}\mathbf{F}^{(k+1)}(v_{j})\|_{p})}{\|\nabla_{W}\mathbf{F}^{(k+1)}(v_{j})\|^{p-1}_{p}}\right].

(20)

$M_{i,j}$ can be simplified as

\displaystyle M_{i,j}=\zeta_{i,j}^{\phi}(\mathbf{F})w_{i,j}\left\|\nabla_{W}\mathbf{F}([i,j])\right\|^{p-2}.

(21)

$\zeta_{i,j}^{\phi}(\mathbf{F})$ is bounded as shown in the following lemma.

Lemma 1.

Assume

\displaystyle\frac{\phi^{\prime}(\xi)}{\xi^{p-1}}\leq C,

(22)

for a suitable constant $C$ . We have $|\zeta_{i,j}^{\phi}(\mathbf{F})|\leq C$ .

The proof is trivial thus we omit it here. In the sequel, we use $\zeta_{i,j}(\mathbf{F})$ for $\zeta_{i,j}^{\phi}(\mathbf{F})$ instead.

Remark 2.

It is reasonable for assuming condition in (22) in Lemma 1 so that $\zeta_{i,j}(\mathbf{F})$ is bounded. For example, one can easily verify that when $\phi(\xi)=\xi^{p}$ , $\zeta_{i,j}(\mathbf{F})$ is bounded for all $p$ . In particular, when $p=2$ , i.e., $\phi(\xi)=\xi^{2}$ , we have $\frac{\phi^{\prime}(\xi)}{\xi^{p-1}}=\frac{2\xi}{\xi^{p-1}}=2\frac{\xi^{2}}{\xi^{p}}$ , thus $\zeta_{i,j}(\mathbf{F})$ is bounded for all $0<p\leq 2$ . Furthermore, when $\phi(\xi)=\xi$ , then $\frac{\phi^{\prime}(\xi)}{\xi^{p-1}}=\frac{\xi}{\xi^{p-1}}$ , indicating $\zeta_{i,j}(\mathbf{F})$ is bounded for all $0<p\leq 1$ . In addition, when $\phi(\xi)=\sqrt{\xi^{2}+\epsilon^{2}}-\epsilon$ , we have $\frac{\phi^{\prime}(\xi)}{\xi^{p-1}}=\frac{(\xi^{2}+\epsilon^{2})^{1/2}\xi}{\xi^{p-1}}\leq C\frac{\xi}{\xi^{p-1}}$ . Therefore $\zeta_{i,j}(\mathbf{F})$ is bounded for all $0<p\leq 2$ . Lastly, when $\phi(\xi)=r^{2}\log(1+\frac{\xi^{2}}{r^{2}})$ , the result of $\frac{\phi^{\prime}(\xi)}{\xi^{p-1}}$ yields $r^{2}\frac{1}{1+\frac{\xi^{2}}{r^{2}}}\cdot\frac{\frac{2}{r^{2}}\xi}{\xi^{p-1}}\leq 2\frac{\xi}{\xi^{p-1}}$ . Hence $\zeta_{i,j}(\mathbf{F})$ remain bounded for all $0<p\leq 2$ . In summary, for all forms of $\phi$ we included in the model, $\zeta_{i,j}(\mathbf{F})$ is bounded.

The boundedness of $\zeta_{i,j}(\mathbf{F})$ from Lemma 1 is useful in the following convergence analysis.

3.1 Convergence Analysis of pL-UFG

We show the iterative algorithm presented in (19) will converge with a suitable choice of $\mu$ . We further note that although the format of Theorem 1 is similar to Theorem 2 in [13], our message passing scheme presented in (19) is different compared to the one defined in [13] via the forms of $\mathbf{M}$ , $\boldsymbol{\alpha}$ and $\boldsymbol{\beta}$ . In fact, the model defined in [13] can be considered as a special case where $\phi(\xi)=\xi^{p}$ . As a generalization of the model proposed in [13], we provide a uniform convergence analysis for the pL-UFG.

Theorem 1 (Weak Convergence of the Proposed Model).

Given a graph $\mathcal{G}(\mathcal{V},\mathcal{E},\mathbf{W}$ ) with node features $\mathbf{X}$ , if $\boldsymbol{\alpha}^{(k)}$ , $\boldsymbol{\beta}^{(k)}$ , $\mathbf{M}^{(k)}$ and $\mathbf{F}^{(k)}$ are updated according to (19), then there exist some real positive value $\mu$ , which depends on the input graph ( $\mathcal{G}$ , $\mathbf{X}$ ) and the quantity of $p$ , updated in each iteration, such that:

\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k+1)})\leq\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}),

where $\mathcal{L}_{p}^{\phi}(\mathbf{F}):=\mathcal{S}^{\phi}_{p}(\mathbf{F})+\mu\|\mathbf{F}-\mathbf{Y}\|^{2}_{F}$ .

Proof.

First, write

\displaystyle M^{(k)}_{i,j}=

\displaystyle\frac{w_{i,j}}{2}\left\|\nabla_{W}\mathbf{F}^{(k)}([i,j])\right\|^{p-2}\cdot\left[\frac{\phi^{\prime}(\|\nabla_{W}\mathbf{F}^{(k)}(v_{i})\|_{p})}{\|\nabla_{W}\mathbf{F}^{(k)}(v_{i})\|^{p-1}_{p}}\right.+\left.\frac{\phi^{\prime}(\|\nabla_{W}\mathbf{F}^{(k)}(v_{j})\|_{p})}{\|\nabla_{W}\mathbf{F}^{(k)}(v_{j})\|^{p-1}_{p}}\right].

(23)

The derivative of the regularization problem defined in (18) is:

\displaystyle\begin{aligned} \left.\frac{\partial\mathcal{L}^{\phi}_{p}(\mathbf{F})}{\partial\mathbf{F}_{i,:}}\right|_{\mathbf{F}^{(k)}}=&2\mu(\mathbf{F}^{(k)}_{i,:}-\mathbf{Y}_{i,:})+\sum_{v_{j}\sim v_{i}}M^{(k)}_{ij}\frac{1}{\sqrt{d_{ii}w_{ij}}}\nabla_{W}\mathbf{F}^{(k)}([j,i])\\ =&2\mu(\mathbf{F}^{(k)}_{i,:}-\mathbf{Y}_{i,:})+\sum_{v_{j}\sim v_{i}}M^{(k)}_{ij}\left(\frac{1}{d_{ii}}\mathbf{F}^{(k)}_{i,:}-\frac{1}{\sqrt{d_{ii}d_{jj}}}\mathbf{F}^{(k)}_{j,:}\right)\\ =&(2\mu+\sum_{v_{j}\sim v_{i}}M^{(k)}_{ij}/d_{ii})\mathbf{F}^{(k)}_{i,:}-2\mu\mathbf{Y}_{i,:}-\sum_{v_{j}\sim v_{i}}\frac{M^{(k)}_{ij}}{\sqrt{d_{ii}d_{jj}}}\mathbf{F}^{(k)}_{j,:}\\ =&\frac{1}{\alpha^{(k)}_{ii}}\mathbf{F}^{(k)}_{i,:}-\frac{1}{\alpha^{(k)}_{ii}}\left(\beta^{(k)}_{ii}\mathbf{Y}_{i,:}+\alpha^{(k)}_{ii}\sum_{v_{j}\sim v_{i}}\frac{M^{(k)}_{ij}}{\sqrt{d_{ii}d_{jj}}}\mathbf{F}^{(k)}_{j,:}\right)\end{aligned}

(24)

Thus, according the update rule of $\mathbf{F}^{(k+1)}$ in (19), we have

\displaystyle\left.\frac{\partial\mathcal{L}^{\phi}_{p}(\mathbf{F})}{\partial\mathbf{F}_{i,:}}\right|_{\mathbf{F}^{(k)}}=\frac{\mathbf{F}^{(k)}_{i,:}-\mathbf{F}^{(k+1)}_{i,:}}{\alpha^{(k)}_{ii}}.

(25)

For our purpose, we denote the partial derivative at $\mathbf{F}^{(*)}$ of the objective function with respect to the node feature $\mathbf{F}_{i,;}$ as

\displaystyle\partial\mathcal{L}^{\phi}_{p}(\mathbf{F}^{(*)}_{i,:}):=\left.\frac{\partial\mathcal{L}^{\phi}_{p}(\mathbf{F})}{\partial\mathbf{F}_{i,:}}\right|_{\mathbf{F}^{(*)}}

(26)

For all $i,j\in[N]$ , let $\mathbf{v}\in\mathbb{R}^{1\times c}$ be a disturbance acting on node $i$ . Define the following:

\displaystyle\begin{aligned} &N^{(k)}_{i,j}=W_{i,j}\left\|\sqrt{\frac{W_{i,j}}{D_{i,i}}}\mathbf{F}^{(k)}_{i,:}-\sqrt{\frac{W_{i,j}}{D_{j,j}}}\mathbf{F}^{(k)}_{j,:}\right\|^{p-2}\\ &N^{\prime(k)}_{i,j}=W_{i,j}\left\|\sqrt{\frac{W_{i,j}}{D_{i,i}}}(\mathbf{F}^{(k)}_{i,:}+\mathbf{v})-\sqrt{\frac{W_{i,j}}{D_{j,j}}}\mathbf{F}^{(k)}_{j,:}\right\|^{p-2}\\ &M^{(k)}_{i,j}=N^{(k)}_{ij}\zeta_{i,j}(\mathbf{F}^{(k)}),\;\;\;M^{\prime(k)}_{i,j}=N^{\prime(k)}_{ij}\zeta_{i,j}(\mathbf{F}^{(k)}+\mathbf{v})\\ &\alpha^{\prime(k)}_{ii}=1/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{\prime(k)}_{i,j}}{D_{i,i}}+2\mu\right),\quad\quad\beta^{\prime(k)}_{ii}=2\mu\alpha^{\prime(k)}_{ii}\\ &\mathbf{F}^{\prime(k+1)}_{i,:}=\alpha^{\prime(k)}_{i,i}\sum_{v_{j}\sim v_{i}}\frac{\mathbf{M}^{\prime(k)}_{i,j}}{\sqrt{D_{i,i}D_{j,j}}}\mathbf{F}^{(k)}_{j,:}+\boldsymbol{\beta}^{\prime(k)}\mathbf{Y}_{i,:},\end{aligned}

(27)

where $\zeta_{ij}(\mathbf{F})$ is defined as (20) and $\mathbf{F}^{(k)}+\mathbf{v}$ means that $\mathbf{v}$ only applies to the $i$ -th of $\mathbf{F}^{(k)}$ ⁵⁵5With slightly abuse of notation, we denote $N^{\prime}$ as the matrix after assigning the disturbance $\mathbf{v}$ to the matrix $N$ ..

Similar to (25), we compute

\displaystyle\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:}+\mathbf{v})=\frac{1}{\alpha^{\prime(k)}_{i,i}}\left((\mathbf{F}^{(k)}_{i,:}+\mathbf{v})-\mathbf{F}^{\prime(k+1)}_{i,:}\right).

(28)

Hence from both (25) and (28) we will have

		$\displaystyle\left\\|\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:}+\mathbf{v})-\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:})\right\\|=\left\\|\frac{1}{\alpha^{\prime(k)}_{i,i}}\left((\mathbf{F}^{(k)}_{i,:}+\mathbf{v})-\mathbf{F}^{\prime(k+1)}_{i,:}\right)-\frac{1}{\alpha^{(k)}_{i,i}}\left(\mathbf{F}^{(k)}_{i,:}-\mathbf{F}^{(k+1)}_{i,:}\right)\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\alpha^{\prime(k)}_{i,i}}\\|\mathbf{v}\\|+\left\\|\frac{1}{\alpha^{\prime(k)}_{i,i}}\left(\mathbf{F}^{(k)}_{i,:}-\mathbf{F}^{\prime(k+1)}_{i,:}\right)-\frac{1}{\alpha^{(k)}_{i,i}}\left(\mathbf{F}^{(k)}_{i,:}-\mathbf{F}^{(k+1)}_{i,:}\right)\right\\|$
	$\displaystyle=$	$\displaystyle\frac{1}{\alpha^{\prime(k)}_{i,i}}\\|\mathbf{v}\\|+\left\\|\left(\frac{1}{\alpha^{\prime(k)}_{i,i}}-\frac{1}{\alpha^{(k)}_{i,i}}\right)\mathbf{F}^{(k)}_{i,:}-\frac{1}{\alpha^{\prime(k)}_{i,i}}\mathbf{F}^{\prime(k+1)}_{i,:}+\frac{1}{\alpha^{(k)}_{i,i}}\mathbf{F}^{(k+1)}_{i,:}\right\\|$
	$\displaystyle=$	$\displaystyle\frac{1}{\alpha^{\prime(k)}_{i,i}}\\|\mathbf{v}\\|+\left\\|\sum_{v_{j}\sim v_{i}}\left(\frac{M^{\prime(k)}_{i,j}}{D_{i,i}}-\frac{M^{(k)}_{i,j}}{D_{i,i}}\right)\mathbf{F}^{(k)}_{i,:}-\sum_{v_{j}\sim v_{i}}\left(\frac{M^{\prime(k)}_{i,j}}{\sqrt{D_{i,i}D_{j,j}}}\right)\mathbf{F}^{\prime(k)}_{j,:}+\left(\frac{M^{(k)}_{i,j}}{\sqrt{D_{i,i}D_{j,j}}}\right)\mathbf{F}^{(k)}_{j,:}\right\\|$
	$\displaystyle=$	$\displaystyle\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(k)}_{i,j}}{D_{i,i}}+2\mu\right)\\|\mathbf{v}\\|+\sum_{v_{j}\sim v_{i}}\left(\frac{M^{\prime(k)}_{i,j}-M^{(k)}_{i,j}}{D_{i,i}}\right)\\|\mathbf{v}\\|$
		$\displaystyle+\left\\|\sum_{v_{j}\sim v_{i}}\left(\frac{M^{\prime(k)}_{i,j}-M^{(k)}_{i,j}}{D_{i,i}}\right)\mathbf{F}^{(k)}_{i,:}-\sum_{v_{j}\sim v_{i}}\left(\frac{M^{\prime(k)}_{i,j}-M^{(k)}_{i,j}}{\sqrt{D_{i,i}D_{j,j}}}\right)\mathbf{F}^{(k)}_{j,:}\right\\|.$

Note that in (27), $\|\cdot\|^{p-2}$ is the matrix $L_{2}$ norm raised to power $p-2$ , that is $\|\mathbf{X}\|^{p-2}=\left((\sum_{i,j}x^{2}_{i,j})^{\frac{1}{2}}\right)^{p-2}$ . It is known that the matrix $L_{2}$ norm as a function is Lipschitz [23], so is its exponential to $p-2$ . Furthermore, it is easy to verify that $\|\mathbf{N}^{\prime}-\mathbf{N}\|\leq c\|\mathbf{v}\|$ due to the property of $\mathbf{N}$ and $\mathbf{N}^{\prime}$ . Hence, according to Lemma 1, the following holds

|M^{\prime(k)}_{i,j}-M^{(k)}_{i,j}|\leq C|N^{\prime(k)}_{i,j}-N^{(k)}_{i,j}|\leq C^{\prime}\|\mathbf{v}\|.

Combining all the above, we have

\displaystyle\left\|\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:}+\mathbf{v})-\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:})\right\|\leq\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(k)}_{i,j}}{D_{i,i}}+2\mu+o(\mathcal{G},\mathbf{v},\mathbf{X},p)\right)\|\mathbf{v}\|,

(29)

where $o(\mathcal{G},\mathbf{v},\mathbf{X},p)$ is bounded. It is worth noting that the quantity of $o(\mathcal{G},\mathbf{v},\mathbf{X},p)$ is bounded by

\sum_{v_{j}\sim v_{i}}\left(\frac{M^{\prime(k)}_{i,j}-M^{(k)}_{i,j}}{D_{i,i}}\right)\|\mathbf{v}\|+\left\|\sum_{v_{j}\sim v_{i}}\left(\frac{M^{\prime(k)}_{i,j}-M^{(k)}_{i,j}}{D_{i,i}}\right)\mathbf{F}^{(k)}_{i,:}-\sum_{v_{j}\sim v_{i}}\left(\frac{M^{\prime(k)}_{i,j}-M^{(k)}_{i,j}}{\sqrt{D_{i,i}D_{j,j}}}\right)\mathbf{F}^{(k)}_{j,:}\right\|.

Let $\overline{o}=o(\mathcal{G},\mathbf{v},\mathbf{X},p)$ , $\boldsymbol{\gamma}=\{\gamma_{1},...\gamma_{N}\}^{\top}$ , and $\boldsymbol{\eta}\in\mathbb{R}^{N\times c}$ . By the Taylor expansion theorem we have:

	$\displaystyle\mathcal{L}_{p}^{\phi}(\mathbf{F}_{i,:}^{(k)}+\gamma_{i}\boldsymbol{\eta}_{i,:})=\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:})+\gamma_{i}\int_{0}^{1}\langle\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:}+\epsilon\gamma_{i}\boldsymbol{\eta}_{i,:}),\boldsymbol{\eta}_{i,:}\rangle d\epsilon\quad\forall i$
	$\displaystyle=\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:})+\langle\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:}),\boldsymbol{\eta}_{i,:}\rangle+\gamma_{i}\int_{0}^{1}\left\langle\partial\mathcal{L}_{p}^{\phi}\left(\mathbf{F}^{(k)}_{i,:}+\epsilon\gamma_{i}\boldsymbol{\eta}_{i,:}-\partial\mathcal{L}_{p}^{\phi}\left(\mathbf{F}^{(k)}_{i,:}\right)\right),\boldsymbol{\eta}_{i,:}\right\rangle d\epsilon$
	$\displaystyle\leq\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:})+\langle\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:}),\boldsymbol{\eta}_{i,:}\rangle\gamma_{i}+\gamma_{i}\int_{0}^{1}\left\\|\partial\mathcal{L}_{p}^{\phi}\left(\mathbf{F}^{(k)}_{i,:}+\epsilon\gamma_{i}\boldsymbol{\eta}_{i,:}-\partial\mathcal{L}_{p}^{\phi}\left(\mathbf{F}^{(k)}_{i,:}\right)\right)\right\\|\left\\|\boldsymbol{\eta}_{i,:}\right\\|d\epsilon$
	$\displaystyle\leq\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:})+\langle\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:}),\boldsymbol{\eta}_{i,:}\rangle\gamma_{i}+\left(\frac{1}{\alpha^{(k)}_{i,i}}+\overline{o}\right)\gamma_{i}^{2}\\|\boldsymbol{\eta}_{i,:}\\|^{2}$

where the last inequality comes from (29).

Taking $\gamma_{i}=\alpha^{(k)}_{ii}$ and $\boldsymbol{\eta}_{i,:}=-\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:})$ in the above inequality gives

	$\displaystyle\mathcal{L}_{p}^{\phi}\left(\mathbf{F}_{i,:}^{(k)}-\alpha^{(k)}_{ii}\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}_{i,:}^{(k)})\right)$
$\displaystyle\leq$	$\displaystyle\mathcal{L}_{p}^{\phi}(\mathbf{F}_{i,:}^{(k)})-\alpha^{(k)}_{ii}\left\langle\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}_{i,:}^{(k)}),\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}_{i,:}^{(k)})\right\rangle+\frac{1}{2}\left(\frac{1}{\alpha^{(k)}_{i,i}}+\overline{o}\right)\alpha^{2(k)}_{i,i}\\|\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}_{i,:}^{(k)})\\|^{2}$
$\displaystyle=$	$\displaystyle\mathcal{L}_{p}^{\phi}(\mathbf{F}_{i,:}^{(k)})-\frac{1}{2}\alpha^{(k)}_{i,i}\left(1-\alpha^{(k)}_{i,i}\overline{o}\right)\\|\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}_{i,:}^{(k)})\\|^{2}.$	(30)

Given that $\overline{o}$ is bounded, if we choose a large $\mu$ , e.g., $2\mu>\overline{o}$ , we will have

1-\alpha^{(k)}_{i,i}\overline{o}=1-\frac{\overline{o}}{\sum_{v_{j}\sim v_{i}}\frac{M^{(k)}_{i,j}}{D_{i,i}}+2\mu}>0.

Thus the second term in (30) is positive. Hence we have

\mathcal{L}_{p}^{\phi}(\mathbf{F}_{i,:}^{(k+1)}):=\mathcal{L}_{p}^{\phi}\left(\mathbf{F}_{i,:}^{(k)}-\alpha^{(k)}_{ii}\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}_{i,:}^{(k)})\right)\leq\mathcal{L}_{p}^{\phi}(\mathbf{F}_{i,:}^{(k)}).

This completes the proof. ∎

Theorem 1 shows that with an appropriately chosen value of $\mu$ , the iteration scheme for the implicit layer (18) is guaranteed to coverage. This inspires us to explore further on the variation of the node feature produced from implicit layer asymptotically. Recall that to measure the difference between node features, one common choice is to analyze its Dirichlet energy, which is initially considered in the setting $p=2$ in (15). It is known that the Dirichlet energy of the node feature tend to approach to 0 after sufficiently large number of iterations in many GNN models [21, 35, 4, 10], known as over-smoothing problem. However, as we will show in the next section, by taking large $\mu$ or small $p$ , the iteration from the implicit layer will always lift up the Dirichlet energy of the node features, and over-smoothing issue can be resolved completely in pL-UFG.

3.2 Energy Behavior of the pL-UFG

In this section, we show the energy behavior of the p-Laplacian based implicit layer. Specifically, we are interested in analyzing the property of the generalized Dirichlet energy defined in [3].We start by denoting generalized graph convolution as follows:

\displaystyle\mathbf{F}^{(k+\tau)}=\mathbf{F}^{(k)}+\tau\sigma\left(-\mathbf{F}^{(k)}\mathbf{\Omega}^{(k)}+\widehat{\mathbf{A}}\mathbf{F}^{(k)}\widehat{\mathbf{W}}^{(k)}-\mathbf{F}^{(0)}\widetilde{\mathbf{W}}^{(k)}\right),

(31)

where $\mathbf{\Omega}^{(k)}$ , $\widehat{\mathbf{W}}^{(k)}$ and $\widetilde{\mathbf{W}}^{(k)}\in\mathbb{R}^{c\times c}$ act on each node feature vector independently and perform channel mixing. When $\tau=1$ , and $\mathbf{\Omega}^{(k)}=\widetilde{\mathbf{W}}^{(k)}=\mathbf{0}$ , it returns to GCN [21]. Additionally, by setting $\mathbf{\Omega}^{(k)}\neq 0$ , we have the anisotropic instance of GraphSAGE [36]. To quantify the quality of the node features generated by (31), specifically, [3] considered a new class of energy as defined below,

\displaystyle\mathbf{E}(\mathbf{F})=\frac{1}{2}\sum^{N}_{i=1}\langle\mathbf{f}_{i},\mathbf{\Omega}\mathbf{f}_{i}\rangle-\frac{1}{2}\sum^{N}_{i,j=1}\widehat{\mathbf{A}}_{i,j}\langle\mathbf{f}_{i},\widehat{\mathbf{W}}\mathbf{f}_{j}\rangle+\varphi^{(0)}(\mathbf{F},\mathbf{F}^{(0)}),

(32)

in which $\varphi^{(0)}(\mathbf{F},\mathbf{F}^{(0)})$ serves a function of that induces the source term from $\mathbf{F}$ or $\mathbf{F}^{(0)}$ . It is worth noting that by setting $\mathbf{\Omega}=\widehat{\mathbf{W}}=\mathbf{I}_{c}$ and $\varphi^{(0)}=0$ , we recover the classic Dirichlet energy when setting $p=2$ in (15) that is, $\mathbf{E}(\mathbf{F})=\frac{1}{2}\sum_{(v_{i},v_{j})\in\mathcal{E}}\left\|\sqrt{\frac{w_{i,j}}{d_{j,j}}}\mathbf{f}_{j}-\sqrt{\frac{w_{i,j}}{d_{i,i}}}\mathbf{f}_{i}\right\|^{2}$ . Additionally, when we set $\varphi^{(0)}{(\mathbf{F},\mathbf{F}^{(0)})}=\sum_{i}\langle\mathbf{f}_{i},\widetilde{\mathbf{W}}\mathbf{f}^{(0)}_{i}\rangle$ , (32) can be rewritten as:

\displaystyle\mathbf{E}(\mathbf{F})=\left\langle\mathrm{vec}(\mathbf{F}),\frac{1}{2}(\mathbf{\Omega}\otimes\mathbf{I}_{N}-\widehat{\mathbf{W}}\otimes\widehat{\mathbf{A}})\mathrm{vec}(\mathbf{F})+(\widetilde{\mathbf{W}}\otimes\mathbf{I}_{N})\mathrm{vec}(\mathbf{F}^{(0)})\right\rangle.

(33)

Recall that (19) produces the node feature $\mathbf{F}^{(k+1)}$ according to the edge diffusion $\boldsymbol{\alpha}\mathbf{D}^{-1/2}\mathbf{M}\mathbf{D}^{-1/2}$ on $\mathbf{F}^{(k)}$ and the scaled source term $2\mu\boldsymbol{\alpha}\mathbf{F}^{(0)}$ where $\mathbf{F}^{(0)}$ can be set to $\mathbf{Y}$ . To be specific, in (33), we set $\mathbf{\Omega}=\widehat{\mathbf{W}}=\widetilde{\mathbf{W}}=\mathbf{I}_{c}$ and replace the edge diffusion $\widehat{\mathbf{A}}$ with $\boldsymbol{\alpha}\mathbf{D}^{-1/2}\mathbf{M}\mathbf{D}^{-1/2}$ and set the identity matrix $\mathbf{I}_{N}$ in the residual term to be the diagonal matrix $2\mu\boldsymbol{\alpha}$ . Finally we propose

Definition 7 (The Generalized Dirichlet Energy).

Based on the previous notation setting, the generalized Dirichlet energy for the node features $\mathbf{F}^{(k+1)}$ in (19) is:

	$\displaystyle\mathbf{E}^{PF}(\mathbf{F}^{(k+1)})=\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\right.$
	$\displaystyle\left.\frac{1}{2}\left(\mathbf{I}_{c}\otimes\mathbf{I}_{N}-\mathbf{I}_{c}\otimes\left(\boldsymbol{\alpha}^{(k+1)}\mathbf{D}^{-1/2}\mathbf{M}^{(k+1)}\mathbf{D}^{-1/2}\right)\right)\mathrm{vec}(\mathbf{F}^{(k+1)})+(\mathbf{I}_{c}\otimes 2\mu\boldsymbol{\alpha}^{(k+1)})\mathrm{vec}(\mathbf{F}^{(0)})\right\rangle,$		(34)

where the superscript “^PF” is short for p-Laplacian based framelet models.

It is worth noting that the generalized Dirichlet energy defined in (34) is dynamic along the iterative layers due to the non-linear nature of the implicit layer defined in (18). We are now able to analyze the energy ( $\mathbf{E}^{PF}(\mathbf{F})$ ) behavior of the pL-UFG, concluded as the following proposition.

Proposition 2 (Energy Behavior).

Assume $\mathcal{G}$ is connected, unweighted and undirected. There exists sufficiently large value of $\mu$ or small value of $p$ such that $\mathbf{E}^{PF}(\mathbf{F})$ will stay away above 0 at each iterative layer $k\,$ and increases with the increase of $\mu$ or the decrease of $p$ .

Proof.

We start with the definition of the generalized Dirichlet energy above, we can re-write $\mathbf{E}^{PF}(\mathbf{F}^{(k+1)})$ in the following inner product between $\mathrm{vec}(\mathbf{F}^{(k+1)})$ and $\mathrm{vec}(\mathbf{F}^{(0)})$ , based on $\mathbf{M}$ , $\boldsymbol{\alpha}$ , $\boldsymbol{\beta}$ and the iterative scheme defined in (19):

	$\displaystyle\mathbf{E}^{PF}(\mathbf{F}^{(k+1)})=\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\right.$
	$\displaystyle\left.\frac{1}{2}\left(\mathbf{I}_{c}\otimes\mathbf{I}_{N}-\mathbf{I}_{c}\otimes\left(\boldsymbol{\alpha}^{(k+1)}\mathbf{D}^{-1/2}\mathbf{M}^{(k+1)}\mathbf{D}^{-1/2}\right)\right)\mathrm{vec}(\mathbf{F}^{(k+1)})+(\mathbf{I}_{c}\otimes 2\mu\boldsymbol{\alpha}^{(k+1)})\mathrm{vec}(\mathbf{F}^{(0)})\right\rangle$
	$\displaystyle=\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\mathrm{vec}(\mathbf{F}^{(k+1)})\right\rangle-\frac{1}{2}\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\mathbf{I}_{c}\otimes\left(\boldsymbol{\alpha}^{(k+1)}\mathbf{D}^{-1/2}\mathbf{M}^{(k+1)}\mathbf{D}^{-1/2}\right)\mathrm{vec}(\mathbf{F}^{(k+1)})\right.$
	$\displaystyle\left.-(\mathbf{I}_{c}\otimes 4\mu\boldsymbol{\alpha}^{(k+1)})\mathrm{vec}(\mathbf{F}^{(0)})\right\rangle.$		(35)

Based on the form of (35), it is straightforward to see that to let $\mathbf{E}^{PF}(\mathbf{F}^{(k+1)})>0$ and further increase with the desired quantities of $\mu$ and $p$ , it is sufficient to require⁶⁶6Strictly speaking, one shall further require all elements in $\mathbf{F}^{(k+1)}$ larger than or equal to 0. As this can be achieved by assigned a non-linear activation function (i.e., ReLU) to the framelet, we omit it here in our main analysis.:

\displaystyle\mathbf{I}_{c}\otimes\left(\boldsymbol{\alpha}^{(k+1)}\mathbf{D}^{-1/2}\mathbf{M}^{(k+1)}\mathbf{D}^{-1/2}\right)\mathrm{vec}(\mathbf{F}^{(k+1)})-(\mathbf{I}_{c}\otimes 2\mu\boldsymbol{\alpha}^{(k+1)})\mathrm{vec}(\mathbf{F}^{(0)})<0.

(36)

To explicitly show how the quantities of $\mu$ and $p$ affect the term in (36), we start with the case when $k=0$ . When $k=0$ , (36) becomes:

	$\displaystyle\mathbf{I}_{c}\otimes\left(\boldsymbol{\alpha}^{(1)}\mathbf{D}^{-1/2}\mathbf{M}^{(1)}\mathbf{D}^{-1/2}\right)\mathrm{vec}(\mathbf{F}^{(1)})-(\mathbf{I}_{c}\otimes 2\mu\boldsymbol{\alpha}^{(1)})\mathrm{vec}(\mathbf{F}^{(0)})$
$\displaystyle=$	$\displaystyle\mathbf{I}_{c}\otimes\left(\boldsymbol{\alpha}^{(1)}\mathbf{D}^{-1/2}\mathbf{M}^{(1)}\mathbf{D}^{-1/2}\right)\mathrm{vec}\left(\boldsymbol{\alpha}^{(0)}\mathbf{D}^{-1/2}\mathbf{M}^{(0)}\mathbf{D}^{-1/2}\mathbf{F}^{(0)}+2\mu\boldsymbol{\alpha}^{(0)}\mathbf{F}^{(0)}\right)-(\mathbf{I}_{c}\otimes 2\mu\boldsymbol{\alpha}^{(1)})\mathrm{vec}(\mathbf{F}^{(0)}),$
$\displaystyle=$	$\displaystyle\mathbf{I}_{c}\otimes\left(\boldsymbol{\alpha}^{(1)}\mathbf{D}^{-1/2}\mathbf{M}^{(1)}\mathbf{D}^{-1/2}\right)\left(\mathbf{I}_{c}\otimes\left(\boldsymbol{\alpha}^{(0)}\mathbf{D}^{-1/2}\mathbf{M}^{(0)}\mathbf{D}^{-1/2}+2\mu\boldsymbol{\alpha}^{(0)}\right)\mathrm{vec}(\mathbf{F}^{(0)})\right)-(\mathbf{I}_{c}\otimes 2\mu\boldsymbol{\alpha}^{(1)})\mathrm{vec}(\mathbf{F}^{(0)}),$
$\displaystyle=$	$\displaystyle\mathbf{I}_{c}\otimes\left(\prod_{s=0}^{1}\boldsymbol{\alpha}^{(s)}\mathbf{D}^{-1/2}\mathbf{M}^{(s)}\mathbf{D}^{-1/2}+\left(\boldsymbol{\alpha}^{(1)}\mathbf{D}^{-1/2}\mathbf{M}^{(1)}\mathbf{D}^{-1/2}2\mu\boldsymbol{\alpha}^{(0)}\right)-2\mu\boldsymbol{\alpha}^{(1)}\right)\mathrm{vec}(\mathbf{F}^{(0)}).$	(37)

We note that, in (37), $\left(\prod_{s=0}^{1}\boldsymbol{\alpha}^{(s)}\mathbf{D}^{-1/2}\mathbf{M}^{(s)}\mathbf{D}^{-1/2}+\left(\boldsymbol{\alpha}^{(1)}\mathbf{D}^{-1/2}\mathbf{M}^{(1)}\mathbf{D}^{-1/2}2\mu\boldsymbol{\alpha}^{(0)}\right)-2\mu\boldsymbol{\alpha}^{(1)}\right)$ can be computed as:

	$\displaystyle\prod_{s=0}^{1}\alpha^{(s)}_{i,i}{d}^{-1/2}_{i,i}M^{(s)}_{i,j}{d}^{-1/2}_{j,j}+\left(\alpha^{(1)}_{i,i}{d}^{-1/2}_{i,i}M^{(1)}_{i,j}{d}^{-1/2}_{j,j}2\mu\alpha^{(0)}_{i,i}\right)-2\mu\alpha^{(1)}_{i,i}$
	$\displaystyle=\prod_{s=0}^{1}\left[\left(1/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(s)}_{i,j}}{d_{i,i}}+2\mu\right)\right)\left(\frac{\left\\|\nabla_{W}\mathbf{F}^{(s)}([i,j])\right\\|^{p-2}}{\sqrt{d_{i,i}d_{j,j}}}\right)\right]+$
	$\displaystyle\left[\left(1/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(1)}_{i,j}}{d_{i,i}}+2\mu\right)\right)\left(\frac{\left\\|\nabla_{W}\mathbf{F}^{(1)}([i,j])\right\\|^{p-2}}{\sqrt{d_{i,i}d_{j,j}}}\right)\left(2\mu/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(0)}_{i,j}}{d_{i,i}}+2\mu\right)\right)\right]$
	$\displaystyle-\left(2\mu/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(1)}_{i,j}}{d_{i,i}}+2\mu\right)\right),$
	$\displaystyle=\left(\frac{\left\\|\nabla_{W}\mathbf{F}^{(0)}([i,j])\right\\|^{p-2}}{\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(0)}_{i,j}}{d_{i,i}}+2\mu\right)\cdot\sqrt{d_{i,i}d_{j,j}}}\right)\left(\frac{\left\\|\nabla_{W}\mathbf{F}^{(1)}([i,j])\right\\|^{p-2}}{\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(1)}_{i,j}}{d_{i,i}}+2\mu\right)\cdot\sqrt{d_{i,i}d_{j,j}}}\right)+$
	$\displaystyle\left(\frac{\left\\|\nabla_{W}\mathbf{F}^{(1)}([i,j])\right\\|^{p-2}}{\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(1)}_{i,j}}{d_{i,i}}+2\mu\right)\cdot\sqrt{d_{i,i}d_{j,j}}}\cdot 2\mu/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(0)}_{i,j}}{d_{i,i}}+2\mu\right)\right)-\left(2\mu/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(1)}_{i,j}}{d_{i,i}}+2\mu\right)\right).$		(38)

Now we see that by assigning a sufficient large of $\mu$ or small value of $p$ , we can see terms like $\frac{\left\|\nabla_{W}\mathbf{F}^{(1)}([i,j])\right\|^{p-2}}{\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(1)}_{i,j}}{d_{i,i}}+2\mu\right)\cdot\sqrt{d_{i,i}d_{j,j}}}$ in (38) are getting smaller Additionally, we have both $2\mu/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(0)}_{i,j}}{d_{i,i}}+2\mu\right)$ and $2\mu/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(1)}_{i,j}}{d_{i,i}}+2\mu\right)\approx 1$ . Therefore, the summation result of (38) tends to be negative. Based on (35), $\mathbf{E}^{PF}(\mathbf{F}^{(k+1)})$ will stay above 0.

For the case that $k\geq 1$ , by taking into the iterative algorithm (19), (37) becomes:

\mathbf{I}_{c}\otimes\left(\left(\prod_{s=0}^{k+1}\boldsymbol{\alpha}^{(s)}\mathbf{D}^{-1/2}\mathbf{M}^{(s)}\mathbf{D}^{-1/2}+\sum_{s=0}^{k+1}\left(\prod_{l=k-s}^{k+1}\boldsymbol{\alpha}^{(l)}\mathbf{D}^{-1/2}\mathbf{M}^{(l)}\mathbf{D}^{-1/2}\right)\left(2\mu\boldsymbol{\alpha}^{(l-1)}\right)-2\mu\boldsymbol{\alpha}^{(k+1)}\right)\right)\mathrm{vec}(\mathbf{F}^{(0)}).

Applying the same reasoning as before, it is not hard to verify that with sufficient large of $\mu$ and small of $p$ , the term $\left(\prod_{s=0}^{k+1}\boldsymbol{\alpha}^{(s)}\mathbf{D}^{-1/2}\mathbf{M}^{(s)}\mathbf{D}^{-1/2}+\sum_{s=0}^{k+1}\left(\prod_{l=k-s}^{k+1}\boldsymbol{\alpha}^{(l)}\mathbf{D}^{-1/2}\mathbf{M}^{(l)}\mathbf{D}^{-1/2}\right)\left(2\mu\boldsymbol{\alpha}^{(l-1)}\right)-2\mu\boldsymbol{\alpha}^{(k+1)}\right)$ in the above equation tend to be negative, yielding a positive $\mathbf{E}^{PF}(\mathbf{F}^{(k+1)})$ . Asymptotically, we have:

\displaystyle\mathbf{E}^{PF}(\mathbf{F}^{(k+1)}){\approx}\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\mathrm{vec}(\mathbf{F}^{(k+1)})\right\rangle+\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\frac{1}{2}\left(\mathbf{I}_{c}\otimes\left(4\mu\boldsymbol{\alpha}^{(k+1)}+\mathbf{I}_{N}\right)\right)\mathrm{vec}(\mathbf{F}^{(0)})\right\rangle.

(39)

This shows that the energy increases along with the magnitude of $\mu$ , and it is not hard to express (39) as the similar form of (38) and verify that the energy decreases with the quantity of $p$ . This completes the proof. ∎

Remark 3.

Proposition 2 shows that, for any of our framelet convolution models, the p-Laplacian based implicit layer will not generate identical node feature across graph nodes, and thus the so-called over-smoothing issue will not appear asymptotically. Furthermore, it is worth noting that the result from Proposition 2 provides the theoretical justification of the empirical observations in [25], where a large value of $\mu$ or small value of $p$ is suitable for fitting heterophily datasets which commonly require the output of GNN to have higher Dirichlet energy.

Remark 4 (Regarding to the quantity of $p$ ).

The conclusion of Proposition 2 is under sufficient large of $\mu$ or small of $p$ . However, it is well-known that the quantity of $p$ cannot be set as arbitrary and in fact it is necessary to have $p\geq 1$ so that the iteration for the solution of the optimization problem defined in (18) can converge. Therefore, it is not hard to see that the effect of $p$ is weaker than $\mu$ in terms of analyzing the asymptotic behavior of the model (i.e., via (38)). Without loss of generality, in the sequel, when we analyze the property of the model with conditions on $\mu$ and $p$ , we mainly target on the effect from $\mu$ and one can check from (38) $\mu$ and $p$ are with opposite effect on the model.

3.3 Interaction with Framelet Energy Dynamic

To analyze the interaction between the energy dynamic of framelet convolution defined in (9) and the p-Laplacian based implicit propagation [25], We first briefly review some recent work on the energy dynamic of the GNNs. In [10], the propagation of GNNs was considered as the gradient flow of the Dirichlet energy that can be formulated as:

\displaystyle\mathbf{E}({\mathbf{F}})=\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\left\|\sqrt{\frac{w_{i,j}}{d_{j,j}}}\mathbf{f}_{j}-\sqrt{\frac{w_{i,j}}{d_{i,i}}}\mathbf{f}_{i}\right\|^{2},

(40)

and similarly by setting the power from 2 to $p$ , we recover the p-Dirichlet form presented in (15). The gradient flow of the Dirichlet energy yields the so-called graph heat equation [9] as $\dot{\mathbf{F}}^{(k)}=-\nabla\mathbf{E}(\mathbf{F}^{(k)})=-\widetilde{\mathbf{L}}\mathbf{F}^{(k)}$ . Its Euler discretization leads to the propagation of linear GCN models [33, 31]. The process is called Laplacian smoothing [22] and it converges to the kernel of $\widetilde{\mathbf{L}}$ , i.e., ${\rm ker}(\widetilde{\mathbf{L}})$ as $k\xrightarrow{}\infty$ , resulting in non-separation of nodes with same degrees, known as the over-smoothing issue.

Following this observation, the work [19, 10] also show even with the help of the non-linear activation function and the weight matrix via classic GCN ((1)), the process described is still dominated by the low frequency (small Laplacian eigenvalues) of the graph, hence eventually converging to the kernel of $\widetilde{\mathbf{L}}$ , for almost every initialization. To quantify such behavior, [10, 19] consider a general dynamic as $\dot{\mathbf{F}}^{(k)}={\rm GNN}_{\theta}(\mathbf{F}^{(k)},k)$ , with ${\rm GNN}_{\theta}(\cdot)$ as an arbitrary graph neural network function, and also characterizes its behavior by low/high-frequency-dominance (L/HFD).

Definition 8 ([10]).

$\dot{\mathbf{F}}^{(k)}={\rm GNN}_{\theta}(\mathbf{F}^{(k)},k)$ is Low-Frequency-Dominant (LFD) if $\mathbf{E}\big{(}\mathbf{F}^{(k)}/\|\mathbf{F}^{(k)}\|\big{)}\xrightarrow{}0$ as $k\xrightarrow{}\infty$ , and is High-Frequency-Dominant (HFD) if $\mathbf{E}\big{(}\mathbf{F}^{(k)}/\|\mathbf{F}^{(k)}\|\big{)}\xrightarrow{}\rho_{\widetilde{\mathbf{L}}}/2$ as $t\xrightarrow{}\infty$ .

Lemma 2 ([10]).

A GNN model is LFD (resp. HFD) if and only if for each $t_{j}\xrightarrow{}\infty$ , there exists a sub-sequence indexed by $k_{j_{l}}\xrightarrow{}\infty$ and $\mathbf{F}_{\infty}$ such that $\mathbf{F}(k_{j_{l}})/\|\mathbf{F}(k_{j_{l}})\|\xrightarrow{}\mathbf{F}_{\infty}$ and $\widetilde{\mathbf{L}}\mathbf{F}_{\infty}=0$ (resp. $\widetilde{\mathbf{L}}\mathbf{F}_{\infty}=\rho_{\widetilde{\mathbf{L}}}\mathbf{F}_{\infty}$ ).

Remark 5 (LFD, HFD and graph homophily).

Based on Definition 8 and Lemma 2, for a given GNN model, if $\mathcal{G}$ is homophilic, i.e., adjacency nodes are more likely to share the same label, one may prefer for the model to induce a LFD dynamic in order to fit the characteristic of $\mathcal{G}$ . On the other hand, if $\mathcal{G}$ is heterophilic, the model is expected to induce a HFD dynamic, so that even in the adjacent nodes, their predicted labels still tend to be different. Thus, ideally, a model should be flexible enough to accommodate both LFD and HFD dynamics.

Generalized from the energy dynamic framework provided in [10], [19] developed a framelet Dirichlet energy and analyzed the energy behavior of both spectral ((9)) and spatial framelet ((10)) convolutions. Specifically, let

\mathbf{E}^{Fr}(\mathbf{F})=\frac{1}{2}\mathrm{Tr}\big{(}(\mathcal{W}_{r,\ell}\mathbf{F})^{\top}\mathcal{W}_{r,\ell}\mathbf{F}\mathbf{\Omega}_{r,\ell}\big{)}-\frac{1}{2}\mathrm{Tr}\big{(}(\mathcal{W}_{r,\ell}\mathbf{F})^{\top}\mathrm{diag}(\theta)_{r,\ell}\mathcal{W}_{r,\ell}\mathbf{F}\widehat{\mathbf{W}}\big{)}

for all $(r,\ell)\in\mathcal{I}$ . The generated framelet energy is given by:

	$\displaystyle\mathbf{E}^{Fr}(\mathbf{F})$	$\displaystyle=\mathbf{E}^{Fr}_{0,J}(\mathbf{F})+\sum_{r,\ell}\mathbf{E}^{Fr}_{r,\ell}(\mathbf{F})$
		$\displaystyle=\frac{1}{2}\sum_{(r,\ell)\in\mathcal{I}}\left\langle\mathrm{vec}(\mathbf{F}),\Big{(}\mathbf{\Omega}_{r,\ell}\otimes\mathcal{W}_{r,\ell}^{\top}\mathcal{W}_{r,\ell}-\widehat{\mathbf{W}}\otimes\mathcal{W}_{r,\ell}^{\top}\mathrm{diag}(\theta)_{r,\ell}\mathcal{W}_{r,\ell}\Big{)}\mathrm{vec}(\mathbf{F})\right\rangle,$		(41)

where the superscript “^Fr” stands for the framelet convolution. This definition is based on the fact that the total Dirichlet energy is conserved under framelet decomposition [19, 10]. By analyzing the gradient flow of the framelet energy ⁷⁷7Similar to the requirement on our p-Laplacian based framelet energy ( $\mathbf{E}^{PF}(\mathbf{F}^{(k+1)})$ , to thoroughly verify the framelet energy in (3.3) is a type of energy, we shall further require: $\nabla^{2}\mathbf{E}^{Fr}_{r,\ell}(\mathbf{F})=\mathbf{\Omega}_{r,\ell}\otimes\mathcal{W}_{r,\ell}^{\top}\mathcal{W}_{r,\ell}-\widehat{\mathbf{W}}\otimes\mathcal{W}_{r,\ell}^{\top}\widetilde{\mathbf{L}}\mathcal{W}_{r,\ell}$ is symmetric, which can be satisfied by requiring both $\mathbf{\Omega}$ and $\widehat{\mathbf{W}}$ are symmetric. defined above, [19] concluded the energy dynamic of framelet as:

Proposition 3 ([19]).

The spectral graph framelet convolution (9) with Haar-type filter (i.e. $R=1$ in the case of scaling function set) can induce both LFD and HFD dynamics. Specifically, let $\boldsymbol{\theta}_{0,\ell}=\mathbf{1}_{N}$ and $\boldsymbol{\theta}_{r,\ell}=\theta\mathbf{1}_{N}$ for $r=1,...,L,\ell=1,...,J$ where $\mathbf{1}_{N}$ is a size $N$ vector of all $1$ s. When $\theta\in[0,1)$ , the spectral framelet convolution is LFD and when $\theta>1$ , the spectral framelet convolution is HFD.

It is worth noting that there are many other settings rather than $\boldsymbol{\theta}_{0,\ell}=\mathbf{1}_{N}$ and $\boldsymbol{\theta}_{r,\ell}=\theta\mathbf{1}_{N}$ , i.e. adjusting $\theta$ , for inducing LFD/HFD from framelet. However, in this paper, we only consider the conditions described in Proposition 3. To properly compare the energy dynamics between the framelet models, we present the following definition.

Definition 9 (Stronger/Weaker Dynamic).

Let $\mathcal{Q}_{\theta}$ be the family of framelet models with the settings described in Proposition 3 and choice of $\theta$ . We say that one framelet model $\mathcal{Q}_{\theta_{1}}$ is with a stronger LFD than another framelet model $\mathcal{Q}_{\theta_{2}}$ if $\theta_{1}<\theta_{2}$ , and weaker otherwise. Similarly, we say $\mathcal{Q}_{\theta_{1}}$ is with a stronger HFD than $\mathcal{Q}_{\theta_{2}}$ if $\theta_{1}>\theta_{2}$ , and weaker otherwise ⁸⁸8In case of any confusion, we note that in this paper we only compare the model’s dynamics relationship when both of two (framelet) models are with the same frequency dominated dynamics (i.e., LFD, HFD)..

Remark 6.

Similar reasoning of Proposition 3 can be easily generalized to other commonly used framelet types such as $\mathrm{Linear},\mathrm{Sigmoid}$ and $\mathrm{Entropy}$ [38].

Before we present our conclusion, we note that to evaluate the changes of (framelet) energy behavior from the impact of implicit layer, one shall also define a layer-wised framelet energy such as $\mathbf{E}^{PF}(\mathbf{F}^{(k+1)})$ by only considering the energy from one step of propagation of graph framelet. With all these settings, we summarize the interaction between framelet and p-Laplacian based implicit propagation as:

Lemma 3 (Stronger HFD).

Based on the condition described in Proposition 3, when framelet is HFD, with sufficient large value of $\mu$ or small of $p$ , the p-Laplacian implicit propagation further amplify the energy $\mathbf{E}^{Fr}(\mathbf{F})$ in (3.3) of the node feature (i.e., $\mathbf{Y}$ in (18)) produced from the framelets, and thus achieving a higher HFD dynamic than original framelet in (9).

Proof.

Recall that by setting sufficient large of $\mu$ or small of $p$ , $\mathbf{E}^{PF}(\mathbf{F}^{(k+1)})$ in (39) has the form

\displaystyle\mathbf{E}^{PF}(\mathbf{F}^{(k+1)})\approx\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\mathrm{vec}(\mathbf{F}^{(k+1)})\right\rangle+\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\frac{1}{2}\left(\mathbf{I}_{c}\otimes\left(4\mu\boldsymbol{\alpha}^{(k+1)}+\mathbf{I}_{N}\right)\right)\mathrm{vec}(\mathbf{F}^{(0)})\right\rangle.

Similarly, when framelet is HFD, with $\boldsymbol{\theta}_{0,\ell}=\mathbf{1}_{N}$ , $\boldsymbol{\theta}_{r,\ell}=\theta\mathbf{1}_{N}$ and $\theta>1$ , the Dirichlet energy (of $\mathbf{F}^{(k+1)}$ ) (3.3) can be rewritten as:

	$\displaystyle\mathbf{E}^{Fr}(\mathbf{F}^{(k+1)})$	$\displaystyle=\frac{1}{2}\sum_{(r,\ell)\in\mathcal{I}}\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\Big{(}\mathbf{\Omega}_{r,\ell}\otimes\mathcal{W}_{r,\ell}^{\top}\mathcal{W}_{r,\ell}-\widehat{\mathbf{W}}\otimes\mathcal{W}_{r,\ell}^{\top}\mathrm{diag}(\theta)_{r,\ell}\mathcal{W}_{r,\ell}\Big{)}\mathrm{vec}(\mathbf{F}^{(k+1)})\right\rangle,$
		$\displaystyle=\frac{1}{2}\sum_{(r,\ell)\in\mathcal{I}}\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\Big{(}\widehat{\mathbf{W}}\otimes\left(\mathcal{W}_{r,\ell}^{\top}\mathcal{W}_{r,\ell}-\mathcal{W}_{r,\ell}^{\top}\mathrm{diag}(\theta)_{r,\ell}\mathcal{W}_{r,\ell}\right)\Big{)}\mathrm{vec}(\mathbf{F}^{(k+1)})\right\rangle,$		(42)

where the last equality is achieved by letting $\mathbf{\Omega}=\widehat{\mathbf{W}}$ , meaning that no external force ⁹⁹9For details, please check [3] exist within the space that contains the node features. We note that it is reasonable to have such assumption in order to explicitly analyze the energy changes in (3.3) via the changes of $\theta$ . Now we take the $\mathrm{Haar}$ framelet with $\ell=1$ as an example, meaning there will be only one high-pass and low-pass frequency domain in the framelet model. Specifically, the R.H.S of (42) can be further rewritten as:

	$\displaystyle\frac{1}{2}\sum_{(r,\ell)\in\mathcal{I}}\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\Big{(}\widehat{\mathbf{W}}\otimes\left(\mathcal{W}_{r,\ell}^{\top}\mathcal{W}_{r,\ell}-\mathcal{W}_{r,\ell}^{\top}\mathrm{diag}(\theta)_{r,\ell}\mathcal{W}_{r,\ell}\right)\Big{)}\mathrm{vec}(\mathbf{F}^{(k+1)})\right\rangle$
	$\displaystyle\approx\frac{1}{2}\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\left(\widehat{\mathbf{W}}\otimes\left(\mathbf{I}_{N}-\mathcal{W}_{1,1}^{\top}\mathrm{diag}(\theta)_{1,1}\mathcal{W}_{1,1}\right)\right)\mathrm{vec}(\mathbf{F}^{(k+1)})\right\rangle.$		(43)

The inclusion of $\mathcal{W}_{1,1}^{\top}\mathrm{diag}(\theta)_{1,1}\mathcal{W}_{1,1}$ is based on the form of $\mathrm{Haar}$ type framelet with one scale. In addition, the approximation in (43) is due to the outcome of HFD ¹⁰¹⁰10The result in (43) provides identical conclusion on the claim in [10] such that in order to have a HFD dynamic, $\widehat{\mathbf{W}}$ must have negative eigenvalue(s).. Now we combine the framelet energy in the above equation ((43)) with the energy induced from p-Laplacian based implicit propagation ((39)). Denote the total energy induced from framelet and implicit layer as:

	$\displaystyle\mathbf{E}^{(total)}(\mathbf{F}^{(k+1)})=\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\mathrm{vec}(\mathbf{F}^{(k+1)})\right\rangle$		(44)
	$\displaystyle+\frac{1}{2}\left\langle\mathrm{vec}(\mathbf{F}^{(k+1)}),\left(\left(\widehat{\mathbf{W}}\otimes\left(\mathbf{I}_{N}-\mathcal{W}_{1,1}^{\top}\mathrm{diag}(\theta)_{1,1}\mathcal{W}_{1,1}\right)\right)\mathrm{vec}(\mathbf{F}^{(k+1)})+\mathbf{I}_{c}\otimes\left(4\mu\boldsymbol{\alpha}^{(k+1)}+\mathbf{I}_{N}\right)\mathrm{vec}(\mathbf{F}^{(0)})\right)\right\rangle.$

It is not hard to check that $\mathbf{E}^{(total)}(\mathbf{F}^{(k+1)})$ is larger than $\mathbf{E}^{Fr}(\mathbf{F}^{(k+1)})$ (the framelet energy under HFD). Hence we have verified that the implicit layer further amplifies the Dirichlet energy. Moreover, one can approximate this stronger dynamic by re-parameterizing $\mathbf{E}^{(total)}(\mathbf{F}^{(k+1)})$ via assigning a higher quantity of $\theta^{\prime}>\theta>0$ and excluding the residual term. Hence, the inclusion of the implicit layer induces a higher HFD dynamic to framelet, and that completes the proof. ∎

Corollary 1 (Escape from Over-smoothing).

With the same conditions in Proposition 3, when framelet is LFD, the implicit layer (with sufficient large $\mu$ or small $p$ ) ensures the Dirichlet energy of node features does not converge to 0, thus preventing the model from the over-smoothing issue.

Proof.

The proof can be done by combining Proposition 3 and Proposition 2, with the former illustrates that when model is HFD, there will be no over-smoothing problem, and the latter shows that even when the model is LFD, the Dirichlet energy of the node features will not converge to 0. Accordingly, pL-UFG is capable of escaping from over-smoothing issue. ∎

Remark 7 (Stronger LFD).

Based on the condition described in Proposition 3, when framelet is LFD, with sufficient small of $\mu$ or larger of $p$ , it is not hard to verify that according to (38), the p-Laplacian implicit propagation further shrink the Dirichlet energy of the node feature produced from framelet, and thus achieving a stronger LFD dynamic.

Remark 8.

In Proposition 2 we showed that the Dirichlet energy of the node features produced from the implicit layer will not coverage to zero, indicating the robustness of the implicit layer in regarding to the over-smoothing issue. Additionally, we further verified in Lemma 3 that when graph framelet is with a monotonic dynamic (e.g., L/HFD), the inclusion of the implicit layer can even amplify the dynamic of framelet by a proper setting of $\mu$ and $p$ . Our conclusion explicitly suggests the effectiveness on incorporating p-Laplacian based implicit propagation to multiscale GNNs which is with flexible control of model dynamics.

3.4 Proposed Model with Controlled Dynamics

Based on the aforementioned conclusions regarding energy behavior and the interaction between the implicit layer and framelet’s energy dynamics, it becomes evident that irrespective of the homophily index of any given graph input, one can readily apply the condition of $\boldsymbol{\theta}$ (s) in Proposition 3 to facilitate the adaptation of the pL-UFG model to the input graph by simply adjusting the quantities of $\mu$ and $p$ . This adjustment significantly reduces the training cost of the graph framelet. For instance, consider the case of employing a $\mathrm{Haar}$ type frame with $\ell=1$ , where we have only one low-pass and one high-pass domain. In this scenario, the trainable matrices for this model are $\boldsymbol{\theta}_{0,1}$ , $\boldsymbol{\theta}_{1,1}$ , and $\widehat{\mathbf{W}}$ . Based on our conclusions, we can manually set both $\boldsymbol{\theta}_{0,1}$ and $\boldsymbol{\theta}_{1,1}$ to our requested quantities, thereby inducing either LFD or HFD. Consequently, the only remaining training cost is associated with $\widehat{\mathbf{W}}$ , leading to large reduction on the overall training cost while preserving model’s capability of handling both types of graphs. Accordingly, we proposed two additional pL-UFG variants with controlled model dynamics, namely pL-UFG-LFD and pL-UFG-HFD. More explicitly, the propagation of graph framelet with controlled dynamic takes the form as:

\displaystyle\mathbf{F}^{(k+1)}=

\displaystyle\sigma\left(\mathcal{W}_{0,1}^{\top}{\rm diag}(\mathbf{1}_{N})\mathcal{W}_{0,1}\mathbf{F}^{(k)}\widehat{\mathbf{W}}+\mathcal{W}_{1,1}^{\top}{\rm diag}(\theta\mathbf{1}_{N})\mathcal{W}_{1,1}\mathbf{F}^{(k)}\widehat{\mathbf{W}}\right),

after which the output node features will be propagated through the iterative layers in defined in (19) for the implicit layer (18) for certain layers, and the resulting node feature will be forwarded to the next graph framelet convolution and implicit layer propagation. We note that to properly represent the Dirichlet energy of node features, we borrow the concept of electronic orbital energy levels in Figure. 1. The shaded outermost electrons correspond to higher energy levels, which can be analogously interpreted as higher variations in node features. Conversely, the closer the electrons are to the nucleus, the lower their energy levels, indicating lower variations in node features.

Refer to caption — Figure 1: Illustration of the working flow of pL-UFG-LFD and pL-UFG-HFD under the $\mathrm{Haar}$ type frame with $\ell=1$ . The input graph features are first decomposed onto two frequency domains and further filtered by the diagonal matrix $\boldsymbol{\theta}_{0,1}$ and $\boldsymbol{\theta}_{1,1}$ . With controlled model dynamics from Proposition 3 i.e., $\boldsymbol{\theta}_{0,1}=\mathbf{1}_{N}$ and $\boldsymbol{\theta}_{1,1}=\theta\boldsymbol{\theta}_{0,1}$ , framelet can induce both LFD and HFD dynamics resulting as different level of Dirichlet energy of the produced node features. It is straightforward to check that when framelet is LFD, the level of node Dirichlet energy is less than its HFD counterpart. The generated node features from graph framelet is then inputted into p-Laplacian (with graph gradient as one component) based implicit layer. Based on our conclusions in Lemma 3 and Remark 7 with small/large quantity of $p$ and large/small quantity of $\mu$ , the model’s (framelet) dynamics are further strengthened resulting even smaller/higher energy levels.

3.5 Equivalence to Non-Linear Diffusion

Diffusion on graph has gained its popularity recently [5, 28] by providing a framework (i.e., PDE) to understand the GNNs architecture and as a principled way to develop a broad class of new methods. To the best of our knowledge, although the GNNs based on linear diffusion on graph [5, 4, 28] have been intensively explored, models built from non-linear graph diffusion have not attracted much attention in general. In this section, we aim to verify that the iteration (19) admits a scaled nonlinear diffusion with a source term. To see this, recall that p-Laplacian operator defined in (14) has the form:

\displaystyle\Delta_{p}\mathbf{F}:=-\frac{1}{2}\text{div}(\|\nabla\mathbf{F}\|^{p-2}\nabla\mathbf{F}),\quad\text{for}\,\,\,p\geq 1.

(45)

Plugging in the definition of graph gradient and divergence defined in (11) and (13) into the above equation, one can compactly write out the form of p-Laplacian as:

\displaystyle(\Delta_{p}\mathbf{F})(i)=\sum_{v_{j}\sim v_{i}}\sqrt{\frac{w_{i,j}}{d_{i,i}}}\left\|\nabla_{W}\mathbf{F}([i,j])\right\|^{p-2}\left(\sqrt{\frac{w_{i,j}}{d_{i,i}}}\mathbf{f}_{i}-\sqrt{\frac{w_{i,j}}{d_{j,j}}}\mathbf{f}_{j}\right).

(46)

Furthermore, if we treat the iteration equation (19) as a diffusion process, its forward Euler scheme has the form:

	$\displaystyle\frac{\mathbf{F}^{(k+1)}-\mathbf{F}^{(k)}}{\tau}$	$\displaystyle=\boldsymbol{\alpha}^{(k)}\mathbf{D}^{-1/2}\mathbf{M}^{(k)}\mathbf{D}^{-1/2}\mathbf{F}^{(k)}-\mathbf{F}^{(k)}+\boldsymbol{\beta}^{(k)}\mathbf{Y},$
		$\displaystyle=\left(\boldsymbol{\alpha}^{(k)}\mathbf{D}^{-1/2}\mathbf{M}^{(k)}\mathbf{D}^{-1/2}-\mathbf{I}\right)\mathbf{F}^{(k)}+\boldsymbol{\beta}^{(k)}\mathbf{Y}.$		(47)

We set $\tau=1$ for the rest of analysis for the convenience reasons. With all these setups, we summarize our results in the following:

Lemma 4 (Non-Linear Diffusion).

Assuming $\mathcal{G}$ is connected, the forward Euler scheme presented in (47) admits a generalized non-linear diffusion on the graph. Specifically, we have:

\displaystyle\left(\boldsymbol{\alpha}^{(k)}\mathbf{D}^{-1/2}\mathbf{M}^{(k)}\mathbf{D}^{-1/2}-\mathbf{I}\right)\mathbf{F}^{(k)}+\boldsymbol{\beta}^{(k)}\mathbf{Y}=\boldsymbol{\alpha}\left(\mathrm{div}(\|\nabla\mathbf{F}^{(k)}\|^{p-2}\nabla\mathbf{F}^{(k)})\right)+{2\mu}{\boldsymbol{\alpha}}^{(k)}\mathbf{D}\mathbf{F}^{(k)}+2\mu\boldsymbol{\alpha}^{(k)}\mathbf{F}^{(0)}.

(48)

Proof.

The proof can be done by verification. We can explicitly write out the computation on the $i$ -th row of the left hand side of (48) as:

First let us denote the rows of $\mathbf{F}^{(k)}$ as $\mathbf{f}^{(k)}(i)$ ’s.

	$\displaystyle\sum_{v_{j}\sim v_{i}}\left({\alpha}^{(k)}_{i,i}{d}^{-1/2}_{ii}M^{(k)}_{i,j}{d}^{-1/2}_{jj}\right)\mathbf{f}^{(k)}(j)-\mathbf{f}^{(k)}(i)+{\beta}^{(k)}_{i,i}Y(i)$
$\displaystyle=$	$\displaystyle{\alpha}^{(k)}_{i,i}\left(\sum_{v_{j}\sim v_{i}}\left(\frac{M_{ij}}{\sqrt{d_{ii}}\sqrt{d_{jj}}}\mathbf{f}^{(k)}(j)\right)-\frac{1}{{\alpha}^{(k)}_{i,i}}\mathbf{f}^{(k)}(i)\right)+2\mu{\alpha}^{(k)}_{i,i}\mathbf{f}^{(0)}(i)$
$\displaystyle=$	$\displaystyle{\alpha}^{(k)}_{i,i}\left(\sum_{v_{j}\sim v_{i}}\left(\frac{M_{ij}}{\sqrt{d_{ii}}\sqrt{d_{jj}}}\mathbf{f}^{(k)}(j)\right)-\sum_{v_{j}\sim v_{i}}\left(\frac{M_{ij}}{d_{ii}}+2\mu\right)\mathbf{f}^{(k)}(i)\right)+2\mu{\alpha}^{(k)}_{i,i}\mathbf{f}^{(0)}(i)$
$\displaystyle=$	$\displaystyle{\alpha}^{(k)}_{i,i}\left(\sum_{v_{j}\sim v_{i}}\sqrt{\frac{w_{i,j}}{d_{i,i}}}\left\\|\nabla_{W}\mathbf{F}([i,j])\right\\|^{p-2}\left(\sqrt{\frac{w_{i,j}}{d_{j,j}}}\mathbf{f}^{(k)}_{j}-\sqrt{\frac{w_{i,j}}{d_{i,i}}}\mathbf{f}^{(k)}_{i}\right)+2\mu\sum_{v_{j}\sim v_{i}}\mathbf{f}^{(k)}_{i}\right)$
	$\displaystyle+2\mu{\alpha}^{(k)}_{i,i}\mathbf{f}^{(0)}(i)$
$\displaystyle=$	$\displaystyle\alpha_{i,i}^{(k)}\left((\Delta_{p}\mathbf{F})(i)\right)+{2\mu}{\alpha_{i,i}^{(k)}}d_{ii}\mathbf{f}^{(k)}_{i}+2\mu{\alpha}^{(k)}_{i,i}\mathbf{f}^{(0)}_{i}$	(49)

When $i$ takes from 1 to $N$ , it gives (48) according to (45) and (46).Thus we complete the proof. ∎

Based on the conclusion of Lemma 4, it is clear that the propagation via p-Laplacian implicit layer admits a scaled non-linear diffusion with two source terms. We note that the form of our non-linear diffusion coincidences to the one developed in [7]. However, in [7] the linear operator is assigned via the calculation of graph Laplacian whereas in our model, the transformation acts over the whole p-Laplacian. Finally, it is worth noting that the conclusion in Lemma 4 can be transferred to the implicit schemes¹¹¹¹11With a duplication of terminology, here the term “implicit” refers to the implicit scheme (i.e., backward propagation) in the training of the diffusion model.. We omit it here.

Remark 9.

With sufficiently large $\mu$ or small $p$ , one can check that the strength of the diffusion, i.e. $\mathrm{div}(\|\nabla\mathbf{F}^{(k)}\|^{p-2}\nabla\mathbf{F}^{(k)})$ , is diluted. Once two source terms ${2\mu}{\boldsymbol{\alpha}}^{(k)}\mathbf{D}\mathbf{F}^{(k)}+2\mu\alpha^{(k)}\mathbf{F}^{(0)}$ dominant the whole process, the generated node features approach to $\mathbf{DF^{(k)}}+\mathbf{F}^{(0)}$ , which suggests a framelet together with two source terms. The first term can be treated as the degree normalization of the node features from the last layer and the second term simply maintains the initial feature embedding. Therefore, the energy of the remaining node features in this case is just with the form presented in (39), suggesting a preservation of node feature variations. Furthermore, this observation suggests our conclusion on the energy behavior of pL-UFG (Proposition 2); the interaction within pL-UFG described in Lemma 3 and Corollary 1 and lastly, the conclusion from Lemma 4 can be unified and eventually forms a well defined framework in assessing and understanding the property of pL-UFG.

4 Experiment

Experiment outlines

In this section, we present comprehensive experimental results on the claims that we made from the theoretical aspects of our model. All experiments were conducted in PyTorch on NVIDIA Tesla V100 GPU with 5,120 CUDA cores and 16GB HBM2 mounted on an HPC cluster. In addition, for the sake of convenience, we listed the summary of each experimental section as follows:

•

In Section 4.1, we show how a sufficient large/small $\mu$ can affect model’s performance on heterophilic/homophilic graphs, and the results are almost invariant to the choice of $p$ .
•

In Section 4.2 we show some tests regarding to the results (i.e., Remark 7 and Lemma 3) of model’s dynamics. Specifically, we verified the conclusions of stronger LFD and HFD in Section 3.3 with controlled model dynamics (quantity of $\theta$ ) of framelet to illustrate how the p-Laplacian based implicit layer interact with framelet model.
•

In Section 4.3 we test the performances of pL-UFG-LFD and pL-UFG-HFD via real-world graph benchmarks versus various baseline models. Furthermore, as these two controllable pL-UFG models largely reduced the computational cost (as we claimed in Section 3.4), we show pL-UFG-LFD and pL-UFG-HFD can even handle the large-scale graph datasets and achieve remarkable learning accuracies.

Hyper-parameter tuning We applied exactly same hyper-parameter tunning strategy as [25] to make a fair comparsion. In terms of the settings for graph framelets, the framelet type is fixed as $\mathrm{Haar}$ ([38]) and the level $J$ is set to 1. The dilation scale ${s}\in\{1,1.5,2,3,6\}$ , and for $n$ , the degree of Chebyshev polynomial approximation is drawn from {2, 3, 7}. It is worth noting that in graph framelets, the Chebyshev polynomial is utilized for approximating the spectral filtering of the Laplacian eigenvalues. Thus, a $d$ -degree polynomial approximates $d$ -hop neighbouring information of each node of the graph. Therefore, when the input graph is heterophilic, one may prefer to have a relatively larger $d$ as node labels tend to be different between directly connected (1-hop) nodes.

4.1 Synthetic Experiment on Variation of $\mu$

Setup

In this section, we show how a sufficiently large/small of $\mu$ can affect model’s performance on hetero/homophilic graphs. In order to make a fair comparison, all the parameters of pL-UFG followed the settings included in [25]. For this test, we selected two datasets: Cora (heterophilic index: 0.825, 2708 nodes and 5278 edges) and Wisconsin (heterophilic index: 0.15, 499 nodes and 1703 edges) from https://www.pyg.org/. We assigned the quantity of $p=\{1,1.5,2,2.5\}$ combined with a set of $\mu=\{0.1,0.5,1,5,10,20,30,50,70\}$ . The number of epochs was set to 200 and the test accuracy (in %) is obtained as the average test accuracy of 10 runs.

Results and Discussion

The experimental results are presented in Figure 2. When examining the results obtained through the homophily graph (Figure 2(a)), it is apparent that all variants of pL-UFGs achieved the best performance when $\mu=0.1$ , which is the minimum value of $\mu$ . As the value of $\mu$ increased, the learning accuracy decreased. This suggests that a larger sharpening effect was induced by the model, as stated in Remark 7 and Proposition 2, causing pL-UFGs to incorporate higher amounts of Dirichlet energy into their generated node features. Consequently, pL-UFGs are better suited for adapting to heterophily graphs. This observation is further supported by the results in Figure 2(b), where all pL-UFG variants achieved their optimal performance with a sufficiently large $\mu$ when the input graph is heterophilic.

Additional interesting observation on the above result is despite the fact that all model variants demonstrated superior learning outcomes on both homophilic and heterophilic graphs when assigned sufficiently large or small values of $\mu$ , it can be observed that when the quantity of $p$ is small, pL-UFG requires a smaller value of $\mu$ to fit the heterophily graph (blue line in Fig. 2(b)). On the other hand, when the models have relatively large value of $p$ (i.e., $p=2.5$ ), it is obvious that these models yielded the most robust results when there is an increase of $\mu$ (red line in Fig. 2(a)). These phenomena further support the notion that $p$ and $\mu$ exhibit opposite effects on the model’s energy behavior as well as its adaptation to homophily and heterophily graphs.

4.2 Synthetic Experiment on Testing of Model’s Dynamics

Now, we take one step ahead. Based on Lemma 3 and Remark 7, with the settings of $\theta$ provided in Proposition 3, the inclusion of $p$ -Laplacian based implicit layer can further enhance framelet’s LFD and HFD dynamics. This suggests that one can control the entries of $\theta$ based on the conditions provided in Proposition 3 and by only changing the quantity of $\mu$ and $p$ to test model’s adaption power on both homophily and heterophily graphs. Therefore, in this section, we show how a (dynamic) controlled framelet model can be further enhanced by the assistant from the $p$ -Laplacian regularizer. Similarly, we applied the same setting to the experiments in [25].

Setup and Results

To verify the claims on in Lemma 3 and Remark 7, we deployed the same settings mentioned in Proposition 3. Specifically, we utilized Haar frame with $\ell=1$ and set $\boldsymbol{\theta}_{0,1}=\mathbf{I}_{N}$ , $\boldsymbol{\theta}_{0,1}=\theta\mathbf{I}_{N}$ . For heterophilic graphs (Wisconsin), $\theta=2$ , and for the homophilic graph (Cora), $\theta=0.2$ . The result of the experiment is presented in Figure 3. Similar to the results observed from Section 4.1, it is shown that when the relatively large quantity of $\mu$ is assigned, model’s capability of adapting to homophily/heterophily graph decreased/increased. This directly verifies that the p-Laplacian based implicit layer interacts and further enhances the (controlled) dynamic of the framelet by the value of $p$ and $\mu$ , in terms of adaptation.

4.3 Real-world Node Classification and Scalability

Previous synthetic numerical results show predictable performance of both pL-UFG-LFD and pL-UFG-HFD. In this section, we present the learning accuracy of our proposed models via real-world homophily and heterophily graphs. Similarly, we deployed the same experimental setting from [25]. In addition, to verify the claim in Remark 3.4, we tested our proposed model via large-scale graph dataset (ogbn-arxiv) to show the proposed model’s scalability which is rarely explored. We include the summary statistic of the datasets in Table 2. All datasets are split according to [17].

For the settings of $\mu$ , $p$ and $\theta$ within pL-UFG-LFD and pL-UFG-HFD, we assigned $\mu=\{0.1,0.5,1,2.0\}$ , $p=\{1,1.5,2,2.5\}$ and $\theta=\{0.2,0.5,0.8\}$ for pL-UFG-LFD in order to fit the homophily graphs, and for pL-UFG-HFD, we assigned $\mu=\{10,20,30\}$ , $p=\{1,1.5,2,2.0,2.5\}$ and $\theta=\{5,7.5,10\}$ for heterophily graphs. The learning accuracy are presented in Table 3 and 4. Furthermore, rather than only reporting the average accuracy and related standard deviation, to further verify the significance of the improvement, we also computed the 95% confidence interval under $t$ -distribution for the highest learning accuracy of the baselines and mark $*$ to our model’s learning accuracy if it is outside the confidence interval.

We include a brief introduction on the baseline models used in this experiment:

•

MLP: Standard feedward multiple layer perceptron.
•

GCN [21]: GCN is the first of its kind to implement linear approximation to spectral graph convolutions.
•

SGC [34]: SGC reduces GCNs’ complexity by removing nonlinearities and collapsing weight matrices between consecutive layers. Thus serves as a more powerful and efficient GNN baseline.
•

GAT [30]: GAT generates attention coefficient matrix that element-wisely multiplied on the graph adjacency matrix according to the node feature based attention mechanism via each layer to propagate node features via the relative importance between them.
•

JKNet [37]: JKNet offers the capability to adaptively exploit diverse neighbourhood ranges, facilitating enhanced structure-aware representation for individual nodes.
•

APPNP [16]: APPNP leverages personalized PageRank to disentangle the neural network from the propagation scheme, thereby merging GNN functionality.
•

GPRGNN [8]: The GPRGNN architecture dynamically learns General Pagerank (GPR) weights to optimize the extraction of node features and topological information from a graph, irrespective of the level of homophily present.
•

p-GNN [14]:p-GNN is a $p$ -Laplacian based graph neural network model that incorporates a message-passing mechanism derived from a discrete regularization framework. To make a fair comparison, we test p-GNN model with different quantity of $p$ .
•

UFG [41]: UFG, a class of GNNs built upon framelet transforms utilizes framelet decomposition to effectively merge graph features into low-pass and high-pass spectra.
•

pL-UFG [25]: pL-UFG employs a p-Laplacian based implicit layer to enhance the adaptability of multi-scale graph convolution networks (i.e.,UFG) to filter-based domains, effectively improving the model’s adaptation to both homophily and heterophily graphs. Furthermore, as two types of pL-UFG models are proposed in [25], we test both two pL-UFG variants as our baseline models. For more details including the precise formulation of the model, please check [25].

Table 2: Statistics of the datasets,

\mathcal{H}(\mathcal{G})

represent the level of homophily of overall benchmark datasets.

Datasets	Class	Feature	Node	Edge	$\mathcal{H}(\mathcal{G})$
Cora	7	1433	2708	5278	0.825
CiteSeer	6	3703	3327	4552	0.717
PubMed	3	500	19717	44324	0.792
Computers	10	767	13381	245778	0.802
Photo	8	745	7487	119043	0.849
CS	15	6805	18333	81894	0.832
Physics	5	8415	34493	247962	0.915
Arxiv	23	128	169343	1166243	0.681
Chameleon	5	2325	2277	31371	0.247
Squirrel	5	2089	5201	198353	0.216
Actor	5	932	7600	26659	0.221
Wisconsin	5	251	499	1703	0.150
Texas	5	1703	183	279	0.097
Cornell	5	1703	183	277	0.386

Table 3: Test accuracy (%) on homophilic graphs, the top two learning accuracies are highlighted in red and blue. The term OOM means out of memory.

Method	Cora	CiteSeer	PubMed	Computers	Photos	CS	Physics	Arxiv
MLP	66.04 $\pm$ 1.11	68.99 $\pm$ 0.48	82.03 $\pm$ 0.24	71.89 $\pm$ 5.36	86.11 $\pm$ 1.35	93.50 $\pm$ 0.24	94.56 $\pm$ 0.11	55.50 $\pm$ 0.78
GCN	84.72 $\pm$ 0.38	75.04 $\pm$ 1.46	83.19 $\pm$ 0.13	78.82 $\pm$ 1.87	90.00 $\pm$ 1.49	93.00 $\pm$ 0.12	95.55 $\pm$ 0.09	70.07 $\pm$ 0.79
SGC	83.79 $\pm$ 0.37	73.52 $\pm$ 0.89	75.92 $\pm$ 0.26	77.56 $\pm$ 0.88	86.44 $\pm$ 0.35	92.18 $\pm$ 0.22	94.99 $\pm$ 0.13	71.01 $\pm$ 0.30
GAT	84.37 $\pm$ 1.13	74.80 $\pm$ 1.00	83.92 $\pm$ 0.28	78.68 $\pm$ 2.09	89.63 $\pm$ 1.75	92.57 $\pm$ 0.14	95.13 $\pm$ 0.15	OOM
JKNet	83.69 $\pm$ 0.71	74.49 $\pm$ 0.74	82.59 $\pm$ 0.54	69.32 $\pm$ 3.94	86.12 $\pm$ 1.12	91.11 $\pm$ 0.22	94.45 $\pm$ 0.33	OOM
APPNP	83.69 $\pm$ 0.71	75.84 $\pm$ 0.64	80.42 $\pm$ 0.29	73.73 $\pm$ 2.49	87.03 $\pm$ 0.95	91.52 $\pm$ 0.14	94.71 $\pm$ 0.11	OOM
GPRGNN	83.79 $\pm$ 0.93	75.94 $\pm$ 0.65	82.32 $\pm$ 0.25	74.26 $\pm$ 2.94	88.69 $\pm$ 1.32	91.89 $\pm$ 0.08	94.85 $\pm$ 0.23	OOM
UFG	80.64 $\pm$ 0.74	73.30 $\pm$ 0.19	81.52 $\pm$ 0.80	66.39 $\pm$ 6.09	86.60 $\pm$ 4.69	95.27 $\pm$ 0.04	95.77 $\pm$ 0.04	71.08 $\pm$ 0.49
PGNN^1.0	84.21 $\pm$ 0.91	75.38 $\pm$ 0.82	84.34 $\pm$ 0.33	81.22 $\pm$ 2.62	87.64 $\pm$ 5.05	94.88 $\pm$ 0.12	96.15 $\pm$ 0.12	OOM
PGNN^1.5	84.42 $\pm$ 0.71	75.44 $\pm$ 0.98	84.48 $\pm$ 0.21	82.68 $\pm$ 1.15	91.83 $\pm$ 0.77	94.13 $\pm$ 0.08	96.14 $\pm$ 0.08	OOM
PGNN^2.0	84.74 $\pm$ 0.67	75.62 $\pm$ 1.07	84.25 $\pm$ 0.35	83.40 $\pm$ 0.68	91.71 $\pm$ 0.93	94.28 $\pm$ 0.10	96.03 $\pm$ 0.07	OOM
PGNN^2.5	84.48 $\pm$ 0.77	75.22 $\pm$ 0.73	83.94 $\pm$ 0.47	82.91 $\pm$ 1.34	91.41 $\pm$ 0.66	93.40 $\pm$ 0.07	95.75 $\pm$ 0.05	OOM
pL-UFG1^1.0	84.54 $\pm$ 0.62	75.88 $\pm$ 0.60	85.56 $\pm$ 0.18	82.07 $\pm$ 2.78	85.57 $\pm$ 19.92	95.03 $\pm$ 0.22	96.19 $\pm$ 0.06	70.28 $\pm$ 9.13
pL-UFG1^1.5	84.96 $\pm$ 0.38	76.04 $\pm$ 0.85	85.59 $\pm$ 0.18	85.04 $\pm$ 1.06	92.92 $\pm$ 0.37	95.03 $\pm$ 0.22	96.27 $\pm$ 0.06	71.25 $\pm$ 8.37
pL-UFG1^2.0	85.20 $\pm$ 0.42	76.12 $\pm$ 0.82	85.59 $\pm$ 0.17	85.26 $\pm$ 1.15	92.65 $\pm$ 0.65	94.77 $\pm$ 0.27	96.04 $\pm$ 0.07	OOM
pL-UFG1^2.5	85.30 $\pm$ 0.60	76.11 $\pm$ 0.82	85.54 $\pm$ 0.18	85.18 $\pm$ 0.88	91.49 $\pm$ 1.29	94.86 $\pm$ 0.14	95.96 $\pm$ 0.11	OOM
pL-UFG2^1.0	84.42 $\pm$ 0.32	74.79 $\pm$ 0.62	85.45 $\pm$ 0.18	84.88 $\pm$ 0.84	85.30 $\pm$ 19.50	95.03 $\pm$ 0.19	96.06 $\pm$ 0.11	71.01 $\pm$ 7.28
pL-UFG2^1.5	85.60 $\pm$ 0.36	75.61 $\pm$ 0.60	85.59 $\pm$ 0.18	84.55 $\pm$ 1.57	93.00 $\pm$ 0.61	95.03 $\pm$ 0.19	96.14 $\pm$ 0.09	71.21 $\pm$ 6.19
pL-UFG2^2.0	85.20 $\pm$ 0.42	76.12 $\pm$ 0.82	85.59 $\pm$ 0.17	85.27 $\pm$ 1.15	92.50 $\pm$ 0.40	94.77 $\pm$ 0.27	96.05 $\pm$ 0.07	OOM
pL-UFG-LFD	85.64 $\pm$ 1.36	77.39^∗ $\pm$ 1.59	85.08 $\pm$ 1.33	85.36^∗ $\pm$ 1.39	93.17^∗ $\pm$ 1.30	96.13^∗ $\pm$ 1.08	96.49^∗ $\pm$ 1.04	71.96 $\pm$ 1.25

Table 4: Test accuracy (%) on heterophilic graphs. the top two learning accuracies are highlighted in red and blue.

Method	Chameleon	Squirrel	Actor	Wisconsin	Texas	Cornell
MLP	48.82 $\pm$ 1.43	34.30 $\pm$ 1.13	41.66 $\pm$ 0.83	93.45 $\pm$ 2.09	71.25 $\pm$ 12.99	83.33 $\pm$ 4.55
GCN	33.71 $\pm$ 2.27	26.19 $\pm$ 1.34	33.46 $\pm$ 1.42	67.90 $\pm$ 8.16	53.44 $\pm$ 11.23	55.68 $\pm$ 10.57
SGC	33.83 $\pm$ 1.69	26.89 $\pm$ 0.98	32.08 $\pm$ 2.22	59.56 $\pm$ 11.19	64.38 $\pm$ 7.53	43.18 $\pm$ 16.41
GAT	41.95 $\pm$ 2.65	25.66 $\pm$ 1.72	33.64 $\pm$ 3.45	60.65 $\pm$ 11.08	50.63 $\pm$ 28.36	34.09 $\pm$ 29.15
JKNet	33.50 $\pm$ 3.46	26.95 $\pm$ 1.29	31.14 $\pm$ 3.63	60.42 $\pm$ 8.70	63.75 $\pm$ 5.38	45.45 $\pm$ 9.99
APPNP	34.61 $\pm$ 3.15	32.61 $\pm$ 0.93	39.11 $\pm$ 1.11	82.41 $\pm$ 2.17	80.00 $\pm$ 5.38	60.98 $\pm$ 13.44
GPRGNN	34.23 $\pm$ 4.09	34.01 $\pm$ 0.82	34.63 $\pm$ 0.58	86.11 $\pm$ 1.31	84.38 $\pm$ 11.20	66.29 $\pm$ 11.20
UFG	50.11 $\pm$ 1.67	31.48 $\pm$ 2.05	40.13 $\pm$ 1.11	93.52 $\pm$ 2.36	84.69 $\pm$ 4.87	83.71 $\pm$ 3.28
PGNN^1.0	49.04 $\pm$ 1.16	34.79 $\pm$ 1.01	40.91 $\pm$ 1.41	94.35 $\pm$ 2.16	82.00 $\pm$ 11.31	82.73 $\pm$ 6.92
PGNN^1.5	49.12 $\pm$ 1.14	34.86 $\pm$ 1.25	40.87 $\pm$ 1.47	94.72 $\pm$ 1.91	81.50 $\pm$ 10.70	81.97 $\pm$ 10.16
PGNN^2.0	49.34 $\pm$ 1.15	34.97 $\pm$ 1.41	40.83 $\pm$ 1.81	94.44 $\pm$ 1.75	84.38 $\pm$ 11.52	81.06 $\pm$ 10.18
PGNN^2.5	49.16 $\pm$ 1.40	34.94 $\pm$ 1.57	40.78 $\pm$ 1.51	94.35 $\pm$ 2.16	83.38 $\pm$ 12.95	81.82 $\pm$ 8.86
pL-UFG1^1.0	56.81 $\pm$ 1.69	38.81 $\pm$ 1.97	41.26 $\pm$ 1.66	96.48 $\pm$ 0.94	86.13 $\pm$ 7.47	86.06 $\pm$ 3.16
pL-UFG1^1.5	56.89 $\pm$ 1.17	39.73 $\pm$ 1.22	40.95 $\pm$ 0.93	96.48 $\pm$ 1.07	87.00 $\pm$ 5.16	86.52 $\pm$ 2.29
pL-UFG1^2.0	56.24 $\pm$ 1.02	39.72 $\pm$ 1.86	40.95 $\pm$ 0.93	96.59 $\pm$ 0.72	86.50 $\pm$ 8.84	85.30 $\pm$ 2.35
pL-UFG1^2.5	56.11 $\pm$ 1.25	39.38 $\pm$ 1.78	41.04 $\pm$ 0.99	95.34 $\pm$ 1.64	89.00 $\pm$ 4.99	83.94 $\pm$ 3.53
pL-UFG2^1.0	55.51 $\pm$ 1.53	36.94 $\pm$ 5.69	29.28 $\pm$ 19.25	93.98 $\pm$ 2.94	85.00 $\pm$ 5.27	87.73 $\pm$ 2.49
pL-UFG2^1.5	57.22 $\pm$ 1.19	39.80 $\pm$ 1.42	40.89 $\pm$ 0.75	96.48 $\pm$ 0.94	87.63 $\pm$ 5.32	86.82 $\pm$ 1.67
pL-UFG2^2.0	56.19 $\pm$ 0.99	39.74 $\pm$ 1.66	41.01 $\pm$ 0.80	96.14 $\pm$ 1.16	86.50 $\pm$ 8.84	85.30 $\pm$ 2.35
pL-UFG2^2.5	55.69 $\pm$ 1.15	39.30 $\pm$ 1.68	40.86 $\pm$ 0.74	95.80 $\pm$ 1.44	86.38 $\pm$ 2.98	84.55 $\pm$ 3.31
pL-fUFG^1.0	55.80 $\pm$ 1.93	38.43 $\pm$ 1.26	32.84 $\pm$ 16.54	93.98 $\pm$ 3.47	86.25 $\pm$ 6.89	87.27 $\pm$ 2.27
pL-fUFG^1.5	55.65 $\pm$ 1.96	38.40 $\pm$ 1.52	41.00 $\pm$ 0.99	96.48 $\pm$ 1.29	87.25 $\pm$ 3.61	86.21 $\pm$ 2.19
pL-fUFG^2.0	55.95 $\pm$ 1.29	38.33 $\pm$ 1.71	41.25 $\pm$ 0.84	96.25 $\pm$ 1.25	88.75 $\pm$ 4.97	83.94 $\pm$ 3.78
pL-fUFG^2.5	55.56 $\pm$ 1.66	38.39 $\pm$ 1.48	40.55 $\pm$ 0.50	95.28 $\pm$ 2.24	88.50 $\pm$ 7.37	83.64 $\pm$ 3.88
pL-UFG-HFD	58.60^∗ $\pm$ 1.74	39.63 $\pm$ 2.01	44.63^∗ $\pm$ 2.75	96.64 $\pm$ 1.77	89.31 $\pm$ 8.40	88.97^∗ $\pm$ 3.36

Discussion on the Results, Scalability and Computational Complexity

From both Table 3 and 4, it is clear that our proposed model (pL-UFG-LFD and pL-UFG-HFD) produce state-of-the-art learning accuracy compared to various baseline models. For the datasets (i.e.,Pubmed and Squirrel) on which pL-UFG-LFD and pL-UFG-HFD are not the best, one can observe that pL-UFG-LFD and pL-UFG-HFD still have nearly identical learning outcomes compared to the best pL-UFG results. This suggests even within the pL-UFG with controlled framelet dynamics, by adjusting the values of $\mu$ and $p$ , our proposed models are still able to generate state-of-the-art learning results with the computational complexity largely reduced compared to the pL-UFG and UFG. This observation directly verifies Lemma 3 and Remark 7. In addition, due to the reduction of computational cost, our dynamic controlled models (pL-UFG-LFD and pL-UFG-HFD) show a strong capability of handling the large-scale graph dataset, which is a challenging issue (scalability) for some GNNs especially multi-scale graph convolutions such as framelets [41] without additional data pre-processing steps. Accordingly, one can check that pL-UFG-LFD outperforms all included baselines on Arxiv datasets. Lastly, one can also find that the most of the improvements between the learning accuracy produced from our model and the baselines are significant.

4.4 Limitation of the Proposed Models and Future Studies

First, we note that our analysis on the convergence, energy dynamic and equivalence between our proposed model can be applied or partially applied to most of existing GNNs. Based on we have claimed in regarding to the theoretical perspective of pL-UFG, although we assessed model property via different perspective, eventually all theoretical conclusions come to the same conclusion (i.e., the asymptotic behavior of pL-UFG). Therefore, it would be beneficial to deploy our analyzing framework to other famous GNNs. Since the main propose of this paper is to re-assess the property of pL-UFG, we leave this to the future work.

In addition, to induce LFD/HFD to pL-UFG, we set the value of $\theta$ as constant according to Proposition 3, however, due to large variety of real-world graphs, it is challenging to determine the most suitable $\theta$ when we fix it as a constant. This suggests the exploration on controlling model’s dynamic via selecting $\theta$ is still rough. Moreover, based on Definition 1, the homophily index of a graph is summary statistic over all nodes. However, even in the highly homophilic graph, there are still some nodes with their neighbours with different labels. This suggests the index is only capable of presenting the global rather than local labelling information of the graph. Accordingly, assigning a constant $\theta$ to induce LFD/HFD might not be able to equip pL-UFG enough power to capture detailed labelling information of the graph. Therefore, another future research direction is to potentially explore the design of $\theta$ via the local labelling information of the graph. Finally, we note that another consequence of setting $\theta_{0,1}$ and $\theta_{1,1}$ as constant is such setting narrows the model’s parameter space, as one can check the only learnable matrix left via explicit part of pL-UFG ((9)) is $\widehat{\mathbf{W}}$ . Accordingly, the narrowed parameter space might make the solution of the model optimization apart from desired solution as before, causing potential increase of learning variance.

5 Concluding Remarks

In this work, we performed theoretical analysis on pL-UFG. Specifically, we verified that by choosing suitable quantify of the model parameters ( $\mu$ and $p$ ), the implicit propagation induced from p-Laplacian is capable of amplifying or shrinking the Dirichlet energy of the node features produced from the framelet. Consequently, such manipulation of the energy results in a stronger energy dynamic of framelet and therefore enhancing model’s adaption power on both homophilic and heterophilic graphs. We further explicitly showed the proof of the convergence of pL-UFG, which to our best of knowledge, fills the knowledge gap at least in the field of p-Laplacian based multi-scale GNNs. Moreover, we showed the equivalence between pL-UFG and the non-linear graph diffusion, indicating that pL-UFG can be trained via various training schemes. Finally, it should be noted that for the simplicity of the analysis, we have made several assumptions and only focus on the $\mathrm{Haar}$ type frames. It suffices in regards to the scope of this work. However, it would be interesting to consider more complex energy dynamics by reasonably dropping some of the assumptions or from other types of frames, we leave this to future work.

References

[1] Wendong Bi, Lun Du, Qiang Fu, Yanlin Wang, Shi Han, and Dongmei Zhang. Make heterophily graphs better fit gnn: A graph rewiring approach, 2022.
[2] Cristian Bodnar, Francesco Di Giovanni, Benjamin Paul Chamberlain, Pietro Liò, and Michael M Bronstein. Neural sheaf diffusion: A topological perspective on heterophily and oversmoothing in gnns. arXiv:2202.04579, 2022.
[3] Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
[4] Ben Chamberlain, James Rowbottom, Maria I Gorinova, Michael Bronstein, Stefan Webb, and Emanuele Rossi. Grand: Graph neural diffusion. In International Conference on Machine Learning, pages 1407–1418. PMLR, 2021.
[5] Benjamin Chamberlain, James Rowbottom, Davide Eynard, Francesco Di Giovanni, Xiaowen Dong, and Michael Bronstein. Beltrami flow and neural diffusion on graphs. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 1594–1609. Curran Associates, Inc., 2021.
[6] Jialin Chen, Yuelin Wang, Cristian Bodnar, Pietro Liò, and Yu Guang Wang. Dirichlet energy enhancement of graph neural networks by framelet augmentation. github, 2022.
[7] Qi Chen, Yifei Wang, Yisen Wang, Jiansheng Yang, and Zhouchen Lin. Optimization-induced graph implicit nonlinear diffusion. In International Conference on Machine Learning, pages 3648–3661. PMLR, 2022.
[8] Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network. In Proceedings of International Conference on Learning Representations, 2021.
[9] Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997.
[10] Francesco Di Giovanni, James Rowbottom, Benjamin P Chamberlain, Thomas Markovich, and Michael M Bronstein. Graph neural networks as gradient flows. arXiv:2206.10991, 2022.
[11] Bin Dong. Sparse representation on graphs by tight wavelet frames and applications. Applied and Computational Harmonic Analysis, 42(3):452–479, 2017.
[12] Pavel Drábek and Stanislav I Pohozaev. Positive solutions for the p-laplacian: application of the fibrering method. Proceedings of the Royal Society of Edinburgh Section A: Mathematics, 127(4):703–726, 1997.
[13] Guoji Fu, Peilin Zhao, and Yatao Bian. $p$ -laplacian based graph neural networks. In International Conference on Machine Learning, pages 6878–6917. PMLR, 2022.
[14] Guoji Fu, Peilin Zhao, and Yatao Bian. $p$ -Laplacian based graph neural networks. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of PMLR, pages 6878–6917, 2022.
[15] JP García Azorero and I Peral Alonso. Existence and nonuniqueness for the p-laplacian. Communications in Partial Differential Equations, 12(12):126–202, 1987.
[16] Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. In Proceedings of International Conference on Learning Representations, 2019.
[17] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in Neural Information Processing Systems, 30, 2017.
[18] David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.
[19] Andi Han, Dai Shi, Zhiqi Shao, and Junbin Gao. Generalized energy and gradient flow via graph framelets. arXiv preprint arXiv:2210.04124, 2022.
[20] Bernd Kawohl and Jiri Horak. On the geometry of the $p$ -laplacian operator. arXiv preprint arXiv:1604.07675, 2016.
[21] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[22] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI Conference on Artificial Intelligence, 2018.
[23] Remigijus Paulavičius and Julius Žilinskas. Analysis of different norms and corresponding lipschitz constants for global optimization. Technological and Economic Development of Economy, 12(4):301–306, 2006.
[24] Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-GCN: Geometric graph convolutional networks. In International Conference on Learning Representations, 2019.
[25] Zhiqi Shao, Andi Han, Dai Shi, Andrey Vasnev, and Junbin Gao. Generalized Laplacian regularized framelet gcns. arXiv:2210.15092, 2022.
[26] Dai Shi, Yi Guo, Zhiqi Shao, and Junbin Gao. How curvature enhance the adaptation power of framelet gcns. arXiv preprint arXiv:2307.09768, 2023.
[27] Wim Sweldens. The lifting scheme: A construction of second generation wavelets. SIAM Journal on Mathematical Analysis, 29(2):511–546, 1998.
[28] Matthew Thorpe, Tan Minh Nguyen, Hedi Xia, Thomas Strohmer, Andrea Bertozzi, Stanley Osher, and Bao Wang. GRAND++: Graph neural diffusion with a source term. In International Conference on Learning Representations, 2022.
[29] César Torres. Boundary value problem with fractional p-laplacian operator. arXiv preprint arXiv:1412.6438, 2014.
[30] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
[31] Yifei Wang, Yisen Wang, Jiansheng Yang, and Zhouchen Lin. Dissecting the diffusion process in linear graph convolutional networks. Advances in Neural Information Processing Systems, 34:5758–5769, 2021.
[32] Quanmin Wei, Jinyan Wang, Jun Hu, Xianxian Li, and Tong Yi. Ogt: optimize graph then training gnns for node classification. Neural Computing and Applications, 34(24):22209–22222, 2022.
[33] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In International Conference on Machine Learning, pages 6861–6871. PMLR, 2019.
[34] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza, Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. Simplifying graph convolutional networks. In Proceedings of International Conference on Machine Learning, 2019.
[35] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on Neural Networks and Learning Systems, 32(1):4–24, 2020.
[36] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019.
[37] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In Proceedings of International Conference on Machine Learning, 2018.
[38] Mengxi Yang, Xuebin Zheng, Jie Yin, and Junbin Gao. Quasi-framelets: Another improvement to graph neural networks. arXiv:2201.04728, 2022.
[39] Xin Zheng, Yixin Liu, Shirui Pan, Miao Zhang, Di Jin, and Philip S. Yu. Graph neural networks for graphs with heterophily: A survey, 2022.
[40] Xuebin Zheng, Bingxin Zhou, Junbin Gao, Yuguang Wang, Pietro Lió, Ming Li, and Guido Montufar. How framelets enhance graph neural networks. In International Conference on Machine Learning, pages 12761–12771. PMLR, 2021.
[41] Xuebin Zheng, Bingxin Zhou, Yu Guang Wang, and Xiaosheng Zhuang. Decimated framelet system on graphs and fast g-framelet transforms. Journal of Machine Learning Research, 23:18–1, 2022.
[42] Bingxin Zhou, Ruikun Li, Xuebin Zheng, Yu Guang Wang, and Junbin Gao. Graph denoising with framelet regularizer. IEEE Transactions on Artificial Intelligence, 2021.
[43] Dengyong Zhou and Bernhard Schölkopf. Regularization on discrete spaces. In Joint Pattern Recognition Symposium, pages 361–368. Springer, 2005.
[44] Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. In Advances in Neural Information Processing Systems, volume 33, pages 7793–7804, 2020.
[45] Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. Interpreting and unifying graph neural networks with an optimization framework. In Proceedings of the Web Conference 2021, pages 1215–1226, 2021.
[46] Chunya Zou, Andi Han, Lequan Lin, and Junbin Gao. A simple yet effective SVD-GCN for directed graphs. arXiv:2205.09335, 2022.

		$\displaystyle\left\\|\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:}+\mathbf{v})-\partial\mathcal{L}_{p}^{\phi}(\mathbf{F}^{(k)}_{i,:})\right\\|=\left\\|\frac{1}{\alpha^{\prime(k)}_{i,i}}\left((\mathbf{F}^{(k)}_{i,:}+\mathbf{v})-\mathbf{F}^{\prime(k+1)}_{i,:}\right)-\frac{1}{\alpha^{(k)}_{i,i}}\left(\mathbf{F}^{(k)}_{i,:}-\mathbf{F}^{(k+1)}_{i,:}\right)\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\alpha^{\prime(k)}_{i,i}}\\|\mathbf{v}\\|+\left\\|\frac{1}{\alpha^{\prime(k)}_{i,i}}\left(\mathbf{F}^{(k)}_{i,:}-\mathbf{F}^{\prime(k+1)}_{i,:}\right)-\frac{1}{\alpha^{(k)}_{i,i}}\left(\mathbf{F}^{(k)}_{i,:}-\mathbf{F}^{(k+1)}_{i,:}\right)\right\\|$
	$\displaystyle=$	$\displaystyle\frac{1}{\alpha^{\prime(k)}_{i,i}}\\|\mathbf{v}\\|+\left\\|\left(\frac{1}{\alpha^{\prime(k)}_{i,i}}-\frac{1}{\alpha^{(k)}_{i,i}}\right)\mathbf{F}^{(k)}_{i,:}-\frac{1}{\alpha^{\prime(k)}_{i,i}}\mathbf{F}^{\prime(k+1)}_{i,:}+\frac{1}{\alpha^{(k)}_{i,i}}\mathbf{F}^{(k+1)}_{i,:}\right\\|$
	$\displaystyle=$	$\displaystyle\frac{1}{\alpha^{\prime(k)}_{i,i}}\\|\mathbf{v}\\|+\left\\|\sum_{v_{j}\sim v_{i}}\left(\frac{M^{\prime(k)}_{i,j}}{D_{i,i}}-\frac{M^{(k)}_{i,j}}{D_{i,i}}\right)\mathbf{F}^{(k)}_{i,:}-\sum_{v_{j}\sim v_{i}}\left(\frac{M^{\prime(k)}_{i,j}}{\sqrt{D_{i,i}D_{j,j}}}\right)\mathbf{F}^{\prime(k)}_{j,:}+\left(\frac{M^{(k)}_{i,j}}{\sqrt{D_{i,i}D_{j,j}}}\right)\mathbf{F}^{(k)}_{j,:}\right\\|$
	$\displaystyle=$	$\displaystyle\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(k)}_{i,j}}{D_{i,i}}+2\mu\right)\\|\mathbf{v}\\|+\sum_{v_{j}\sim v_{i}}\left(\frac{M^{\prime(k)}_{i,j}-M^{(k)}_{i,j}}{D_{i,i}}\right)\\|\mathbf{v}\\|$
		$\displaystyle+\left\\|\sum_{v_{j}\sim v_{i}}\left(\frac{M^{\prime(k)}_{i,j}-M^{(k)}_{i,j}}{D_{i,i}}\right)\mathbf{F}^{(k)}_{i,:}-\sum_{v_{j}\sim v_{i}}\left(\frac{M^{\prime(k)}_{i,j}-M^{(k)}_{i,j}}{\sqrt{D_{i,i}D_{j,j}}}\right)\mathbf{F}^{(k)}_{j,:}\right\\|.$

	$\displaystyle\prod_{s=0}^{1}\alpha^{(s)}_{i,i}{d}^{-1/2}_{i,i}M^{(s)}_{i,j}{d}^{-1/2}_{j,j}+\left(\alpha^{(1)}_{i,i}{d}^{-1/2}_{i,i}M^{(1)}_{i,j}{d}^{-1/2}_{j,j}2\mu\alpha^{(0)}_{i,i}\right)-2\mu\alpha^{(1)}_{i,i}$
	$\displaystyle=\prod_{s=0}^{1}\left[\left(1/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(s)}_{i,j}}{d_{i,i}}+2\mu\right)\right)\left(\frac{\left\\|\nabla_{W}\mathbf{F}^{(s)}([i,j])\right\\|^{p-2}}{\sqrt{d_{i,i}d_{j,j}}}\right)\right]+$
	$\displaystyle\left[\left(1/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(1)}_{i,j}}{d_{i,i}}+2\mu\right)\right)\left(\frac{\left\\|\nabla_{W}\mathbf{F}^{(1)}([i,j])\right\\|^{p-2}}{\sqrt{d_{i,i}d_{j,j}}}\right)\left(2\mu/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(0)}_{i,j}}{d_{i,i}}+2\mu\right)\right)\right]$
	$\displaystyle-\left(2\mu/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(1)}_{i,j}}{d_{i,i}}+2\mu\right)\right),$
	$\displaystyle=\left(\frac{\left\\|\nabla_{W}\mathbf{F}^{(0)}([i,j])\right\\|^{p-2}}{\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(0)}_{i,j}}{d_{i,i}}+2\mu\right)\cdot\sqrt{d_{i,i}d_{j,j}}}\right)\left(\frac{\left\\|\nabla_{W}\mathbf{F}^{(1)}([i,j])\right\\|^{p-2}}{\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(1)}_{i,j}}{d_{i,i}}+2\mu\right)\cdot\sqrt{d_{i,i}d_{j,j}}}\right)+$
	$\displaystyle\left(\frac{\left\\|\nabla_{W}\mathbf{F}^{(1)}([i,j])\right\\|^{p-2}}{\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(1)}_{i,j}}{d_{i,i}}+2\mu\right)\cdot\sqrt{d_{i,i}d_{j,j}}}\cdot 2\mu/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(0)}_{i,j}}{d_{i,i}}+2\mu\right)\right)-\left(2\mu/\left(\sum_{v_{j}\sim v_{i}}\frac{M^{(1)}_{i,j}}{d_{i,i}}+2\mu\right)\right).$		(38)

Revisiting Generalized p-Laplacian Regularized Framelet GCNs: Convergence, Energy Dynamic and Training with Non-Linear Diffusion

Abstract

1 Introduction

2 Preliminaries

Graph, Graph Convolution and Graph Consistency

Definition 1 (Homophily and Heterophily [14]).

Graph Framelet.

Definition 2.

Generalized p-Laplacian Regularized Framelet GCN.

Definition 3 (The pp-Laplace Operator [12]).

Definition 4 (Graph Gradient [43]).

Definition 5 (Graph Divergence [43]).

Definition 6 (Graph pp-Laplacian).

Remark 1 (Dirichlet Energy, Graph Homophily and Heterophily).

Proposition 1.

3 Theoretical Analysis of the pL-UFG

Lemma 1.

Remark 2.

3.1 Convergence Analysis of pL-UFG

Theorem 1 (Weak Convergence of the Proposed Model).

Proof.

3.2 Energy Behavior of the pL-UFG

Definition 7 (The Generalized Dirichlet Energy).

Proposition 2 (Energy Behavior).

Proof.

Remark 3.

Remark 4 (Regarding to the quantity of pp).

3.3 Interaction with Framelet Energy Dynamic

Definition 8 ([10]).

Lemma 2 ([10]).

Remark 5 (LFD, HFD and graph homophily).

Proposition 3 ([19]).

Definition 9 (Stronger/Weaker Dynamic).

Remark 6.

Lemma 3 (Stronger HFD).

Proof.

Corollary 1 (Escape from Over-smoothing).

Proof.

Remark 7 (Stronger LFD).

Remark 8.

3.4 Proposed Model with Controlled Dynamics

3.5 Equivalence to Non-Linear Diffusion

Lemma 4 (Non-Linear Diffusion).

Proof.

Remark 9.

4 Experiment

Experiment outlines

4.1 Synthetic Experiment on Variation of μ\mu

Setup

Results and Discussion

4.2 Synthetic Experiment on Testing of Model’s Dynamics

Setup and Results

4.3 Real-world Node Classification and Scalability

Discussion on the Results, Scalability and Computational Complexity

4.4 Limitation of the Proposed Models and Future Studies

5 Concluding Remarks

References

Definition 3 (The $p$ -Laplace Operator [12]).

Definition 6 (Graph $p$ -Laplacian).

Remark 4 (Regarding to the quantity of $p$ ).

4.1 Synthetic Experiment on Variation of $\mu$