Training MLPs on Graphs without Supervision

Zehong Wang University of Notre DameNotre DameIndianaUSA zwang43@nd.edu , Zheyuan Zhang University of Notre DameNotre DameIndianaUSA zzhang42@nd.edu , Chuxu Zhang University of ConnecticutStorrsConnecticutUSA chuxu.zhang@uconn.edu and Yanfang Ye University of Notre DameNotre DameIndianaUSA yye7@nd.edu

(2025)

Abstract.

Graph Neural Networks (GNNs) have demonstrated their effectiveness in various graph learning tasks, yet their reliance on neighborhood aggregation during inference poses challenges for deployment in latency-sensitive applications, such as real-time financial fraud detection. To address this limitation, recent studies have proposed distilling knowledge from teacher GNNs into student Multi-Layer Perceptrons (MLPs) trained on node content, aiming to accelerate inference. However, these approaches often inadequately explore structural information when inferring unseen nodes. To this end, we introduce SimMLP, a Self-supervised framework for learning MLPs on graphs, designed to fully integrate rich structural information into MLPs. Notably, SimMLP is the first MLP-learning method that can achieve equivalence to GNNs in the optimal case. The key idea is to employ self-supervised learning to align the representations encoded by graph context-aware GNNs and neighborhood dependency-free MLPs, thereby fully integrating the structural information into MLPs. We provide a comprehensive theoretical analysis, demonstrating the equivalence between SimMLP and GNNs based on mutual information and inductive bias, highlighting SimMLP’s advanced structural learning capabilities. Additionally, we conduct extensive experiments on 20 benchmark datasets, covering node classification, link prediction, and graph classification, to showcase SimMLP’s superiority over state-of-the-art baselines, particularly in scenarios involving unseen nodes (e.g., inductive and cold-start node classification) where structural insights are crucial. Our codes are available at: https://github.com/Zehong-Wang/SimMLP.

Graph Neural Networks; Inference Acceleration; Self-Supervision

^†^†journalyear: 2025^†^†copyright: acmlicensed^†^†conference: Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining; March 10–14, 2025; Hannover, Germany^†^†booktitle: Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM ’25), March 10–14, 2025, Hannover, Germany^†^†doi: 10.1145/3701551.3703550^†^†isbn: 979-8-4007-1329-3/25/03^†^†ccs: Information systems Data mining^†^†ccs: Computing methodologies Unsupervised learning^†^†ccs: Computing methodologies Neural networks

1. Introduction

Refer to caption — Figure 1. Accuracy vs. Inference Time on Arxiv dataset under cold-start setting.

Given the widespread presence of graph-structured data, such as computing networks, e-commerce recommender systems, citation networks, and social networks, Graph Neural Networks (GNNs) have drawn significant attention in recent years. Typically, GNNs rely on message-passing mechanisms (Gilmer et al., 2017) to iteratively capture neighborhood information and learn representations of graphs. Despite their effectiveness across various graph learning tasks, GNNs face challenges in latency-sensitive applications, such as financial fraud detection (Wang et al., 2021), due to the computational overhead associated with neighborhood fetching (Zhang et al., 2022). To enable faster inference in downstream tasks, existing approaches primarily employ techniques like quantization (Ding et al., 2021), pruning (Zhou et al., 2021), and knowledge distillation (Yan et al., 2020) for accelerating graph inference. However, these methods are still constrained by the need to fetch neighborhoods, limiting their effectiveness in real-time scenarios.

To address this issue, Multi-Layer Perceptrons (MLPs), trained solely on node features without message passing, have emerged as efficient alternatives for latency-sensitive applications, offering up to two orders of magnitude acceleration ( $100\times$ ) compared to GNNs (Zhang et al., 2022; Guo et al., 2023; Tian et al., 2023). However, despite this significant speedup, the absence of message passing inevitably hinders the ability to capture structural information, resulting in degraded model performance. To mitigate this, Zhang et al. (2022) proposed distilling knowledge from pre-trained GNN teachers into MLP students by minimizing the KL divergence between the predictions of GNNs and MLPs. Although this approach has notably improved MLP performance, it has been observed that the model tends to mimic GNN predictions using MLPs rather than truly understanding the localized structural information of nodes, leading to sub-optimal performance.

To address this limitation, researchers intend to incorporate structural knowledge into MLPs. For instance, Tian et al. (2023); Wang et al. (2023a) employ random walks on the original graph to learn positional embeddings for each node, which are then appended to the raw node features as complementary information. Yang et al. (2024) use vector quantization (Van Den Oord et al., 2017) to learn a structural codebook during GNN pre-training and subsequently distill the knowledge from the codebook into downstream MLPs. Additionally, Hu et al. (2021); Xiao et al. (2024b) leverage contrastive learning to encode localized structures by pulling the target node closer to its neighbors in the embedding space. In summary, these methods utilize different heuristics–positional embeddings (Tian et al., 2023; Wang et al., 2023a), structural codebooks (Yang et al., 2024), and neighborhood relationships (Hu et al., 2021; Xiao et al., 2024b)–to provide structural information during MLP training. However, these heuristic approaches cannot fully replicate the functionality of GNNs in capturing the complete structural information of graphs, which may lead to reduced generalization to unseen nodes.

To this end, we present a simple yet effective method, SimMLP: a Self-supervised framework for learning MLPs on graphs, which is the first MLP learning method equivalent to GNNs in the optimal case. SimMLP is based on the insight that modeling the fine-grained correlation between node features and graph structures can enhance the generalization of node embeddings (Tian et al., 2020b). Building on this insight, we employ a self-supervised loss to maximize the alignment between GNNs and MLPs in the embedding space, preserving intricate semantic knowledge. Theoretically, we demonstrate the equivalence between SimMLP and GNNs based on mutual information maximization and inductive biases. Furthermore, we interpret SimMLP through the lens of the information bottleneck theory to illustrate its generalization capability. This equivalence to GNNs offers three distinct advantages compared to existing methods: (1) Generalization: SimMLP has the capability to fully comprehend the localized structures of nodes, making it well-suited for scenarios involving unseen nodes, such as inductive (Hamilton et al., 2017) or cold-start settings (Zheng et al., 2022). (2) Robustness: SimMLP demonstrates robustness to both feature and edge noise due to its advanced structural utilization. Additionally, the self-supervised nature mitigates the risk of overfitting to specific structural patterns, enhancing its resilience in situations with label scarcity. (3) Versatility: The self-supervised alignment in SimMLP enables the acquisition of task-agnostic knowledge, allowing the model to be applied across various graph-related tasks, whereas existing methods are tailored for a single task. To evaluate these benefits, we conducted experiments across 20 benchmark datasets, encompassing node classification on homophily and heterophily graphs, link prediction, and graph classification. Notably, SimMLP proves highly effective in inductive and cold-start settings with unseen nodes, as well as in link prediction tasks where localized structures are crucial. In terms of inference efficiency, SimMLP demonstrates significant acceleration compared to GNNs (90 $\sim$ 126 $\times$ ) and other acceleration techniques (5 $\sim$ 90 $\times$ ). Our contributions are summarized as follows:

•

We introduce SimMLP, a self-supervised MLP learning method for graphs that aims to maximize the alignment between node features and graph structures in the embedding space.
•

SimMLP is the first MLP learning method that is equivalent to GNNs in the optimal case. We provide a comprehensive theoretical analysis to demonstrate SimMLP’s generalization capabilities and its equivalence to GNNs.
•

We conduct extensive experiments to showcase the superiority of SimMLP, particularly in scenarios where structural insights are essential.

2. Related Work

Graph Neural Networks (Kipf and Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018; Gilmer et al., 2017; Liu et al., 2023, 2024; Wang et al., 2024c; Zhang et al., 2024) encode node embeddings following message passing framework. Basic GNNs include GCN (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017), GAT (Veličković et al., 2018), and so forth. For repid inference, SGC (Wu et al., 2019) and APPNP (Gasteiger et al., 2019) decompose feature transformation and neighborhood aggregation, which are recently proven to be expressive (Han et al., 2023; Yang et al., 2023). Despite their success, the neighborhood dependency significantly constrains the inference speed.

Self-Supervised Learning (SSL) (Chen et al., 2020a; He et al., 2020) acts as a pre-training strategy for learning discriminative (Tian et al., 2020b) and generalizable (Huang et al., 2023) representations without supervision. Numerous studies extend SSL on graphs to train GNNs (Veličković et al., 2019; Hassani and Khasahmadi, 2020; Zhu et al., 2020, 2021; Thakoor et al., 2022; Hou et al., 2022; Sun et al., 2023; Wang et al., 2023b, 2024b, 2024a). In particular, GCA (Zhu et al., 2021) extends instance discrimination (Chen et al., 2020a) to align similar instances in two graph views, BGRL (Thakoor et al., 2022) employs bootstrapping (Grill et al., 2020) to further enhance training efficiency, and GraphACL (Xiao et al., 2024a) leverages an asymmetric contrastive loss to encode structural information on graphs. However, the dependency on neighborhood information still limits the inference speed.

Inference Acceleration on GNNs encompasses quantization (Gupta et al., 2015; Jacob et al., 2018), pruning (Han et al., 2015; Frankle and Carbin, 2019), and knowledge distillation (KD) (Hinton et al., 2015; Sun et al., 2019). Quantization (Ding et al., 2021) approximates continuous data with limited discrete values, pruning (Zhou et al., 2021) involves dropping redundant neurons in the model, and KD focuses on transferring knowledge from large models to small models (Yan et al., 2020). However, they still need to fetch neighborhoods, resulting in constrained inference acceleration. Considering this, GLNN (Zhang et al., 2022) utilizes structure-independent MLPs for predictions, significantly accelerating inference by eliminating message passing. However, the introduction of MLPs inevitably compromises the structural learning capability. Following works further integrate structural information into MLPs via positional embedding (Tian et al., 2023; Wang et al., 2023a), label propagation (Yang et al., 2021), neighborhood alignment (Hu et al., 2021; Dong et al., 2022; Liu et al., 2022; Xiao et al., 2024b), or motif cookbook (Yang et al., 2024), but these heuristic methods only consider one aspects of graph structures, failing to fully integrating structural knowledge. Unlike these methods, SimMLP is equivalent to GNNs in the optimal case, demonstrating better structural learning capabilities.

3. Preliminary

Notations. Considering a graph ${\mathcal{G}}=({\mathbf{A}},{\mathbf{X}})$ consisting of node set ${\mathcal{V}}$ and edge set $E$ , with $N$ nodes in total, we have node features ${\mathbf{X}}\in\mathbb{R}^{N\times d}$ and a adjacent matrix ${\mathbf{A}}\in\{0,1\}^{N\times N}$ , where ${\mathbf{A}}_{ij}=1$ iff $e_{i,j}\in E$ , and ${\mathbf{A}}_{ij}=0$ otherwise. The GNN $\phi(\cdot,\cdot)$ takes node features ${\mathbf{x}}_{i}$ and graph structure ${\mathbf{A}}$ as input and outputs the structure-aware node embeddings ${\mathbf{h}}_{i}^{GNN}$ . The embedding follows a linear head to classify nodes into different classes, defined as:

(1)

{\mathbf{h}}_{i}^{GNN}=\phi({\mathbf{x}}_{i},{\mathbf{A}}),\quad\hat{{\mathbf{y}}}_{i}^{GNN}=\text{head}^{GNN}({\mathbf{h}}_{i}^{GNN}).

The GNNs highly rely on neighborhood information ${\mathbf{A}}$ , whereas neighborhood fetching poses considerable computational overhead during the inference. On the contrary, the MLP ${\mathcal{E}}(\cdot)$ takes the node feature ${\mathbf{x}}_{i}$ as input and output the node embeddings ${\mathbf{h}}_{i}^{MLP}$ , achieving fast inference by alleviating the neighborhood-fetching. The embeddings are then decoded via a prediction head:

(2)

{\mathbf{h}}_{i}^{MLP}={\mathcal{E}}({\mathbf{x}}_{i}),\quad\hat{{\mathbf{y}}}_{i}^{MLP}=\text{head}^{MLP}({\mathbf{h}}_{i}^{MLP}).

Although MLPs provide significantly faster inference over graph-structured datasets, the omitting of structural information inevitably sacrifices the model performance.

Training MLPs on Graphs. To jointly leverage the benefits of GNNs and MLPs, researchers propose methods to distill knowledge from pre-trained GNNs to MLPs by mimicking the predictions (Zhang et al., 2022). The training objective is defined as:

(3)

{\mathcal{L}}=\sum_{i\in{\mathcal{V}}_{train}}{\mathcal{L}}_{CE}(\hat{{\mathbf{y}}}_{i}^{MLP},{\mathbf{y}}_{i})+\lambda\sum_{i\in{\mathcal{V}}}{\mathcal{L}}_{KD}(\hat{{\mathbf{y}}}_{i}^{MLP},\hat{{\mathbf{y}}}_{i}^{GNN}),

where ${\mathcal{L}}_{CE}$ is the cross-entropy between the prediction and ground-truth, and ${\mathcal{L}}_{KD}$ optimizes the KL-divergence between predictions of teacher GNN and student MLP. During the inference, only the MLP is leveraged to encode node embeddings and make predictions, leading to a substantial inference acceleration. Despite this, the alignment in the label space maximizes the coarse-grained task-specific correlation between GNNs and MLPs, failing to capture the fine-grained and generalizable relationship between node features and graph structures (Tian et al., 2020a). To this end, SimMLP applies self-supervised learning to align GNNs and MLPs in a more intricate embedding space, better capturing structural information.

4. Proposed SimMLP

4.1. Framework

We present SimMLP: a Self-supervised framework for learning MLPs on graphs. The framework consists of three components: (1) GNN encoder, (2) MLP encoder, and (3) alignment loss. As illustrated in Figure 2, SimMLP maximizes the alignment between GNN and MLP via a self-supervised loss. Specifically, given a graph ${\mathcal{G}}=({\mathbf{A}},{\mathbf{X}})$ , we use GNN encoder $\phi(\cdot,\cdot)$ to extract structure-aware GNN embeddings ${\mathbf{h}}_{i}^{GNN}$ and MLP encoder ${\mathcal{E}}(\cdot)$ to obtain structure-free MLP embeddings ${\mathbf{h}}_{i}^{MLP}$ . The choice of GNN encoder is arbitrary; we can use different encoder for adopting to different tasks. For alignment, we employ the loss function to pretrain the model:

(4)

{\mathcal{L}}=\sum_{i\in{\mathcal{V}}}\underbrace{{\|\rho({\mathbf{h}}_{i}^{MLP})-{\mathbf{h}}_{i}^{GNN}\|}^{2\gamma}}_{invariance}+\lambda\underbrace{\|{\mathcal{D}}({\mathbf{h}}_{i}^{GNN})-{\mathbf{x}}_{i}\|^{2}}_{reconstruction},

where $\gamma\geq 1$ serves as a scaling term, akin to an adaptive sample reweighing technique (Lin et al., 2017), and $\lambda$ denotes the trade-off coefficient. The projector $\rho(\cdot),{\mathcal{D}}(\cdot)$ can either be identity or learnable; we opt for a non-linear MLP to enhance the expressiveness in estimating instance distances (Chen et al., 2020a). The invariance term ensures the alignment between GNN and MLP embeddings (Grill et al., 2020), modeling the fine-grained and generalizable correlation between node features and localized graph structures. The reconstruction term acts as a regularizer to prevent the potential distribution shift (Batson and Royer, 2019), providing better signals for training MLPs. In downstream tasks, we further train a task head for classification, as shown in Figure 2.

Proposition 4.1.

Suppose ${\mathcal{G}}=({\mathbf{A}},{\mathbf{X}})$ is sampled from a latent graph ${\mathcal{G}}_{\mathcal{I}}=({\mathbf{A}},{\mathbf{F}})$ , ${\mathcal{G}}\sim P({\mathcal{G}}_{\mathcal{I}})$ , and ${\mathbf{F}}^{*}$ is the lossless compression of ${\mathbf{F}}$ that $\mathbb{E}[{\mathbf{X}}|{\mathbf{A}},{\mathbf{F}}^{*}]={\mathbf{F}}$ . Let $\rho$ be an identity projector, and $\lambda=1,\gamma=1$ . The optimal MLP encoder ${\mathcal{E}}^{*}$ satisfies

(5)		$\displaystyle{\mathcal{E}}^{*}=$	$\displaystyle\operatorname{arg\,min}_{{\mathcal{E}}}\>\mathbb{E}\left[\left\\|{\mathbf{H}}^{MLP}-{\mathbf{F}}^{}\right\\|^{2}+\left\\|{\mathbf{H}}^{GNN}-{\mathbf{F}}^{*}\right\\|^{2}\right]$
	$\displaystyle+$	$\displaystyle\mathbb{E}\left[\left\\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\\|^{2}\right]-2\mathbb{E}_{{\mathbf{F}}^{}}\left[\sum_{i}\mathrm{Cov}({\mathbf{H}}^{MLP}_{i},{\mathbf{H}}^{GNN}_{i})\|{\mathbf{F}}^{}\right].$

Proof.

All proofs in the paper are presented in Appendix A. ∎

That is, the alignment loss implicitly enables the learned embeddings invariant to latent variables (Muthén, 2004; Xie et al., 2022), and maximizes the consistency between GNNs and MLPs in covariance.

4.2. Preventing Model Collapse

Challenges. Training MLPs on graphs without supervision is a non-trivial task. As illustrated in Figure 3, naively applying the basic loss function (Equation 4) to align GNNs and MLPs results in model collapse, as evidenced by lower training loss and reduced accuracy. Consistent with our findings, Xiao et al. (2024b) also show that simply employing the InfoNCE loss results in reduced performance.

Causes. We consider the issue derives from the heterogeneity in information sources, specifically node features and localized structures, encoded by MLPs and GNNs respectively. Before delving into the root causes, it is important to note that existing self-supervised methods primarily focus on aligning different aspects of a homogeneous information source (i.e., different views of the same graph). These methods typically use a single encoder to map various graph views into a unified embedding space, applying self-supervised loss to align the distances between views. This approach facilitates the learning of informative and generalizable embeddings. However, in scenarios involving heterogeneous information sources, each source often requires a distinct encoder, leading to projections into separate embedding spaces. Consequently, the self-supervised objective fails to accurately measure the true distance between sources, resulting in non-informative embeddings. We propose two strategies to address this issue.

Strategy 1: Enhancing Consistency between GNN and MLP Encoders. The challenge of handling heterogeneous information sources primarily stems from using different encoders. A straightforward solution is to use a single encoder to process all sources, including node features and localized structures. Fortunately, this approach is feasible for graphs. In the learning process of a GCN, neighborhood information is iteratively aggregated and updated, where aggregation can be viewed as a non-parametric averaging of neighborhoods, and updating as a parametric non-linear transformation. Thus, the learning process of GNNs can be approximated using MLPs by (1) applying an MLP encoder to node features to obtain MLP embeddings ${\mathbf{h}}^{M}_{i}$ , and (2) aggregating the MLP embeddings of neighboring nodes to approximate GNN embeddings ${\mathbf{h}}^{G}_{i}$ :

(6)

Approx.:

\displaystyle{\mathbf{h}}_{i}^{G}=\sigma\left(\;\,{\mathbf{h}}_{i}^{M}+\textstyle\sum_{(i,j)\in E}\alpha_{ij}\;{\mathbf{h}}_{j}^{M}\right),

where $\alpha_{ij}$ denotes the aggregation weight (Kipf and Welling, 2017), and $\sigma(\cdot)$ is the activation function. This approach allows the use of a single MLP encoder for both node features and localized structures, ensuring that ${\mathbf{h}}^{M}_{i}$ and ${\mathbf{h}}^{G}_{i}$ reside in the same embedding space. The form of this approximation is similar to SGC (Wu et al., 2019) and APPNP (Gasteiger et al., 2019), which decompose feature transformation and message passing to reduce model complexity and redundant computations. However, SimMLP applies this strategy specifically to enhance the consistency between GNN and MLP encoders. Notably, this approximation can be adapted to various GNN architectures with simple modifications.

Strategy 2: Enhancing Data Diversity through Augmentation. Increasing data diversity (Lee et al., 2020) is beneficial for handling heterogeneous information sources by creating multiple pairs of the same instances (i.e., target nodes and localized structures). We use augmentation techniques (You et al., 2020; Zhao et al., 2021) to generate multiple views of nodes, allowing a single node to be associated with various pairs of node features and localized structures. The augmentation is defined as:

(7)

\hat{\mathcal{G}}=(\hat{\mathbf{A}},\hat{\mathbf{X}})\sim t({\mathcal{G}}),s.t.,t({\mathcal{G}})=\langle q_{e}({\mathbf{A}}),q_{f}({\mathbf{X}})\rangle,

where $t(\cdot)$ represents the augmentation function, which includes structural augmentation $q_{e}(\cdot)$ and node feature augmentation $q_{f}(\cdot)$ . For simplicity, we apply random edge masking and node feature masking with pre-defined augmentation ratios (Zhu et al., 2020). The application of these two strategies prevents model collapse, as shown in Figure 3. It is important to note that these strategies are only employed during pre-training and do not affect the inference phase.

4.3. Theoretical Understanding

Table 1. The learning objective from the perspective of mutual information maximization. SimMLP is equivalent to GNNs in the optimal case, whileas other MLP methods lack the capability of (fully) leveraging localized structures.

	Methods	Learning Objective
	GNNs (Kipf and Welling, 2017; Hamilton et al., 2017)	$\sum_{i\in{\mathcal{V}}}I({\mathbf{y}}_{i};{\mathcal{S}}_{i})=\sum_{i\in{\mathcal{V}}}I({\mathbf{y}}_{i};{\mathbf{X}}^{[i]})+\sum_{i\in{\mathcal{V}}}I({\mathbf{y}}_{i};{\mathbf{A}}^{[i]}\|{\mathbf{X}}^{[i]})$
MLP	MLP (Zhang et al., 2022)	$\sum_{i\in{\mathcal{V}}}I({\mathbf{x}}_{i};{\mathbf{y}}_{i})$
	GraphMLP (Hu et al., 2021), GraphECL (Xiao et al., 2024b)	$\sum_{i\in{\mathcal{V}}}I({\mathbf{y}}_{i};{\mathbf{x}}_{i})+I({\mathbf{x}}_{i};{\mathbf{X}}^{[i]})$
	GLNN (Zhang et al., 2022)	$\sum_{i\in{\mathcal{V}}}I({\mathbf{x}}_{i};{\mathbf{y}}_{i}\|{\mathcal{S}}_{i})+I({\mathbf{x}}_{i};{\mathbf{y}}_{i})$
	NOSMOG (Tian et al., 2023), GENN (Wang et al., 2023a)	$\sum_{i\in{\mathcal{V}}}I({\mathbf{x}}_{i};{\mathbf{y}}_{i}\|{\mathcal{S}}_{i})+I({\mathbf{x}}_{i};{\mathbf{y}}_{i})+I({\mathbf{A}};{\mathbf{y}}_{i})$
	VQGraph (Yang et al., 2024)	$\sum_{i\in{\mathcal{V}}}I({\mathbf{x}}_{i};{\mathbf{y}}_{i}\|{\mathcal{S}}_{i})+I({\mathbf{x}}_{i};{\mathbf{y}}_{i})+I({\mathbf{A}};{\mathbf{x}}_{i})$
	SimMLP	$\sum_{i\in{\mathcal{V}}}I({\mathbf{y}}_{i};{\mathbf{x}}_{i})+I({\mathbf{x}}_{i};{\mathcal{S}}_{i})$

Mutual Information Maximization. Mutual information $I(\cdot;\cdot)$ is a concept in information theory, measuring the mutual dependency between two random variables. This has been widely used in signal processing and machine learning (Belghazi et al., 2018). Intuitively, maximizing the mutual information between two variables can increase their correlation (i.e., decrease their uncertainty). In this section, we interpret SimMLP and existing MLP learning methods from the perspective of mutual information maximization to analyze their learning objectives, as summarized in Table 1. Firstly, we unify the notations and introduce the key lemmas. A graph ${\mathcal{G}}=({\mathbf{X}},{\mathbf{A}},{\mathbf{Y}})$ consists of node features ${\mathbf{X}}$ , graph structure ${\mathbf{A}}$ and node labels ${\mathbf{Y}}$ . We define ego-graph around node $i$ as ${\mathcal{S}}_{i}=({\mathbf{X}}^{[i]},{\mathbf{A}}^{[i]})$ , where ${\mathbf{X}}^{[i]}$ , ${\mathbf{A}}^{[i]}$ denote node features and graph structure of ${\mathcal{S}}_{i}$ , respectively.

Lemma 4.2.

Minimizing the cross-entropy $H({\mathbf{Y}};\hat{{\mathbf{Y}}}|{\mathbf{X}})$ is equivalent to maximizing the mutual information $I({\mathbf{X}};{\mathbf{Y}})$ .

MLP minimizes the cross-entropy between ground-truth ${\mathbf{Y}}$ and the predictions relying on node features, i.e., $\hat{{\mathbf{Y}}}|{\mathbf{X}}$ , which is equivalent to maximize the mutual information $\sum_{i\in{\mathcal{V}}}I({\mathbf{y}}_{i};{\mathbf{x}}_{i})$ . GNNs apply message passing to use localized structures in making predictions, which aims to maximize $\sum_{i\in{\mathcal{V}}}I({\mathbf{y}}_{i};{\mathcal{S}}_{i})=\sum_{i\in{\mathcal{V}}}I({\mathbf{y}}_{i};{\mathbf{X}}^{[i]})+\sum_{i\in{\mathcal{V}}}I({\mathbf{y}}_{i};{\mathbf{A}}^{[i]}|{\mathbf{X}}^{[i]})$ . GraphMLP and GraphECL maximize the consistency between the target node and its neighborhood encoded by MLPs $I({\mathbf{x}}_{i};{\mathbf{X}}^{[i]})$ , failing to learning the intrinsic correlation between node features and graph structures. GLNN distills knowledge from GNNs to MLPs to maximize $\sum_{i\in{\mathcal{V}}}I({\mathbf{x}}_{i};{\mathbf{y}}_{i}|{\mathcal{S}}_{i})+I({\mathbf{x}}_{i};{\mathbf{y}}_{i})$ , where ${\mathbf{y}}_{i}|{\mathcal{S}}_{i}$ denotes the soft label from GNNs, but fails to explicitly utilize structural information in making predictions. Following GLNN, GENN and NOSMOG employ positional embeddings to leverage structural information by optimizing $I({\mathbf{A}};{\mathbf{y}}_{i})$ , and VQGraph uses codebook to inject structural knowledge on node features to further optimize $I({\mathbf{A}};{\mathbf{x}}_{i})$ . However, these methods cannot fully leverage the localized structures in predictions, leading to sub-optimal performance.

SimMLP employs self-supervised learning to maximize the mutual information between GNNs and MLPs (Bachman et al., 2020). The objective is to maximize $\sum_{i\in{\mathcal{V}}}I({\mathbf{y}}_{i};{\mathbf{x}}_{i})+I({\mathbf{x}}_{i};{\mathcal{S}}_{i})$ . The first term optimizes the model on downstream tasks, corresponding to task-specific prediction head. The second term is the training objective of SimMLP (Equation 4) that denotes the alignment between GNNs and MLPs. When the second term is maximized, ${\mathbf{x}}_{i}$ would preserve all information of ${\mathcal{S}}_{i}$ , turning the overall objective to maximize $\sum_{i\in{\mathcal{V}}}I({\mathbf{y}}_{i};{\mathcal{S}}_{i})$ . This demonstrates the equivalence between SimMLP (in the optimal case) and GNNs, showing the superiority of SimMLP in leveraging graph structures for predictions. Our analysis aligns with Chen et al. (2021) and Zhang et al. (2022) that the expressiveness of MLPs on node classification task is bounded by the induced ego-graphs ${\mathcal{S}}_{i}$ . Despite our analysis is based on node classification, it is readily to be extended to link prediction or graph classification.

Information Bottleneck Principle. Information bottleneck (Tishby et al., 2000; Tishby and Zaslavsky, 2015) focuses on finding the optimal compression of observed random variables by achieving the trade-off between informativeness and generalization (Shwartz-Ziv and Tishby, 2017). For example, given a random variable ${\mathbf{X}}$ sampled from latent variable ${\mathbf{Y}}$ , the aim is to find the optimal compression ${\mathbf{Z}}^{*}=\operatorname*{arg\,min}_{{\mathbf{Z}}}I({\mathbf{X}};{\mathbf{Z}})-\beta I({\mathbf{Z}};{\mathbf{Y}})$ . Intuitively, minimizing $I({\mathbf{X}};{\mathbf{Z}})$ aims to obtain the minimum compression, and maximizing $I({\mathbf{Z}};{\mathbf{Y}})$ preserves the essential information of ${\mathbf{Y}}$ . For SimMLP, we assume the observed graph ${\mathcal{G}}$ is sampled from latent graph ${\mathcal{G}}_{\mathcal{I}}$ (Proposition 4.1), and aim to compress the graph into ${\mathbf{T}}=({\mathbf{H}}^{MLP},{\mathbf{H}}^{GNN})$ .

Proposition 4.3.

The optimal compression ${\mathbf{T}}^{*}$ satisfies

(8)

\displaystyle{\mathbf{T}}^{*}=

\displaystyle\operatorname*{arg\,min}_{{\mathbf{H}}^{M},{\mathbf{H}}^{G}}\lambda H({\mathbf{H}}^{M}|{\mathcal{G}}_{\mathcal{I}})+H({\mathbf{H}}^{G})+\lambda H({\mathbf{H}}^{G}|{\mathcal{G}}_{\mathcal{I}})+H({\mathbf{H}}^{M}|{\mathbf{H}}^{G}),

where $\lambda=\frac{\beta}{1-\beta}>0$ , ${\mathbf{H}}^{M}$ and ${\mathbf{H}}^{G}$ indicate ${\mathbf{H}}^{MLP}$ and ${\mathbf{H}}^{GNN}$ .

$I(\cdot;\cdot)$ and $H(\cdot)$ denote mutual information and entropy, respectively. Intuitively, the optimal compression (Equation 8) is attainable with the optimal encoder ${\mathcal{E}}^{*}$ (Equation 5). In particular, minimizing $H({\mathbf{H}}^{MLP}|{\mathcal{G}}_{\mathcal{I}})$ and $H({\mathbf{H}}^{GNN}|{\mathcal{G}}_{\mathcal{I}})$ preserves the latent information in GNN and MLP embeddings, which can be instantiated as minimizing $\|{\mathbf{H}}^{MLP}-{\mathbf{F}}^{*}\|^{2}+\|{\mathbf{H}}^{GNN}-{\mathbf{F}}^{*}\|^{2}$ . In addition, the minimum conditional entropy $H({\mathbf{H}}^{MLP}|{\mathbf{H}}^{GNN})$ denotes the alignment between GNN and MLP embeddings, which could be modeled as maximizing $\sum_{i}\mathrm{Cov}({\mathbf{H}}^{MLP}_{i},{\mathbf{H}}^{GNN}_{i})$ . Furthermore, minimizing entropy $H({\mathbf{H}}^{GNN})$ reduces the uncertainty of the GNN embeddings, which could be achieved by preserving more node feature information, i.e., minimizing $\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\|^{2}$ . The analysis bridges the information bottleneck and the objective function of SimMLP, showing the potential in learning generalizable embeddings (Alemi et al., 2017).

Table 2. SimMLP shares two inductive biases with GNNs, i.e., homophily and local structure importance, which are measured by smoothness and min-cut, respectively.

	Smoothness $\downarrow$			Min-Cut $\uparrow$
Methods	Cora	Citeseer	Pubmed	Cora	Citeseer	Pubmed
Raw Node Feature	0.822	0.783	0.734	$-$	$-$	$-$
SAGE (Hamilton et al., 2017)	0.113	0.184	0.143	0.924	0.943	0.918
BGRL (Thakoor et al., 2022)	0.155	0.102	0.333	0.885	0.935	0.856
MLP (Zhang et al., 2022)	0.463	0.444	0.485	0.666	0.804	0.863
GLNN (Zhang et al., 2022)	0.282	0.268	0.421	0.886	0.916	0.793
NOSMOG (Tian et al., 2023)	0.267	0.230	0.394	0.902	0.932	0.834
VQGraph (Yang et al., 2024)	0.253	0.212	0.396	0.914	0.940	0.831
SimMLP	0.196	0.170	0.360	0.934	0.958	0.886

Table 3. Node classification accuracy in transductive setting, where the overall best and the sub-category best are indicated by bold and underline, respectively. We report the mean and standard deviation of ten runs with different random seeds.

	Methods	Cora	Citeseer	Pubmed	Computer	Photo	Co-CS	Co-Phys	Wiki-CS	Flickr	Arxiv	Avg.
GNN	SAGE (Hamilton et al., 2017)	81.4±0.9	70.4±1.4	85.9±0.4	88.9±0.3	93.8±0.4	93.4±0.2	95.7±0.1	80.9±0.6	48.5±0.8	72.1±0.3	81.1
	GAT (Veličković et al., 2018)	82.3±1.2	68.9±1.5	84.7±0.4	89.9±0.5	91.9±0.4	92.0±0.3	95.1±0.2	80.0±0.6	51.4±0.2	71.8±0.4	80.8
	APPNP (Gasteiger et al., 2019)	75.5±1.6	68.1±1.2	84.6±0.3	87.4±0.3	93.4±0.5	94.6±0.2	95.4±0.1	79.1±0.3	47.5±0.3	71.0±0.2	79.7
	SGC (Wu et al., 2019)	81.8±0.9	69.0±1.6	85.3±0.3	89.3±0.6	92.7±0.4	94.0±0.2	94.8±0.2	81.1±0.6	51.8±0.2	70.0±0.4	81.0
GCL	DGI (Veličković et al., 2019)	82.3±0.6	71.8±0.7	76.8±0.7	80.0±0.2	91.6±0.2	92.2±0.5	94.5±0.0	76.4±0.6	46.9±0.1	70.1±0.2	78.3
	MVGRL (Hassani and Khasahmadi, 2020)	83.9±0.5	72.1±1.3	86.3±0.6	87.9±0.3	91.9±0.2	92.2±0.1	95.3±0.0	77.6±0.1	49.3±0.1	70.9±0.1	80.7
	GRACE (Zhu et al., 2020)	80.5±1.0	65.5±2.1	84.6±0.5	88.4±0.3	92.8±0.6	93.0±0.3	95.4±0.1	78.6±0.5	49.3±0.1	71.0±0.1	79.5
	GCA (Zhu et al., 2021)	83.5±0.5	71.3±0.2	86.0±0.4	87.4±0.3	92.6±0.2	93.1±0.0	95.7±0.0	78.4±0.1	49.0±0.1	70.9±0.1	80.8
	BGRL (Thakoor et al., 2022)	81.3±0.6	66.9±0.6	84.9±0.2	88.2±0.2	92.5±0.1	92.1±0.1	95.2±0.1	77.5±0.8	49.7±0.1	70.8±0.1	79.9
MLP	MLP (Zhang et al., 2022)	64.5±1.9	64.0±1.3	80.7±0.3	80.8±0.3	87.8±0.5	91.7±0.3	95.1±0.1	75.2±0.5	46.2±0.1	56.4±0.3	74.2
	GraphMLP (Hu et al., 2021)	79.5±0.8	72.1±0.5	84.3±0.2	84.0±0.6	90.9±1.0	90.4±0.6	93.5±0.2	76.4±0.5	46.3±0.2	63.4±0.2	78.1
	GLNN (Zhang et al., 2022)	81.3±1.2	71.2±0.7	86.3±0.5	87.5±0.6	93.9±0.3	94.2±0.2	95.4±0.1	80.7±0.7	46.2±0.2	64.0±0.5	80.1
	GENN (Wang et al., 2023a)	82.1±0.8	71.4±1.3	86.3±0.3	87.1±0.6	93.6±0.7	93.8±0.3	95.5±0.1	80.5±0.7	46.4±0.3	70.1±0.6	80.7
	VQGraph (Yang et al., 2024)	82.3±0.6	73.0±1.2	86.0±0.4	87.5±0.7	93.9±0.3	93.8±0.1	95.6±0.1	79.9±0.2	47.0±0.2	70.8±0.8	81.0
	NOSMOG (Tian et al., 2023)	82.3±1.1	72.4±1.3	86.2±0.3	87.6±1.1	93.9±0.5	93.8±0.2	95.7±0.1	80.5±0.8	46.7±0.3	70.8±0.4	81.0
	SimMLP	84.6±0.2	73.5±0.5	87.0±0.1	88.5±0.2	94.3±0.1	94.9±0.1	96.2±0.0	81.2±0.1	49.9±0.1	71.1±0.1	82.1

Table 4. Node classification accuracy in inductive settings.

Methods	Cora	Citeseer	Pubmed	Arxiv
SAGE (Hamilton et al., 2017)	77.5±1.8	68.4±1.6	85.0±0.4	68.5±0.6
BGRL (Thakoor et al., 2022)	77.7±1.1	64.3±1.6	84.0±0.5	69.3±0.4
MLP (Zhang et al., 2022)	63.8±1.7	64.0±1.2	80.9±0.5	55.9±0.5
GLNN (Zhang et al., 2022)	78.3±1.0	69.6±1.1	85.5±0.5	63.5±0.5
GENN (Wang et al., 2023a)	77.8±1.6	67.3±1.5	84.3±0.5	68.5±0.5
VQGraph (Yang et al., 2024)	78.4±1.8	70.4±1.1	85.4±0.6	69.3±0.9
NOSMOG (Tian et al., 2023)	77.8±1.9	68.6±1.4	83.8±0.5	69.1±0.8
GraphECL (Xiao et al., 2024b)	77.8±1.3	69.2±1.2	84.5±0.5	69.1±0.6
SimMLP	81.4±1.2	72.3±0.9	86.5±0.3	70.2±0.5

Inductive Bias. To further analyze the equivalence between SimMLP and GNNs, we investigate whether SimMLP and GNNs have similar inductive biases. We consider SimMLP has two key inductive biases, i.e., homophily philosophy and local structure importance, as shown in Table 2 (where more resutls are in Appendix C.3). Homophily implies topologically close nodes have similar properties, which could be naturally incorporated in message passing (Li et al., 2022) that updates node embeddings based on neighborhoods. However, MLP-based methods cannot (or partially) leverage homophily bias due to the lack of structural learning ability. To evaluate the homophily, we measure the distance between embeddings of directly connected nodes, which corresponds to smoothness. In particular, we instantiate this as Mean Average Distance (MAD) (Chen et al., 2020b):

(9)

\text{MAD}=\frac{\sum_{i\in{\mathcal{V}}}\sum_{j\in{\mathcal{N}}(i)}({\mathbf{h}}_{i}^{MLP}-{\mathbf{h}}_{j}^{MLP})^{2}}{\sum_{i\in{\mathcal{V}}}\sum_{j\in{\mathcal{N}}(i)}\mathbf{1}}.

Intuitively, a low smoothness value indicates a high similarity between directly connected nodes, demonstrating the capability to leverage graph structural information (Hou et al., 2019). As shown in Table 2, compared to GNNs, MLP-based methods fall short in reducing the distance between topologically close nodes, even for NOSMOG and VQGraph that directly integrate graph structures. SimMLP goes beyond these methods by aligning GNNs and MLPs in a intricate embedding space, approaching to GNNs.

Table 5. Node classification accuracy under cold-start setting.

Methods	Cora	Citeseer	Pubmed	Arxiv
SAGE (Hamilton et al., 2017)	69.7±2.9	67.1±2.6	82.9±1.0	55.5±0.8
BGRL (Thakoor et al., 2022)	79.4±1.7	65.0±2.2	84.0±1.0	65.0±0.5
MLP (Zhang et al., 2022)	64.2±2.1	64.4±1.8	80.9±0.7	55.9±0.7
GLNN (Zhang et al., 2022)	72.0±1.7	69.1±2.6	84.4±0.9	60.6±0.6
GENN (Wang et al., 2023a)	68.1±2.2	65.1±2.8	78.4±0.8	62.6±0.7
VQGraph (Yang et al., 2024)	70.4±3.4	70.0±1.6	84.5±1.5	64.0±1.7
NOSMOG (Tian et al., 2023)	68.1±3.0	67.1±2.1	77.4±0.8	63.5±0.8
GraphECL (Xiao et al., 2024b)	71.5±4.2	70.9±2.4	82.4±0.9	63.3±0.7
SimMLP	80.5±2.2	72.8±1.6	86.4±0.5	66.1±1.1

Table 6. The performance on graph classification tasks with accuracy (%).

	Methods	IMDB-B	IMDB-M	COLLAB	PTC-MR	MUTAG	DD	PROTEINS
Supervised	GIN (Xu et al., 2019)	75.1±5.1	52.3±2.8	80.2±1.9	64.6±1.7	89.4±5.6	74.9±3.1	76.2±2.8
Graph Kernel	WL (Shervashidze et al., 2011)	72.3±3.4	47.0±0.5	-	58.0±0.5	80.7±3.0	-	72.9±0.6
Graph Kernel	DGK (Yanardag and Vishwanathan, 2015)	67.0±0.6	44.6±0.5	-	60.1±2.6	87.4±2.7	-	73.3±0.8
GCL	graph2vec (Narayanan et al., 2017)	71.1±0.5	50.4±0.9	-	60.2±6.9	83.2±9.3	-	73.3±2.1
	MVGRL (Hassani and Khasahmadi, 2020)	71.8±0.8	50.8±0.9	73.1±0.6	-	89.2±1.3	75.2±0.6	74.0±0.3
	InfoGraph (Sun et al., 2020)	73.0±0.9	49.7±0.5	70.7±1.1	61.7±1.4	89.0±1.1	72.9±1.8	74.4±0.3
	GraphCL (You et al., 2020)	71.1±0.4	48.6±0.7	71.4±1.2	-	86.8±1.3	78.6±0.4	74.4±0.5
	JOAO (You et al., 2021)	70.2±3.1	49.2±0.8	69.5±0.4	-	87.4±1.0	-	74.6±0.4
MLP	MLP^∗	49.5±1.7	33.1±1.6	51.9±1.0	54.4±1.4	67.2±1.0	58.6±1.4	59.2±1.0
	MLP + KD^∗	72.9±1.0	48.1±0.5	75.4±1.5	59.4±1.4	87.4±0.7	73.6±1.7	73.5±1.8
	SimMLP	74.1±0.2	51.4±0.5	81.0±0.1	60.3±1.1	87.7±0.2	78.4±0.5	75.3±0.1

The reported results of baselines are from previous papers if available (You et al., 2020, 2021; Hou et al., 2022). ^∗ indicates the results are from our implementation.

Local structure importance describes the local neighborhoods preserve the crucial information for predictions. GNNs utilize localized information in making predictions, naturally emphasizing the localized structures, but MLPs generally take node features as input, failing to fully leverage structural information. Alternatively, it measures the alignment between localized structures and model predictions, aligning to the philosophy of Min-Cut (Stoer and Wagner, 1997). In particular, denote predictions $\hat{{\mathbf{Y}}}$ as graph partitions, a high Min-Cut value indicates a high correlation between predictions and localized structures, as it implies high intra-partition connectivity and low inter-partition connectivity. We follow Shi and Malik (2000) to model the Min-Cut problem:

(10)

\text{Min-Cut}=tr(\hat{\mathbf{Y}}^{T}{\mathbf{A}}\hat{\mathbf{Y}})/tr(\hat{\mathbf{Y}}^{T}{\mathbf{D}}\hat{\mathbf{Y}}),

where ${\mathbf{A}}$ and ${\mathbf{D}}$ are adjacency matrix and diagonal node degree matrix, respectively. As shown in Table 2, SimMLP demonstrates an inductive bias towards local structure importance, evidenced by the optimal average Min-Cut result.

5. Experiments

5.1. Experimental Setup

We conduct experiments on node classification, link prediction, and graph classification. We use 20 datasets in total, including homophily graphs: Cora, Citeseer, Pubmed, Computer, Photo, Co-CS, Co-Phys, Wiki-CS, Flickr, and Arxiv; heterophily graphs: Actor, Texas, Wisconsin, and molecule/social networks (for graph classification): IMDB-B, IMDB-D, COLLAB, PTC-MR, MUTAG, DD, PROTEINS. More evaluation protocols are in Appendix B.

5.2. Node Classification

Transductive Setting. Given a graph ${\mathcal{G}}=({\mathcal{V}},E)$ , all nodes $v\in{\mathcal{V}}$ are visible during training, with the nodes partitioned into non-overlapping sets: ${\mathcal{V}}={\mathcal{V}}_{train}\sqcup{\mathcal{V}}_{val}\sqcup{\mathcal{V}}_{test}$ . We use ${\mathcal{V}}_{train}$ for training and ${\mathcal{V}}_{val}$ and ${\mathcal{V}}_{test}$ for evaluation. The results, presented in Table 3, show that SimMLP outperforms self-supervised GCL methods in all settings and surpasses supervised GNNs in 7 out of 10 datasets. Compared to other MLP-based methods, SimMLP achieves a superior average performance of 82.1, exceeding the second-best score of 81.0, thereby demonstrating its effectiveness in node classification. We present the comprehensive ablation study in Appendix E.

Table 7. SimMLP achieves significant acceleration and improved performance over inference acceleration techniques.

	Flickr		Arxiv
Methods	Time (ms)	Acc.	Time (ms)	Acc.
SAGE (Hamilton et al., 2017)	80.7	47.2	314.7	68.5
SGC (Wu et al., 2019)	76.9 (1.1 $\times$ )	47.4	265.9 (1.2 $\times$ )	68.9
APPNP (Gasteiger et al., 2019)	78.1 (1.0 $\times$ )	47.5	284.1 (1.1 $\times$ )	69.1
QSAGE (Zhang et al., 2022)	70.6 (1.1 $\times$ )	47.2	289.5 (1.1 $\times$ )	68.5
PSAGE (Zhang et al., 2022)	67.4 (1.2 $\times$ )	47.3	297.5 (1.1 $\times$ )	68.6
Neighbor Sample (Zhang et al., 2022)	25.5 (3.2 $\times$ )	47.0	78.3 (4.0 $\times$ )	68.4
SimMLP	0.9 (89.7 $\times$ )	49.3	2.5 (125.9 $\times$ )	70.2

Inductive Setting. In this paper, we consider inductive inference for unseen nodes within the same graph. We partition the graph ${\mathcal{G}}=({\mathcal{V}},E)$ into a non-overlapping transductive graph ${\mathcal{G}}^{T}=({\mathcal{V}}^{T},E^{T})$ and an inductive graph ${\mathcal{G}}^{I}=({\mathcal{V}}^{I},E^{I})$ . The transductive graph ${\mathcal{G}}^{T}$ contains 80% of the nodes, further divided into ${\mathcal{V}}^{T}={\mathcal{V}}^{T}{train}\sqcup{\mathcal{V}}^{T}{valid}\sqcup{\mathcal{V}}^{T}{test}$ , which are used for training in the transductive setting. The inductive graph ${\mathcal{G}}^{I}$ consists of the remaining 20% of nodes, which are unseen during training. Compared to the settings in (Zhang et al., 2022; Tian et al., 2023), our approach is more challenging because ${\mathcal{V}}^{I}$ is disconnected from ${\mathcal{V}}^{T}$ during inference. We evaluate three measures: (1) transductive results, evaluated on ${\mathcal{V}}^{T}{test}$ , (2) inductive results, evaluated on ${\mathcal{V}}^{I}$ , and (3) production results, which are the weighted average of the previous two. The production results are reported in Table 4. We observe that SimMLP outperforms various baselines, highlighting the advantage of leveraging structural information for unseen nodes. Additional results can be found in Appendix C.1.

Cold-Start Setting. We follow the inductive setting by partitioning the graph ${\mathcal{G}}=({\mathcal{V}},E)$ into a transductive graph ${\mathcal{G}}^{T}$ and an inductive graph ${\mathcal{G}}^{I}$ . The key difference is that the nodes in ${\mathcal{G}}^{I}$ are isolated, with all edges removed, such that ${\mathcal{G}}^{I}=({\mathcal{V}}^{I},\emptyset)$ . This approach aligns with real-world applications where new users often emerge independently (Hao et al., 2021). We report the performance on ${\mathcal{V}}^{I}$ as the cold-start results, as shown in Table 5. SimMLP achieves significant improvements over all baselines, highlighting its superior structural learning capabilities. Notably, SimMLP shows performance gains of 7% and 18% over the vanilla MLP on Flickr and Arxiv, respectively, and 7% and 6% enhancements over VQGraph. Additional results can be found in Appendix C.2.

Heterophily Graphs. The homophily inductive bias in SimMLP (Table 2) stems from the chosen GNN encoder, which in our case is GCN. However, by employing different GNN architectures, we can introduce varying inductive biases that are better suited for heterophily graphs. To explore this, we use two advanced methods–ACMGCN (Luan et al., 2022) and GCNII (Chen et al., 2020c)–which are known for their effectiveness in handling heterophily graphs, alongside MLP and GCN. We evaluate these models on three challenging heterophily datasets: Actor, Texas, and Wisconsin. The results, presented in Figure 4, demonstrate that SimMLP can adapt to different heterophily-oriented GNN architectures, consistently enhancing their performance. This highlights SimMLP’s adaptability and broad applicability across various graph domains and GNN architectures.

5.3. Link Prediction

Due to its self-supervised nature, SimMLP can be easily extended to link-level and graph-level tasks. We compare SimMLP against MLP, GNN, and LLP (Guo et al., 2023)–a state-of-the-art MLP-based method for link prediction–using Cora, Citeseer, and Pubmed as benchmark datasets. We strictly adhere to the experimental setup from (Guo et al., 2023). The results, illustrated in Figure 5, show that SimMLP achieves the best performance across all three datasets, with particularly significant improvements on Cora and Citeseer. These findings underscore SimMLP’s superiority in modeling localized structural information for accurately determining node connectivity.

5.4. Graph Classification

Although SimMLP is primarily designed for learning localized structures, it still delivers strong performance on graph classification tasks, which require an understanding of global structural knowledge. We compare SimMLP against various baselines, including supervised GNNs such as GIN (Xu et al., 2019), as well as self-supervised GNNs like the WL kernel (Shervashidze et al., 2011), DGK (Yanardag and Vishwanathan, 2015), graph2vec (Narayanan et al., 2017), MVGRL (Hassani and Khasahmadi, 2020), InfoGraph (Sun et al., 2020), GraphCL (You et al., 2020), and JOAO (You et al., 2021). Additionally, we implemented an MLP learning method on graphs by applying a KL divergence loss similar to that used in GLNN (Zhang et al., 2022). The graph classification results, presented in Table 6, show that SimMLP achieves the best or second-best performance on 6 out of 7 datasets across various domains, with particularly strong results on the large-scale COLLAB dataset. These findings highlight the potential of SimMLP for graph-level tasks.

5.5. Inference Acceleration

Table 7 compares SimMLP with existing acceleration methods (Zhang et al., 2022), including quantization (QSAGE), pruning (PSAGE), and neighbor sampling (Neighbor Sample) in an inductive setting. We also include SGC and APPNP, which simplify message passing for faster inference. Our observations indicate that even the most efficient of these methods offers only marginal acceleration ( $3.2\sim 4.0\times$ ) and inevitably sacrifices model performance. In contrast, SimMLP achieves remarkable inference acceleration ( $89.7\sim 125.9\times$ ) by eliminating the neighborhood fetching process. Additionally, we provide a comparison with MLP-based methods, as shown in Figure 1. This figure illustrates the trade-off between model performance and inference time on Arxiv in a cold-start setting. SimMLP delivers significant performance improvements over MLP-based methods and substantial inference acceleration compared to GNN-based methods, demonstrating the best overall trade-off.

5.6. Robustness Analysis

In this section, we analyze the robustness of SimMLP on noisy data and with scarce labels. We report the results for the inductive set ${\mathcal{V}}^{\mathcal{I}}$ in the inductive setting, averaging over seven datasets: Cora, Citeseer, Pubmed, Computer, Photo, Co-CS, and Wiki-CS.

Noisy Node Features. We assess the impact of node feature noise by introducing random Gaussian noise. Specifically, we replace ${\mathbf{X}}$ with $\tilde{\mathbf{X}}=(1-\alpha){\mathbf{X}}+\alpha x$ , where $x$ represents random noise independent of ${\mathbf{X}}$ , and the noise level $\alpha\in[0,1]$ . As shown in Figure 6 (Left), SimMLP outperforms all baselines across all settings, despite the fact that node content quality is critical for MLP-based methods (Zhang et al., 2022; Guo et al., 2023). We attribute this robustness to the augmentation process, which synthesizes additional high-quality node and ego-graph pairs, thereby enhancing MLP training. This augmentation also contributes to the robustness of BGRL. However, we observe that the performance of other MLP-based methods deteriorates rapidly as noise levels increase.

Noisy Topology. To introduce structural noise, we randomly flip edges within the graph. Specifically, we replace ${\mathbf{A}}$ with $\tilde{\mathbf{A}}={\mathbf{M}}\odot(1-{\mathbf{A}})+(1-{\mathbf{M}})\odot{\mathbf{A}}$ , where ${\mathbf{M}}_{ij}\sim{\mathcal{B}}(p)$ , and ${\mathcal{B}}(p)$ is a Bernoulli distribution with probability $p$ . The results under varying noise levels are depicted in Figure 6 (Middle). SimMLP consistently outperforms other methods, demonstrating its robustness. While increasing noise levels significantly degrade the performance of GNNs, especially self-supervised BGRL, they have minimal impact on MLPs, even when $\tilde{\mathbf{A}}$ becomes independent of ${\mathbf{A}}$ .

Label Scarcity. We also examine SimMLP’s robustness under label scarcity. Figure 6 (Right) presents model performance across various label ratios for node classification. Our method consistently outperforms all other baselines, even with extremely limited training data (0.001). This highlights SimMLP’s resilience to label scarcity. Furthermore, we observe that self-supervised methods demonstrate greater robustness (Huang et al., 2023) compared to supervised approaches, likely due to their ability to leverage unlabeled data during training.

6. Conclusion

MLPs offer rapid inference on graph-structured datasets, yet the lack in learning structural information limits their performance. In this paper, we propose SimMLP, the first MLP learning method that is equivalent to GNNs (in the optimal case). Our insight is that modeling the fine-grained correlation between node features and graph structures preserves generalized structural information. To instantiate the insight, we apply self-supervised alignment between GNNs and MLPs in the embedding space. Experimental results show that SimMLP is generalizable to unseen nodes, robust against noisy graphs and label scarcity, and flexible to various graph-related tasks.

Acknowledgements.

This work was partially supported by the NSF under grants IIS-2321504, IIS-2334193, IIS-2340346, IIS-2217239, CNS-2426514, CNS-2203261, and CMMI-2146076. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.

References

(1)
Alemi et al. (2017) Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. 2017. Deep Variational Information Bottleneck. In ICLR.
Bachman et al. (2020) Philip Bachman, R Devon Hjelm, and William Buchwalter. 2020. Learning representations by maximizing mutual information across views. In NeurIPS.
Batson and Royer (2019) Joshua Batson and Loic Royer. 2019. Noise2self: Blind denoising by self-supervision. In ICML.
Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. 2018. Mutual information neural estimation. In ICML.
Boudiaf et al. (2020) Malik Boudiaf, Jérôme Rony, Imtiaz Masud Ziko, Eric Granger, Marco Pedersoli, Pablo Piantanida, and Ismail Ben Ayed. 2020. A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses. In ECCV.
Chen et al. (2020b) Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2020b. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In AAAI.
Chen et al. (2021) Lei Chen, Zhengdao Chen, and Joan Bruna. 2021. On Graph Neural Networks versus Graph-Augmented MLPs. In ICLR.
Chen et al. (2020c) Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. 2020c. Simple and deep graph convolutional networks. In ICML.
Chen et al. (2020a) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020a. A simple framework for contrastive learning of visual representations. In ICML.
Ding et al. (2021) Mucong Ding, Kezhi Kong, Jingling Li, Chen Zhu, John Dickerson, Furong Huang, and Tom Goldstein. 2021. VQ-GNN: A universal framework to scale up graph neural networks using vector quantization. In NeurIPS.
Dong et al. (2022) Wei Dong, Junsheng Wu, Yi Luo, Zongyuan Ge, and Peng Wang. 2022. Node representation learning in graph via node-to-neighbourhood mutual information maximization. In CVPR.
Dwivedi et al. (2022) Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. 2022. Graph Neural Networks with Learnable Structural and Positional Representations. In ICLR.
Frankle and Carbin (2019) Jonathan Frankle and Michael Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In ICLR.
Gasteiger et al. (2019) Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. 2019. Combining Neural Networks with Personalized PageRank for Classification on Graphs. In ICLR.
Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In ICML.
Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS.
Guo et al. (2023) Zhichun Guo, William Shiao, Shichang Zhang, Yozen Liu, Nitesh V Chawla, Neil Shah, and Tong Zhao. 2023. Linkless link prediction via relational distillation. In ICML.
Gupta et al. (2015) Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In ICML.
Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NeurIPS.
Han et al. (2015) Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In NeurIPS.
Han et al. (2023) Xiaotian Han, Tong Zhao, Yozen Liu, Xia Hu, and Neil Shah. 2023. MLPInit: Embarrassingly Simple GNN Training Acceleration with MLP Initialization. In ICLR.
Hao et al. (2021) Bowen Hao, Jing Zhang, Hongzhi Yin, Cuiping Li, and Hong Chen. 2021. Pre-training graph neural networks for cold-start users and items representation. In WSDM.
Hassani and Khasahmadi (2020) Kaveh Hassani and Amir Hosein Khasahmadi. 2020. Contrastive multi-view representation learning on graphs. In ICML.
He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR.
Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv (2015).
Hou et al. (2019) Yifan Hou, Jian Zhang, James Cheng, Kaili Ma, Richard TB Ma, Hongzhi Chen, and Ming-Chang Yang. 2019. Measuring and Improving the Use of Graph Information in Graph Neural Networks. In ICLR.
Hou et al. (2022) Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang. 2022. Graphmae: Self-supervised masked graph autoencoders. In KDD.
Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. In NeurIPS.
Hu et al. (2021) Yang Hu, Haoxuan You, Zhecan Wang, Zhicheng Wang, Erjin Zhou, and Yue Gao. 2021. Graph-mlp: Node classification without message passing in graph. arXiv (2021).
Huang et al. (2023) Weiran Huang, Mingyang Yi, Xuyang Zhao, and Zihao Jiang. 2023. Towards the Generalization of Contrastive Self-Supervised Learning. In ICLR.
Jacob et al. (2018) Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR.
Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.
Lee et al. (2020) Hankook Lee, Sung Ju Hwang, and Jinwoo Shin. 2020. Self-supervised label augmentation via input transformations. In ICML.
Li et al. (2019) Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. 2019. Deepgcns: Can gcns go as deep as cnns?. In ICCV.
Li et al. (2022) Xiang Li, Renyu Zhu, Yao Cheng, Caihua Shan, Siqiang Luo, Dongsheng Li, and Weining Qian. 2022. Finding global homophily in graph neural networks when meeting heterophily. In ICML.
Lin et al. (2017) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In ICCV.
Liu et al. (2022) Siwei Liu, Iadh Ounis, and Craig Macdonald. 2022. An MLP-based algorithm for efficient contrastive graph recommendations. In SIGIR.
Liu et al. (2024) Zheyuan Liu, Xiaoxin He, Yijun Tian, and Nitesh V Chawla. 2024. Can we soft prompt LLMs for graph learning tasks?. In WWW.
Liu et al. (2023) Zheyuan Liu, Chunhui Zhang, Yijun Tian, Erchi Zhang, Chao Huang, Yanfang Ye, and Chuxu Zhang. 2023. Fair graph representation learning via diverse mixture-of-experts. In WWW.
Luan et al. (2022) Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, and Doina Precup. 2022. Revisiting heterophily for graph neural networks. In NeurIPS.
Mernyei and Cangea (2020) Péter Mernyei and Cătălina Cangea. 2020. Wiki-cs: A wikipedia-based benchmark for graph neural networks. arXiv (2020).
Morris et al. (2020) Christopher Morris, Nils M Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. 2020. Tudataset: A collection of benchmark datasets for learning with graphs. arXiv (2020).
Muthén (2004) Bengt Muthén. 2004. Latent variable analysis. The Sage handbook of quantitative methodology for the social sciences (2004).
Narayanan et al. (2017) Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. graph2vec: Learning distributed representations of graphs. arXiv (2017).
Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of graph neural network evaluation. arXiv (2018).
Shervashidze et al. (2011) Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. 2011. Weisfeiler-lehman graph kernels. JMLR (2011).
Shi and Malik (2000) Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation. TPAMI (2000).
Shwartz-Ziv and Tishby (2017) Ravid Shwartz-Ziv and Naftali Tishby. 2017. Opening the black box of deep neural networks via information. arXiv (2017).
Stoer and Wagner (1997) Mechthild Stoer and Frank Wagner. 1997. A simple min-cut algorithm. J. ACM (1997).
Sun et al. (2020) Fan-Yun Sun, Jordan Hoffman, Vikas Verma, and Jian Tang. 2020. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. In ICLR.
Sun et al. (2019) Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient knowledge distillation for bert model compression. In EMNLP.
Sun et al. (2023) Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan. 2023. All in One: Multi-Task Prompting for Graph Neural Networks. In KDD.
Thakoor et al. (2022) Shantanu Thakoor, Corentin Tallec, Mohammad Gheshlaghi Azar, Mehdi Azabou, Eva L Dyer, Remi Munos, Petar Veličković, and Michal Valko. 2022. Large-Scale Representation Learning on Graphs via Bootstrapping. In ICLR.
Tian et al. (2020a) Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020a. Contrastive Representation Distillation. In ICLR.
Tian et al. (2020b) Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020b. What makes for good views for contrastive learning? NeurIPS (2020).
Tian et al. (2023) Yijun Tian, Chuxu Zhang, Zhichun Guo, Xiangliang Zhang, and Nitesh Chawla. 2023. Learning MLPs on graphs: A unified view of effectiveness, robustness, and efficiency. In ICLR.
Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. 2000. The information bottleneck method. arXiv (2000).
Tishby and Zaslavsky (2015) Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle. In ITW.
Van Den Oord et al. (2017) Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. In NeurIPS.
Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In ICLR.
Veličković et al. (2019) Petar Veličković, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. 2019. Deep Graph Infomax. In ICLR.
Virmaux and Scaman (2018) Aladin Virmaux and Kevin Scaman. 2018. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In NeurIPS.
Wang et al. (2021) Xuhong Wang, Ding Lyu, Mengjian Li, Yang Xia, Qi Yang, Xinwen Wang, Xinguang Wang, Ping Cui, Yupu Yang, Bowen Sun, et al. 2021. Apan: Asynchronous propagation attention network for real-time temporal graph embedding. In SIGMOD.
Wang et al. (2023a) Yiwei Wang, Bryan Hooi, Yozen Liu, and Neil Shah. 2023a. Graph Explicit Neural Networks: Explicitly Encoding Graphs for Efficient and Accurate Inference. In WSDM.
Wang et al. (2023b) Zehong Wang, Qi Li, Donghua Yu, Xiaolong Han, Xiao-Zhi Gao, and Shigen Shen. 2023b. Heterogeneous graph contrastive multi-view learning. In SDM.
Wang et al. (2024a) Zehong Wang, Donghua Yu, Shigen Shen, Shichao Zhang, Huawen Liu, Shuang Yao, and Maozu Guo. 2024a. Select Your Own Counterparts: Self-Supervised Graph Contrastive Learning With Positive Sampling. TNNLS (2024).
Wang et al. (2024b) Zehong Wang, Zheyuan Zhang, Nitesh V Chawla, Chuxu Zhang, and Yanfang Ye. 2024b. GFT: Graph Foundation Model with Transferable Tree Vocabulary. In NeurIPS.
Wang et al. (2024c) Zehong Wang, Zheyuan Zhang, Chuxu Zhang, and Yanfang Ye. 2024c. Subgraph Pooling: Tackling Negative Transfer on Graphs. In IJCAI.
Wu et al. (2019) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying graph convolutional networks. In ICML.
Xiao et al. (2024a) Teng Xiao, Huaisheng Zhu, Zhengyu Chen, and Suhang Wang. 2024a. Simple and asymmetric graph contrastive learning without augmentations. In NeurIPS.
Xiao et al. (2024b) Teng Xiao, Huaisheng Zhu, Zhiwei Zhang, Zhimeng Guo, Charu C Aggarwal, Suhang Wang, and Vasant G Honavar. 2024b. Efficient Contrastive Learning for Fast and Accurate Inference on Graphs. In ICML.
Xie et al. (2022) Yaochen Xie, Zhao Xu, and Shuiwang Ji. 2022. Self-supervised representation learning via latent graph prediction. In ICML.
Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In ICLR.
Yan et al. (2020) Bencheng Yan, Chaokun Wang, Gaoyang Guo, and Yunkai Lou. 2020. Tinygnn: Learning efficient graph neural networks. In KDD.
Yanardag and Vishwanathan (2015) Pinar Yanardag and SVN Vishwanathan. 2015. Deep graph kernels. In KDD.
Yang et al. (2021) Cheng Yang, Jiawei Liu, and Chuan Shi. 2021. Extract the knowledge of graph neural networks and go beyond it: An effective knowledge distillation framework. In WWW.
Yang et al. (2023) Chenxiao Yang, Qitian Wu, Jiahua Wang, and Junchi Yan. 2023. Graph Neural Networks are Inherently Good Generalizers: Insights by Bridging GNNs and MLPs. In ICLR.
Yang et al. (2024) Ling Yang, Ye Tian, Minkai Xu, Zhongyi Liu, Shenda Hong, Wei Qu, Wentao Zhang, Bin CUI, Muhan Zhang, and Jure Leskovec. 2024. VQGraph: Rethinking Graph Representation Space for Bridging GNNs and MLPs. In ICLR.
Yang et al. (2016) Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting semi-supervised learning with graph embeddings. In ICML.
You et al. (2021) Yuning You, Tianlong Chen, Yang Shen, and Zhangyang Wang. 2021. Graph contrastive learning automated. In ICML.
You et al. (2020) Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. In NeurIPS.
Zeng et al. (2020) Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. 2020. GraphSAINT: Graph Sampling Based Inductive Learning Method. In ICLR.
Zhang et al. (2022) Shichang Zhang, Yozen Liu, Yizhou Sun, and Neil Shah. 2022. Graph-less Neural Networks: Teaching Old MLPs New Tricks Via Distillation. In ICLR.
Zhang et al. (2024) Zheyuan Zhang, Zehong Wang, Shifu Hou, Evan Hall, Landon Bachman, Jasmine White, Vincent Galassi, Nitesh V Chawla, Chuxu Zhang, and Yanfang Ye. 2024. Diet-ODIN: A Novel Framework for Opioid Misuse Detection with Interpretable Dietary Patterns. In KDD.
Zhao et al. (2021) Tong Zhao, Yozen Liu, Leonardo Neves, Oliver Woodford, Meng Jiang, and Neil Shah. 2021. Data augmentation for graph neural networks. In AAAI.
Zheng et al. (2022) Wenqing Zheng, Edward W Huang, Nikhil Rao, Sumeet Katariya, Zhangyang Wang, and Karthik Subbian. 2022. Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods. In ICLR.
Zhou et al. (2021) Hongkuan Zhou, Ajitesh Srivastava, Hanqing Zeng, Rajgopal Kannan, and Viktor Prasanna. 2021. Accelerating large scale real-time GNN inference using channel pruning. In VLDB.
Zhu et al. (2020) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2020. Deep graph contrastive representation learning. arXiv (2020).
Zhu et al. (2021) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation. In WWW.

Appendix A Proof

A.1. Proof of Proposition 4.1

Proof.

Our proof is based on a mild assumption:

•

The graph ${\mathcal{G}}=({\mathbf{A}},{\mathbf{X}})$ is sampled from a latent graph ${\mathcal{G}}_{\mathcal{I}}=({\mathbf{A}},{\mathbf{F}})$ (Xie et al., 2022) following the distribution ${\mathcal{G}}\sim P({\mathcal{G}}_{\mathcal{I}})$ , where ${\mathbf{F}}\in\mathbb{R}^{N\times d}$ represents the latent node semantics. This assumption is the extension of latent variable assumption, a general assumption in statistics and machine learning, which is based on the principle that observed data is sampled from an unobserved distribution.

Then, we introduce some notations used in the proof:

•

The MLP encoder ${\mathcal{E}}$ and the decoder ${\mathcal{D}}$ are defined as fully-connected layer, which ensures the $l$ -Lipschitz continuity with respect to the $l_{2}$ -norm. This is a common property in neural networks with continuous activation functions, e.g., ReLU (Virmaux and Scaman, 2018). This property is data-agnostic and thus suitable for most real-world graphs. The output of MLP encoder is ${\mathbf{H}}^{MLP}={\mathcal{E}}({\mathbf{X}})$ with ${\mathbf{H}}^{MLP}\in\mathbb{R}^{N\times d^{\prime}}$ .
•

GNN encoder $\phi$ yields ${\mathbf{H}}^{GNN}=\phi({\mathbf{X}},{\mathbf{A}})$ with ${\mathbf{H}}^{GNN}\in\mathbb{R}^{N\times d^{\prime}}$ .
•

${\mathbf{F}}^{*}\in\mathbb{R}^{N\times d^{\prime}}$ denotes the lossless compression of ${\mathbf{F}}$ that $\mathbb{E}[{\mathbf{X}}|{\mathbf{A}},{\mathbf{F}}^{*}]={\mathbf{F}}$ .

The Equation 5 can be rewritten as:

	$\displaystyle{\mathcal{E}}^{*}=$	$\displaystyle\operatorname*{arg\,min}_{{\mathcal{E}}}\mathbb{E}{\left\\|{\mathbf{H}}^{MLP}-{\mathbf{H}}^{GNN}\right\\|}^{2}+\left\\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\\|^{2}$
	$\displaystyle=$	$\displaystyle\operatorname{arg\,min}_{{\mathcal{E}}}\mathbb{E}{\left\\|({\mathbf{H}}^{MLP}-{\mathbf{F}}^{})-({\mathbf{H}}^{GNN}-{\mathbf{F}}^{*})\right\\|}^{2}+\left\\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\\|^{2}$
	$\displaystyle=$	$\displaystyle\operatorname{arg\,min}_{{\mathcal{E}}}\mathbb{E}\left[\left\\|{\mathbf{H}}^{MLP}-{\mathbf{F}}^{}\right\\|^{2}+\left\\|{\mathbf{H}}^{GNN}-{\mathbf{F}}^{*}\right\\|^{2}+\left\\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\\|^{2}\right]$
		$\displaystyle-2\mathbb{E}\left[\left\langle{\mathbf{H}}^{MLP}-{\mathbf{F}}^{},{\mathbf{H}}^{GNN}-{\mathbf{F}}^{}\right\rangle\right]$
	$\displaystyle=$	$\displaystyle\operatorname{arg\,min}_{{\mathcal{E}}}\mathbb{E}\left[\left\\|{\mathbf{H}}^{MLP}-{\mathbf{F}}^{}\right\\|^{2}+\left\\|{\mathbf{H}}^{GNN}-{\mathbf{F}}^{*}\right\\|^{2}+\left\\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\\|^{2}\right]$
		$\displaystyle-2\mathbb{E}_{{\mathbf{F}}^{}}\left[\sum_{i}\left({\mathbf{H}}_{i}^{MLP}-{\mathbf{F}}^{}_{i}\right)\left({\mathbf{H}}_{i}^{GNN}-{\mathbf{F}}^{}_{i}\right)\|{\mathbf{F}}^{}\right]$
	$\displaystyle=$	$\displaystyle\operatorname{arg\,min}_{{\mathcal{E}}}\mathbb{E}\left[\left\\|{\mathbf{H}}^{MLP}-{\mathbf{F}}^{}\right\\|^{2}+\left\\|{\mathbf{H}}^{GNN}-{\mathbf{F}}^{*}\right\\|^{2}+\left\\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\\|^{2}\right]$
		$\displaystyle-2\mathbb{E}_{{\mathbf{F}}^{}}\left[\sum_{i}\mathrm{Cov}({\mathbf{H}}^{MLP}_{i}-{\mathbf{F}}^{},{\mathbf{H}}^{GNN}_{i}-{\mathbf{F}}^{})\|{\mathbf{F}}^{}\right].$
	$\displaystyle=$	$\displaystyle\operatorname{arg\,min}_{{\mathcal{E}}}\mathbb{E}\left[\left\\|{\mathbf{H}}^{MLP}-{\mathbf{F}}^{}\right\\|^{2}+\left\\|{\mathbf{H}}^{GNN}-{\mathbf{F}}^{*}\right\\|^{2}+\left\\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\\|^{2}\right]$
(11)			$\displaystyle-2\mathbb{E}_{{\mathbf{F}}^{}}\left[\sum_{i}\mathrm{Cov}({\mathbf{H}}^{MLP}_{i},{\mathbf{H}}^{GNN}_{i})\|{\mathbf{F}}^{}\right].$

Then, with a bit of simple transformations, the Equation 11 can be expressed in the form of Equation 5. We explain these four terms in details. The first two terms $\left\|{\mathbf{H}}^{MLP}-{\mathbf{F}}^{*}\right\|^{2}$ and $\left\|{\mathbf{H}}^{GNN}-{\mathbf{F}}^{*}\right\|^{2}$ indicate the reconstruction errors of MLP embedding ${\mathbf{H}}^{MLP}$ and GNN embedding ${\mathbf{H}}^{GNN}$ on the latent variable ${\mathbf{F}}^{*}$ , ensuring the invariance on the latent graph ${\mathcal{G}}_{\mathcal{I}}$ . The third term $\left\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\|^{2}$ reconstructs the node feature ${\mathbf{X}}$ using GNN embeddings ${\mathbf{H}}^{GNN}$ , mitigating the risk of potential distribution shifts. The last term $-2\sum_{i}\mathrm{Cov}({\mathbf{H}}^{MLP}_{i},{\mathbf{H}}^{GNN}_{i})$ maximizes the covariance between GNN and MLP embeddings at each dimension, aligning GNNs and MLPs in the embedding space.

∎

A.2. Proof of Lemma 4.2

Proof.

We follow the paper (Boudiaf et al., 2020) to prove the lemma. We show the equivalence between these two terms by expanding $H({\mathbf{Y}};\hat{{\mathbf{Y}}}|{\mathbf{X}})$ and $I({\mathbf{X}};{\mathbf{Y}})$ . We first expand the mutual information as

(12)

I({\mathbf{X}};{\mathbf{Y}})=H({\mathbf{Y}})-H({\mathbf{Y}}|{\mathbf{X}}).

Maximizing the mutual information $I({\mathbf{X}};{\mathbf{Y}})$ indicates minimizing the conditional entropy $H({\mathbf{Y}}|{\mathbf{X}})$ . The entropy on the label $H({\mathbf{Y}})$ is a constant, which can be ignored.

The cross-entropy $H({\mathbf{Y}};\hat{{\mathbf{Y}}}|{\mathbf{X}})$ can be written as the combination of conditional entropy $H({\mathbf{Y}}|{\mathbf{X}})$ and KL divergence ${\mathcal{D}}_{KL}({\mathbf{Y}}\|\hat{{\mathbf{Y}}}|{\mathbf{X}})$ :

	$\displaystyle H({\mathbf{Y}};\hat{{\mathbf{Y}}}\|{\mathbf{X}})$	$\displaystyle=-\sum_{i}({\mathbf{Y}}_{i}\|{\mathbf{X}})\log(\hat{{\mathbf{Y}}}_{i}\|{\mathbf{X}})$
		$\displaystyle=-\sum_{i}({\mathbf{Y}}_{i}\|{\mathbf{X}})\log({\mathbf{Y}}_{i}\|{\mathbf{X}})+\sum_{i}({\mathbf{Y}}_{i}\|{\mathbf{X}})\log({\mathbf{Y}}_{i}\|{\mathbf{X}})$
		$\displaystyle-\sum_{i}{\mathbf{Y}}_{i}\log(\hat{{\mathbf{Y}}}_{i}\|{\mathbf{X}})$
		$\displaystyle=H({\mathbf{Y}}\|{\mathbf{X}})+\sum_{i}({\mathbf{Y}}_{i}\|{\mathbf{X}})\log\frac{({\mathbf{Y}}_{i}\|{\mathbf{X}})}{(\hat{{\mathbf{Y}}}_{i}\|{\mathbf{X}})}$
(13)			$\displaystyle=H({\mathbf{Y}}\|{\mathbf{X}})+{\mathcal{D}}_{KL}({\mathbf{Y}}\\|\hat{{\mathbf{Y}}}\|{\mathbf{X}})$

Considering Equation 13, minimizing the cross-entropy $H({\mathbf{Y}};\hat{{\mathbf{Y}}}|{\mathbf{X}})$ can minimize $H({\mathbf{Y}}|{\mathbf{X}})$ (as well as ${\mathcal{D}}_{KL}({\mathbf{Y}}\|\hat{{\mathbf{Y}}}|{\mathbf{X}})$ ), which is equivalent to maximizing the mutual information $I({\mathbf{X}};{\mathbf{Y}})$ . Based on the analysis in (Boudiaf et al., 2020), Equation 13 can be optimized in a Max-Min manner. In particular, the first step is to freeze the encoder and only optimize the classifier, corresponding to fix $H({\mathbf{Y}}|{\mathbf{X}})$ and minimize ${\mathcal{D}}_{KL}({\mathbf{Y}}\|\hat{{\mathbf{Y}}}|{\mathbf{X}})$ . The KL term would ideally vanish at the end of this step. Following step involves optimizing the parameters of the encoder while fixing the classifier.

∎

A.3. Proof of Proposition 4.3

Proof. Before the proof, we need to provide some notations. We aim to compress the original graph ${\mathcal{G}}$ into ${\mathbf{T}}=({\mathbf{H}}^{MLP},{\mathbf{H}}^{GNN})$ by preserving the information of latent graph ${\mathcal{G}}_{\mathcal{I}}$ . Based on the definition of information bottleneck (Tishby et al., 2000), the optimal compression is

(14)

{\mathbf{T}}^{*}=\operatorname*{arg\,min}_{\mathbf{T}}I({\mathcal{G}};{\mathbf{T}})-\beta I({\mathbf{T}};{\mathcal{G}}_{\mathcal{I}}),

where $\beta$ denotes the Lagrange multiplier and $I(\cdot;\cdot)$ is the mutual information. The optimal compression ${\mathbf{T}}^{*}$ preserves the essential latent information by maximizing $\beta I({\mathbf{T}};{\mathcal{G}}_{\mathcal{I}})$ and discard the noises contained in the observed data ${\mathcal{G}}$ by minimizing $I({\mathcal{G}};{\mathbf{T}})$ . To handle the equation in a more accessible manner, we convert it as

	$\displaystyle{\mathbf{T}}^{*}$	$\displaystyle=\operatorname*{arg\,min}_{\mathbf{T}}I({\mathcal{G}};{\mathbf{T}})-\beta I({\mathbf{T}};{\mathcal{G}}_{\mathcal{I}})$
		$\displaystyle=\operatorname*{arg\,min}_{\mathbf{T}}(1-\beta)H({\mathbf{T}})+\beta H({\mathbf{T}}\|{\mathcal{G}}_{\mathcal{I}})-H({\mathbf{T}}\|{\mathcal{G}})$
		$\displaystyle=\operatorname*{arg\,min}_{\mathbf{T}}H({\mathbf{T}})+\lambda H({\mathbf{T}}\|{\mathcal{G}}_{\mathcal{I}})$
		$\displaystyle=\operatorname*{arg\,min}_{{\mathbf{H}}^{MLP},{\mathbf{H}}^{GNN}}\lambda H({\mathbf{H}}^{MLP}\|{\mathcal{G}}_{\mathcal{I}})$
(15)			$\displaystyle+H({\mathbf{H}}^{GNN})+\lambda H({\mathbf{H}}^{GNN}\|{\mathcal{G}}_{\mathcal{I}})+H({\mathbf{H}}^{MLP}\|{\mathbf{H}}^{GNN}),$

where $\lambda=\frac{\beta}{1-\beta}>0$ and $H(\cdot)$ is the entropy.

Appendix B Experimental Setup

B.1. Dataset Statistics

Table 8. The statistics of node classification datasets.

Dataset	# Nodes	# Edges	# Features	# Classes
Cora	2,708	10,556	1,433	7
Citeseer	3,327	9,104	3,703	6
Pubmed	19,717	88,648	500	3
Computer	13,752	491,722	767	10
Photo	7,650	238,162	745	8
Co-CS	18,333	163,788	6,805	15
Co-Phys	34,493	495,924	8,415	5
Wiki-CS	11,701	432,246	300	10
Flickr	89,250	899,756	500	7
Arxiv	169,343	1,166,243	128	40
Texas	183	295	1,703	5
Wisconsin	251	515	1,703	5
Actor	7,600	30,019	932	5

Table 9. The statistics of graph classification datasets.

Dataset	# Graphs	# Nodes	# Edges	# Features	# Classes
IMDB-B	1,000	$\sim$ 19.8	$\sim$ 193.1	-	2
IMDB-M	1,500	$\sim$ 13.0	$\sim$ 65.9	-	3
COLLAB	5,000	$\sim$ 74.5	$\sim$ 4,914.4	-	3
PTC-MR	344	$\sim$ 14.3	$\sim$ 14.7	18	2
MUTAG	118	$\sim$ 17.9	$\sim$ 39.6	7	2
DD	1,178	$\sim$ 284.3	$\sim$ 715.6	89	2
PROTEINS	1,113	$\sim$ 39.1	$\sim$ 145.6	3	2

Node Classification.

We select 10 benchmark datasets to evaluate the performance of SimMLP and other baselines. These datasets are collected from diverse domains, encompassing citation networks, social networks, wikipedia networks, etc. We present the statistics of these datasets in Table 8. Specifically, Cora, Citeseer, Pubmed (Yang et al., 2016) are three citation networks, in which nodes denote papers and edges represent citations. The node features are represented as bag-of-words based on paper keywords. Computer (Amazon-CS) and Photo (Amazon-Photo) (Shchur et al., 2018) are two co-purchase networks that describe the frequent co-purchases of items (nodes). Co-CS (Coauthor-CS) and Co-Phys (Coauthor-physics) (Shchur et al., 2018) consist of nodes representing authors and edges indicating collaborations between authors. Wiki-CS (Mernyei and Cangea, 2020) is extracted from Wikipedia, comprising computer science articles (nodes) connected by hyperlinks (edges). Flickr (Zeng et al., 2020) consists online images, with the goal of categorizing images based on their descriptions and common properties. All these datasets are available through PyG (Pytorch Geometric), and we partition them randomly into training, validation, and testing sets with a split ratio of 10%/10%/80%. Additionally, we employ Arxiv dataset from OGB benchmarks (Hu et al., 2020) to evaluate model performance on large-scale datasets. We process the dataset in PyG using OGB public interfaces with standard public split setting.

Link Prediction.

We take five public benchmark datasets, including Cora, Citeseer, Pubmed, Computer, and Photo, for evaluating link prediction performance. The statistics is presented in Table 8, and we adopt the hyper-parameters in Table 10.

Graph Classification.

For graph classification, all datasets are sourced from TU datasets (Morris et al., 2020)¹¹1These datasets are available in PyG library., including biochemical molecule datasets (PTC-MR, MUTAG, DD, PROTEINS) and social networks (IMDB-B, IMDB-M, COLLAB). Table 9 shows the statistics of these datasets. In PTC-MR and DD, we utilize the original node features, whereas for other datasets lacking rich node features, we generate one-hot features based on node degrees. We follow 10-fold cross-validation to evaluate model performance.

B.2. Summary of Baselines

We compare SimMLP against a range of baselines, encompassing supervised GNNs, self-supervised graph contrastive learning (GCL) methods, and MLP-based graph learning methods.

Node Classification.

Supervised GNNs: Our primary node classification baselines include GraphSAGE (Hamilton et al., 2017) and GAT (Veličković et al., 2018). Furthermore, we also incorporate SGC (Wu et al., 2019) and APPNP (Gasteiger et al., 2019) as additional node classification baselines. Self-supervised GNNs: We compare SimMLP to self-supervised graph learning methods. DGI (Veličković et al., 2019) and MVGRL (Hassani and Khasahmadi, 2020) conduct contrastive learning between graph patches and graph summaries to integrate knowledge into node embeddings. GRACE (Zhu et al., 2020) and subsequent GCA (Zhu et al., 2021) perform contrast between nodes in two corrupted views to acquire augmentation-invariant embeddings. BGRL (Thakoor et al., 2022) utilizes predictive objective for node-level contrastive learning to achieve efficient training. MLPs on Graphs: In node classification, we employ basic MLP that considers only node content as baseline. Furthermore, we incorporate GraphMLP (Hu et al., 2021) that trains an MLP by emphasizing consistency between target nodes and their direct neighborhoods. We exclude the following works (Dong et al., 2022; Liu et al., 2022) as baselines since they are high-order versions of GraphMLP. To achieve this, we slightly modify the original GraphMLP to enable the ability in learning high-order information, and search the number of layers within {1, 2, 3}. GLNN (Zhang et al., 2022) employs knowledge distillation to transfer knowledge from GNNs to MLPs, GENN leverages positional encoding to acquire structural knowledge, while NOSMOG (Tian et al., 2023) jointly integrates positional information and robust training strategies based on GLNN. Note that the public code of GENN is not available, thus we implement GENN based on the code of NOSMOG. VQGraph (Yang et al., 2024) is the recent SOTA method that leverages codebook to learn structural information.

Link Prediction

For baselines, we use the basic GNN, MLP, and LLP (Guo et al., 2023), a state-of-the-art MLP learning framework for link prediction. We strictly follow the experimental settings of Guo et al. (2023) and adopt the AUC as metric.

Graph Classification

We use the following baselines. Supervised GNNs: We utilize 5-layer GIN (Xu et al., 2019) as the baseline. Self-supervised GNNs: For graph-level tasks, we explore traditional graph kernels for classification, including WL kernel (Shervashidze et al., 2011) and DGK (Yanardag and Vishwanathan, 2015). Furthermore, we include advanced contrastive learning approaches, such as graph2vec (Narayanan et al., 2017), MVGRL (Hassani and Khasahmadi, 2020), InfoGraph (Sun et al., 2020), GraphCL (You et al., 2020), and JOAO (You et al., 2021), which conduct contrastive learning between embeddings of two augmented graphs. MLPs on Graphs: For standard MLP, we append a pooling function following the encoder to generate graph embeddings, which are utilized to perform predictions. Considering other MLP learning baselines, they cannot be directly applied on graph-level tasks. To this end, we extend GLNN (Zhang et al., 2022) to graph classification by distilling knowledge from pre-trained GINs to MLPs on graph-level embeddings, dubbed as MLP + KD.

B.3. Hyper-parameter setting

Table 10. Hyper-parameters used for SimMLP for node-level task.

	Node Classification & Link Prediction
Hyper-parameters	Cora	Citeseer	Pubmed	Computer	Photo	Co-CS	Co-Phys	Wiki-CS	Flickr	Arxiv
Epochs	1000	1000	1000	1000	1000	2000	1000	2000	2000	5000
Optimizer	AdamW used for all datasets
Learning Rate	1e-3	5e-4	5e-4	1e-3	1e-3	1e-4	1e-3	5e-4	1e-3	1e-3
Weight Decay	0	5e-5	1e-5	-	1e-4	-	1e-4	1e-5	5e-4	-
Activation	PReLU used for all datasets
Hidden Dimension	512	512	512	512	512	512	512	512	1024	1024
Normalization	Batchnorm used for all datasets
# MLP Layers	2	2	2	3	2	2	2	2	2	8
# GNN Layers	2	3	3	2	1	1	1	2	3	3
Feature Mask Ratio	0.50	0.75	0.25	0.25	0.25	0.50	0.75	0.00	0.25	0.00
Edge Mask Ratio	0.25	0.50	0.25	0.25	0.50	0.75	0.50	0.25	0.50	0.25

Table 11. Hyper-parameters of SimMLP on graph-level task.

	Graph Classification
Hyper-parameters	IMDB-B	IMDB-M	COLLAB	PTC-MR	MUTAG	DD	PROTEINS
Epochs	200	100	30	100	100	100	500
Optimizer	AdamW used for all datasets
Learning Rate	1e-2	1e-2	5e-4	1e-2	1e-2	1e-3	1e-3
Weight Decay	0 used for all datasets
Batch Size	64	128	32	64	64	32	64
Hidden Dimension	512 used for all datasets
Pooling	MEAN	MEAN	MEAN	SUM	SUM	MEAN	SUM
Activation	PReLU used for all datasets
Normalization	Batchnorm used for all datasets
Raw Feature	N	N	N	Y	N	Y	N
Deg4Feature	Y	Y	Y	N	Y	N	Y
# Encoder Layers	2 used for all datasets
# Aggregator Layers	2	2	2	2	1	2	1
Feature Mask Ratio	0.50	0.25	0.75	0.25	0.5	0.00	0.00
Edge Mask Ratio	0.75	0.50	0.75	0.00	0.25	0.00	0.50

We run each experiment 10 times with different seeds to alleviate the impact of randomness. We perform hyper-parameter tuning for each approach using a grid search strategy. Specifically, we set the number of epochs to 1,000, the hidden dimension to 512, and employ PReLU as the activation function. We explore various learning rates {5e-4, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2}, weight decay values {5e-5, 1e-5, 5e-3, 1e-4, 0}, and the number of layers {1, 2, 3}. In self-supervised learning methods, we employ a 2-layer GCN (Kipf and Welling, 2017) as the encoder for node-level tasks. Subsequently, we assess the quality of the acquired embeddings by training a Logistic regression function on downstream tasks (Zhu et al., 2020). For other settings, we follow the settings reported in the original papers. Regarding SimMLP, we provide a comprehensive overview of the hyper-parameter settings for node classification task in Table 10. For graph classification, the experimental setting is the same as Sec. B.3. The only exclusion is we utilize a 5-layer GIN (Xu et al., 2019) as the encoder. The readout function are selected from {MEAN, SUM, MAX}. The hyper-parameters of SimMLP is in Table 11.

Appendix C Additional Results

C.1. Full Inductive Setting Results

See Table 12.

Table 12. Node classification accuracy (%) under inductive (production) scenario for both transductive and inductive settings. ind represents the accuracy on

\mathcal{V}^{I}

, trans represents the accuracy on

\mathcal{V}_{test}^{T}

, and prod is the interpolated accuracy of both ind and trans.

Methods	Setting	Cora	Citeseer	Pubmed	Computer	Photo	Co-CS	Co-Phys	Wiki-CS	Flickr	OGB-Arxiv	Avg.
	prod	77.5±1.8	68.4±1.6	85.0±0.4	87.2±0.4	93.2±0.5	92.9±0.4	95.7±0.1	79.3±0.7	47.2±0.7	68.5±0.6	79.5
	trans	79.5±1.5	68.7±1.4	85.6±0.3	88.0±0.3	93.7±0.4	93.1±0.3	95.8±0.0	80.0±0.4	48.2±0.6	71.8±0.5	80.4
SAGE	ind	69.7±2.9	67.1±2.6	82.9±1.0	84.5±0.9	91.2±0.6	91.9±0.7	95.6±0.1	76.3±1.6	43.3±1.1	55.5±0.8	75.8
	prod	77.7±1.1	64.3±1.6	84.0±0.5	87.3±0.5	91.5±0.6	91.3±0.4	94.4±0.3	76.3±1.1	49.1±0.3	69.3±0.4	78.5
	trans	77.3±0.9	64.2±1.4	84.0±0.3	87.3±0.4	91.5±0.5	91.3±0.3	94.4±0.3	76.3±1.0	49.1±0.2	70.4±0.4	78.6
BGRL	ind	79.4±1.7	65.0±2.2	84.0±1.0	87.6±0.8	91.5±1.1	91.1±0.5	94.3±0.5	76.0±1.6	49.3±0.6	65.0±0.5	78.3
	prod	63.8±1.7	64.0±1.2	80.9±0.5	81.0±0.5	87.7±0.9	91.7±0.6	95.2±0.1	75.1±0.7	46.1±0.2	55.9±0.5	74.1
	trans	63.7±1.5	63.9±1.1	80.9±0.4	81.1±0.5	87.7±0.9	91.7±0.5	95.2±0.1	75.1±0.4	46.2±0.2	55.9±0.5	74.1
MLP	ind	64.2±2.1	64.4±1.8	80.9±0.7	80.8±0.9	87.9±1.0	91.8±0.8	95.2±0.2	74.9±1.8	46.1±0.5	55.9±0.7	74.2
	prod	78.3±1.0	69.6±1.1	85.4±0.5	87.0±0.5	93.3±0.4	93.7±0.4	95.8±0.1	78.4±0.5	46.1±0.3	63.5±0.5	79.1
	trans	79.9±0.9	69.7±0.8	85.7±0.4	87.8±0.5	93.8±0.4	93.8±0.3	95.8±0.0	78.6±0.3	46.1±0.2	64.3±0.5	79.6
GLNN	ind	72.0±1.7	69.1±2.6	84.4±0.9	84.0±0.7	91.1±0.5	93.3±0.5	95.7±0.1	77.6±1.4	46.1±0.4	60.6±0.6	77.4
	prod	77.8±1.6	67.3±1.5	84.3±0.5	85.8±1.2	92.1±1.0	93.6±0.4	95.7±0.1	78.3±1.0	45.6±0.5	68.5±0.5	78.9
	trans	80.3±1.4	67.9±1.2	85.8±0.4	87.4±1.0	93.4±0.6	93.8±0.4	95.8±0.1	80.3±0.9	45.7±0.5	70.0±0.5	80.0
GENN	ind	68.1±2.2	65.1±2.8	78.4±0.8	79.1±1.8	87.1±2.4	92.7±0.5	95.2±0.1	70.1±1.7	45.1±0.7	62.6±0.7	74.3
	prod	78.4±1.8	70.4±1.1	85.4±0.6	87.4±1.0	93.3±0.7	93.7±0.4	95.8±0.1	79.0±1.0	46.4±0.4	69.3±0.9	79.9
	trans	80.4±1.4	70.4±1.0	85.6±0.3	88.2±0.9	93.8±0.4	93.9±0.4	95.8±0.1	79.4±0.9	46.4±0.3	70.6±0.7	80.4
VQGraph	ind	70.4±3.4	70.0±1.6	84.5±1.5	84.3±1.1	91.5±1.8	93.0±0.6	95.7±0.3	77.5±1.7	46.3±0.9	64.0±1.7	77.7
	prod	77.8±1.9	68.6±1.4	83.8±0.5	86.6±1.2	92.5±0.7	93.5±0.4	95.8±0.1	78.4±0.7	46.1±0.6	69.1±0.8	79.2
	trans	80.3±1.7	69.0±1.2	85.4±0.4	88.3±1.1	93.9±0.5	93.7±0.4	95.9±0.1	80.4±0.6	46.2±0.5	70.5±0.8	80.4
NOSMOG	ind	68.1±3.0	67.1±2.1	77.4±0.8	79.8±1.5	87.1±1.5	92.6±0.7	95.5±0.1	70.4±1.2	45.3±0.7	63.5±0.8	74.7
	prod	81.4±1.2	72.3±0.9	86.5±0.3	87.7±0.4	93.9±0.3	94.6±0.2	96.0±0.1	79.3±0.8	49.3±0.2	70.2±0.5	81.1
	trans	81.6±1.0	72.2±0.7	86.5±0.2	87.7±0.3	93.9±0.3	94.7±0.2	96.1±0.1	79.5±0.7	49.2±0.1	71.3±0.3	81.3
SimMLP	ind	80.5±2.2	72.8±1.6	86.4±0.5	87.6±1.0	93.9±0.6	94.5±0.2	96.0±0.2	78.5±1.5	49.4±0.5	66.1±1.1	80.6

C.2. Full Cold Start Results

See Table 13.

Table 13. Node classification accuracy under cold-start setting.

Methods	Cora	Citeseer	Pubmed	Computer	Photo	Co-CS	Co-Phys	Wiki-CS	Flickr	OGB-Arxiv	Avg.
SAGE (Hamilton et al., 2017)	69.7±2.9	67.1±2.6	82.9±1.0	84.5±0.9	91.2±0.6	91.9±0.7	95.6±0.1	76.3±1.6	43.3±1.1	55.5±0.8	75.8
BGRL (Thakoor et al., 2022)	79.4±1.7	65.0±2.2	84.0±1.0	87.6±0.8	91.5±1.1	91.1±0.5	94.3±0.5	76.0±1.6	49.3±0.6	65.0±0.5	78.3
MLP (Zhang et al., 2022)	64.2±2.1	64.4±1.8	80.9±0.7	80.8±0.9	87.9±1.0	91.8±0.8	95.2±0.2	74.9±1.8	46.1±0.5	55.9±0.7	74.2
GLNN (Zhang et al., 2022)	72.0±1.7	69.1±2.6	84.4±0.9	84.0±0.7	91.1±0.5	93.3±0.5	95.7±0.1	77.6±1.4	46.1±0.4	60.6±0.6	77.4
GENN (Wang et al., 2023a)	68.1±2.2	65.1±2.8	78.4±0.8	79.1±1.8	87.1±2.4	92.7±0.5	95.2±0.1	70.1±1.7	45.1±0.7	62.6±0.7	74.3
VQGraph (Yang et al., 2024)	70.4±3.4	70.0±1.6	84.5±1.5	84.3±1.1	91.5±1.8	93.0±0.6	95.7±0.3	77.5±1.7	46.3±0.9	64.0±1.7	77.7
NOSMOG (Tian et al., 2023)	68.1±3.0	67.1±2.1	77.4±0.8	79.8±1.5	87.1±1.5	92.6±0.7	95.5±0.1	70.4±1.2	45.3±0.7	63.5±0.8	74.7
SimMLP	80.5±2.2	72.8±1.6	86.4±0.5	87.6±1.0	93.9±0.6	94.5±0.2	96.0±0.2	78.5±1.5	49.4±0.5	66.1±1.1	80.6

C.3. Full Inductive Bias Analysis

See Table 14

Table 14. SimMLP shares two inductive biases with GNNs, i.e., homophily and local structure importance, which are measured by smoothness and min-cut, respectively.

	Homophily (Smoothness $\downarrow$ )						Local Structure Importance (Min-Cut $\uparrow$ )
Methods	Cora	Citeseer	Pubmed	Computer	Photo	Avg.	Cora	Citeseer	Pubmed	Computer	Photo	Avg.
Raw Node Feature	0.822	0.783	0.734	0.539	0.540	0.684	$-$	$-$	$-$	$-$	$-$	$-$
SAGE (Hamilton et al., 2017)	0.113	0.184	0.143	0.156	0.109	0.141	0.924	0.943	0.918	0.854	0.872	0.902
BGRL (Thakoor et al., 2022)	0.155	0.102	0.333	0.251	0.203	0.209	0.885	0.935	0.856	0.834	0.849	0.872
MLP (Zhang et al., 2022)	0.463	0.444	0.485	0.456	0.432	0.456	0.666	0.804	0.863	0.718	0.747	0.759
GLNN (Zhang et al., 2022)	0.282	0.268	0.421	0.355	0.398	0.345	0.886	0.916	0.793	0.804	0.811	0.842
NOSMOG (Tian et al., 2023)	0.267	0.230	0.394	0.306	0.277	0.295	0.902	0.932	0.834	0.838	0.823	0.866
VQGraph (Yang et al., 2024)	0.253	0.212	0.396	0.328	0.310	0.300	0.914	0.940	0.831	0.858	0.836	0.876
SimMLP	0.196	0.170	0.360	0.299	0.288	0.263	0.934	0.958	0.886	0.901	0.860	0.908

Appendix D Training Efficiency

Table 15 presents a comparison of the running time and memory usage between SimMLP and other baselines, namely GAT (Veličković et al., 2018), GRACE (Zhu et al., 2020), and BGRL (Thakoor et al., 2022). Apart from the significant inference acceleration, SimMLP has less training time and memory usage. In particular, GAT with 4 attention heads imposes a substantial computational consumption in model training. This is highly probable to be the consumption in learning attention scores. GRACE utilizes InfoNCE loss to align the consistency between two graph views, where the similarity measurements might lead to significant computational overhead. Compared to this method, SimMLP demonstrates improvements in terms of memory usage ( $3.8\sim 6.8\times$ ) and training time ( $4.8\sim 8.3\times$ ). BGRL employs bootstrap (Grill et al., 2020) to alleviate the need for negative samples in InfoNCE, thus alleviating significant computational usage in measuring the distance between negative pairs. However, SimMLP remains more efficient than BGRL due to the use of MLP encoder.

Table 15. Computational requirements of different baseline methods on a set of standard benchmark graphs. The experiments are performed on a 24GB Nvidia GeForce RTX 3090.

	Computer		Photo		Coauthor-CS		Coauthor-Phys		Wiki-CS
Methods	Memory	Training Time	Memory	Training Time	Memory	Training Time	Memory	Training Time	Memory	Training Time
GAT	5239 MB	73.8 (s)	2571 MB	41.9 (s)	2539 MB	60.4 (s)	13199 MB	265.2 (s)	4568 MB	74.4 (s)
GRACE	8142 MB	349.5 (s)	2755 MB	138.4 (s)	11643 MB	261.4 (s)	16294 MB	573.2 (s)	5966 MB	290.9 (s)
BGRL	2196 MB	96.8 (s)	1088 MB	64.1 (s)	2513 MB	129.9 (s)	5556 MB	273.8 (s)	1899 MB	108.8 (s)
SimMLP	1969 MB	53.4 (s)	694 MB	27.0 (s)	1716 MB	54.8 (s)	3920 MB	110.7 (s)	1590 MB	35.5 (s)

Our SimMLP can scale to very large graphs via mini-batch training. We report the time and memory consumption on OGB-Product dataset in Table 16. We compare our SimMLP against BGRL, an efficient SSL method on graphs. The results demonstrate the scalability and practicality of SimMLP in handling real-world applications.

Table 16. The time and memory consumption on large-scaled OGB-product dataset.

Method	Training Time (Per Epoch)	Memory Consumption
BGRL	263s	17,394MB
SimMLP	158s	11,993MB

Table 17. Comparison between MLP-based methods in training the MLP for downstream node classification (5000 epochs).

	Cora	Citeseer	Pubmed	Flickr	Arxiv
GLNN	1.6s	1.9s	2.0s	2.5s	3.3s
VQGraph	1.9s	2.3s	2.7s	3.2s	4.5s
NOSMOG	2.3s	2.5s	2.7s	3.6s	4.7s
SimMLP	1.6s	1.9s	1.9s	2.5s	3.2s

Appendix E Comprehensive Ablation Study

E.1. The necessity in incorporating structural information

The use of GNN encoder in learning structure-aware knowledge is essential in SimMLP, as it directly aligns the embeddings of two encoders. Without the GNN encoder, the model will fail to capture the fine-grained and generalizable correlation between node features and graph structures, demonstrated in Table 18.

Table 18. The ablation study on incorporating structural information using GNNs. Without the GNN encoder (i.e., only using MLPs), the model performance will be significantly decreased.

Methods	Cora	Citeseer	Pubmed	Computer	Photo	Co-CS	Co-Phys	Wiki-CS	Flickr	Arxiv
SimMLP	84.60±0.24	73.52±0.53	86.99±0.09	88.46±0.16	94.28±0.08	94.87±0.07	96.17±0.03	81.21±0.13	49.85±0.09	71.12±0.10
w/o GNN	55.91±0.66	57.36±0.33	79.93±0.32	72.76±0.71	77.05±0.18	91.19±0.13	93.35±0.12	73.87±0.26	45.82±0.07	54.83±0.41

E.2. The design choice of Strategy 1

In this section, we analyze some design choices of Strategy 1, i.e., using MLP to approximate GNN. The learning process is similar to SGC (Wu et al., 2019) or APPNP (Gasteiger et al., 2019) that decompose feature transformation and message passing. In SimMLP, we consider using normalized Laplacian matrix to direct the message passing due to its simplicity. Based on the normalization, we have three design choices. We dub them as (1) Col: using column-normalized Laplacian matrix $\tilde{{\mathbf{D}}}^{-1}\tilde{{\mathbf{A}}}$ for message passing, (2) Row: using row-normalized Laplacian matrix $\tilde{{\mathbf{A}}}\tilde{{\mathbf{D}}}^{-1}$ for message passing, and (3) Bi: using bi-normalized Laplacian matrix $\tilde{{\mathbf{D}}}^{-1/2}\tilde{{\mathbf{A}}}\tilde{{\mathbf{D}}}^{-1/2}$ for message passing. Here $\tilde{{\mathbf{A}}}={\mathbf{A}}+{\mathbf{I}}$ and $\tilde{{\mathbf{D}}}$ is the diagonal matrix of node degrees of $\tilde{{\mathbf{A}}}$ . We present the results of these three choices on ten benchmark datasets, shown in Table E.2.

Table 19. Ablation study on node aggregation choices. Col indicates column-normalized Laplacian aggregation matrix

\tilde{{\mathbf{D}}}^{-1}\tilde{{\mathbf{A}}}

, Row indicates row-normalized Laplacian aggregation matrix

\tilde{{\mathbf{A}}}\tilde{{\mathbf{D}}}^{-1}

, and Bi. indicates bi-normalized Laplacian aggregation matrix

\tilde{{\mathbf{D}}}^{-1/2}\tilde{{\mathbf{A}}}\tilde{{\mathbf{D}}}^{-1/2}

. SimMLP employs Bi. since it consistently outperforms others even though the improvements may not be significant.

	Cora	Citeseer	Pubmed	Computer	Photo	Co-CS	Co-Phys	Wiki-CS	Flickr	Arxiv
Bi.	84.60±0.24	73.52±0.53	86.99±0.09	88.46±0.16	94.28±0.08	94.87±0.07	96.17±0.03	81.21±0.13	49.85±0.09	71.12±0.10
Col	84.14±0.34	73.48±0.53	86.92±0.08	87.93±0.27	93.11±0.15	94.81±0.06	96.09±0.03	80.62±0.30	49.15±0.16	71.03±0.09
Row	84.09±0.32	73.49±0.54	86.92±0.08	87.96±0.27	93.07±0.15	94.82±0.06	96.07±0.04	80.63±0.25	49.18±0.10	71.04±0.09

In this table, we observe that there is no significant difference in performance among the various aggregation methods. All of these methods can achieve desirable performance. Nevertheless, the bi-normalized aggregation (Bi.) consistently outperforms the others. Actually, we can directly use the message passing functions of SGC or APPNP. For SGC, we do not observe significant performance differences compared to the discussed three choices. For APPNP that performs message passing based on page-rank, we consider obtaining the page-rank aggregation matrix would lead to significant time consumption, especially with graph structural augmentations. We leave this in the future work.

Apart from the choice of message passing methods, determining the message passing layers is also important. We show the performance of SimMLP with varying numbers of message passing layers on five benchmark datasets in Figure 7. We observe the optimal performance is achieved with 2 or 3 layers, which is consistent with prior research on GNNs (Li et al., 2019). It might be because a high number of message passing layers can result in over-smoothing.

E.3. How does Strategy 2 (augmentation) prevent trivial solutions?

Additionally, we conduct a detailed analysis of how augmentations impact model performance. Figure 8 illustrates the model performance at different augmentation probabilities on Cora, Citeseer, Pubmed, Computer, and Photo datasets under the transductive setting. The augmentation ratio is searched among {0.0, 0.25, 0.5, 0.75}. These figures enable us to gain insight into the specific effects of augmentations on model performance.

E.4. How the reconstruction term in Equation 4 works?

In this section, we evaluate the role of the reconstruction term of SimMLP in Equation 4. We treat the term serves as a regularizer that mitigates the potential distribution shifts. It works like positional embedding (Dwivedi et al., 2022) that preserves more localized information on GNN embeddings. We show the impact of the reconstruction term on model performance in Table 20. Our observations indicate the reconstruction term might be important in large-scale datasets, e.g., Arxiv. It might be because these datasets contain more noise.

Table 20. Reconstruction term in Equation 4 serves as a regularizer, preventing the potential distribution shifts.

Methods	Cora	Citeseer	Pubmed	Computer	Photo	Co-CS	Co-Phys	Wiki-CS	Flickr	Arxiv
SimMLP	84.60±0.24	73.52±0.53	86.99±0.09	88.46±0.16	94.28±0.08	94.87±0.07	96.17±0.03	81.21±0.13	49.85±0.09	71.12±0.10
w/o Rec.	84.37±0.27	73.18±0.24	86.86±0.10	88.25±0.07	94.15±0.07	94.64±0.06	96.01±0.07	81.10±0.13	49.60±0.11	70.38±0.22

E.5. Different GNN Architectures

It is straightforward to adapt our approach from GCN to other GNN architectures. In Table 21, we show results of GCN, SAGE, and APPNP on Cora, Citeseer, and Pubmed. SimMLP consistently enhances performance across GNN architectures, owing to the capability of SSL to capture generalizable patterns. This underscores SimMLP’s adaptability and effectiveness across a wide range of tasks and architectures.

Table 21. Model performance with different GNN backbones.

	GCN	GCN + SimMLP	SAGE	SAGE + SimMLP	APPNP	APPNP + SimMLP
Cora	82.1	84.6	81.4	84.1	81.4	84.3
Citeseer	70.7	73.5	70.4	73.5	70.3	73.6
Pubmed	85.6	87.0	85.9	86.9	85.7	86.8

	$\displaystyle{\mathcal{E}}^{*}=$	$\displaystyle\operatorname*{arg\,min}_{{\mathcal{E}}}\mathbb{E}{\left\\|{\mathbf{H}}^{MLP}-{\mathbf{H}}^{GNN}\right\\|}^{2}+\left\\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\\|^{2}$
	$\displaystyle=$	$\displaystyle\operatorname{arg\,min}_{{\mathcal{E}}}\mathbb{E}{\left\\|({\mathbf{H}}^{MLP}-{\mathbf{F}}^{})-({\mathbf{H}}^{GNN}-{\mathbf{F}}^{*})\right\\|}^{2}+\left\\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\\|^{2}$
	$\displaystyle=$	$\displaystyle\operatorname{arg\,min}_{{\mathcal{E}}}\mathbb{E}\left[\left\\|{\mathbf{H}}^{MLP}-{\mathbf{F}}^{}\right\\|^{2}+\left\\|{\mathbf{H}}^{GNN}-{\mathbf{F}}^{*}\right\\|^{2}+\left\\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\\|^{2}\right]$
		$\displaystyle-2\mathbb{E}\left[\left\langle{\mathbf{H}}^{MLP}-{\mathbf{F}}^{},{\mathbf{H}}^{GNN}-{\mathbf{F}}^{}\right\rangle\right]$
	$\displaystyle=$	$\displaystyle\operatorname{arg\,min}_{{\mathcal{E}}}\mathbb{E}\left[\left\\|{\mathbf{H}}^{MLP}-{\mathbf{F}}^{}\right\\|^{2}+\left\\|{\mathbf{H}}^{GNN}-{\mathbf{F}}^{*}\right\\|^{2}+\left\\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\\|^{2}\right]$
		$\displaystyle-2\mathbb{E}_{{\mathbf{F}}^{}}\left[\sum_{i}\left({\mathbf{H}}_{i}^{MLP}-{\mathbf{F}}^{}_{i}\right)\left({\mathbf{H}}_{i}^{GNN}-{\mathbf{F}}^{}_{i}\right)\|{\mathbf{F}}^{}\right]$
	$\displaystyle=$	$\displaystyle\operatorname{arg\,min}_{{\mathcal{E}}}\mathbb{E}\left[\left\\|{\mathbf{H}}^{MLP}-{\mathbf{F}}^{}\right\\|^{2}+\left\\|{\mathbf{H}}^{GNN}-{\mathbf{F}}^{*}\right\\|^{2}+\left\\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\\|^{2}\right]$
		$\displaystyle-2\mathbb{E}_{{\mathbf{F}}^{}}\left[\sum_{i}\mathrm{Cov}({\mathbf{H}}^{MLP}_{i}-{\mathbf{F}}^{},{\mathbf{H}}^{GNN}_{i}-{\mathbf{F}}^{})\|{\mathbf{F}}^{}\right].$
	$\displaystyle=$	$\displaystyle\operatorname{arg\,min}_{{\mathcal{E}}}\mathbb{E}\left[\left\\|{\mathbf{H}}^{MLP}-{\mathbf{F}}^{}\right\\|^{2}+\left\\|{\mathbf{H}}^{GNN}-{\mathbf{F}}^{*}\right\\|^{2}+\left\\|{\mathcal{D}}({\mathbf{H}}^{GNN})-{\mathbf{X}}\right\\|^{2}\right]$
(11)			$\displaystyle-2\mathbb{E}_{{\mathbf{F}}^{}}\left[\sum_{i}\mathrm{Cov}({\mathbf{H}}^{MLP}_{i},{\mathbf{H}}^{GNN}_{i})\|{\mathbf{F}}^{}\right].$

	$\displaystyle H({\mathbf{Y}};\hat{{\mathbf{Y}}}\|{\mathbf{X}})$	$\displaystyle=-\sum_{i}({\mathbf{Y}}_{i}\|{\mathbf{X}})\log(\hat{{\mathbf{Y}}}_{i}\|{\mathbf{X}})$
		$\displaystyle=-\sum_{i}({\mathbf{Y}}_{i}\|{\mathbf{X}})\log({\mathbf{Y}}_{i}\|{\mathbf{X}})+\sum_{i}({\mathbf{Y}}_{i}\|{\mathbf{X}})\log({\mathbf{Y}}_{i}\|{\mathbf{X}})$
		$\displaystyle-\sum_{i}{\mathbf{Y}}_{i}\log(\hat{{\mathbf{Y}}}_{i}\|{\mathbf{X}})$
		$\displaystyle=H({\mathbf{Y}}\|{\mathbf{X}})+\sum_{i}({\mathbf{Y}}_{i}\|{\mathbf{X}})\log\frac{({\mathbf{Y}}_{i}\|{\mathbf{X}})}{(\hat{{\mathbf{Y}}}_{i}\|{\mathbf{X}})}$
(13)			$\displaystyle=H({\mathbf{Y}}\|{\mathbf{X}})+{\mathcal{D}}_{KL}({\mathbf{Y}}\\|\hat{{\mathbf{Y}}}\|{\mathbf{X}})$

	$\displaystyle{\mathbf{T}}^{*}$	$\displaystyle=\operatorname*{arg\,min}_{\mathbf{T}}I({\mathcal{G}};{\mathbf{T}})-\beta I({\mathbf{T}};{\mathcal{G}}_{\mathcal{I}})$
		$\displaystyle=\operatorname*{arg\,min}_{\mathbf{T}}(1-\beta)H({\mathbf{T}})+\beta H({\mathbf{T}}\|{\mathcal{G}}_{\mathcal{I}})-H({\mathbf{T}}\|{\mathcal{G}})$
		$\displaystyle=\operatorname*{arg\,min}_{\mathbf{T}}H({\mathbf{T}})+\lambda H({\mathbf{T}}\|{\mathcal{G}}_{\mathcal{I}})$
		$\displaystyle=\operatorname*{arg\,min}_{{\mathbf{H}}^{MLP},{\mathbf{H}}^{GNN}}\lambda H({\mathbf{H}}^{MLP}\|{\mathcal{G}}_{\mathcal{I}})$
(15)			$\displaystyle+H({\mathbf{H}}^{GNN})+\lambda H({\mathbf{H}}^{GNN}\|{\mathcal{G}}_{\mathcal{I}})+H({\mathbf{H}}^{MLP}\|{\mathbf{H}}^{GNN}),$