This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\useunder

\ul

Graph Contrastive Learning with
Cross-view Reconstruction

Qianlong Wen1, Zhongyu Ouyang1, Chunhui Zhang2, Yiyue Qian1,
Yanfang Ye1, Chuxu Zhang2
1University of Notre Dame, 2Brandeis University
1{\{
qwen,zouyang2,yqian5,yye7}\}@nd.edu;
2{\{
chunhuizhang,chuxuzhang}\}@brandeis.edu
Abstract

Graph self-supervised learning is commonly taken as an effective framework to tackle the supervision shortage issue in the graph learning task. Among different existing graph self-supervised learning strategies, graph contrastive learning (GCL) has been one of the most prevalent approaches to this problem. Despite the remarkable performance those GCL methods have achieved, existing GCL methods that heavily depend on various manually designed augmentation techniques still struggle to alleviate the feature suppression issue without risking losing task-relevant information. Consequently, the learned representation is either brittle or unilluminating. In light of this, we introduce the Graph Contrastive Learning with Cross-View Reconstruction (GraphCV), which follows the information bottleneck principle to learn minimal yet sufficient representation from graph data. Specifically, GraphCV aims to elicit the predictive (useful for downstream instance discrimination) and other non-predictive features separately. Except for the conventional contrastive loss which guarantees the consistency and sufficiency of the representation across different augmentation views, we introduce a cross-view reconstruction mechanism to pursue the disentanglement of the two learned representations. Besides, an adversarial view perturbed from the original view is added as the third view for the contrastive loss to guarantee the intactness of the global semantics and improve the representation robustness. We empirically demonstrate that our proposed model outperforms the state-of-the-art on graph classification task over multiple benchmark datasets.

1 Introduction

Graph representation learning (GRL) has attracted significant attention due to its widespread applications in the real-world interaction systems, such as social, molecules, biological and citation networks (Hamilton et al., 2017b). The current state-of-the-art supervised GRL methods are mostly based on Graph Neural Networks (GNNs) (Kipf & Welling, 2017; Veličković et al., 2018; Hamilton et al., 2017a; Xu et al., 2019), which require a large amount of task-specific supervised information. Despite the remarkable performance, they are usually limited by the deficiency of label supervision in real-world graph data due to the fact that it is usually easy to collect unlabeled graph but very costly to obtain enough annotated labels, especially in certain fields like biochemistry. Therefore, many recent works (Qiu et al., 2020; Hassani & Khasahmadi, 2020a; Sun et al., 2019) study how to fully utilize the unlabeled information on graph and further stimulate the application of self-supervised learning (SSL) for GRL where only limited or even no label is needed.

As a prevalent and effective strategy of SSL, contrastive learning follows the mutual information maximization principle (InfoMax) (Veličković et al., 2019) to maximize the agreements of the positive pairs while minimizing that of the negative pairs in the embedding space. However, the graph contrastive learning paradigm guided by the InfoMax principle is insufficient to learn robust and transferable representation. State-of-the-art GCL methods (Qiu et al., 2020; Hassani & Khasahmadi, 2020a; You et al., 2020) usually rely on augmentor(s) t()t(\cdot) (e.g., Identity, Subgraph Sampling, Node Dropping, Edge Removing and Attributes Masking.) applied to the anchor graph GG to generate a positive pair of graphs t1(G)t_{1}(G) and t2(G)t_{2}(G). Then, the graph feature encoder ff will be trained to ensure the representation consistency within the positive pair, i.e., 𝐳=f(t1(G))=f(t2(G))\mathbf{z}=f(t_{1}(G))=f(t_{2}(G)). Consequently, such training strategy is heavily dependent on the choice and strength of graph augmentation techniques. To be more specific, moderate graph augmentation will push encoders to capture redundant and biased information (Tschannen et al., 2019), which could inadvertently suppress the space of important predictive features and negatively affect the representation transferability via the so-called "shortcut" solution (Geirhos et al., 2020; Minderer et al., 2020). A more intuitively illustration is provided in the Figure 1 (a), where the shared part of the two augmentation view include both predictive information (the overlapping area with yy) and non-predictive information (shadow area). Such optimization result usually yield lower contrastive loss, however, it has been empirically proved that the redundant information could lead to poor robustness (Robinson et al., 2021), especially under the out-of-distribution (OOD) setting (Ye et al., 2021). We provide a showcase example in Appendix A to illustrate the OOD scenario on graph learning task. On the other hand, overly aggressive augmentation may easily lead to another extreme where many predictive features are randomly dropped and the learned representation does not contain sufficient predictive information for downstream instance discrimination. Recent works (Suresh et al., 2021; Li et al., 2022; You et al., 2021) propose to use automated augmentations to extract the invariant rationale features (Wu et al., 2022b; a). These methods assume the most salient sub-structure (those are resistant to graph augmentation) is sufficient to make rational and correct label identification, and thereby implement trainable augmentation operations (e.g., edge deleting, node dropping) to strictly regularize the graph topological structure. Despite that these methods can alleviate the aforementioned feature suppression issue to some extent, they still suffer from inherent limitation. The harsh regularization may force the encoders focusing on the easy-learned "shallower" features (e.g. graph size and node degree), which might be helpful under certain domains but not necessarily for others (Bevilacqua et al., 2021), thus fail to guarantee stronger robustness. Therefore, the GCL methods guided with the saliency philosophy is not flexible enough to balance the representation sufficiency and robustness without the guidance of explicit domain knowledge. To reconcile the robustness and sufficiency of the learned representation, a method which can reduce redundant and biased information without sacrificing the sufficiency of the predictive graph features is in urgent need.

Refer to caption
Figure 1: Illustration of the relation between graph GG, label yy, predictive feature subsets GpG^{p} and non-predictive feature subset GcG^{c} in terms of information entropy. Ideally, the green areas in the three figures is null. (a) The usual optimization result of graph contrastive learning, where the shared features of two augmentation view is extracted for the learned representation 𝐳\mathbf{z}. Owing to the lack of supervision or domain knowledge, redundant and biased information (shadow area) is usually included in 𝐳\mathbf{z}; (b) GpG^{p} cover the feature subset which is sufficient to make correct graph label identification (I(y;GGp)=0I(y;G\mid G^{p})=0), other features (GcG^{c}) is either useless or misguiding; (c) GpG^{p} and GcG^{c} are supposed to be mutually disentangled with each other (I(Gp;Gc)=0I(G^{p};G^{c})=0). The union of them cover all the features of original data.

Recently, the information bottleneck (IB) principle (Tishby et al., 2000) has been introduced to the graph learning, which encourages extracting minimal yet sufficient information for representation learning. The core idea of IB principle is in accordance with the ultimate optimization objective to solve the feature suppression issue (Robinson et al., 2021), thus shed more light on this problem. Moreover, the representation learning guided by the IB principle has been empirically proved to generate more robust and transferable representations at different domains (Wu et al., 2020). Therefore, a graph contrastive learning framework in accordance with the IB principle is promising in balancing the representation robustness and sufficiency. Given an input graph GG, we denote GpG^{p} and GcG^{c} as its predictive feature subset and the complementary non-predictive feature subset, respectively. According to the assumption of recent studies about rationale invariance discover (Wu et al., 2022b; a), the two features subsets would the satisfy I(y;GGp)=0I(y;G\mid G^{p})=0 (sufficiency condition) and disentanglement condition (i.e., I(Gp;Gc)=0I(G^{p};G^{c})=0). We illustrate the relations among the two feature subsets and GG in Figure 1 (b) and (c). It is inevitable that the learned representation maintains some redundant information for a specific downstream task. However, a GCL framework under the guidance of the IB principle is expected to suppress the feature space of GcG^{c} as much as possible while keeping the predictive feature GpG^{p} intact simultaneously in the learned representation.

In this paper, we propose the novel Graph Contrastive Learning with Cross-View Reconstruction, named GraphCV, to pursue the optimization objective of the IB principle. Specifically, GraphCV consists of a graph encoder followed with two decoders that are trained to extract information specific to the predictive and non-predictive features, respectively. To approximate the disentanglement objective, we propose the reconstruction-based representation learning scheme, including the intra-view and inter-view reconstructions, to reconstruct the original learned representation with the two separated feature subsets. Furthermore, the encoded representation from the original view perturbed in the adversarial fashion serves as the third view when computing the contrastive loss, apart from the predictive relevant representations of the two augmentation views, to further improve the representations’ robustness and prevent them from collapsing into partial or even trivial ones. We provide theoretical analysis to show that GraphCV is capable to learn minimal sufficient representations. Finally, we conduct experiments to validate the effectiveness of GraphCV over the commonly-used graph benchmark datasets. The experimental results demonstrate that GraphCV achieves significant performance gains over different datasets and settings compared with state-of-the-art baselines.

The main contributions of this work are summarized from three aspects: (i) We propose the GraphCV to alleviate the feature suppression issue with the cross-view reconstruction mechanism; (ii) We provide solid theoretical analysis on our model designs; (iii) Thorough experiments are conducted to demonstrate the robustness and transferability of the learned representations via GraphCV.

2 Preliminaries

2.1 Graph Representation Learning

In this work, we focus on the graph-level task, let 𝒢={Gi=(Vi,Ei)}i=1N\mathcal{G}=\left\{G_{i}=(V_{i},E_{i})\right\}_{i=1}^{N} denote a graph dataset with N graphs, where ViV_{i} and EiE_{i} are the node set and edge set of graph GiG_{i}, respectively. We use xvdx_{v}\in\mathbb{R}^{d} and xedx_{e}\in\mathbb{R}^{d} to denote the attribute vector of each node vViv\in V_{i} and edge eEie\in E_{i}. Each graph is associated with a label, denoted as yiy_{i}, the goal the graph representation learning is to learn an encoder f:Gidf:G_{i}\rightarrow\mathbb{R}^{d} so that the learned representation 𝐳i=f(Gi)\mathbf{z}_{i}=f(G_{i}) is sufficient to predict yiy_{i} related to the downstream task. We clarify the sufficiency of 𝐳i\mathbf{z}_{i} as containing no less information of the label of GiG_{i} (Achille & Soatto, 2018), and it is formulated as:

I(Gi;yi𝐳i)=0,I(G_{i};y_{i}\mid\mathbf{z}_{i})=0, (1)

where I(;)I\left(;\right) denotes the mutual information between two variables.

2.2 Contrastive Learning

Contrastive Learning (CL) is a self-supervised representation learning method which leverages instance-level identity for supervision. During the training phase, each graph GG firstly goes through proper data augmentation to generate two data augmentation views t1(G)t_{1}(G) and t2(G)t_{2}(G), where t1()t_{1}(\cdot) and t2()t_{2}(\cdot) are two augmentation operators. Then, the CL method encourages the encoder ff (a backbone network plus a projection layer) to map t1(x)t_{1}(x) and t2(x)t_{2}(x) closer in the hidden space so that the learned representations 𝐳1\mathbf{z}_{1} and 𝐳2\mathbf{z}_{2} maintain all the information shared by t1(G)t_{1}(G) and t2(G)t_{2}(G). The learning of the encoder is usually directed by a contrastive loss, such as NCE loss (Wu et al., 2018b), InfoNCE loss (van den Oord et al., 2018) and NT-Xent loss (Chen et al., 2020). In Graph Contrastive Learning (GCL), we usually adopt a GNN, such as GCN (Kipf & Welling, 2017) or GIN (Xu et al., 2019), as the backbone network, and the commonly-used graph data augmentation operators (You et al., 2020), such as node dropping, edge perturbation, subgraph sampling, and attribute masking.

All the GCL-based methods are built on the assumption that augmentations do not break the sufficiency requirement to make correct prediction. Here, we follow (Federici et al., 2020) to clear up the definition of mutual redundancy. t1(G)t_{1}(G) is redundant to t2(G)t_{2}(G) with respect of yy iff t1(G)t_{1}(G) and t2(G)t_{2}(G) share the same predictive information. Mathematically, the mutual redundancy in CL exists when:

I(t1(G);yt2(G))=I(t2(G);yt1(G))=0.I(t_{1}(G);y\mid t_{2}(G))=I(t_{2}(G);y\mid t_{1}(G))=0. (2)

Although GCL-based methods are usaully capable to extract useful information for label identification, it is unavoidable to include non-predictive features under the SSL setting owing lack of explicit domain knowledge. There exist the situation (e.g., OOD setting) that the latent space of learned representation is dominated by non-predictive features in SSL  (Chen et al., 2021) and it is no more informative enough to make correct prediction. Therefore, feature suppression is not just a prevalent issue in supervised learning, but also in SSL. Due to the page limitation, we provide more discussion about the relation between feature suppression and GCL in Appendix B

3 Proposed Model

Refer to caption
Figure 2: The illustration of the proposed GraphCV. (1) Graph augmentations are applied to the input graph GG to produce two augmented graphs, which are then fed into the shared graph encoder f()f(\cdot) to generate two graph representations 𝐳1\mathbf{z}_{1} and 𝐳2\mathbf{z}_{2}. (2) 𝐳1\mathbf{z}_{1} and 𝐳2\mathbf{z}_{2} are used as the inputs of the two decoder to generate two pairs of graph representations, 𝐳p\mathbf{z}^{p} captures the predictive factors and 𝐳c\mathbf{z}^{c} keep other complementary non-predictive features. Then we use the two pairs of representations to reconstruct 𝐳1\mathbf{z}_{1} and 𝐳2\mathbf{z}_{2} in both of the intra-view and inter-view. (3) An adversarial sample generated by GG will go through the same procedure to generate 𝐳advp\mathbf{z}_{adv}^{p}. We take it as the third view besides 𝐳1p\mathbf{z}_{1}^{p} and 𝐳2p\mathbf{z}_{2}^{p} in CL guarantee the 𝐳p\mathbf{z}^{p} can keep the global semantics.

In this section, we introduce the details of our proposed GraphCV whose framework is shown in Figure 2. Corresponding theoretical analysis are provided to justify the rationality of our designs. Before diving into the details of GraphCV, we briefly introduce the overall framework of our model.

The proposed GraphCV model is designed in accordance with the IB principle to extract minimal yet sufficient representation through the designed cross-view reconstruction mechanism. Given f()f(\cdot) as the graph encoder, we aim to map the graph representation 𝐳=f(G)d\mathbf{z}=f(G)\in\mathbb{R}^{d} into two different feature spaces (𝐳p,𝐳c)(\mathbf{z}^{p},\mathbf{z}^{c}), where 𝐳pd\mathbf{z}^{p}\in\mathbb{R}^{d} is expected to be specific to the predictive information GpG^{p}, and 𝐳cd\mathbf{z}^{c}\in\mathbb{R}^{d} is optimized to elicit the complementary non-predictive factors GcG^{c}. Later, we reconstruct the representation 𝐳\mathbf{z} with the feature subsets mapped from same and different augmentation views to approximate the disentanglement objective demonstrated in Figure 1. By separating the learned representation into two sets of disentangled features and later utilizing them to reconstruct the two, we alleviate the feature suppression issue (Robinson et al., 2021) at no cost of information sufficiency. We further add extra regularization to guarantee 𝐳p\mathbf{z}^{p} does not collapse into shallow or partial features during the reconstruction process. We will introduce more details of GraphCV in the later contents.

3.1 Disentanglement by Cross-View Reconstruction

In GCL, we usually leverage a graph encoder, such as a GCN (Kipf & Welling, 2017) or a GIN (Xu et al., 2019), to encode the graph data into its representation. There are multiple choices of graph encoders in GCL, including GCN (Kipf & Welling, 2017) and GIN (Xu et al., 2019), etc. In this work, we adopt GIN as the backbone network ff for simplicity. Note that any other commonly-used graph encoders can also be adapted to our model. Given two augmentation views t1(G)t_{1}(G) and t1(G)t_{1}(G) (where t1()t_{1}(\cdot) and t2()t_{2}(\cdot) are IID sampled from the same family of augmentation 𝒯\mathcal{T}), we firstly use the encoder f()f(\cdot) to map them into a lower dimension hidden space for the two embeddings, 𝐳1\mathbf{z}_{1} and 𝐳2\mathbf{z}_{2}. Instead of directly maximizing the agreement between the two representations 𝐳1\mathbf{z}_{1} and 𝐳2\mathbf{z}_{2}, we further feed each of them into a pair decoders (gp(g_{p}, gc)g_{c}) (both of them are MLP-based networks or GNN) and optimize the two decoders to map each of the presentation into the two disentangled feature sub-spaces:

[𝐳p=gp(f(t(G))),𝐳c=gc(f(t(G)))],\left[\mathbf{z}^{p}=g_{p}(f(t(G)))\text{,}\;\;\mathbf{z}^{c}=g_{c}(f(t(G)))\right], (3)

where a pair of embeddings for both t1(G)t_{1}(G) and t2(G)t_{2}(G) are generated. Ideally, 𝐳1p\mathbf{z}_{1}^{p} and 𝐳2p\mathbf{z}_{2}^{p} suffice the mutual redundancy assumption stated in 2.2 because t1(G)t_{1}(G) and t1(G)t_{1}(G) are augmented from the same original graph, and thus naturally share the same predictive factors.

Here, we clarify the lower bound of the mutual information between one augmented view and the two mapped representations learned from the other augmented view in Theorem 1.

Theorem 1

Suppose f()f(\cdot) is a GNN encoder as powerful as 1-WL test. Let 𝐳1p\mathbf{z}_{1}^{p} and 𝐳2p\mathbf{z}_{2}^{p} be specific to the predictive information of GG, meanwhile 𝐳1c\mathbf{z}_{1}^{c} and 𝐳2c\mathbf{z}_{2}^{c} account for the non-predictive factors of t1(G)t_{1}(G) and t2(G)t_{2}(G). Then we have:

I(t1(G);𝐳2p,𝐳2c)I(𝐳1p;𝐳2p) where G𝒢 and t1(),t2()𝒯.I\left(t_{1}(G);\mathbf{z}_{2}^{p},\mathbf{z}_{2}^{c}\right)\geq I\left(\mathbf{z}_{1}^{p};\mathbf{z}_{2}^{p}\right)\text{ where }G\in\mathcal{G}\text{ and }t_{1}(\cdot),t_{2}(\cdot)\in\mathcal{T}.

The detailed proof is provided in Appendix E. Given the lower bound, we substitute the objective by the mutual information between the two representations in the predictive view (𝐳1p\mathbf{z}_{1}^{p} and 𝐳2p\mathbf{z}_{2}^{p}) to maximize the consistency between the information of the two views. Therefore, we derive the objective function ensuring view invariance as follows:

pre=1Ni=1NCL(𝐳1,ip,𝐳2,ip),\mathcal{L}_{\text{pre}}=\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}_{\text{CL}}(\mathbf{z}_{1,i}^{p},\mathbf{z}_{2,i}^{p}), (4)

where CL()\mathcal{L}_{\text{CL}}(\cdot) is the adopted InfoNCE loss (van den Oord et al., 2018). To further pursue the feature disentanglement as illustrated in Figure 1(c), we thus propose the cross-view reconstruction mechanism. To be specific, we would like the representation pair (𝐳p,𝐳c)(\mathbf{z}^{p},\mathbf{z}^{c}) within and cross the augmentation views be able to recover the raw data so that the two objectives can be approached simultaneously. Due to the fact that graphs are non-Euclidean structured data, we instead try to recover 𝐳=f(t(G))\mathbf{z}=f(t(G)) given (𝐳c(\mathbf{z}^{c} and 𝐳p)\mathbf{z}^{p}).

More specifically, we first perform the reconstruction within the augmentation view, namely mapping (𝐳wp,𝐳wc)(\mathbf{z}_{w}^{p},\mathbf{z}_{w}^{c}) to 𝐳w\mathbf{z}_{w}, where w{1,2}w\in\{1,2\} representing the augmentation view. Then, we define the (𝐳wp,𝐳wc)(\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}) as a cross-view representation pair and the reconstruction procedure is repeated on it to predict 𝐳w\mathbf{z}_{w}, aiming to ensure 𝐳p\mathbf{z}^{p} and 𝐳c\mathbf{z}^{c} is optimized to approximate mutual disentanglement, where w=1,w=2w=1,w^{\prime}=2 or w=2,w=1w=2,w^{\prime}=1. Intuitively, the reconstruction process is capable of separating the information of the shared features sets from the one resided in the unique feature sets between the two augmentation views. Since the two IID sampled augmentation operators (t1()t_{1}(\cdot) and t2()t_{2}(\cdot)) are expected to preserve the predictive/rational features while varying the augmentation-related ones, we disentangle the rational features from GG according to the rationale discover studies (Chang et al., 2020) to ensure the features’ robustness for downstream tasks. Here, we formulate the reconstruction procedures as:

𝐳wr=gr(𝐳wp𝐳wc),𝐳wcr=gr(𝐳wp𝐳wc),\mathbf{z}_{w}^{r}=g_{r}\left(\mathbf{z}_{w}^{p}\odot\mathbf{z}_{w}^{c}\right)\text{,}\;\;\mathbf{z}_{w}^{cr}=g_{r}\left(\mathbf{z}_{w^{\prime}}^{p}\odot\mathbf{z}_{w}^{c}\right), (5)

where grg_{r} is the parameterized reconstruction model and \odot is the free-to-choose fusion operator, such as element-wise product or concatenation. The reconstruction procedures are optimized by minimizing the entropy H(𝐳w𝐳wp,𝐳wc)H\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right), where w=ww=w^{\prime} or www\neq w^{\prime}. Ideally, we reach the optimal sufficiency and disentanglement conditions illustrated in Figure 1 (b) and (c) iff H(𝐳w𝐳wp,𝐳wc)=𝔼p(𝐳w,𝐳wp,𝐳wc)[logp(𝐳w𝐳wp,𝐳wc)]=0H\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)=-\mathbb{E}_{p\left(\mathbf{z}_{w},\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)}\left[\log p\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)\right]=0, where 𝐳w\mathbf{z}_{w} is exactly recovered given its complementary representation and the predictive representation of any view. Nevertheless, the condition probability p(𝐳w𝐳wp,𝐳wc)p\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right) is intractable, we hence use the variational distribution approximated by grg_{r} instead, denoted as q(𝐳w𝐳wp,𝐳wc)q\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right). We provide the upper bound of H(𝐳w𝐳wp,𝐳wc)H\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right) in Theorem 2.

Theorem 2

Assume qq is a Gaussian distribution, grg_{r} is the parameterized reconstruction model which infers 𝐳w\mathbf{z}_{w} from (𝐳wp,𝐳wc)\left(\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right). Then we have:

H(𝐳w𝐳wp,𝐳wc)𝐳wgr(𝐳wp𝐳wc)22 where w=w or ww.H\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)\leq\left\|\mathbf{z}_{w}-g_{r}\left(\mathbf{z}_{w^{\prime}}^{p}\odot\mathbf{z}_{w}^{c}\right)\right\|_{2}^{2}\text{ where }w=w^{\prime}\text{ or }w\neq w^{\prime}.

The detailed proof is demonstrated in Appendix E. Since we adopt two augmentation views, the objective function constraining representation disentanglement can be formulated as:

recon=12Ni=1Nw=12[𝐳w,i𝐳w,ir22+𝐳w,i𝐳w,icr22].\mathcal{L}_{\text{recon}}=\frac{1}{2N}\sum_{i=1}^{N}\sum_{w=1}^{2}\left[\left\|\mathbf{z}_{w,i}-\mathbf{z}_{w,i}^{r}\right\|_{2}^{2}+\left\|\mathbf{z}_{w,i}-\mathbf{z}_{w,i}^{cr}\right\|_{2}^{2}\right]. (6)

3.2 Adversarial Contrastive View

With the cross-view reconstruction mechanism above, the two learned representations stated above are optimized towards the disentangled manner. However, it is still necessary to further prevent the learned predictive representation from focusing on the partial features, because we do not have access to the explicit domain knowledge and such small scope will increase the risk of shortcut solution. Therefore, we extend the Equation 4 to three contrastive views and add an extra global view without topological perturbation as the third views to guarantee the learned 𝐳p\mathbf{z}^{p} maintain the global semantics instead of partial or even trivial features, i.e., 𝐳1pG\mathbf{z}_{1}^{p}\sim G and 𝐳2pG\mathbf{z}_{2}^{p}\sim G. During the experiments, we find an adversarial graph sample perturbed from original graph view can help the model achieve stronger robustness. A possible explanation is that there is still redundant information that is not predictive left in the shared information of the two 𝐳p\mathbf{z}^{p}’s in the two augmentation views, especially when the implemented augmentations are moderate. An adversarial view may further alleviate redundancy. We define the adversarial objective as follows:

δ=argmaxδϵadv(t1(G),t2(G),G+δ),\delta^{*}=\underset{\left\|\delta\right\|_{\infty}\leq\epsilon}{\operatorname{argmax}}\mathcal{L}_{\text{adv}}\left(t_{1}(G),t_{2}(G),G+\delta\right), (7)

where the adversarial sample G+δG+\delta together with the two augmentation views, i.e., t1(G)t_{1}(G) and t2(G)t_{2}(G) are employed as the positive pair. Our crafted perturbation is spurred by recent work (Yang et al., 2021) that add perturbation δ\delta on the output of first hidden layer 𝐡(1)\mathbf{h}^{(1)}, since it is empirically proved to generate more challenging views than adding perturbation on the initial node feature. Therefore, the adversarial contrastive objective is defined as:

adv=1Ni=1Nmaxδ[CL(𝐳1,ip,G+δ)+CL(𝐳2,ip,G+δ)].\mathcal{L}_{\text{adv}}=\frac{1}{N}\sum_{i=1}^{N}\underset{\delta^{*}}{\max}\left[\mathcal{L}_{\text{CL}}\left(\mathbf{z}_{1,i}^{p},G+\delta^{*}\right)+\mathcal{L}_{\text{CL}}\left(\mathbf{z}_{2,i}^{p},G+\delta^{*}\right)\right]. (8)

where the optimized perturbation δ\delta^{\prime} is solved by projected gradient descent (PGD) (Madry et al., 2018). Finally, we derive the joint objective of GraphCV by combining all of objectives above together. The joint objective is as follow:

minf,g𝔼G𝐆[pre+λrrecon+λamaxδϵadv],\min_{f,g}\mathbb{E}_{G\in\mathbf{G}}\left[\mathcal{L}_{\text{pre}}+\lambda_{r}\mathcal{L}_{\text{recon}}+\lambda_{a}\underset{\left\|\delta\right\|_{\infty}\leq\epsilon}{\max}\mathcal{L}_{\text{adv}}\right]\\ , (9)

where λr\lambda_{r} and λa\lambda_{a} are the coefficients to balance the magnitude of each loss term. Our proposed model is able to learn optimal representation illustrated in Figure 1(c) with the joint objective.

4 Experiments

In this section, we demonstrate the empirical evaluation results of GraphCV on public graph benchmark datasets under different settings. Ablation study and hyper-parameter analysis are also conducted to evaluate the effectiveness of the designs in GraphCV. We further compare the robustness of GraphCV with the adversarial training-based GCL method. More content about dataset statistics, training details and other empirical analysis are provided in the Appendix.

4.1 Experimental Setups

Datasets. For unsupervised learning setting, we evaluate our model on five graph benchmark datasets from the field of bioinformatics, including MUTAG, PTC-MR, NCI1, DD, and PROTEINS, and other four from the field of social network, which are COLLAB, IMDB-B, RDT-B, and IMDB-M, for the task of graph-level property classification. For the transfer learning setting, we follow previous work (You et al., 2020; Xu et al., 2021b) to pretrain our model on the ZINC-2M dataset, which cotains 2 million unlabeled molecule graphs sampled from MoleculeNet (Wu et al., 2018a), then evaluate its performance on eight binary classification datasets from chemistry domain, where the eight datasets are splitted according to the scaffold to simulate the out-of-distribution scenario in real-world. Additionally, We use ogbg-molhiv from Open Graph Benchmark Dataset (Hu et al., 2020a) to evaluate our model over large-scale dataset under semi-supervised setting. More details about dataset statistics are included in Appendix C.

Baselines. Under the unsupervised representation learning setting, we compare GraphCV with the eight SOTA self-supervised learning methods GraphCL (You et al., 2020), InfoGraph(Sun et al., 2019), MVGRL (Hassani & Khasahmadi, 2020a), AD-GCL(Suresh et al., 2021), GASSL(Yang et al., 2021), InfoGCL(Xu et al., 2021a), RGCL (Li et al., 2022) and DGCL(Li et al., 2021), as well as three classical unsupervised representation learning methods, including node2vec (Grover & Leskovec, 2016), graph2vec (Narayanan et al., 2017), and GVAE(Kipf & Welling, 2016). Besides, we employ AttrMasking (Hu et al., 2020b), ContextPred (Hu et al., 2020b), GraphCL (You et al., 2020), GraphLoG (Xu et al., 2021b), AD-GCL (Suresh et al., 2021) and RGCL (Li et al., 2022) as baselines to evaluate the effectiveness of our proposed GraphCV under transfer learning setting.

Evaluation Protocol. For unsupervised setting, we follow the evaluation protocols in the previous works (Sun et al., 2019; You et al., 2020; Li et al., 2021) to verify the effectiveness of our model. The mean test accuracy score evaluated by a 10-fold cross validation with standard deviation of five random seeds as the final performance. For transfer learning setting, we follow the finetuning procedures of previous work (You et al., 2020; Xu et al., 2021b) and report the mean ROC-AUC scores with standard deviation of 10 repeated runs on each downstream datasets. In addition, we follow the setting of semi-supervised representation learning from GraphCL on the ogbg-molhiv dataset, with the finetune label rates as 1%, 10%, and 20%. The final performance is reported as the mean ROC-AUC of five initialization random seeds

Table 1: Overall comparison on multiple graph classification benchmarks under unsupervised learning setting. Results are reported as mean±std%, the best performance is bolded and runner-ups are underlined. "-" indicates the result is not reported in original papers.

MUTAG PTC-MR COLLAB NCI1 PROTEINS IMDB-B RDT-B IMDB-M DD node2vec 72.6±10.2 58.6±8.0 - 54.9±1.6 57.5±3.6 - - - - graph2vec 83.2±9.3 60.2±6.9 - 73.2±1.8 73.3±2.1 71.1±0.5 75.8±1.0 50.4±0.9 - InfoGraph 89.0±1.1 61.7±1.4 70.7±1.1 76.2±1.1 74.4±0.3 73.0±0.9 82.5±1.4 49.7±0.5 72.9±1.8 VGAE 87.7±0.7 61.2±1.8 - - - 70.7±0.7 87.1±0.1 49.3±0.4 - MVGRL 89.7±1.1 62.5±1.7 - - - 74.2±0.7 84.5±0.6 51.2±0.5 - GraphCL 86.8±1.3 63.6±1.8 71.4±1.2 77.9±0.4 74.4±0.5 71.1±0.4 89.5±0.8 - 78.6±0.4 InfoGCL 91.2±1.3 63.5±1.5 80.0±1.3 80.2±0.6 - 75.1±0.9 - 51.4±0.8 - DGCL 92.1±0.8 65.8±1.5 81.2±0.3 81.9±0.2 76.4±0.5 75.9±0.7 91.8±0.2 51.9±0.4 - AD-GCL 89.7±1.0 - 73.3±0.6 69.7±0.5 73.8±0.5 72.3±0.6 85.5±0.8 49.9±0.7 75.1±0.4 RGCL 87.7±1.0 - 70.9±0.7 78.1±1.1 75.0±0.4 71.9±0.8 90.3±0.6 - 78.9±0.5 GASSL 90.9±7.9 64.6±6.1 78.0±2.0 80.2±1.9 - 74.2±0.5 - 51.7±2.5 - GraphCV 92.3±0.7 67.4±1.3 80.5±0.5 82.0±1.0 76.8±0.4 75.6±0.4 92.4±0.9 52.2±0.5 80.5±0.5

4.2 Overall Performance Comparison

Unsupervised learning. The overall performance comparison is shown in Table 1 and we can have three observations: (1) The GCL-based methods generally yield higher performances than classical unsupervised learning methods, indicating the effectiveness of utilizing instance-level supervision; (2) RGCL, AD-GCL, and GASSL achieve better performances than GraphCL, which empirically proves the conclusion that InfoMax object could bring overwhelmed redundant information and thus suffer from feature suppression issue; (3) Our proposed GraphCV and DGCL consistently outperform other baselines, proving the advantage of disentangled representation. More importantly, GraphCV achieves state-of-the-art results on most of the datasets, demonstrating the model effectiveness.

Transfer learning. Table 2 demonstrates the experimental results under transfer learning setting, where No Pre-Train skips self-supervised pre-training process on the ZINC-2M dataset for model initialization before finetune. It is noteworthy that some strong baselines (AttrMasking and ContextPred) are trained under the guidance of domain knowledge. Despite in lacking of such domain knowledge, our model still outperforms all the other baselines on 3 out 8 datasets and achieve highest average performance. More importantly, JOAO, RGCL and our proposed GraphCV are all developed from GraphCL, but achieve higher average performance than GraphCL. This observation further empirically prove the poisoning effect of biased information and the necessity to to suppress them.

Table 2: Overall comparison on multiple graph classification benchmarks under transfer learning setting. Results are reported as mean±std%, the best performance is bolded and runner-ups are underlined.

BBBP Tox21 ToxCast SIDER ClinTox MUV HIV BACE Avg No Pre-Train 65.8±4.5 74.0±0.8 63.4 ±0.6 57.3±1.6 58.0±4.4 71.8±2.5 75.3±1.9 70.1±5.4 67.0 AttrMasking 64.3±2.8 76.7±0.4 64.2±0.5 61.0±0.7 71.8±4.1 74.7±1.4 77.2±1.1 79.3±1.6 71.1 ContextPred 68.0±2.0 75.7±0.7 63.9±0.6 60.9±0.6 65.9±3.8 75.8±1.7 77.3±1.0 79.6±1.2 70.9 GraphCL 69.5±0.5 75.4±0.9 63.8±0.4 60.8±0.7 70.1±1.9 74.5±1.3 77.6±0.9 78.2±1.2 70.8 GraphLoG 72.5±0.8 75.7±0.5 63.5±0.7 61.2±1.1 76.7±3.3 76.0±1.1 77.8±0.8 83.5±1.2 73.4 JOAO 70.2±1.0 75.0±0.3 62.9±0.5 60.0±0.8 81.3±2.5 71.7±1.4 76.7±1.2 51.5±0.4 71.9 RGCL 71.4±0.7 75.2±0.3 63.3±0.2 61.4±0.6 83.4 ±0.9 76.7 ±1.0 77.9±0.8 76.03±0.8 73.2 GraphCV 71.6±0.6 75.7±0.6 63.2±0.5 62.2±0.7 83.6±1.5 76.4±0.8 77.9±1.0 80.8±1.8 73.9

Refer to caption
Figure 3: Performance comparison of semi-supervised learning on ogbg-molhiv.

Semi-supervised learning. The experimental results are shown in Figure 3. It is obvious that our model gains significant improvements under the three label-rate fine-tuning settings. We also notice that as the label rate increases, the amount of improvement increases as well (1%, 1.8%, and 4.4% for label rate 1%, 10%, and 20%, respectively). A possible explanation could be that more trainable data could bring more redundant information, thereby further deteriorate the feature suppression issue. Therefore, removing redundant information causes a higher performance boost.

4.3 Ablation Study

To further verify the effectiveness of different modules in GraphCV, we perform ablation studies on each one of the module by creating two model variants: (1) w/o CV Recon, the cross-view reconstruction process is discarded; (2) w/o Adv. Training, the third adversarial view is discarded. The comparison results are shown in Table 3. We can observe from Table 3 that our model with the combination of cross-view reconstruction and adversarial training module outperforms all of the variants. Omitting the reconstruction process could cause the failure to optimize the representation in a disentangled manner illustrated in Figure 1(c), thereby the learned representation still suffer from features suppression issue. Compared with our model, the variant w/o Adv. Training may lead to representation collapse and bring extra redundant information, therefore resulting in sub-optimal performances in downstream tasks.

Table 3: Overall comparison of the model variants’ performance. Results are reported as mean±std%, the best performance is bolded.

MUTAG PTC-MR COLLAB NCI1 PROTEINS IMDB-B RDT-B IMDB-M DD w/o CV Recon 91.0±0.9 64.7±1.4 78.0±0.8 78.7±1.2 74.9±0.7 75.0±0.6 91.1±0.7 51.7±0.6 79.0±0.8 w/o Adv. Training 92.1±0.6 66.8±0.5 76.5±0.6 81.2±0.9 76.0±0.3 75.1±0.6 92.2±1.0 50.8±0.4 80.1±0.6 GraphCV 92.3±0.7 67.4±0.5 80.5±0.5 82.0±1.0 76.8±0.4 75.6±0.4 92.5±0.9 52.2±0.5 80.5±0.5

4.4 Robustness and Hyper-parameter Analysis

Refer to caption
Figure 4: The model performance under different perturbation bound, attack step and analysis the sensitivity of the two important hyper-parameters (i.e., λr\lambda_{r} and λa\lambda_{a}).

In this section, we firstly conduct extra experiments on ogbg-molhiv dataset to evaluate the representation robustness under aggressive augmentation and perturbation. The results are shown in left two subplots of Figure 4, we compare our method with GASSL under different perturbation bounds and attack steps to demonstrate its robustness against adversarial attacks. Since both our model and GASSL use GIN as the backbone network, we hereby add the performance of GIN as the compared baseline. Although aggressive adversarial attacks can largely deteriorate the performance, our proposed GraphCV still achieves more robust performance than GASSL. In the right two subplots, we analysis the model sensitivity of the two important hyper-parameters in our model, λr\lambda_{r} and λa\lambda_{a}. The consistent superiority of different values over the initial point (i.e., λr\lambda_{r}, λa=0\lambda_{a}=0) prove the effectiveness of our design once again. We can also observe that the appropriate range of the two hyper-parameters are 5.0 to 10.0 and 0.0 to 0.5, respectively. Depend on the datasets size and attributes, the range can have some variance, we suggest to finetune the two hyper-parameters around 10.0 and 0.25 to find the appropriate values when adopting our model to a new datasets. More experiments and discussion about hyper-parameter sensitivity in provided in the Appendix H. Besides, we also conduct extra experiments to analyze the disentanglement of 𝐳p\mathbf{z}^{p} and 𝐳c\mathbf{z}^{c} in Appendix F.

5 Related Work

Graph contrastive learning. Contrastive learning is firstly proposed in the compute vision field (Chen et al., 2020) and raises a surge of interests in the area of self-supervised graph representation learning for the past few years. The principle behind contrastive learning is to utilize the instance-level identity as supervision and maximize the consistency between positive pairs in hidden space through designed contrast mode. Previous graph contrastive learning works generally rely on various graph augmentation (transformation) techniques (Veličković et al., 2019; Qiu et al., 2020; Hassani & Khasahmadi, 2020b; You et al., 2020; Sun et al., 2019) to generate positive pair from original data as similar samples. Recent works in this field try to improve the effectiveness of graph contrastive learning by finding more challenge view (Suresh et al., 2021; Xu et al., 2021a; You et al., 2021) or adding adversarial perturbation (Yang et al., 2021). However, most of the existing methods contrast over entangled embeddings, where the complex intertwined information may pose obstacles to extracting useful information for downstream tasks. Our model is spared from the issue by contrasting over disentangled representations.

Disentangled representation learning on graphs. Disentangled representation learning arises from the computer vision field (Hsieh et al., 2018; Zhao et al., 2021) to disentangle the heterogeneous latent factors of the representations, and therefore making the representations more robust and interpretable (Bengio et al., 2013). This idea has now been widely adopted in graph representation learning. (Liu et al., 2020; Ma et al., 2019) utilizes neighborhood routing mechanism to identify the latent factors in the node representations. Some other generative models (Kipf & Welling, 2016; Simonovsky & Komodakis, 2018) utilize Variational Autoencoders to balance reconstruction and disentanglement. Recent work (Li et al., 2021) outspreads the application of disentangled representations learning in self-supervised graph learning by contrasting the factorized representations. Although these methods gain significant benefit from the representation disentanglement, the underlined excessive information could still overload the model, thus resulting in limited capacities. Our model targets the issue by removing the redundant information that is considered irrelevant to the graph property.

Graph information bottleneck. The Information bottleneck (IB) (Tishby et al., 2000) has been widely adopted as a critical principle of representation learning. A representation contains minimal yet sufficient information is considered to be in compliance with the IB priciple and many works (Alemi et al., 2017; Shwartz-Ziv & Tishby, 2017; Federici et al., 2020) have empirically and theoretically proved that representation agree with IB principle is both informative and robust. Recently, IB principle is also borrowed to guide the representation learning of graph structure data. Current methods (Wu et al., 2020; Xu et al., 2021a; Suresh et al., 2021; Li et al., 2022) usually propose different regularization designs to learn compressed yet informative representations in accordance with IB principle. We follow the information bottleneck to learn the expressive and robust representation from disentangled information in this work.

6 Conclusion

In this paper, we study the feature suppression problem in representation learning. To avoid the predictive features being suppressed in learned representation, we propose a novel model, namely GraphCV, which is designed in accordance with the information bottleneck principle. The cross-view reconstruction in GraphCV can disentangle those more robust and transferable features from those easily-disturbed ones. Meanwhile, we also add an adversarial view as the third view of the contrastive learning to to guarantee the global semantics and further enhance representation robustness. In addition, we theoretically analyze the effectiveness of each component in our model and derive the objective based on the analysis. Extensive experiments on multiple graph benchmark datasets and different settings prove the ability of GraphCV to learn robust and transferable graph representation. In the future, we can explore how to come up with a practical objective to further decrease the upper bound of the mutual information between the disentangled representations and try to utilize more efficient training strategy to make the proposed model more time-saving on large-scale graphs.

Ethics Statement

This idea is proposed to solve the general graph learning problem, so we believe there should exist no ethical concern applicable to our work. Any unethical application that benefits from our work is against our initial intent.

Reproducibility Statement

We provide the source code along with the submission in the supplementary materials for reproducibility. The source code and all the implementation details will be open to public once upon the acceptance of this paper.

References

  • Achille & Soatto (2018) Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. JMLR, 2018.
  • Alemi et al. (2017) Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep Variational Information Bottleneck. ICLR, 2017.
  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation Learning: A Review and New Perspectives. TPAMI, 2013. doi: 10.1109/TPAMI.2013.50.
  • Bevilacqua et al. (2021) Beatrice Bevilacqua, Yangze Zhou, and Bruno Ribeiro. Size-invariant graph representations for graph classification extrapolations. In ICML, 2021.
  • Chang et al. (2020) Shiyu Chang, Yang Zhang, Mo Yu, and Tommi Jaakkola. Invariant rationalization. In ICML, 2020.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A Simple Framework for Contrastive Learning of Visual Representations. In ICML, 2020. URL https://proceedings.mlr.press/v119/chen20j.html.
  • Chen et al. (2021) Ting Chen, Calvin Luo, and Lala Li. Intriguing properties of contrastive losses. In NeurIPS, 2021.
  • Cover (1999) Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
  • Federici et al. (2020) Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. Learning robust representations via multi-view information bottleneck. In ICLR, 2020.
  • Geirhos et al. (2020) Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020.
  • Grover & Leskovec (2016) Aditya Grover and Jure Leskovec. node2vec: Scalable Feature Learning for Networks. In KDD, 2016. URL https://doi.org/10.1145/2939672.2939754.
  • Hamilton et al. (2017a) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive Representation Learning on Large Graphs. In NeurIPS, 2017a. URL https://proceedings.neurips.cc/paper/2017/hash/5dd9db5e033da9c6fb5ba83c7a7ebea9-Abstract.html.
  • Hamilton et al. (2017b) William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017b.
  • Hassani & Khasahmadi (2020a) Kaveh Hassani and Amir Hosein Khasahmadi. Contrastive Multi-View Representation Learning on Graphs. In ICML, 2020a. URL https://proceedings.mlr.press/v119/hassani20a.html.
  • Hassani & Khasahmadi (2020b) Kaveh Hassani and Amir Hosein Khasahmadi. Contrastive multi-view representation learning on graphs. In ICML, 2020b.
  • Hsieh et al. (2018) Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. Learning to Decompose and Disentangle Representations for Video Prediction. In NeurIPS, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/496e05e1aea0a9c4655800e8a7b9ea28-Abstract.html.
  • Hu et al. (2020a) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open Graph Benchmark: Datasets for Machine Learning on Graphs. In NeurIPS, 2020a. URL https://proceedings.neurips.cc/paper/2020/hash/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html.
  • Hu et al. (2020b) Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. In ICLR, 2020b.
  • Kipf & Welling (2016) Thomas N. Kipf and Max Welling. Variational Graph Auto-Encoders. In NeurIPS, 2016.
  • Kipf & Welling (2017) Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR, 2017.
  • Li et al. (2021) Haoyang Li, Xin Wang, Ziwei Zhang, Zehuan Yuan, Hang Li, and Wenwu Zhu. Disentangled Contrastive Learning on Graphs. In NeurIPS, 2021. URL https://openreview.net/forum?id=C_L0Xw_Qf8M.
  • Li et al. (2022) Sihang Li, Xiang Wang, An Zhang, Xiangnan He, and Tat-Seng Chua. Let invariant rationale discovery inspire graph contrastive learning. In ICML, 2022.
  • Liu et al. (2020) Yanbei Liu, Xiao Wang, Shu Wu, and Zhitao Xiao. Independence promoted graph disentangled networks. In AAAI, 2020.
  • Ma et al. (2019) Jianxin Ma, Peng Cui, Kun Kuang, Xin Wang, and Wenwu Zhu. Disentangled Graph Convolutional Networks. In ICML, 2019. URL https://proceedings.mlr.press/v97/ma19a.html.
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.
  • Minderer et al. (2020) Matthias Minderer, Olivier Bachem, Neil Houlsby, and Michael Tschannen. Automatic shortcut removal for self-supervised representation learning. In ICML, 2020.
  • Morris et al. (2020) Christopher Morris, Nils M. Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. TUDataset: A collection of benchmark datasets for learning with graphs. Technical report, arXiv, 2020. URL http://arxiv.org/abs/2007.08663.
  • Narayanan et al. (2017) A. Narayanan, Mahinthan Chandramohan, R. Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. graph2vec: Learning Distributed Representations of Graphs. ArXiv, 2017.
  • Qiu et al. (2020) Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. In KDD, 2020. ISBN 978-1-4503-7998-4. doi: 10.1145/3394486.3403168. URL https://doi.org/10.1145/3394486.3403168.
  • Robinson et al. (2021) Joshua Robinson, Li Sun, Ke Yu, Kayhan Batmanghelich, Stefanie Jegelka, and Suvrit Sra. Can contrastive learning avoid shortcut solutions? In NeurIPS, 2021.
  • Shwartz-Ziv & Tishby (2017) Ravid Shwartz-Ziv and Naftali Tishby. Opening the Black Box of Deep Neural Networks via Information. arXiv:1703.00810 [cs], April 2017. URL http://arxiv.org/abs/1703.00810.
  • Simonovsky & Komodakis (2018) Martin Simonovsky and Nikos Komodakis. GraphVAE: Towards generation of small graphs using variational autoencoders. In ICLR, 2018.
  • Sun et al. (2019) Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. In ICLR, 2019. URL http://arxiv.org/abs/1908.01000.
  • Suresh et al. (2021) Susheel Suresh, Pan Li, Cong Hao, and Jennifer Neville. Adversarial Graph Augmentation to Improve Graph Contrastive Learning. In NeurIPS, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/854f1fb6f65734d9e49f708d6cd84ad6-Abstract.html.
  • Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
  • Tschannen et al. (2019) Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. In ICLR, 2019.
  • van den Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding. arXiv e-prints, 2018. URL https://ui.adsabs.harvard.edu/abs/2018arXiv180703748V.
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph Attention Networks. In ICLR, 2018. URL https://openreview.net/forum?id=rJXMpikCZ.
  • Veličković et al. (2019) Petar Veličković, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R. Devon Hjelm. Deep Graph Infomax. In ICLR, 2019. URL https://openreview.net/forum?id=rklz9iAcKQ.
  • Wu et al. (2022a) Qitian Wu, Hengrui Zhang, Junchi Yan, and David Wipf. Handling distribution shifts on graphs: An invariance perspective. In ICLR, 2022a.
  • Wu et al. (2020) Tailin Wu, Hongyu Ren, Pan Li, and Jure Leskovec. Graph Information Bottleneck. In NeurIPS. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ebc2aa04e75e3caabda543a1317160c0-Abstract.html.
  • Wu et al. (2022b) Ying-Xin Wu, Xiang Wang, An Zhang, Xiangnan He, and Tat seng Chua. Discovering invariant rationales for graph neural networks. In ICLR, 2022b.
  • Wu et al. (2018a) Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018a.
  • Wu et al. (2018b) Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. Unsupervised Feature Learning via Non-Parametric Instance Discrimination. In CVPR, 2018b. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Wu_Unsupervised_Feature_Learning_CVPR_2018_paper.html.
  • Xu et al. (2021a) Dongkuan Xu, Wei Cheng, Dongsheng Luo, Haifeng Chen, and Xiang Zhang. InfoGCL: Information-Aware Graph Contrastive Learning. In NeurIPS, 2021a. URL https://proceedings.neurips.cc/paper/2021/hash/ff1e68e74c6b16a1a7b5d958b95e120c-Abstract.html.
  • Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural Networks? In ICLR, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
  • Xu et al. (2021b) Minghao Xu, Hang Wang, Bingbing Ni, Hongyu Guo, and Jian Tang. Self-supervised graph-level representation learning with local and global structure. In ICML, 2021b.
  • Yang et al. (2021) Longqi Yang, Liangliang Zhang, and Wenjing Yang. Graph Adversarial Self-Supervised Learning. In NeurIPS, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/7d3010c11d08cf990b7614d2c2ca9098-Abstract.html.
  • Ye et al. (2021) Haotian Ye, Chuanlong Xie, Tianle Cai, Ruichen Li, Zhenguo Li, and Liwei Wang. Towards a theoretical framework of out-of-distribution generalization. In NeurIPS, 2021.
  • You et al. (2020) Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph Contrastive Learning with Augmentations. In NeurIPS, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/3fe230348e9a12c13120749e3f9fa4cd-Abstract.html.
  • You et al. (2021) Yuning You, Tianlong Chen, Yang Shen, and Zhangyang Wang. Graph contrastive learning automated. In ICLR, 2021.
  • Zhao et al. (2021) Long Zhao, Yuxiao Wang, Jiaping Zhao, Liangzhe Yuan, Jennifer J. Sun, Florian Schroff, Hartwig Adam, Xi Peng, Dimitris Metaxas, and Ting Liu. Learning View-Disentangled Human Pose Representation by Contrastive Cross-View Mutual Information Maximization. In CVPR, 2021. URL https://openaccess.thecvf.com/content/CVPR2021/html/Zhao_Learning_View-Disentangled_Human_Pose_Representation_by_Contrastive_Cross-View_Mutual_Information_CVPR_2021_paper.html.
  • Zhu et al. (2021) Yanqiao Zhu, Yichen Xu, Qiang Liu, and Shu Wu. An Empirical Study of Graph Contrastive Learning. arXiv.org, 2021.

Appendix A Out-of-distribution Scenario on Graph

Refer to caption
Figure 5: An out-of-distribution situation in molecule graph prediction task. The casual functional sub-structure (red) are spuriously correlated with different trivial sub-structures in training and test set. The statistical correlations can lead to poor robustness and transferability.

In this section, we will illustrate the out-of-distribution scenario in graph learning task. During molecule property study, A specific kind of property (e.g., toxicity and lipophilicity) of a molecule is usually dependent on if it has corresponding sub-structures (termed as functional group). For example, hydrophilic molecules usually have oxhydryl group (OH-OH) Therefore, a well-trained GNN model on molecule graph prediction task is capable to reflect the sub-structure information in the graph representation. However, it is usually the case in real-world scenario that the predictive functional group is usually accompanied by some irrelevant groups in some environments, thus causing spurious correlations. This correlation usually lead to poor generalization performance when the model is evaluated on another environment with different spurious correlation. Figure 5 intuitively demonstrates this kind of scenario, where the red subgraph is the feature we can rely on to make casual prediction. But it usually show up with green subgraph that do not serve as the functional graph of the property in training set. Consequently, the model are easily misguided that the green subgraph is an important indicator of the property. When we evaluate the model on the testing set where the casual graph is correlated with another kind of group (yellow subgraph), there usually exists a huge gap between its performances on the two sets.

Appendix B Discussion on Feature Suppression

In this section, we will follow the previous works (Chen et al., 2021; Robinson et al., 2021) to present a more formal definition of the feature suppression and clarify its relation with contrastive learning.

First of all, we assume graph data GG has nn feature sub-spaces, G1,,GnG^{1},\ldots,G^{n}, where each GiGG^{i}\in G corresponding to a distinct feature of GG. To quantify the relation between GG and its feature sub-spaces, we need to measure the conditional probability of GG given a specific kind of feature sub-space GiG^{i} (i[n]i\subseteq[n]), denoted as p(GGi)p(G\mid G^{i}). Finally, we define a an injective map g:GiGg:G^{i}\rightarrow G produce observation G=g(Gi)G=g(G^{i}). Due to the reason that GiG^{i} is not explicit, so we aim to train a encoder f:Gidf:G_{i}\rightarrow\mathbb{R}^{d} to map input graph data GG into a latent space to extract useful high-level information 𝐳i\mathbf{z}^{i} corresponding to each feature sub-space GiG^{i} of input data GG during cotrastive learning. Therefore, we use p(G𝐳i)p(G\mid\mathbf{z}^{i}) as the approximated value of the measurement p(GGi)p(G\mid G^{i}). Then we have,

  • For any feature sub-space GiG^{i} and its complementary feature sub-subspace Gi¯G^{\bar{i}}, ff suppress feature i[n]i\subseteq[n] if we have p(G𝐳i)=p(G𝐳i¯)p(G\mid\mathbf{z}^{i})=p(G\mid\mathbf{z}^{\bar{i}})

  • For any feature sub-space GiG^{i} and its complementary feature sub-subspace Gi¯G^{\bar{i}}, ff distinguish feature i[n]i\subseteq[n] if p(G𝐳i)p(G\mid\mathbf{z}^{i}) and p(G𝐳i¯)p(G\mid\mathbf{z}^{\bar{i}}) have disjoint support.

To sum up, a feature is suppressed if it does not make any difference to the instance discrimination. One of the common acknowledgements for unsuprevised learning strategy is that it can usually produce representation with uniform feature space distribution due to the lack of supervision, i.e., every feature sub-space is equally treated without feature suppression. However, it could not be the situation in contrastive learning. Taking the commonly used InfoNCE ](van den Oord et al., 2018) as an example, it can be divided into two parts, i.e. align term and uniform term (Chen et al., 2020), as follow:

τInfoNCE=1mi,jsim(𝒛i,𝒛j)alignment +τmilogk=12m𝟏[ki]exp(sim(𝒛i,𝒛k)/τ)uniform\tau\mathcal{L}^{\mathrm{InfoNCE}}=\underbrace{-\frac{1}{m}\sum_{i,j}\operatorname{sim}\left(\bm{z}_{i},\bm{z}_{j}\right)}_{\mathcal{L}_{\text{alignment }}}+\underbrace{\frac{\tau}{m}\sum_{i}\log\sum_{k=1}^{2m}\mathbf{1}_{[k\neq i]}\exp\left(\operatorname{sim}\left(\bm{z}_{i},\bm{z}_{k}\right)/\tau\right)}_{\mathcal{L}_{\text{uniform}}} (10)

Aligning the positive pair will distinguish the shared feature subspace GiG^{i}. Meanwhile, there also exits random negative samples that might own same factors in GiG^{i}, so the uniform term might suppress the feature sub-space GiG^{i}. Therefore, for any feature i[n]i\subseteq[n], the optimization process can either suppress or distinguish it, but both of them can reach to lower contrastive loss. From the analysis we can derive the conclusion mentioned in Section 1 that lower contrastive loss might not yield better performance.

Appendix C Summary of Datasets

In this work, we use nine datasets from TU Benchmark Datasets (Morris et al., 2020) to evaluate our proposed GraphCV under unsupervised setting, where five of them are biochemical datasets and the other four belong to social network datasets. We also utilize the ogng-molhiv dataset from Open Graph Benchmark (OGB) (Hu et al., 2020a) to further evaluate GraphCV under semi-supervised setting. Besides, the datasets sampled from MoleculeNet (Wu et al., 2018a) are employed to evaluate our model under transfer learning setting. The statistics of these datasets are shown in Table 4 and 5.

Dataset #Graphs Avg #Nodes Avg #Edges #Class Metric Category
MUTAG 188 17.93 19.79 2 Accuracy biochemical
PTC-MR 344 14.29 14.69 2 Accuracy biochemical
PROTEINS 1,113 39.06 72.82 2 Accuracy biochemical
NCI1 4,110 29.87 32.30 2 Accuracy biochemical
DD 1,178 284.32 715.66 2 Accuracy biochemical
COLLAB 5,000 74.49 2457.78 3 Accuracy social network
IMDB-B 1,000 19.77 96.53 2 Accuracy social network
RDT-B 2,000 429.63 497.75 2 Accuracy social network
IMDB-M 1,500 13.00 65.94 3 Accuracy social network
ogbg-molhiv 41,127 25.50 27.50 2 ROC-AUC MoleculeNet
Table 4: Statistics of TU-datasets and OGB dataset.
Dataset #Graphs Avg #Nodes Avg Degree #Tasks Metric Category
ZINC-2M 2,000,000 26.62 57.72 - - biochemical
BBBP 2,039 24.06 51.90 1 ROC-AUC biochemical
Tox21 7,813 18.57 38.58 12 ROC-AUC biochemical
ToxCast 8,576 18.78 38.62 617 ROC-AUC biochemical
SIDER 1,427 33.64 70.71 27 ROC-AUC biochemical
ClinTox 1,477 26.15 55.76 2 ROC-AUC biochemical
MUV 93,087 24.23 52.55 17 ROC-AUC biochemical
HIV 41,127 25.51 54.93 1 ROC-AUC biochemical
BACE 1,513 34.08 73.71 1 ROC-AUC biochemical
Table 5: Statistics of MoleculeNet datasets.

All of the eleven datasets are public available, we attach attach their links as follow:

Appendix D Implementation Details

All experiments are conducted with the following settings:

  • Operating System: Ubuntu 18.04.5 LTS

  • CPU: AMD(R) Ryzen 9 3900x

  • GPU: NVIDIA GeForce RTX 2080ti

  • Software: Python 3.8.5; Pytorch 1.10.1; PyTorch Geometric 2.0.4; PyGCL 0.1.2; Numpy 1.20.1; scikit-learn 0.24.1.

We implement our framework with PyTorch and PyGCL library (Zhu et al., 2021). We choose GIN (Xu et al., 2019) as the backbone graph encoder and the model is optimized through Adam optimizer. We choose GIN (Xu et al., 2019) as the backbone graph encoder and the model is optimized through Adam optimizer. We follow (You et al., 2020; Yang et al., 2021; Li et al., 2021) to employ a linear SVM classifier for downstream task-specific classification. The graph augmentation operations used in our work are same as (You et al., 2020), including node dropping, edge perturbation, attribute masking and subgraph sampling, all of them are borrowed from the implementation of (Zhu et al., 2021). There are two specific hyper-parameters in our model, namely λr\lambda_{r} and λa\lambda_{a}, the search space of them are {0.0,1.0,3.0,5.0,10.0}\left\{0.0,1.0,3.0,5.0,10.0\right\} and {0.0,0.25,0.5,0.75,1.0}\left\{0.0,0.25,0.5,0.75,1.0\right\}, respectively. For other important hyper-parameters, we find the best value of learning rate from {0.01,0.005,0.001,0.0005,0.0001}\left\{0.01,0.005,0.001,0.0005,0.0001\right\}, embedding dimension from {32,64,128,256,512}\left\{32,64,128,256,512\right\}, number of GNN layers from {2,3,4,5}\left\{2,3,4,5\right\}, batch size from {32,64,128,256,512}\left\{32,64,128,256,512\right\} (except for ogbg-molhiv {64,128,256,512,1024}\left\{64,128,256,512,1024\right\}). Besides, we fix the perturbation bound ϵ\epsilon, ascent step size α\alpha and ascent step TT as 0.008, 0.008 and 5 during hyper-parameter fine-tuning. As for the implementation details of transfer learning, we basically follow the pre-training setting of previous works (You et al., 2020; Xu et al., 2021b).

Appendix E Proof

E.1 Proof of Theorem 1

We repeat Theorem 1 as follows.

Theorem 1. Suppose f()f(\cdot) is a GNN encoder as powerful as 1-WL test. Let gp()g_{p}(\cdot) elicits only the augmentation information from 𝐳\mathbf{z} meanwhile gc()g_{c}(\cdot) extracts the essential factors of GG from 𝐳1\mathbf{z}_{1} and 𝐳2\mathbf{z}_{2}. Then we have:

I(t1(G);𝐳2c,𝐳2p)I(𝐳1p;𝐳2p) where G𝒢 and t1(),t2()𝒯.I\left(t_{1}(G);\mathbf{z}_{2}^{c},\mathbf{z}_{2}^{p}\right)\geq I\left(\mathbf{z}_{1}^{p};\mathbf{z}_{2}^{p}\right)\text{ where }G\in\mathcal{G}\text{ and }t_{1}(\cdot),t_{2}(\cdot)\in\mathcal{T}.

Proof. According to the assumption in Theorem 1, for any two graphs G,G𝒢G,G^{\prime}\in\mathcal{G}, if GGG\cong G^{\prime} then we have 𝐳=𝐳\mathbf{z}=\mathbf{z^{\prime}}, where 𝐳=f(G)\mathbf{z}=f(G) and 𝐳=f(G)\mathbf{z^{\prime}}=f(G^{\prime}).

Besides, 𝐳p=gp(𝐳)\mathbf{z}^{p}=g_{p}(\mathbf{z}) is specific to the predictive factors and 𝐳c=gc(𝐳)\mathbf{z}^{c}=g_{c}(\mathbf{z}) is particular to the non-predictive factors, which means 𝐳p\mathbf{z}^{p} and 𝐳c\mathbf{z}^{c} are mutually excluded and 𝐳pG\mathbf{z}^{p}\sim G. So we have,

p(𝐳p,𝐳c)=p(𝐳p)p(𝐳c)p(𝐳p,𝐳ct(G))=p(𝐳pt(G))p(𝐳ct(G)).\begin{gathered}p\left(\mathbf{z}^{p},\mathbf{z}^{c}\right)=p\left(\mathbf{z}^{p}\right)p\left(\mathbf{z}^{c}\right)\\ p\left(\mathbf{z}^{p},\mathbf{z}^{c}\mid t(G)\right)=p\left(\mathbf{z}^{p}\mid t(G)\right)p\left(\mathbf{z}^{c}\mid t(G)\right).\end{gathered} (11)

Then, we want to prove that given three random variables aa, bb and cc, if they satisfy p(b,c)=p(b)p(c)p\left(b,c\right)=p\left(b\right)p\left(c\right) and p(b,ca)=p(ba)p(ca)p\left(b,c\mid a\right)=p\left(b\mid a\right)p\left(c\mid a\right), we have I(a,bc)=I(a,b)I\left(a,b\mid c\right)=I\left(a,b\right). According to the definition of mutual information, we have that,

I(a;bc)=\displaystyle I\left(a;b\mid c\right)= (12)
=abcp(a,b,c)logp(a,b,c)p(c)p(a,c)p(b,c)\displaystyle=\sum_{a}\sum_{b}\sum_{c}p\left(a,b,c\right)\log\frac{p\left(a,b,c\right)p\left(c\right)}{p\left(a,c\right)p\left(b,c\right)}
=abcp(a)p(b,ca)logp(b,ca)p(c)p(ca)p(b)p(c)\displaystyle=\sum_{a}\sum_{b}\sum_{c}p\left(a\right)p\left(b,c\mid a\right)\log\frac{p\left(b,c\mid a\right)p\left(c\right)}{p\left(c\mid a\right)p\left(b\right)p\left(c\right)}
=abcp(a)p(ba)p(ca)logp(ba)p(ca)p(ca)p(b)\displaystyle=\sum_{a}\sum_{b}\sum_{c}p\left(a\right)p\left(b\mid a\right)p\left(c\mid a\right)\log\frac{p\left(b\mid a\right)p\left(c\mid a\right)}{p\left(c\mid a\right)p\left(b\right)}
=abp(a)p(ba)logp(ba)p(b)\displaystyle=\sum_{a}\sum_{b}p\left(a\right)p\left(b\mid a\right)\log\frac{p\left(b\mid a\right)}{p\left(b\right)}
=abp(a,b)logp(ba)p(b)\displaystyle=\sum_{a}\sum_{b}p\left(a,b\right)\log\frac{p\left(b\mid a\right)}{p\left(b\right)}
=I(a;b).\displaystyle=I\left(a;b\right).

After that, by applying the chain rule to I(t1(G);𝐳2p,𝐳2c)I\left(t_{1}(G);\mathbf{z}_{2}^{p},\mathbf{z}_{2}^{c}\right), we have,

I(t1(G);𝐳2p,𝐳2c)\displaystyle I\left(t_{1}(G);\mathbf{z}_{2}^{p},\mathbf{z}_{2}^{c}\right) =I(t1(G);𝐳2p𝐳2c)+I(t1(G);𝐳2c)\displaystyle=I\left(t_{1}(G);\mathbf{z}_{2}^{p}\mid\mathbf{z}_{2}^{c}\right)+I\left(t_{1}(G);\mathbf{z}_{2}^{c}\right) (13)
=(2)I(t1(G);𝐳2p)+I(t1(G);𝐳2c)\displaystyle\stackrel{{\scriptstyle(2)}}{{=}}I\left(t_{1}(G);\mathbf{z}_{2}^{p}\right)+I\left(t_{1}(G);\mathbf{z}_{2}^{c}\right)
(a)I(t1(G);𝐳2p)\displaystyle\stackrel{{\scriptstyle(a)}}{{\geq}}I\left(t_{1}(G);\mathbf{z}_{2}^{p}\right)
(b)I(𝐳1c,𝐳1p;𝐳2p)\displaystyle\stackrel{{\scriptstyle(b)}}{{\geq}}I\left(\mathbf{z}_{1}^{c},\mathbf{z}_{1}^{p};\mathbf{z}_{2}^{p}\right)
=(2)I(𝐳1c;𝐳2p)+I(𝐳1p;𝐳2p)\displaystyle\stackrel{{\scriptstyle(2)}}{{=}}I\left(\mathbf{z}_{1}^{c};\mathbf{z}_{2}^{p}\right)+I\left(\mathbf{z}_{1}^{p};\mathbf{z}_{2}^{p}\right)
(a)I(𝐳1p;𝐳2p),\displaystyle\stackrel{{\scriptstyle(a)}}{{\geq}}I\left(\mathbf{z}_{1}^{p};\mathbf{z}_{2}^{p}\right),

where =(2)\stackrel{{\scriptstyle(2)}}{{=}} is derived from the conclusion we get in Equation 12, (a)\stackrel{{\scriptstyle(a)}}{{\geq}} is based on the non-negativity of mutual information, i.e., I(;)0I(;)\geq 0, and (b)\stackrel{{\scriptstyle(b)}}{{\geq}} is because data processing inequality (Cover, 1999). Finally, we reach to the lower bound of I(t1(G);𝐳2p,𝐳2c)I\left(t_{1}(G);\mathbf{z}_{2}^{p},\mathbf{z}_{2}^{c}\right) in Equation 12, thus we can maximize the consistency between the information we capture from the two augmentation graph views by minimizing pre\mathcal{L}_{\text{pre}}.

E.2 Proof of Theorem 2

We repeat Theorem 2 as follows.

Theorem 2. Assume qq is a Gaussian distribution, grg_{r} is the parameterized reconstruction model which infer 𝐳w\mathbf{z}_{w} from (𝐳wp,𝐳wc)\left(\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right). Then we have:

H(𝐳w𝐳wp,𝐳wc)𝐳wgr(𝐳wp𝐳wc)22 where w=w or ww.H\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)\leq\left\|\mathbf{z}_{w}-g_{r}\left(\mathbf{z}_{w^{\prime}}^{p}\odot\mathbf{z}_{w}^{c}\right)\right\|_{2}^{2}\text{ where }w=w^{\prime}\text{ or }w\neq w^{\prime}.

Proof. To reconstruct the entangled representation 𝐳w\mathbf{z}_{w} from its corresponding non-predictive representation 𝐳wc\mathbf{z}_{w}^{c} and the predictive representation of any augmentation view 𝐳wp\mathbf{z}_{w^{\prime}}^{p} (ww and ww^{\prime} are not necessarily equal), we need to minimize the conditional entropy:

H(𝐳w𝐳wp,𝐳wc)=𝔼p(𝐳w,𝐳wp,𝐳wc)[logp(𝐳w𝐳wp,𝐳wc)].H\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)=-\mathbb{E}_{p\left(\mathbf{z}_{w},\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)}\left[\log p\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)\right]. (14)

Since the real distribution of p(𝐳w𝐳wp,𝐳wc)p\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w^{\prime}}^{c}\right) is unknown and intractable, we hereby introduce a variational distribution q(𝐳w𝐳wp,𝐳wc)q\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right) to approximate it. Therefore, we have,

𝔼p(𝐳w,𝐳wp,𝐳wc)\displaystyle\mathbb{E}_{p\left(\mathbf{z}_{w},\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)} [logp(𝐳w𝐳wp,𝐳wc)]=\displaystyle\left[\log p\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)\right]= (15)
𝔼p(𝐳w,𝐳wp,𝐳wc)[logq(𝐳w𝐳wp,𝐳wc)]\displaystyle\mathbb{E}_{p\left(\mathbf{z}_{w},\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)}\left[\log q\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)\right]
+DKL(p(𝐳w𝐳wp,𝐳wc)q(𝐳w𝐳wp,𝐳wc)).\displaystyle+D_{\mathrm{KL}}\left(p\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)\|q\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)\right).

Due to the non-negativity of KL-divergence between any two distributions, it is safe to say 𝔼p(𝐳w,𝐳wp,𝐳wc)[logq(𝐳w𝐳wp,𝐳wc)]-\mathbb{E}_{p\left(\mathbf{z}_{w},\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)}\left[\log q\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)\right] is the upper bound of H(𝐳w𝐳wp,𝐳wc)H\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right). Based on the assumption of Theorem 2, let q(𝐳w𝐳wp,𝐳wc)q\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right) being a Gaussian distribution 𝒩(𝐳wgr(𝐳wp𝐳wc),σ2𝐈)\mathcal{N}\left(\mathbf{z}_{w}\mid g_{r}\left(\mathbf{z}_{w^{\prime}}^{p}\odot\mathbf{z}_{w}^{c}\right),\sigma^{2}\mathbf{I}\right), where gr()g_{r}(\cdot) is the reconstruct network that predict 𝐳w\mathbf{z}_{w} from (𝐳wp,𝐳wc)\left(\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right) and σ\sigma is the variance. Thus we have,

H(𝐳w𝐳wp,𝐳wc)\displaystyle H\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right) 𝔼p(𝐳w,𝐳wp,𝐳wc)[logq(𝐳w𝐳wp,𝐳wc)]\displaystyle\leq-\mathbb{E}_{p\left(\mathbf{z}_{w},\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)}\left[\log q\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)\right] (16)
=𝔼p(𝐳w,𝐳wp,𝐳wc)[log(12πIσe12(𝐳wgr(𝐳wp𝐳wc))2(σ2𝐈))]\displaystyle=-\mathbb{E}_{p\left(\mathbf{z}_{w},\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)}\left[\log\left(\frac{1}{\sqrt{2\pi I}\sigma}e^{-\frac{1}{2}\frac{\left(\mathbf{z}_{w}-g_{r}\left(\mathbf{z}_{w^{\prime}}^{p}\odot\mathbf{z}_{w}^{c}\right)\right)^{2}}{(\sigma^{2}\mathbf{I})}}\right)\right]
=𝔼p(𝐳w,𝐳wp,𝐳wc)[log(12πIσ)(𝐳wgr(𝐳wp𝐳wc))22σ2𝐈].\displaystyle=-\mathbb{E}_{p\left(\mathbf{z}_{w},\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)}\left[\log\left(\frac{1}{\sqrt{2\pi I}\sigma}\right)-\frac{\left(\mathbf{z}_{w}-g_{r}\left(\mathbf{z}^{p}_{w^{\prime}}\odot\mathbf{z}^{c}_{w}\right)\right)^{2}}{2\sigma^{2}\mathbf{I}}\right].

Hence, we get the upper bound of H(𝐳w𝐳wp,𝐳wc)H\left(\mathbf{z}_{w}\mid\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right) as Equation 16. To minimize the value of the unsolvable entropy, we can instead minimize the value of its upper bound and thereby derive the objective function as follow by neglecting the constant terms,

min𝔼p(𝐳w,𝐳wp,𝐳wc)𝐳wgr(𝐳wp𝐳wc)22.\min\mathbb{E}_{p\left(\mathbf{z}_{w},\mathbf{z}_{w^{\prime}}^{p},\mathbf{z}_{w}^{c}\right)}\left\|\mathbf{z}_{w}-g_{r}\left(\mathbf{z}^{p}_{w^{\prime}}\odot\mathbf{z}^{c}_{w}\right)\right\|_{2}^{2}. (17)

Since we adopt two augmentation views and propose the cross-view reconstruction mechanism in our method, we can minimize the entropy by minimizing recon\mathcal{L}_{\text{recon}} and thus guarantee the disentanglement of 𝐳p\mathbf{z}^{p} and 𝐳c\mathbf{z}^{c}.

Appendix F Effects of Representation Disentanglement

Refer to caption
Figure 6: InfoNCE loss of the two disentangled representations between the two augmentation graph views, where orange lines are the InfoNCE loss between the two non-predictive representations and blue lines are the InfoNCE loss between the two predictive representaitons

In this section, we set experiments to investigate the representation disentanglement of our proposed GraphCV. Specifically, we use the InfoNCE loss (van den Oord et al., 2018) to dynamically measure the representation difference between the two augmentation graph views based on the two disentangled representations, where blue lines indicate the InfoNCE loss between 𝐳1p\mathbf{z}^{p}_{1} and 𝐳2p\mathbf{z}^{p}_{2} and orange lines represent the InfoNCE loss between 𝐳1c\mathbf{z}^{c}_{1} and 𝐳2c\mathbf{z}^{c}_{2}. For simplicity, we only demonstrate the first 100 pre-training epochs of PROTEINS and COLLAB in Figure 6, we can observe similar phenomena on other datasets. From the loss curves in Figure 6 we can find that contrastive loss between predictive representations gradually decreases, indicating the predictive representation is optimized to capture all the shared information between the two augmentation view. Meanwhile, we can see contrastive loss between the non-predictive representations achieve a noticeable increases, which is consistent with our expectation that the two independent sampled augmentation operators cause a distribution shift between the two augmentation views. To further investigate whether the feature suppression problem is equally serious in 𝐳p\mathbf{z}^{p} and 𝐳c\mathbf{z}^{c}, we conduct experiments to compare the performance of the two representation on downstream tasks. The comparison results are as follow:

Table 6: Performance comparison of the two learned representations. Results are reported as mean±std%, the best performance is bolded.

MUTAG COLLAB NCI1 PROTEINS IMDB-B RDT-B DD ogbg-molhiv 𝐳c\mathbf{z}^{c} 88.1±1.2 75.1±0.7 72.2±2.0 73.5±0.8 71.8±0.9 89.4±1.0 75.8±0.6 69.7.0±2.8 𝐳p\mathbf{z}^{p} 92.3±0.7 80.5±0.5 82.0±1.0 76.8±0.4 75.6±0.4 92.5±0.9 80.5±0.5 75.36±1.4

It is easily to observe that there is a obvious performance gap between the two learned representation, indicating the different feature suppression issue between them and the features subset that are more robust to augmentation is more informative and transferable that those sensitive to augmentations. Therefore, we believe our proposed GraphCV can further alleviate the feature suppression issue with the disentanglement design.

Appendix G Training Algorithm

In this sectionm we summarized the details of our proposed method in the following Algorithm.

1
Input: Graph dataset 𝒢={Gi=(Vi,Ei)}i=1N\mathcal{G}=\left\{G_{i}=(V_{i},E_{i})\right\}_{i=1}^{N}; augmentation family 𝒯\mathcal{T}; loss coefficient λr\lambda_{r}, λa\lambda_{a}; ascernt step TT; ascent step size α\alpha; perturbation bound ϵ\epsilon.
Output: The disentangled predictive representations 𝐙p={𝐳ip}i=1N\mathbf{Z}^{p}=\left\{\mathbf{z}_{i}^{p}\right\}_{i=1}^{N}
2 for each training epoch do
3       for sampled minibatch ={Gi}i=1||\mathcal{B}=\left\{G_{i}\right\}_{i=1}^{|\mathcal{B}|} do
4             for GiG_{i}\in\mathcal{B} do
5                   𝐳1,i=f(t1(Gi))\mathbf{z}_{1,i}=f\left(t_{1}(G_{i})\right), 𝐳2,i=f(t2(Gi))\mathbf{z}_{2,i}=f\left(t_{2}(G_{i})\right) ; \triangleright t1(),t2()𝒯t_{1}(\cdot),t_{2}(\cdot)\in\mathcal{T}
6                   𝐳1,ip=gp(𝐳1,i)\mathbf{z}_{1,i}^{p}=g_{p}\left(\mathbf{z}_{1,i}\right), 𝐳2,ip=gp(𝐳1,i)\mathbf{z}_{2,i}^{p}=g_{p}\left(\mathbf{z}_{1,i}\right) ;
7                   𝐳1,ic=gc(𝐳1,i)\mathbf{z}_{1,i}^{c}=g_{c}\left(\mathbf{z}_{1,i}\right), 𝐳2,ic=gc(𝐳1,i)\mathbf{z}_{2,i}^{c}=g_{c}\left(\mathbf{z}_{1,i}\right) ;
8            Calculate pre\mathcal{L}_{\text{pre}} according to Equation 6;
9             Calculate recon\mathcal{L}_{\text{recon}} according to Equation 8 ;
10             pre+λrrecon\mathcal{L}\leftarrow\mathcal{L}_{\text{pre}}+\lambda_{r}\mathcal{L}_{\text{recon}};
11             δ0U(ϵ,ϵ)\delta_{0}\leftarrow U(-\epsilon,\epsilon);
12            
13            for each t=1t=1 to TT do
14                   Calculate the adv\mathcal{L}_{\text{adv}} according to Equation 10;
15                   δtδt1+αδadv\delta_{t}\leftarrow\delta_{t-1}+\alpha\nabla_{\delta}\mathcal{L}_{\text{adv}}; \triangleright Update perturbation to maximize adv\mathcal{L}_{\text{adv}}
16                   +λaTadv\mathcal{L}\leftarrow\mathcal{L}+\frac{\lambda_{a}}{T}\mathcal{L}_{\text{adv}}
17            Update the parameter θ\theta of ff and gg with the gradient θ(θ,)\nabla_{\theta}\mathcal{L}(\theta,\mathcal{B}) over a minibatch;
18      
19return 𝐙p={𝐳ip}i=1N\mathbf{Z}^{p}=\left\{\mathbf{z}_{i}^{p}\right\}_{i=1}^{N}, where 𝐳ip=gp(f(Gi))\mathbf{z}_{i}^{p}=g_{p}\left(f(G_{i})\right)
Algorithm 1 The training algorithm of GraphCV

Appendix H Hyper-parameter Sensitivity

In this section, we study the impacts of some important hyper-parameters in our method, including reconstruction loss coefficient λr\lambda_{r}, adversarial loss coefficient λa\lambda_{a}, embedding dimension dd, batch size |||\mathcal{B}| and number of GNN layers LL. Here, we select four datasets, i.e., MUTAG, PROTEINS, RDT-B and COLLAB, to report for simplicity because the four datasets cover different domains and scales. We illustrate the impacts of these hyper-parameters in the figures below.

Refer to caption
Figure 7: Impact of reconstruction loss coefficient λr\lambda_{r} on different datasets, we specify the non-reconstruction situation (λr=0\lambda_{r}=0) with the dashed line for comparison.

From the result demonstrated in Figure 7, we can see the optimal reconstruction loss coefficient λr\lambda_{r} is different dependent on the specific dataset, but all the values in our experiment can enhance the performance compared with non-reconstruction variant, i.e., λr=0\lambda_{r}=0, indicating the effectiveness of our proposed cross-view reconstruction mechanism.

Refer to caption
Figure 8: Impact of adversarial loss coefficient λa\lambda_{a} on different datasets, we specify the non-adversarial situation (λa=0\lambda_{a}=0) with the dashed line for comparison.

The Figure 8 shows that we could further raise the model performance through the adversarial training, which proves a robust representation with less redundant information usually achieve more performance gain compared with the brittle one. During this process, we need to choose a appropriate adversarial loss coefficient λa\lambda_{a}, otherwise a too large λa\lambda_{a} may hurt the information sufficiency of the learned representation.

Refer to caption
Figure 9: Impact of a embedding dimension dd and GNN layer number LL on different datasets.

We put the impacts of embedding dimension dd and GNN layer number LL together because we can find a similar observation form their experimental results. From Figure 9, we observe that the optimal values of the two hyper-parameters generally increase as the dataset scale increases. The reason behind this phenomena could be large datasets usually contain more latent factors than the small datasets, therefore a model with larger capacity is needed to fit the large datasets. However, such high-capacity message-passing model will deteriorate the performance of small dataset because it may cause the learned representation over-smoothing and hence less informative.