Barriers for the performance of graph neural networks (GNN) in discrete random structures. A comment on [SBK22],[ART23],[SBK23]

David Gamarnik Operations Research Center, Statistics and Data Science Center, Sloan School of Management, MIT; e-mail: gamarnik@mit.eduFunding from NSF Grant DMS-2015517 is gratefully acknowledged.

Recently graph neural network (GNN) based algorithms were proposed to solve a variety of combinatorial optimization problems, including Maximum Cut problem, Maximum Independent Set problem and similar other problems [SBK22],[SBZK22]. The algorithm was tested in particular on random instances of these problems, namely when the underlying graph is generated according to some specified probability distribution. Earlier, a similar proposal using a somewhat different learning architecture was put forward to solve another optimization problem, one of finding ground states of spin glass models [SNL⁺22].

The publication [SBK22] stirred a debate whether GNN based method was adequately benchmarked against best prior methods. In particular, critical commentaries [ART23] and [Boe23] point out that simple greedy algorithm performs better than GNN in the setting of random graphs, and in fact stronger algorithmic performance can be reached with more sophisticated methods. A response from the authors [SBK23] pointed out that GNN performance can be improved further by tuning up the parameters better.

We do not intend to discuss the merits of arguments and counter-arguments in [SBK22],[ART23],[Boe23],[SBK23]. Rather in this note we establish a fundamental limitation for running GNN on random graphs considered in these references, for a broad range of choices of GNN architecture. Specifically, these barriers hold when the depth of GNN does not scale with graph size (we note that depth 2 was used in experiments in [SBK22]), and importantly regardless of any other parameters of GNN architecture, including internal dimension and update functions. These limitations arise from the presence of the Overlap Gap Property (OGP) phase transition, which is a barrier for many algorithms, both classical and quantum, including importantly local algorithms [Gam21],[GMZ22]. As we demonstrate in this paper, it is also a barrier to GNN due to its local structure. We note that at the same time known algorithms ranging from simple greedy algorithms to more sophisticated algorithms based on message passing, provide best results for these problems up to the OGP phase transition. This leaves very little space for GNN to outperform the known algorithms, and based on this we side with the conclusions made in [ART23] and [Boe23].

1 GNN for Combinatorial Optimization in Random Graphs

A class of problems discussed in [SBK22] and solved using GNN based methods falls into the domain of combinatorial optimization in random graphs. A graph $G$ is a collection of nodes $V$ and edges $E$ , which is s subset of unordered pairs or, more generally, tuples (hyper-edges) of nodes. A generic combinatorial optimization problem is defined by introducing a cost function $C:\{0,1\}^{V}\to\mathbb{R}$ (also called Hamiltonian in physics jargon), which maps bit strings $\sigma\in\{0,1\}^{V}$ (aka ”decisions”) into real values $C(\sigma)$ (aka ”cost” or ”energy”), and solving the problem $\max_{\sigma}C(\sigma)$ . An equivalent choice of $\sigma\in\{0,1\}^{V}$ will be adopted here often for convenience. The presence of various kinds of combinatorial constraints on decisions arising from the presence of edges and hyper-edges can be encoded into the cost function $C$ .

A canonical example considered in the aforementioned references is the Independent Set problem (which we abbreviate as IS) which is a problem of finding a largest in cardinality subset $I\subset V$ such that no two nodes are spanned by an edge. Namely $(i,j)\notin E$ for all $i,j\in I$ . This corresponds to a special case of $C$ where $C(\sigma)=\left(\sum_{i\in V}\sigma_{i}\right){\bf 1}\left(\sigma_{i}\sigma_{j}=0,\forall(i,j)\in E\right)$ . Namely, $C$ is the number of ones in the string $\sigma$ (indicating a inclusion into the independent set) multiplied by the indicator function for the event that $\sigma$ indeed encodes a legitimate independent set. Another example discussed in the same collection of references is the graph Maximum Cut problem (which we abbreviate as MAXCUT). This is the problem of partitioning nodes of a graph into two sets which maximizes the number of crossed edges. Formally, this corresponds to the cost function $C:\{-1,1\}^{V}\to\mathbb{R}$ defined by $C(\sigma)=\sum_{(i,j)\in E}{\bf 1}\left(\sigma_{i}\sigma_{j}=-1\right)$ . This model extends naturally to hypergraphs as follows. A $K$ -uniform hypergraph is a pair of a node set $V$ and a collection $E$ of hyperedges, where each hyperedge is an unordered subset of $K$ nodes. Thus $2$ -uniform hypergraph is just a graph. An extension of MAXCUT to hypergraphs is obtained by considering the cost function $C(\sigma)=\sum_{(i_{1},\ldots,i_{K})\in E}{\bf 1}\left(\sigma(i_{1})\sigma(i_{2})\cdots\sigma(i_{K})=-1\right)$ . The case $K=2$ again then reduces to the case of MAXCUT on graphs.

Our last example, arising from the studies of spin glasses, corresponds fixing an order $p$ tensor $J=(J_{i_{1},\ldots,i_{p}},i_{1},\ldots,i_{p}\in V)\in\mathbb{R}^{n\otimes p}$ and defining $C(\sigma)=\sum_{i_{1},\ldots,i_{p}\in V}J_{i_{1},\ldots,i_{p}}\sigma_{i_{1}}\sigma_{i_{2}}\cdots\sigma_{i_{p}}$ for each $\sigma\in\{-1,1\}^{V}$ . The optimization problem is one of finding the value of $\max_{\sigma}C(\sigma)$ . Put it differently, this an unconstrained optimization problem on a complete weighted hyper-graph with hyper-edges $(i_{1},\ldots,i_{p}),i_{1},\ldots,i_{p}\in V$ .

In the random setting, either the cost function $C$ or the graph $G$ (or both) are generated randomly according to some probability distribution. The setting discussed in [SBK22] is IS problem when the underlying graph a random $d$ -regular graph on the set of $n$ nodes denoted for convenience by $V=\{1,\ldots,n\}$ . $d$ -regular means every node has exactly $d$ neighbors. The graph is generated uniformly at random from the space of all $d$ -regular graphs on $n$ nodes (see [JLR11],[FK15] for some background regarding existence and constructions). The random graph constructed this way will be denoted by $\mathbb{G}_{d}(n)$ . The setting of spin glasses corresponds to assuming that the entries of the tensor $J$ are generated randomly and independently from some common distribution, such as the standard normal distribution.

Next we turn to a generic description of GNN algorithms. We follow the notations used in [SBK22]. Given a graph $G=(V,E)$ the algorithm generates a sequence of node and time dependent features $(h_{u,t}\in\mathbb{R}^{d_{u}},u\in V,t\geq 0)$ . Time is assumed to evolve in discrete steps $t=0,1,2,\ldots$ , and $d_{u}$ represents the dimension of the feature space for node $u$ . The feature vectors $h_{u,t}$ are generated as follows. The algorithm designer creates a node and time dependent functions $(f_{u,t},u\in V,t\geq 0)$ where each $f_{u,t}$ maps $\mathbb{R}^{d_{u}+\sum_{v\in\mathcal{N}(u)}}\to\mathbb{R}^{d_{u}}$ . Here $\mathcal{N}(u)$ denotes the set of neighbors of $u$ ( the set of nodes $v$ such that $(u,v)\in E$ ). The features are then updated according to the rule $h_{u,t+1}=f_{u,t}\left(h_{u,t},\{h_{v,t},v\in\mathcal{N}(u)\}\right)$ . The update rules $f_{u,t}$ can be parametric or non-parametric (our conclusions do not depend on that), and can be learned using various learning algorithms. The algorithm runs for a certain time $t=0,1,\ldots,R$ , which is also the depth of the underlying neural architecture. The obtained vector of features $(h_{u,R},u\in V)$ is then projected to a desired solution of the problem. As we will see below, the actual details of how the update functions $f_{u,t}$ come about, and, furthermore, regardless of the dimensions $d_{u},u\in V$ that the algorithm designers opts to work with, the power of GNN algorithms is fundamentally limited by the Overlap Gap Property, which we turn to next.

2 Limits of GNN

We begin with some background on problems introduced earlier: IS and MAXCUT in a setting of random graphs, and ground states of spin glasses. Let $I_{n}^{*}$ denote (any) maximum size independent set in $\mathbb{G}_{d}(n)$ , which we recall is a random $d$ -regular graph, and $|I_{n}^{*}|$ denote its size (cardinality). The following two facts were established in [BGT13] and [FŁ92] respectively. For each $d$ there exists $\alpha_{d}$ such that $|I_{n}|/n$ converges to $\alpha_{d}$ with high probability as $n\to\infty$ . Furthermore, $\alpha_{d}=2(1+o_{d}(1))\log d/d$ . Here $o_{d}(1)$ denotes a function which converges to zero as $d\to\infty$ . Informally, we summarize this by saying that the size $|I^{*}_{n}|$ of a largest independent set in $\mathbb{G}_{d}(n)$ is approximately $2(\log d/d)n$ .

Next we turn to the discussion of algorithms for finding large independent sets in $\mathbb{G}_{d}(n)$ . It turns out that the best known algorithm for this problem is in fact the Greedy algorithm (the algorithms discussed in [ART23],[Boe23]) which recovers a factor $1/2$ -optimum independent set. More precisely, let $I_{\rm Greedy}$ be the independent set produced by the Greedy algorithm for $\mathbb{G}_{d}(n)$ . Then $\lim_{d\to\infty}\alpha_{d}^{-1}\lim_{n}(|I_{\rm Greedy}|/n)=1/2$ as $d\to\infty$ , Exercise 6.7.20 in [FK15]. No algorithm is known which beats Greedy by a factor non-vanishing in $d$ .

The theory based on the Overlap Gap Property (OGP) explains this phenomena rigorously. The OGP for this problem was established in [GS17] and it reads as follows: for every factor $1/2+1/(2\sqrt{2})<\theta<1$ there exists $0<\nu_{1}<\nu_{2}<1$ such that for every two independent sets $I_{1},I_{2}$ which are $\theta$ -optimal, namely $|I_{1}|/n\geq\theta\alpha_{d}$ , $|I_{2}|/n\geq\theta\alpha_{d}$ , it is the case that either $|I_{1}\cap I_{2}|/n\leq\nu_{1}$ or $|I_{1}\cap I_{2}|/n\geq\nu_{2}$ , for all large enough $d$ , with high probability as $n\to\infty$ . Informally, every two sufficiently large independent sets (namely those which are multiplicative factor $\theta$ -close to optimality) are either ”close” to each other (overlap in at least $\nu_{2}n$ many nodes) or ”far” from each other (overlap in at most $\nu_{1}n$ many nodes). Namely, solutions to the IS optimization problem with sufficiently large optimization values exhibit a gap in the overlaps (hence the name of the property).

It turns out that OGP is a barrier to a broad class of algorithms, in particular algorithms which are local in an appropriately defined sense. This was established in the same paper [GS17]. We introduce the notion of locality only informally. The formal definition involves the concept of Factors of IID for which we refer the reader to [GS17]. An algorithm, which maps graphs $G$ to an independent set in $G$ is called $R$ -local if for every node $u$ of the graph $G$ , the algorithmic decision as to whether to make this node a part of the constructed independent set or not, is based entirely on the size $R$ neighborhood of this node $u$ . In particular, we see that the GNN algorithm is $R$ -local provided that the number of iterations $t$ of GNN is at most $R$ . Importantly this holds regardless of the complexity of the feature dimensions $d_{u}$ and the choice of update functions $f_{u,t}$ . We recall that the GNN algorithm reported in [SBK22] was based on 2 iterations and as such it is 2-local.

A main theorem proved in [GS17] states that OGP is a barrier for all $R$ -local algorithms, as long as $R$ is any constant not growing with the size of the graph. Specifically, for any $R$ , consider any algorithm $\mathcal{A}$ which is $R$ -local. Then the independent set produced by $\mathcal{A}$ is at most $(1/2+1/(2\sqrt{2}))\alpha_{d}$ for large enough $d$ with high probability as $n\to\infty$ . Using a more sophisticated notion of multi-overlaps the result was improved in [RV17] to factor $1/2$ of optimality for the same class of all local algorithms. Importantly, as we recall, $1/2$ is the threshold achievable by the Greedy algorithm. The result was recently extended to the class of algorithms based on low-degree polynomials and small depth Boolean circuits in [GJW20],[Wei22]. It is conjectured that beating the $1/2$ threshold is not possible within the class of polynomial time algorithms (but showing this will amount to proving $P\neq NP$ ).

As a consequence of the discussion above we obtain an important conclusion regarding the power of GNN for solving the IS problem in $\mathbb{G}_{d}(n)$ .

Theorem 2.1.

Consider any architecture of the GNN algorithm with any choice of dimensions $(d_{v},v\in\{1,2,\ldots,n\})$ , any choice of feature functions $h_{u,t}$ , and any choice of update functions $f_{u,t}$ . Suppose the GNN algorithm iterates for $R$ steps and produces an independent set $I_{\rm GNN}$ in the random regular graph $\mathbb{G}_{d}(n)$ . Then the size of $I_{\rm GNN}$ is at most half-optimum asymptotically in $d$ , for any value of $R$ .

We stress here that the depth parameter $R$ can be arbitrarily large and, in particular, may depend on the average degree $d$ , provided it does not depend on the size $n$ of the graph. We recall that $R=2$ in the implementation reported in [SBK22]. Since the Greedy algorithm already achieves $1/2$ optimality, as we have remarked earlier, this result leaves very little space for GNN to outperform the known (Greedy) algorithm for the IS defined on random regular graphs. We note that while the results above are stated in the asymptotic sense of increasing degrees $d$ , the fact is that OGP is a barrier to local algorithm as soon as OGP holds. For example if it is the case that say $\mathbb{G}_{10}(n)$ graph exhibits the OGP above some approximation factor $\rho$ to optimality, this would imply that GNN cannot beat the $\rho$ -factor approximation for the IS problem in the random graph $\mathbb{G}_{10}(n)$ for any graph independent depth (number of rounds) $R$ . The obstruction to this is proving the OGP for small values of $d$ , which is more challenging mathematically.

Next we turn to the MAXCUT problem on random graphs and random hypergraphs. The situation here is rather similar, but better developed in the context of random Erdös-Rényi graphs and hypergraphs, as opposed random regular graphs, and thus, this is the class of random graphs we now turn to. A random Erdös-Rényi graph with average degree $d$ denoted by $\mathbb{G}(n,d)$ is obtained by connecting every pair of nodes $i,j$ among $n$ nodes with probability $d/n$ , independently across all unordered pairs $i\neq j$ . A random $K$ -uniform hypergraph is obtained similarly by creating a hyperedge from a collection of nodes $i_{1},\ldots,i_{K}$ with probability $d/{n-1\choose K-1}$ . We denote this graph by $\mathbb{G}(n,d;K)$ It is easy to see that the average degree in both $\mathbb{G}(n,d)$ and $\mathbb{G}(n,d;K)$ is $d+o(1)$ . It was known for a while that the optimum values of MAXCUT in $\mathbb{G}(n,d;K)$ are of the form $n(d/(2K)+\gamma_{K}^{*}\sqrt{d}+o(\sqrt{d}))$ as $n\to\infty$ , [CGHS04], for some constant $\gamma_{K}$ . Namely, the optimum value is known up to the order $n\sqrt{d}$ . The constant $\gamma_{K}^{*}$ was computed in [DMS17] and [Sen18] first for the case $K=2$ and then extended to general $K$ in [CGPR19]. As it turns out, this constant is the value of the ground state of a $K$ -spin model, known since the work of Parisi [Par80], Talagrand [Tal06] and Panchenko [Pan13].

Interestingly, as far as the algorithms are concerned there is a fundamental difference between the case $K=2$ (aka graphs) versus $K\geq 3$ . Specifically, algorithms achieving the asymptotically optimal value $n(d/(2K)+\gamma_{K}^{*}\sqrt{d}+o(\sqrt{d}))$ are known based on Approximate Message Passing (AMP) schemes [AMS21]. Furthermore, conjecturally, the OGP does not hold for this problem. However, when $K\geq 4$ and is even, OGP provably does hold and again presents a barrier to all local algorithms, as was established in [CGPR19]. Furthermore, a sophisticated version of the multi-OGP called Branching-OGP was computed [HS22], the threshold for which, denoted by $\gamma_{\rm B-OGP,K}$ matches the best known algorithms, which is again the AMP type. The formal statement of the OGP is very similar to the one for the IS and we skip it. As an implication we obtain our second conclusion.

Theorem 2.2.

Consider any architecture of the GNN algorithm which produces a partition $\sigma_{\rm GNN}\in\{\pm 1\}^{n}$ in the random hyper graph $\mathbb{G}(n,d;K)$ . Suppose $K\geq 4$ and is even. For sufficiently large degree values $d$ the size of the cut associated with this solution is at most $n(d/(2K)+\gamma_{\rm B-OGP,K}\sqrt{d}+o(\sqrt{d}))$ with high probability, for any choice of $R$ . This is suboptimal since $\gamma_{\rm B-OGP,K}<\gamma_{K}^{*}$ .

As the threshold $\gamma_{\rm B-OGP,K}$ is achievable by the AMP algorithm, again this leaves very little space for GNN to outperform the best known (namely AMP) algorithm for this problem.

The story for the problem of finding near ground states in spin glasses is very similar and is skipped. We refer the reader to surveys [Gam21] and [GMZ22] for details. In fact many of the results above described for the MAXCUT problem, were obtained first by deriving them for the spin glasses model, and then transferred to the case of random graphs $\mathbb{G}(n,d/n;K)$ using an interpolation technique.

3 Discussion

In this paper we have presented a barriers faced by GNN based algorithms in solving combinatorial optimization problems in random graphs and random structures. These barriers stem from the complex solution space geometry property in the form of the Overlap Gap Property (OGP), a known barrier to broad classes of algorithms, local algorithms in particular. As GNN falls within the framework of local algorithms, OGP is a barrier to GNN as well. Since algorithms are known which achieve all the optimality values below the OGP phase transition threshold, this leaves very little room for GNN to outperform the known algorithms.

Some further investigation can be done however to obtain a more refined picture. Most of the OGP results are obtained in the doubly-asymptotic regime where not only the graph size diverges but also the degree (and the degree type parameters) diverge. While it is possible to prove OGP for a fixed and sufficiently large values of the degree, the values arising from these computations tend to be quite large. Instead, it would be nice to see whether OGP takes place say in random regular graphs $\mathbb{G}_{d}(n)$ for a small degree value such as say $d=5$ . We need sharper mathematical techniques for this. Knowing this might provide us with a place where non-trivial algorithms going beyond the simple Greedy algorithms might provide some value. It is known (as already observed in [ART23]) that more clever version of the Greedy algorithm known as the Degree Greedy algorithm provably outperform the naive Greedy algorithms for fixed values of $d$ . It is possible thus that a more sophisticated version of the GNN can perhaps achieve performance values even stronger than the ones obtained by the Degree Greedy algorithm. Whether this is possible is yet to be seen, but in case this is indeed verified rigorously, it would provide a more compelling argument in favor of GNN than the one currently presented in [SBK22].

References

[AMS21] Ahmed El Alaoui, Andrea Montanari, and Mark Sellke. Local algorithms for maximum cut and minimum bisection on locally treelike regular graphs of large degree. arXiv preprint arXiv:2111.06813, 2021.
[ART23] Maria Chiara Angelini and Federico Ricci-Tersenghi. Modern graph neural networks do worse than classical greedy algorithms in solving combinatorial optimization problems like maximum independent set. Nature Machine Intelligence, 5(1):29–31, 2023.
[BGT13] M. Bayati, D. Gamarnik, and P. Tetali. Combinatorial approach to the interpolation method and scaling limits in sparse random graphs. Annals of Probability. (Conference version in Proc. 42nd Ann. Symposium on the Theory of Computing (STOC) 2010), 41:4080–4115, 2013.
[Boe23] Stefan Boettcher. Inability of a graph neural network heuristic to outperform greedy algorithms in solving combinatorial optimization problems. Nature Machine Intelligence, 5(1):24–25, 2023.
[CGHS04] D. Coppersmith, D. Gamarnik, M. Hajiaghayi, and G. Sorkin. Random MAXSAT, random MAXCUT, and their phase transitions. Random Structures and Algorithms, 24(4):502–545, 2004.
[CGPR19] Wei-Kuo Chen, David Gamarnik, Dmitry Panchenko, and Mustazee Rahman. Suboptimality of local algorithms for a class of max-cut problems. The Annals of Probability, 47(3):1587–1618, 2019.
[DMS17] AMIR DEMBO, ANDREA MONTANARI, and SUBHABRATA SEN. Extremal cuts of sparse random graphs. The Annals of Probability, 45(2):1190–1217, 2017.
[FK15] Alan Frieze and Michał Karoński. Introduction to random graphs. Cambridge University Press, 2015.
[FŁ92] A.M. Frieze and T. Łuczak. On the independence and chromatic numbers of random regular graphs. Journal of Combinatorial Theory, Series B, 54(1):123–132, 1992.
[Gam21] David Gamarnik. The overlap gap property: A topological barrier to optimizing over random structures. Proceedings of the National Academy of Sciences, 118(41), 2021.
[GJW20] David Gamarnik, Aukosh Jagannath, and Alexander S Wein. Low-degree hardness of random optimization problems. In 61st Annual Symposium on Foundations of Computer Science, 2020.
[GMZ22] David Gamarnik, Cristopher Moore, and Lenka Zdeborová. Disordered systems insights on computational hardness. Journal of Statistical Mechanics: Theory and Experiment, 2022(11):114015, 2022.
[GS17] David Gamarnik and Madhu Sudan. Limits of local algorithms over sparse random graphs. Annals of Probability, 45:2353–2376, 2017.
[HS22] Brice Huang and Mark Sellke. Tight lipschitz hardness for optimizing mean field spin glasses. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 312–322. IEEE, 2022.
[JLR11] Svante Janson, Tomasz Luczak, and Andrzej Rucinski. Random graphs, volume 45. John Wiley & Sons, 2011.
[Pan13] Dmitry Panchenko. The Sherrington-Kirkpatrick model. Springer Science & Business Media, 2013.
[Par80] Giorgio Parisi. A sequence of approximated solutions to the sk model for spin glasses. Journal of Physics A: Mathematical and General, 13(4):L115, 1980.
[RV17] Mustazee Rahman and Balint Virag. Local algorithms for independent sets are half-optimal. The Annals of Probability, 45(3):1543–1577, 2017.
[SBK22] Martin JA Schuetz, J Kyle Brubaker, and Helmut G Katzgraber. Combinatorial optimization with physics-inspired graph neural networks. Nature Machine Intelligence, 4(4):367–377, 2022.
[SBK23] Martin JA Schuetz, J Kyle Brubaker, and Helmut G Katzgraber. Reply to: Modern graph neural networks do worse than classical greedy algorithms in solving combinatorial optimization problems like maximum independent set. Nature Machine Intelligence, 5(1):32–34, 2023.
[SBZK22] Martin JA Schuetz, J Kyle Brubaker, Zhihuai Zhu, and Helmut G Katzgraber. Graph coloring with physics-inspired graph neural networks. Physical Review Research, 4(4):043131, 2022.
[Sen18] Subhabrata Sen. Optimization on sparse random hypergraphs and spin glasses. Random Structures & Algorithms, 53(3):504–536, 2018.
[SNL⁺22] Mutian Shen, Zohar Nussinov, Yang-Yu Liu, Changjun Fan, Yizhou Sun, and Zhong Liu. Finding spin glass ground states through deep reinforcement learning. In APS March Meeting Abstracts, volume 2022, pages K09–001, 2022.
[Tal06] Michel Talagrand. The parisi formula. Annals of mathematics, pages 221–263, 2006.
[Wei22] Alexander S Wein. Optimal low-degree hardness of maximum independent set. Mathematical Statistics and Learning, 4(3):221–251, 2022.