This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Rethinking the Capacity of Graph Neural Networks for Branching Strategy

Ziang Chen (ZC) Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139. ziang@mit.edu Jialin Liu (JL) Department of Statistics and Data Science, University of Central Florida, Orlando, FL 32816. jialin.liu@ucf.edu Xiaohan Chen (XC) Decision Intelligence Lab, Damo Academy, Alibaba US, Bellevue, WA 98004. xiaohan.chen@alibaba-inc.com Xinshang Wang (XW) Decision Intelligence Lab, Damo Academy, Alibaba US, Bellevue, WA 98004. xinshang.w@alibaba-inc.com  and  Wotao Yin (WY) Decision Intelligence Lab, Damo Academy, Alibaba US, Bellevue, WA 98004. wotao.yin@alibaba-inc.com
Abstract.

Graph neural networks (GNNs) have been widely used to predict properties and heuristics of mixed-integer linear programs (MILPs) and hence accelerate MILP solvers. This paper investigates the capacity of GNNs to represent strong branching (SB), the most effective yet computationally expensive heuristic employed in the branch-and-bound algorithm. In the literature, message-passing GNN (MP-GNN), as the simplest GNN structure, is frequently used as a fast approximation of SB and we find that not all MILPs’s SB can be represented with MP-GNN. We precisely define a class of “MP-tractable” MILPs for which MP-GNNs can accurately approximate SB scores. Particularly, we establish a universal approximation theorem: for any data distribution over the MP-tractable class, there always exists an MP-GNN that can approximate the SB score with arbitrarily high accuracy and arbitrarily high probability, which lays a theoretical foundation of the existing works on imitating SB with MP-GNN. For MILPs without the MP-tractability, unfortunately, a similar result is impossible, which can be illustrated by two MILP instances with different SB scores that cannot be distinguished by any MP-GNN, regardless of the number of parameters. Recognizing this, we explore another GNN structure called the second-order folklore GNN (2-FGNN) that overcomes this limitation, and the aforementioned universal approximation theorem can be extended to the entire MILP space using 2-FGNN, regardless of the MP-tractability. A small-scale numerical experiment is conducted to directly validate our theoretical findings.

A major part of the work of ZC was completed during his internship at Alibaba US DAMO Academy.

1. Introduction

Mixed-integer linear programming (MILP) involves optimization problems with linear objectives and constraints, where some variables must be integers. These problems appear in various fields, from logistics and supply chain management to planning and scheduling, and are in general NP-hard. The branch-and-bound (BnB) algorithm [land1960automatic] is the core of a MILP solver. It works by repeatedly solving relaxed versions of the problem, called linear relaxations, which allow the integer variables to take on fractional values. If a relaxation’s solution satisfies the integer requirements, it is a valid solution to the original problem. Otherwise, the algorithm divides the problem into two subproblems and solves their relaxations. This process continues until it finds the best solution that meets all the constraints.

Branching is the process of dividing a linear relaxation into two subproblems. When branching, the solver selects a variable with a fractional value in the relaxation’s solution and create two new subproblems. In one subproblem, the variable is forced to be less than or equal to the nearest integer below the fractional value. In the other, it is bounded above the fractional value. The branching variable choice is critical because it can impact the solver’s efficiency by orders of magnitude.

A well-chosen branching variable can lead to a significant improvement in the lower bound, which is a quantity that can quickly prove that a subproblem and its further subdivisions are infeasible or not promising, thus reducing the total number of subproblems to explore. This means fewer linear relaxations to solve and faster convergence to the optimal solution. On the contrary, a poor choice may result in branches that do little to improve the bounds or reduce the solution space, thus leading to a large number of subproblems to be solved, significantly increasing the total solution time. The choice of which variable to branch on is a pivotal decision. This is where branching strategies, such as strong branching and learning to branch, come into play, evaluating the impact of different branching choices before making a decision.

Strong branching (SB) [applegate1995finding] is a sophisticated strategy to select the most promising branches to explore. In SB, before actually performing a branch, the solver tentatively branches on several variables and calculates the potential impact of each branch on the objective function. This “look-ahead” strategy evaluates the quality of branching choices by solving linear relaxations of the subproblems created by the branching. The variable that leads to the most significant improvement in the objective function is selected for the actual branching. Usually recognized as the most effective branching strategy, SB often results in a significantly lower number of subproblems to resolve during the branch-and-bound (BnB) process compared to other methods [gamrath2018measuring]. As such, SB is frequently utilized directly or as a fundamental component in cutting-edge solvers.

While SB can significantly reduce the size of the BnB search space, it comes with high computational cost: evaluating multiple potential branches at each decision point requires solving many LPs. This leads to a trade-off between the time spent on SB and the overall time saved due to a smaller search space. In practice, MILP solvers use heuristics to limit the use of SB to where it is most beneficial.

Learning to branch (L2B) introduces a new approach by incorporating machine learning (ML) to develop branching strategies, offering new solutions to address this trade-off. This line of research begins with imitation learning [khalil2016learning, alvarez2017machine, balcan2018learning, gasse2019exact, gupta2020hybrid, zarpellon2021parameterizing, gupta2022lookback, lin2022learning, yang2022learning], where models, including SVM, decision tree, and neural networks, are trained to mimic SB outcomes based on the features of the underlying MILP. They aim to create a computationally efficient strategy that achieves the effectiveness of SB on specific datasets. Furthermore, in recent reinforcement learning approaches, mimicking SB continues to take crucial roles in initialization or regularization [qu2022improved, zhang2022deep].

While using a heuristic (an ML model) to approximate another heuristic (the SB procedure) may seem counterintuitive, it is important to recognize the potential benefits. The former can significantly reduce the time required to make branching decisions as effectively as the latter. As MILPs become larger and more complex, the computational cost of SB grows at least cubically, but some ML models grow quadratically, even just linearly after training on a set of similar MILPs. Although SB can theoretically solve LP relaxations in parallel, the time required for different LPs may vary greatly, and there is a lack of GPU-friendly methods that can effectively utilize starting bases for warm starts. In contrast, ML models, particularly GNNs, are more amenable to efficient implementation on GPUs, making them a more practical choice for accelerating the branching variable selection process. Furthermore, additional problem-specific characteristics can be incorporated into the ML model, allowing it to make more informed branching decisions tailored to each problem instance.

Graph neural network (GNN) stands out as an effective class of ML models for L2B, surpassing other models like SVM and MLP, due to the excellent scalability and the permutation-invariant/equivariant property. To utilize a GNN on a MILP, one first conceptualizes the MILP as a graph and the GNN is then applied to that graph and returns a branching decision. This approach [gasse2019exact, ding2020accelerating] has gained prominence in not only L2B but various other MILP-related learning tasks [nair2020solving, wu2021learning, scavuzzo2022learning, paulus2022learning, chi2022deep, liu2022learning, khalil2022mip, labassi2022learning, falkner2022learning, song2022learning, hosny2023automatic, wang2023learning, turner2023adaptive, ye2023gnn, marty2023learning]. More details are provided in Section 2.

Despite the widespread use of GNNs on MILPs, a theoretical understanding remains largely elusive. A vital concept for any ML model, including GNNs, is its capacity or expressive power [sato2020survey, Li2022, jegelka2022theory], which in our context is their ability to accurately approximate the mapping from MILPs to their SB results. Specifically, this paper aims to answer the following question:

(1.1) Given a distribution of MILPs, is there a GNN model capable of mapping each
MILP problem to its strong branching result with a specified level of precision?

Related works and our contributions

While the capacity of GNNs for general graph tasks, such as node and link prediction or function approximation on graphs, has been extensively studied [xu2019powerful, maron2019universality, chen2019equivalence, keriven2019universal, sato2019approximation, Loukas2020What, azizian2020expressive, geerts2022expressiveness, zhang2023rethinking], their capacities in approximating SB remains largely unexplored. The closest studies [chen2022representing-lp, chen2022representing-milp] have explored GNNs’ ability to represent properties of linear programs (LPs) and MILPs, such as feasibility, boundedness, or optimal solutions, but have not specifically focused on branching strategies. Recognizing this gap, our paper makes the following contributions:

  • In the context of L2B using GNNs, we first focus on the most widely used type: message-passing GNNs (MP-GNNs). Our study reveals that MP-GNNs can reliably predict SB results, but only for a specific class of MILPs that we introduce as message-passing-tractable (MP-tractable). We prove that for any distribution of MP-tractable MILPs, there exists an MP-GNN capable of accurately predicting their SB results. This finding establishes a theoretical basis for the widespread use of MP-GNNs to approximate SB results in current research.

  • Through a counter-example, we demonstrate that MP-GNNs are incapable of predicting SB results beyond the class of MP-tractable MILPs. The counter-example consists of two MILPs with distinct SB results to which all MP-GNNs, however, yield identical branching predictions.

  • For general MILPs, we explore the capabilities of second-order folklore GNNs (2-FGNNs), a type of higher-order GNN with enhanced expressive power. Our results show that 2-FGNNs can reliably answer question (1.1) positively, effectively replicating SB results across any distribution of MILP problems, surpassing the capabilities of standard MP-GNNs.

Overall, as a series of works have empirically shown that learning an MP-GNN as a fast approximation of SB significantly benefits the performance of an MILP solver on specific data sets [khalil2016learning, alvarez2017machine, balcan2018learning, gasse2019exact, gupta2020hybrid, zarpellon2021parameterizing, gupta2022lookback, lin2022learning, yang2022learning], our goal is to determine whether there is room, in theory, to further understand and improve the GNNs’ performance on this task.

2. Preliminaries and problem setup

We consider the MILP defined in its general form as follows:

(2.1) minxncx,s.t.Axb,xu,xj,jI,\min_{x\in\mathbb{R}^{n}}~{}~{}c^{\top}x,\quad\textup{s.t.}~{}~{}Ax\circ b,~{}~{}\ell\leq x\leq u,~{}~{}x_{j}\in\mathbb{Z},\ \forall~{}j\in I,

where Am×nA\in\mathbb{R}^{m\times n}, bmb\in\mathbb{R}^{m}, cnc\in\mathbb{R}^{n}, {,=,}m\circ\in\{\leq,=,\geq\}^{m} is the type of constraints, ({})n\ell\in(\{-\infty\}\cup\mathbb{R})^{n} and u({})nu\in(\mathbb{R}\cup\{\infty\})^{n} are the lower bounds and upper bounds of the variable xx, and I{1,2,,n}I\subset\{1,2,\dots,n\} identifies which variables are constrained to be integers.

Graph Representation of MILP

Here we present an approach to represent MILP as a bipartite graph, termed the MILP-graph. This conceptualization was initially proposed by [gasse2019exact] and has quickly become a prevalent model in ML for MILP-related tasks. The MILP-graph is defined as a tuple G=(V,W,A,FV,FW)G=(V,W,A,F_{V},F_{W}), where the components are specified as follows: V={1,2,,m}V=\{1,2,\dots,m\} and W={1,2,,n}W=\{1,2,\dots,n\} are sets of nodes representing the constraints and variables, respectively. An edge (i,j)(i,j) connects node iVi\in V to node jWj\in W if the corresponding entry AijA_{ij} in the coefficient matrix of (2.1) is non-zero, with AijA_{ij} serving as the edge weight. FVF_{V} are features/attributes of constraints, with features vi=(bi,i)v_{i}=(b_{i},\circ_{i}) attached to node iVi\in V. FWF_{W} are features/attributes of variables, with features wj=(cj,j,uj,δI(j))w_{j}=(c_{j},\ell_{j},u_{j},\delta_{I}(j)) attached to node jWj\in W, where δI(j){0,1}\delta_{I}(j)\in\{0,1\} indicates whether the variable xjx_{j} is integer-constrained.

We define 𝒩W(i)=:{jW:Aij0}W\mathcal{N}_{W}(i)=:\{j\in W:A_{ij}\neq 0\}\subset W as the neighbors of iVi\in V and similarly define 𝒩V(j)=:{iV:Aij0}V\mathcal{N}_{V}(j)=:\{i\in V:A_{ij}\neq 0\}\subset V. This graphical representation completely describes a MILP’s information, allowing us to interchangeably refer to a MILP and its graph throughout this paper. An illustrative example is presented in Figure 1. We also introduce a space of MILP-graphs:

min[x1x2x3][123]\min~{}~{}\begin{bmatrix}\ \ \ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{1}}&\quad\ \ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{2}}&\quad\quad\ \ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{3}}~{}~{}~{}\end{bmatrix}\begin{bmatrix}1\\ 2\\ 3\end{bmatrix}s.t.     2x1+x252~{}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{1}}~{}+~{}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{2}}\quad\quad\quad\quad\leq 5x2+3x30{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{2}}~{}~{}+~{}3~{}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{3}}\geq 00x1,x2,x310\leq{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{1}}~{}~{}~{},\ \ \ ~{}~{}~{}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{2}}~{}~{}~{}~{}~{},\ \ \ \ \ \ ~{}~{}~{}~{}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{3}}~{}\leq 1x1x_{1}\in\mathbb{Z}w1w_{1}(1,0,1,1)(1,0,1,1)w2w_{2}(2,0,1,0)(2,0,1,0)w3w_{3}(3,0,1,0)(3,0,1,0)v1v_{1}(5,)(5,\leq)v2v_{2}(0,)(0,\geq)2113
Figure 1. An illustrative example of MILP and its graph representation.
Definition 2.1 (Space of MILP-graphs).

We use 𝒢m,n\mathcal{G}_{m,n} to denote the collection of all MILP-graphs induced from MILPs of the form (2.1) with nn variables and mm constraints.111Rigorously, the space 𝒢m,nm×n×n×m×({})n×({+})n×{,=,}m×{0,1}n\mathcal{G}_{m,n}\cong\mathbb{R}^{m\times n}\times\mathbb{R}^{n}\times\mathbb{R}^{m}\times(\mathbb{R}\cup\{-\infty\})^{n}\times(\mathbb{R}\cup\{+\infty\})^{n}\times\{\leq,=,\geq\}^{m}\times\{0,1\}^{n} is equipped with product topology, where all Euclidean spaces have standard Eudlidean topologies, discrete spaces {}\{-\infty\}, {+}\{+\infty\}, {,=,}\{\leq,=,\geq\}, and {0,1}\{0,1\} have the discrete topologies, and all unions are disjoint unions.

Message-passing graph neural networks (MP-GNNs) are a class of GNNs that operate on graph-structured data, by passing messages between nodes in a graph to aggregate information from their local neighborhoods. In our context, the input is an aforementioned MILP-graph G=(V,W,A,FV,FW)G=(V,W,A,F_{V},F_{W}), and each node in WW is associated with a real-number output. We use the standard MP-GNNs for MILPs in the literature [gasse2019exact, chen2022representing-milp].

Specifically, the initial layer assigns features si0,tj0s_{i}^{0},t_{j}^{0} for each node as

  • si0=p0(vi)s_{i}^{0}=p^{0}(v_{i}) for each constraint iVi\in V, and tj0=q0(wj)t_{j}^{0}=q^{0}(w_{j}) for each variable jWj\in W.

Then message-passing layers l=1,2,,Ll=1,2,\dots,L update the features via

  • sil=pl(sil1,j𝒩W(i)fl(tjl1,Aij))s_{i}^{l}=p^{l}\big{(}s_{i}^{l-1},\sum_{j\in\mathcal{N}_{W}(i)}f^{l}(t_{j}^{l-1},A_{ij})\big{)} for each constraint iVi\in V, and

  • tjl=ql(tjl1,i𝒩V(j)gl(sil1,Aij))t_{j}^{l}=q^{l}\big{(}t_{j}^{l-1},\sum_{i\in\mathcal{N}_{V}(j)}g^{l}(s_{i}^{l-1},A_{ij})\big{)} for each variable jWj\in W.

Finally, the output layer produces a read-number output yjy_{j} for each node jWj\in W:

  • yj=r(iVsiL,jWtjL,tjL)y_{j}=r\big{(}\sum_{i\in V}s_{i}^{L},\sum_{j\in W}t_{j}^{L},t_{j}^{L}\big{)}.

In practice, functions {pl,ql,fl,gl}l=1L,r,p0,q0\{p^{l},q^{l},f^{l},g^{l}\}_{l=1}^{L},r,p^{0},q^{0} are learnable and usually parameterized with multi-linear perceptrons (MLPs). In our theoretical analysis, we assume for simplicity that those functions are continuous on given domains. The space of MP-GNNs is introduced as follows.

Definition 2.2 (Space of MP-GNNs).

We use MP-GNN\mathcal{F}_{\textup{MP-GNN}} to denote the collection of all MP-GNNs constructed as above with pl,ql,fl,gl,rp^{l},q^{l},f^{l},g^{l},r being continuous with fl(,0)0f^{l}(\cdot,0)\equiv 0 and gl(,0)0g^{l}(\cdot,0)\equiv 0.222We require fl,glf^{l},g^{l} yield 0 when the edge weight is 0 to avoid the discontinuity of functions in MP-GNN\mathcal{F}_{\textup{MP-GNN}}.

Overall, any MP-GNN FMP-GNNF\in\mathcal{F}_{\textup{MP-GNN}} maps a MILP-graph GG to a nn-dim vector: y=F(G)ny=F(G)\in\mathbb{R}^{n}.

Second-order folklore graph neural networks (2-FGNNs) are an extension of MP-GNNs designed to overcome some of the capacity limitations. It is proved in [xu2019powerful] the expressive power of MP-GNNs can be measured by the Weisfeiler-Lehman test (WL test [weisfeiler1968reduction]). To enhance the ability to identify more complex graph patterns, [morris2019weisfeiler] developed high-order GNNs, inspired by high-order WL tests [cai1992optimal]. Since then, there has been growing literature about high-order GNNs and other variants including high-order folklore GNNs [maron2019provably, geerts2020expressive, geerts2020walk, azizian2020expressive, zhao2022practical, geerts2022expressiveness]. Instead of operating on individual nodes of the given graph, 2-FGNNs operate on pairs of nodes (regardless of whether two nodes in the pair are neighbors or not) and the neighbors of those pairs. We say two node pairs are neighbors if they share a common node. Let G=(V,W,A,FV,FW)G=(V,W,A,F_{V},F_{W}) be the input graph. The initial layer performs:

  • sij0=p0(vi,wj,Aij)s_{ij}^{0}=p^{0}(v_{i},w_{j},A_{ij}) for each constraint iVi\in V and each variable jWj\in W, and

  • tj1j20=q0(wj1,wj2,δj1j2)t_{j_{1}j_{2}}^{0}=q^{0}(w_{j_{1}},w_{j_{2}},\delta_{j_{1}j_{2}}) for variables j1,j2Wj_{1},j_{2}\in W,

where δj1j2=1\delta_{j_{1}j_{2}}=1 if j1=j2j_{1}=j_{2} and δj1j2=0\delta_{j_{1}j_{2}}=0 otherwise. For internal layers l=1,2,,Ll=1,2,\dots,L, compute

  • sijl=pl(sijl1,j1Wfl(tj1jl1,sij1l1))s_{ij}^{l}=p^{l}\big{(}s_{ij}^{l-1},\sum_{j_{1}\in W}f^{l}(t_{j_{1}j}^{l-1},s_{ij_{1}}^{l-1})\big{)} for all iV,jWi\in V,j\in W, and

  • tj1j2l=ql(tj1j2l1,iVgl(sij2l1,sij1l1))t_{j_{1}j_{2}}^{l}=q^{l}\big{(}t_{j_{1}j_{2}}^{l-1},\sum_{i\in V}g^{l}(s_{ij_{2}}^{l-1},s_{ij_{1}}^{l-1})\big{)} for all j1,j2Wj_{1},j_{2}\in W.

The final layer produces the output yjy_{j} for each node jWj\in W:

  • yj=r(iVsijL,j1Wtj1jL)y_{j}=r\big{(}\sum_{i\in V}s_{ij}^{L},\sum_{j_{1}\in W}t_{j_{1}j}^{L}\big{)}.

Similar to MP-GNNs, the functions within 2-FGNNs, including {pl,ql,fl,gl}l=1L,r,p0,q0\{p^{l},q^{l},f^{l},g^{l}\}_{l=1}^{L},r,p^{0},q^{0}, are also learnable and typically parameterized with MLPs. The space of 2-FGNNs is defined with:

Definition 2.3.

We use 2-FGNN\mathcal{F}_{\textup{2-FGNN}} to denote the set of all 22-FGNNs with continuous pl,ql,fl,gl,rp^{l},q^{l},f^{l},g^{l},r.

Any 22-FGNN, F2-FGNNF\in\mathcal{F}_{\textup{2-FGNN}}, maps a MILP-graph GG to a nn-dim vector: y=F(G)y=F(G). While MP-GNNs and 2-FGNNs share the same input-output structure, their internal structures differ, leading to distinct expressive powers.

3. Imitating strong branching by GNNs

In this section, we present some observations and mathematical concepts underlying the imitation of strong branching by GNNs. This line of research, which aims to replicate SB strategies through GNNs, has shown promising empirical results across a spectrum of studies [gasse2019exact, gupta2020hybrid, zarpellon2021parameterizing, gupta2022lookback, lin2022learning, yang2022learning, scavuzzo2022learning], yet it still lacks theoretical foundations. Its motivation stems from two key observations introduced earlier in Section 1, which we elaborate on here in detail.

Observation I

SB is notably effective in reducing the size of the BnB search space. This size is measured by the size of the BnB tree. Here, a “tree” refers to a hierarchical structure of “nodes”, each representing a decision point or a subdivision of the problem. The tree’s size corresponds to the number of these nodes. For instance, consider the instance “neos-3761878-oglio” from MIPLIB [gleixner2021miplib]. When solved using SCIP [BolusaniEtal2024OO, BolusaniEtal2024ZR] under standard configurations, the BnB tree size is 851851, and it takes 61.0461.04 seconds to attain optimality. However, disabling SB, along with all branching rules dependent on SB, results in an increased BnB tree size to 3554835548 and an increased runtime to 531.0531.0 seconds.

Observation II

SB itself is computationally expensive. In the above experiment under standard settings, SB consumes an average of 70.40%70.40\% of the total runtime, 42.9742.97 out of 61.0461.04 seconds in total.

Therefore, there is a clear need of approximating SB with efficient ML models. Ideally, if we can substantially reduce the SB calculation time from 42.9742.97 seconds to a negligible duration while maintaining its effectiveness, the remaining runtime of 61.0442.97=18.0761.04-42.97=18.07 seconds would become significantly more efficient.

To move forward, we introduce some basic concepts related to SB.

Concepts for SB

SB begins by identifying candidate variables for branching, typically those with non-integer values in the solution to the linear relaxation but which are required to be integers. Each candidate is then assigned a SB score, a non-negative real number determined by creating two linear relaxations and calculating the objective improvement. A higher SB score indicates the variable has a higher priority to be chosen for branching. Variables that do not qualify as branching candidates are assigned a zero score. Compiling these scores for each variable results in an nn-dimensional SB score vector, denoted as SB(G)=(SB(G)1,SB(G)2,,SB(G)n)\textup{SB}(G)=(\textup{SB}(G)_{1},\textup{SB}(G)_{2},\dots,\textup{SB}(G)_{n}).

Consequently, the task of approximating SB with GNNs can be described with a mathematical language: finding an FMP-GNNF\in\mathcal{F}_{\textup{MP-GNN}} or F2-FGNNF\in\mathcal{F}_{\textup{2-FGNN}} such that F(G)SB(G)F(G)\approx\textup{SB}(G). Formally, it is:

Formal statement of Problem (1.1): Given a distribution of GG, is there FMP-GNNF\in\mathcal{F}_{\textup{MP-GNN}} or F2-FGNNF\in\mathcal{F}_{\textup{2-FGNN}} such that F(G)SB(G)\|F(G)-\textup{SB}(G)\| is smaller than some error tolerance with high probability?

To provide clarity, we present a formal definition of SB scores:

Definition 3.1 (LP relaxation with a single bound change).

Pick a G𝒢m,nG\in\mathcal{G}_{m,n}. For any j{1,2,,n}j\in\{1,2,\dots,n\}, l^j{}\hat{l}_{j}\in\{-\infty\}\cup\mathbb{R}, and u^j{+}\hat{u}_{j}\in\mathbb{R}\cup\{+\infty\}, we denote by LP(G,j,l^j,u^j)\textup{LP}(G,j,\hat{l}_{j},\hat{u}_{j}) the following LP problem obtained by changing the lower/upper bound of xjx_{j} in the LP relaxation of (2.1):

minxncx,s.t.Axb,l^jxju^j,ljxjuj for j{1,2,,n}\{j}.\min_{x\in\mathbb{R}^{n}}~{}~{}c^{\top}x,\quad\textup{s.t.}~{}~{}Ax\circ b,~{}~{}\hat{l}_{j}\leq x_{j}\leq\hat{u}_{j},~{}~{}l_{j^{\prime}}\leq x_{j^{\prime}}\leq u_{j^{\prime}}\textup{ for }j^{\prime}\in\{1,2,\dots,n\}\backslash\{j\}.
Definition 3.2 (Strong branching scores).

Let G𝒢m,nG\in\mathcal{G}_{m,n} be a MILP-graph associated with the problem (2.1) whose LP relaxation is feasible and bounded. Denote fLP(G)f^{*}_{\textup{LP}}(G)\in\mathbb{R} as the optimal objective value of the LP relaxation of GG and denote xLP(G)nx^{*}_{\textup{LP}}(G)\in\mathbb{R}^{n} as the optimal solution with the smallest 2\ell_{2}-norm. The SB score SB(G)j\textup{SB}(G)_{j} for variable xjx_{j} is defined via

SB(G)j={0,if jI,(fLP(G,j,lj,u^j)fLP(G))(fLP(G,j,l^j,uj)fLP(G)),otherwise,\textup{SB}(G)_{j}=\begin{cases}0,&\textup{if }j\notin I,\\ (f^{*}_{\textup{LP}}(G,j,l_{j},\hat{u}_{j})-f^{*}_{\textup{LP}}(G))\cdot(f^{*}_{\textup{LP}}(G,j,\hat{l}_{j},u_{j})-f^{*}_{\textup{LP}}(G)),&\textup{otherwise},\end{cases}

where fLP(G,j,lj,u^j)f^{*}_{\textup{LP}}(G,j,l_{j},\hat{u}_{j}) and fLP(G,j,l^j,uj)f^{*}_{\textup{LP}}(G,j,\hat{l}_{j},u_{j}) are the optimal objective values of LP(G,j,lj,u^j)\textup{LP}(G,j,l_{j},\hat{u}_{j}) and LP(G,j,l^j,uj)\textup{LP}(G,j,\hat{l}_{j},u_{j}) respectively, with u^j=xLP(G)j\hat{u}_{j}=\lfloor x^{*}_{\textup{LP}}(G)_{j}\rfloor being the largest integer no greater than xLP(G)jx^{*}_{\textup{LP}}(G)_{j} and l^j=xLP(G)j\hat{l}_{j}=\lceil x^{*}_{\textup{LP}}(G)_{j}\rceil being the smallest integer no less than xLP(G)jx^{*}_{\textup{LP}}(G)_{j}, for j=1,2,,nj=1,2,\dots,n.

Remark: LP solution with the smallest 2\ell_{2}-norm

We only define the SB score for MILP problems with feasible and bounded LP relaxations; otherwise the optimal solution xLP(G)x^{*}_{\textup{LP}}(G) does not exist. If the LP relaxation of GG admits multiple optimal solutions, then the strong branching score SB(G)\textup{SB}(G) depends on the choice of the particular optimal solution. To guarantee that the SB score is uniquely defined, in Definition 3.2, we use the optimal solution with the smallest 2\ell_{2}-norm, which is unique.

Remark: SB at leaf nodes

While the strong branching score discussed here primarily pertains to root SB, it is equally relevant to SB at leaf nodes within the BnB framework. By interpreting the MILP-graph GG in Definition 3.2 as representing the subproblems encountered during the BnB process, we can extend our findings to strong branching decisions at any point in the BnB tree. Here, root SB refers to the initial branching decisions made at the root of the BnB tree, while leaf nodes represent subsequent branching points deeper in the tree, where similar SB strategies can be applied.

Remark: Other types of SB scores

Although this paper primarily focuses on the product SB scores (where the SB score is defined as the product of objective value changes when branching up and down), our analysis can extend to other forms of SB scores in [dey2024theoretical]. (Refer to Appendix D.1)

4. Main results

4.1. MP-GNNs can represent SB for MP-tractable MILPs

In this subsection, we define a class of MILPs, named message-passing-tractable (MP-tractable) MILPs, and then show that MP-GNNs can represent SB within this class.

To define MP-tractability, we first present the Weisfeiler-Lehman (WL) test [weisfeiler1968reduction], a well-known criterion for assessing the expressive power of MP-GNNs [xu2019powerful]. The WL test in the context of MILP-graphs is stated in Algorithm 1. It follows exactly the same updating rule as the MP-GNN, differing only in the local updates performed via hash functions.

Algorithm 1 The WL test for MILP-Graphs
1:A graph instance G𝒢m,nG\in\mathcal{G}_{m,n} and iteration limit L>0L>0.
2:Initialize with C0V(i)=HASH0V(vi)C_{0}^{V}(i)=\text{HASH}_{0}^{V}(v_{i}), C0W(j)=HASH0W(wj)C_{0}^{W}(j)=\text{HASH}_{0}^{W}(w_{j}).
3:for l=1,2,,Ll=1,2,\cdots,L do
4:     ClV(i)=HASHlV(Cl1V(i),{{(Cl1W(j),Aij):j𝒩W(i)}})C_{l}^{V}(i)=\text{HASH}_{l}^{V}\left(C_{l-1}^{V}(i),\left\{\left\{\left(C_{l-1}^{W}(j),A_{ij}\right):j\in\mathcal{N}_{W}(i)\right\}\right\}\right).
5:     ClW(j)=HASHlW(Cl1W(j),{{(Cl1W(i),Aij):i𝒩V(j)}})C_{l}^{W}(j)=\text{HASH}_{l}^{W}\left(C_{l-1}^{W}(j),\left\{\left\{\left(C_{l-1}^{W}(i),A_{ij}\right):i\in\mathcal{N}_{V}(j)\right\}\right\}\right).
6:end for
7:Output: Final colors CLV(i)C_{L}^{V}(i) for all iVi\in V and CLW(j)C_{L}^{W}(j) for all jVj\in V.

The WL test can be understood as a color refinement algorithm. In particular, each vertex in GG is initially assigned a color C0V(i)C_{0}^{V}(i) or C0W(j)C_{0}^{W}(j) according to its initial feature viv_{i} or wjw_{j}. Then the vertex colors ClV(i)C_{l}^{V}(i) and ClW(j)C_{l}^{W}(j) are iteratively refined via aggregation of neighbors’ information and corresponding edge weights. If there is no collision of hash functions333Here, “no collision of a hash function” indicates that the hash function doesn’t map two distinct inputs to the same output during the WL test on a specific instance. Another stronger assumption, commonly used in WL test analysis [jegelka2022theory], assumes that all hash functions are injective., then two vertices are of the same color at some iteration if and only if at the previous iteration, they have the same color and the same multiset of neighbors’ information and corresponding edge weights. Such a color refinement process is illustrated by an example shown in Figure 2.

min[x1x2x3][123]\min~{}~{}\begin{bmatrix}\ \ \ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{1}}&\quad\ \ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{2}}&\quad\quad\ \ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{3}}~{}~{}~{}\end{bmatrix}\begin{bmatrix}1\\ 2\\ 3\end{bmatrix}s.t.     2x1+x252~{}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{1}}~{}+~{}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{2}}\quad\quad\quad\quad\leq 5x2+3x30{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{2}}~{}~{}+~{}3~{}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{3}}\geq 00x1,x2,x310\leq{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{1}}~{}~{}~{},\ \ \ ~{}~{}~{}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{2}}~{}~{}~{}~{}~{},\ \ \ \ \ \ ~{}~{}~{}~{}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}x_{3}}~{}\leq 1x1x_{1}\in\mathbb{Z}v1v_{1}v2v_{2}w1w_{1}w2w_{2}w3w_{3}Initializationl=1l=1l=2l=2The WL test (Algorithm 1)MILP-graph GG
min\min x1+x2+x3x_{1}+x_{2}+x_{3},
s.t. x1+x2+x31x_{1}+x_{2}+x_{3}\leq 1,
x1+x21x_{1}+x_{2}\leq 1,
0x1,x2,x310\leq x_{1},x_{2},x_{3}\leq 1,
x1,x2,x3x_{1},x_{2},x_{3}\in\mathbb{Z}.
MILP formula
Figure 2. An illustrative example of color refinement and partitions. Initially, all variables share a common color due to their identical node attributes, as do the constraint nodes. After a round of the WL test, x1x_{1} and x2x_{2} retain their shared color, while x3x_{3} is assigned a distinct color, as it connects solely to the first constraint, unlike x1x_{1} and x2x_{2}. Similarly, the colors of the two constraints can also be differentiated. Finally, this partition stabilizes, resulting in ={{1},{2}}\mathcal{I}=\{\{1\},\{2\}\}, 𝒥={{1,2},{3}}\mathcal{J}=\{\{1,2\},\{3\}\}.

One can also view a vertex coloring as a partition, i.e., all vertices are partitioned into several classes such that two vertices are in the same class if and only if they are of the same color. After each round of Algorithm 1, the partition always becomes finer if no collision happens, though it may not be strictly finer. The following theorem suggests that this partition eventually stabilizes or converges, with the final limit uniquely determined by the graph GG, independent of the hash functions selected.

Theorem 4.1 ([chen2022representing-lp]*Theorem A.2).

For any G𝒢m,nG\in\mathcal{G}_{m,n}, the vertex partition induced by Algorithm 1 (if no collision) will converge within 𝒪(m+n)\mathcal{O}(m+n) iterations to a partition (,𝒥)(\mathcal{I},\mathcal{J}), where ={I1,I2,,Is}\mathcal{I}=\{I_{1},I_{2},\dots,I_{s}\} is a partition of {1,2,,m}\{1,2,\dots,m\} and 𝒥={J1,J2,,Jt}\mathcal{J}=\{J_{1},J_{2},\dots,J_{t}\} is a partition of {1,2,,n}\{1,2,\dots,n\}, and that partition (,𝒥)(\mathcal{I},\mathcal{J}) is uniquely determined by the input graph GG.

With the concepts of color refinement and partition, we can introduce the core concept of this paper:

Definition 4.2 (Message-passing-tractability).

For G𝒢m,nG\in\mathcal{G}_{m,n}, let (,𝒥)(\mathcal{I},\mathcal{J}) be the partition as in Theorem 4.1. We say that GG is message-passing-tractable (MP-tractable) if for any p{1,2,,s}p\in\{1,2,\dots,s\} and q{1,2,,t}q\in\{1,2,\dots,t\}, all entries of the submatrix (Aij)iIp,jJq(A_{ij})_{i\in I_{p},j\in J_{q}} are the same. We use 𝒢m,nMP𝒢m,n\mathcal{G}_{m,n}^{\textup{MP}}\subset\mathcal{G}_{m,n} to denote the subset of all MILP-graphs in 𝒢m,n\mathcal{G}_{m,n} that are MP-tractable.

In order to help readers better understand the concept of “MP-tractable”, let’s examine the MILP instance shown in Figure 2. After numerous rounds of WL tests, the partition stabilizes to ={{1},{2}}\mathcal{I}=\{\{1\},\{2\}\} and 𝒥={{1,2},{3}}\mathcal{J}=\{\{1,2\},\{3\}\}. According to Definition 4.2, one must examine the following submatrices to determine whether the MILP is MP-tractable:

A[1,1:2]=[1,1],A[2,1:2]=[1,1],A[1,3]=[1],A[2,3]=[0].A[1,1:2]=[1,1],\quad A[2,1:2]=[1,1],\quad A[1,3]=[1],\quad A[2,3]=[0].

All elements within each submatrix are identical. Hence, this MILP is indeed MP-tractable. To rigorously state our result, we require the following assumption of the MILP data distribution.

Assumption 4.3.

\mathbb{P} is a Borel regular probability measure on 𝒢m,n\mathcal{G}_{m,n} and [SB(G)n]=1\mathbb{P}[\textup{SB}(G)\in\mathbb{R}^{n}]=1.

Borel regularity is a “minimal” assumption that is actually satisfied by almost all practically used data distributions such as normal distributions, discrete distributions, etc. Let us also comment on the other assumption [SB(G)n]=1\mathbb{P}[\textup{SB}(G)\in\mathbb{R}^{n}]=1. In Definition 3.2, the linear relaxation of GG is feasible and bounded, which implies fLP(G)f^{*}_{\textup{LP}}(G)\in\mathbb{R}. However, it is possible for a linear program that is initially bounded and feasible to become infeasible upon adjusting a single variable’s bounds, potentially resulting in fLP(G,j,lj,u^j)=+f^{*}_{\textup{LP}}(G,j,l_{j},\hat{u}_{j})=+\infty or fLP(G,j,l^j,uj)=+f^{*}_{\textup{LP}}(G,j,\hat{l}_{j},u_{j})=+\infty and leading to an infinite SB score: SB(G)j=+\textup{SB}(G)_{j}=+\infty. Although we ignore such a case by assuming [SB(G)n]=1\mathbb{P}[\textup{SB}(G)\in\mathbb{R}^{n}]=1, it is straightforward to extend all our results by simply representing ++\infty as 1-1 considering SB(G)j\textup{SB}(G)_{j} as a non-negative real number, thus avoiding any collisions in definitions.

Based on the above assumptions, as well as an extra assumption: GG is message-passing tractable with probability one, we can show the existence of an MP-GNN capable of accurately mapping a MILP-graph GG to its corresponding SB score, with an arbitrarily high degree of precision and probability. The formal theorem is stated as follows.

Theorem 4.4.

Let \mathbb{P} be any probability distribution over 𝒢m,n\mathcal{G}_{m,n} that satisfies Assumption 4.3 and [G𝒢m,nMP]=1\mathbb{P}[G\in\mathcal{G}_{m,n}^{\textup{MP}}]=1. Then for any ε,δ>0\varepsilon,\delta>0, there exists a GNN FMP-GNNF\in\mathcal{F}_{\textup{MP-GNN}} such that

[F(G)SB(G)δ]1ϵ.\mathbb{P}[\|F(G)-\textup{SB}(G)\|\leq\delta]\geq 1-\epsilon.

The proof of Theorem 4.4 is deferred to Appendix A, with key ideas outlined here. First, we show that if Algorithm 1 produces identical results for two MP-tractable MILPs, they must share the same SB score. That is, if two MP-tractable MILPs have different SB scores, the WL test (or equivalently MP-GNNs) can capture this distinction. Building on this result, along with a generalized version of the Stone-Weierstrass theorem and Luzin’s theorem, we reach the final conclusion.

Let us compare our findings with [chen2022representing-milp] that establishes the existence of an MP-GNN capable of directly mapping GG to one of its optimal solutions, under the assumption that GG must be unfoldable. Unfoldability means that, after enough rounds of the WL test, each node receives a distinct color assignment. Essentially, it assumes that the WL test can differentiate between all nodes in GG, and the elements within the corresponding partition (,𝒥)(\mathcal{I},\mathcal{J}) have cardinality one: |Ip|=1|I_{p}|=1 and |Jq|=1|J_{q}|=1 for all p{1,2,,s}p\in\{1,2,\dots,s\} and q{1,2,,t}q\in\{1,2,\dots,t\}. Consequently, any unfoldable MILP must be MP-tractable because the submatrices under the partition of an unfoldable MILP (Aij)iIp,jJq(A_{ij})_{i\in I_{p},j\in J_{q}} must be 1×11\times 1 and obviously satisfy the condition in Definition 4.2. However, the reverse assertion is not true: The example in Figure 2 serves as a case in point—it is MP-tractable but not unfoldable. Therefore, unfoldability is a stronger assumption than MP-tractability. Our Theorem 4.4 demonstrates that, to illustrate the expressive power of MP-GNNs in approximating SB, MP-tractability suffices; we do not need to make assumptions as strong as those required when considering MP-GNN for approximating the optimal solution.

4.2. MP-GNNs cannot universally represent SB beyond MP-tractability

Our next main result is that MP-GNNs do not have sufficient capacity to represent SB scores on the entire MILP space without the assumption of MP-tractability, stated as follows.

Theorem 4.5.

There exist two MILP problems with different SB scores, such that any MP-GNN has the same output on them, regardless of the number of parameters.

There are infinitely many pairs of examples proving Theorem 4.5, and we show two simple examples:

(4.1) minx1+x2+x3+x4+x5+x6+x7+x8,s.t.x1+x21,x2+x31,x3+x41,x4+x51,x5+x61,x6+x71,x7+x81,x8+x11, 0xj1,xj, 1j8,\begin{split}\min&\quad x_{1}+x_{2}+x_{3}+x_{4}+x_{5}+x_{6}+x_{7}+x_{8},\\ \textup{s.t.}&\quad x_{1}+x_{2}\geq 1,\ x_{2}+x_{3}\geq 1,\ x_{3}+x_{4}\geq 1,\ x_{4}+x_{5}\geq 1,\ x_{5}+x_{6}\geq 1,\\ &\quad x_{6}+x_{7}\geq 1,\ x_{7}+x_{8}\geq 1,\ x_{8}+x_{1}\geq 1,\ 0\leq x_{j}\leq 1,\ x_{j}\in\mathbb{Z},\ 1\leq j\leq 8,\\ \end{split}
(4.2) minx1+x2+x3+x4+x5+x6+x7+x8,s.t.x1+x21,x2+x31,x3+x11,x4+x51,x5+x61,x6+x41,x7+x81,x8+x71, 0xj1,xj, 1j8.\begin{split}\min&\quad x_{1}+x_{2}+x_{3}+x_{4}+x_{5}+x_{6}+x_{7}+x_{8},\\ \textup{s.t.}&\quad x_{1}+x_{2}\geq 1,\ x_{2}+x_{3}\geq 1,\ x_{3}+x_{1}\geq 1,\ x_{4}+x_{5}\geq 1,\ x_{5}+x_{6}\geq 1,\\ &\quad x_{6}+x_{4}\geq 1,\ x_{7}+x_{8}\geq 1,\ x_{8}+x_{7}\geq 1,\ 0\leq x_{j}\leq 1,\ x_{j}\in\mathbb{Z},\ 1\leq j\leq 8.\end{split}

We will prove in Appendix B that these two MILP instances have different SB scores, but they cannot be distinguished by any MP-GNN in the sense that for any FMP-GNNF\in\mathcal{F}_{\textup{MP-GNN}}, inputs (4.1) and (4.2) lead to the same output. Therefore, it is impossible to train an MP-GNN to approximate the SB score meeting a required level of accuracy with high probability, independent of the complexity of the MP-GNN. Any MP-GNN that accurately predicts one MILP’s SB score will necessarily fail on the other. We also remark that our analysis for (4.1) and (4.2) can be generalized easily to any aggregation mechanism of neighbors’ information when constructing the MP-GNNs, not limited to the sum aggregation as in Section 2.

The MILP instances on which MP-GNNs fail to approximate SB scores, (4.1) and (4.2), are not MP-tractable. It can be verified that for both (4.1) and (4.2), the partition as in Theorem 4.1 is given by ={I1}\mathcal{I}=\{I_{1}\} with I1={1,2,,8}I_{1}=\{1,2,\dots,8\} and 𝒥={J1}\mathcal{J}=\{J_{1}\} with J1={1,2,,8}J_{1}=\{1,2,\dots,8\}, i.e., all vertices in VV form a class and all vertices in WW form the other class. Then the matrices (Aij)iI1,jJ1(A_{ij})_{i\in I_{1},j\in J_{1}} and (A¯ij)iI1,jJ1(\bar{A}_{ij})_{i\in I_{1},j\in J_{1}} are just AA and A¯\bar{A}, the coefficient matrices in (4.1) and (4.2), and have both 0 and 11 as entries, which does not satisfies Definition 4.2.

Based on Theorem 4.5, we can directly derive the following corollary by considering a simple discrete uniform distribution \mathbb{P} on only two instances (4.1) and (4.2).

Corollary 4.6.

There exists a probability distribution \mathbb{P} over 𝒢m,n\mathcal{G}_{m,n} satisfying Assumption 4.3 and constants ϵ,δ>0\epsilon,\delta>0, such that for any MP-GNN FMP-GNNF\in\mathcal{F}_{\textup{MP-GNN}}, it holds that

[F(G)SB(G)δ]ϵ.\mathbb{P}[\|F(G)-\textup{SB}(G)\|\geq\delta]\geq\epsilon.

This corollary indicates that the assumption of MP-tractability in Theorem 4.4 is not removable.

4.3. 2-FGNNs are capable of universally representing SB

Although the universal approximation of MP-GNNs for SB scores is conditioned on the MP-tractability, we find an unconditional positive result stating that when we increase the order of GNNs a bit, it is possible to represent SB scores of MILPs, regardless of the MP-tractability.

Theorem 4.7.

Let \mathbb{P} be any probability distribution over 𝒢m,n\mathcal{G}_{m,n} that satisfies Assumption 4.3. Then for any ε,δ>0\varepsilon,\delta>0, there exists a GNN F2-FGNNF\in\mathcal{F}_{\textup{2-FGNN}} such that

[F(G)SB(G)δ]1ϵ.\mathbb{P}[\|F(G)-\textup{SB}(G)\|\leq\delta]\geq 1-\epsilon.

The proof of Theorem 4.7 leverages the second-order folklore Weisfeiler-Lehman (2-FWL) test. We show that for any two MILPs, whether MP-tractable or not, identical 2-FWL results imply they share the same SB score, thus removing the need for MP-tractability. Details are provided in Appendix C.

Theorem 4.7 establishes the existence of a 2-FGNN that can approximate the SB scores of MILPs well with high probability. This is a fundamental result illustrating the possibility of training a GNN to predict branching strategies for MILPs that are not MP-tractable. In particular, for any probability distribution \mathbb{P} as in Corollary 4.6 on which MP-GNNs fail to predict the SB scores well, Theorem 4.7 confirms the capability of 2-FGNNs to work on it.

However, it’s worth noting that 2-FGNNs typically have higher computational costs, both during training and inference stages, compared to MP-GNNs. This computational burden comes from the fact that calculations of 2-FGNNs reply on pairs of nodes instead of nodes, as we discussed in Section 2. To mitigate such computational challenges, one could explore the use of sparse or local variants of high-order GNNs that enjoy cheaper information aggregation with strictly stronger separation power than GNNs associated with the original high-order WL test [morris2020weisfeiler].

4.4. Practical insights of our theoretical results

Theorem 4.4 and Corollary 4.6 indicate the significance of MP-tractability in practice. Before attempting to train a MP-GNN to imitate SB, practitioners can first verify if the MILPs in their dataset satisfy MP-tractability. If the dataset contains a substantial number of MP-intractable instances, careful consideration of this approach is necessary, and 2-FGNNs may be more suitable according to Theorem 4.7. Notably, assessing MP-tractability relies solely on conducting the WL test (Algorithm 1). This algorithm is well-established in graph theory and benefits from abundant resources and repositories for implementation. Moreover, it operates with polynomial complexity (detailed below), which is reasonable compared to solving MILPs.

Complexity of verifying MP-tractability

To verify MP-tractability of a MILP, one requires at most 𝒪(m+n)\mathcal{O}(m+n) color refinement iterations according to Theorem 4.1. The complexity of each iteration is bounded by the number of edges in the graph [shervashidze2011weisfeiler]. In our context, it is bounded by the number of nonzeros in matrix AA: nnz(A)\text{nnz}(A). Therefore, the overall complexity is 𝒪((m+n)nnz(A))\mathcal{O}((m+n)\cdot\text{nnz}(A)), which is linear in terms of (m+n)(m+n) and nnz(A)\text{nnz}(A). In contrast, solving an MILP or even calculating its all the SB scores requires significantly higher complexity. To calculate the SB score of each MILP, one needs to solve at most nn LPs. We denote the complexity of solving each LP as CompLP(m,n)\text{CompLP}(m,n). Therefore, the overall complexity of calculating SB scores is 𝒪(nCompLP(m,n))\mathcal{O}(n\cdot\text{CompLP}(m,n)). Note that, currently, there is still no strongly polynomial-time algorithm for LP, thus this complexity is significantly higher than that of verifying MP-tractability.

While verifying MP-tractability is polynomial in complexity, the complexity of GNNs is still not guaranteed. Theorems 4.4 and 4.7 address existence, not complexity. In other words, this paper answers the question of whether GNNs can represent the SB score. To explore how well GNNs can represent SB, further investigation is needed.

Frequency of MP-tractability

In practice, the occurrence of MP-tractable instances is highly dependent on the dataset. In both Examples 4.1 and 4.2 (both MP-intractable), all variables exhibit symmetry, as they are assigned the same color by the WL test, which fails to distinguish them. Conversely, in the 3-variable example in Figure 2 (MP-tractable), only two of the three variables, x1x_{1} and x2x_{2}, are symmetric. Generally, the frequency of MP-tractability depends on the level of symmetry in the data — higher levels of symmetry increase the risk of MP-intractability. This phenomenon is commonly seen in practical MILP datasets, such as MIPLIB 2017 [gleixner2021miplib]. According to [chen2022representing-milp], approximately one-quarter of examples show significant symmetry in over half of the variables.

5. Numerical results

We implement numerical experiments to validate our theoretical findings in Section 4.

Experimental settings

We train an MP-GNN and a 2-FGNN with L=2L=2, where we replace the functions fl(tjl1,Aij)f^{l}(t_{j}^{l-1},A_{ij}) and gl(sil1,Aij)g^{l}(s_{i}^{l-1},A_{ij}) in the MP-GNN by Aijfl(tjl1)A_{ij}f^{l}(t_{j}^{l-1}) and Aijgl(sil1)A_{ij}g^{l}(s_{i}^{l-1}) to guarantee that they take the value 0 whenever Aij=0A_{ij}=0. For both GNNs, p0,q0p^{0},q^{0} are parameterized as linear transformations followed by a non-linear activation function; {pl,ql,fl,gl}l=1L\{p^{l},q^{l},f^{l},g^{l}\}_{l=1}^{L} are parameterized as 3-layer multi-layer perceptrons (MLPs) with respective learnable parameters; and the output mapping rr is parameterized as a 2-layer MLP. All layers map their input to a 1024-dimensional vector and use the ReLU activation function. With θ\theta denoting the set of all learnable parameters of a network, we train both MP-GNN and 2-FGNN to fit the SB scores of the MILP dataset 𝒢\mathcal{G}, by minimizing 12G𝒢Fθ(G)SB(G)2\frac{1}{2}\sum_{G\in\mathcal{G}}\|F_{\theta}(G)-\textup{SB}(G)\|^{2} with respect to θ\theta, using Adam [kingma2014adam]. The networks and training scheme is implemented with Python and TensorFlow [abadi2016tensorflow]. The numerical experiments are conducted on a single NVIDIA Tesla V100 GPU for two datasets:

  • We randomly generate 100100 MILP instances, with 6 constraints and 20 variables, that are MP-tractable with probability 11. SB scores are collected using SCIP [scip]. More details about instance generation are provided in Appendix E.

  • We train the MP-GNN and 2-FGNN to fit the SB scores of (4.1) and (4.2), i.e., the dataset only consists of two instances that are not MP-tractable.

Experimental results

The numerical results are displayed in Figure 3. One can see from Figure 3(a) that both MP-GNN and 2-FGNN can approximate the SB scores over the dataset of random MILP instances very well, which validates Theorem 4.4 and Theorem 4.7. As illustrated in Figure 3(b), 2-FGNN can perfectly fit the SB scores of (4.1) and (4.2) simultaneously while MP-GNN can not, which is consistent with Theorem 4.5 and Theorem 4.7 and serves as a numerical verification of the capacity differences between MP-GNN and 2-FGNN for SB prediction. The detailed exploration of training and performance evaluations of GNNs is deferred to future work to maintain a focused investigation on the theoretical capabilities of GNNs in this paper.

Refer to caption
(a) MP-tractable MILPs: Both MP-GNN and 2-FGNN can fit the SB scores.
Refer to caption
(b) MP-intractable MILPs (4.1) and (4.2): 2-FGNN can fit SB scores while MP-GNN can not.
Figure 3. Numerical results of MP-GNN and 2-FGNN for SB score fitting. In the right figure, the training error of MP-GNN on MP-intractable examples does not decrease after however many epochs.

Number of parameters

In Figure 3(b), the behavior of MP-GNN remains unchanged regardless of the number of parameters used, as guaranteed by Theorem 4.5. This error is intrinsically due to the structure of MP-intractable MILPs and cannot be reduced by adding parameters. Conversely, 2-FGNN can achieve near-zero loss with sufficient parameters, as guaranteed by Theorem 4.7 and confirmed by our numerical experiments. To further verify this, we tested 2-FGNN with embedding sizes from 64 to 2,048. All models reached near-zero errors, though epoch counts varied, as shown in Table 1. The results suggest that larger embeddings improve model capacity to fit counterexamples. The gains level off beyond an embedding size of 1,024 due to increased training complexity.

Table 1. Epochs required to reach specified errors with varying embedding sizes for 2-FGNN.
Embedding size 64 128 256 512 1,024 2,048
Epochs to reach 10610^{-6} error 16,570 5,414 2,736 1,442 980 1,126
Epochs to reach 101210^{-12} error 18,762 7,474 4,412 2,484 1,128 1,174

Larger instances

While our study primarily focuses on theory and numerous empirical studies have shown the effectiveness of GNNs in branching strategies (as noted in Section 1), we conducted experiments on larger instances to further assess the scalability of this approach. We trained an MP-GNN on 100 large-scale set covering problems, each with 1,000 variables and 2,000 constraints, generated following the methodology in [gasse2019exact]. The MP-GNN achieved a training loss of 1.94×1041.94\times 10^{-4}, calculated as the average 2\ell_{2} norm of errors across all training instances.

6. Conclusion

In this work, we study the expressive power of two types of GNNs for representing SB scores. We find that MP-GNNs can accurately predict SB results for MILPs within a specific class termed “message-passing-tractable” (MP-tractable). However, their performance is limited outside this class. In contrast, 2-FGNNs, which update node-pair features instead of node features as in MP-GNNs, can universally approximate the SB scores on every MILP dataset or for every MILP distribution. These findings offer insights into the suitability of different GNN architectures for varying MILP datasets, particularly considering the ease of assessing MP-tractability. We also comment on limitations and future research topics. Although the universal approximation result is established for MP-GNNs and 2-FGNNs to represent SB scores, it is still unclear what is the required complexity/number of parameters to achieve a given precision. It would thus be interesting and more practically useful to derive some quantitative results. In addition, exploring efficient training strategies or alternatives of higher order GNNs for MILP tasks is an interesting and significant future direction.

Acknowledgements

We would like to express our deepest gratitude to Prof. Pan Li from the School of Electrical and Computer Engineering at Georgia Institute of Technology (GaTech ECE), for insightful discussions on second-order folklore GNNs and their capacities for general graph tasks. We would also like to thank Haoyu Wang from GaTech ECE for helpful discussions during his internship at Alibaba US DAMO Academy.

References

Appendix A Proof of Theorem 4.4

This section presents the proof of Theorem 4.4. We define the separation power of WL test in Definition A.1 and prove that two MP-tractable MILP-graphs, or two vertices in a single MP-tractable graph, indistinguishable by WL test must share the same SB score in Theorem A.3. In other words, WL test has sufficient separation power to distinguish MP-tractable MILP graphs, or vertices in a single MP-tractable graph, with different SB scores.

Before stating the major result, we first introduce some definitions and useful theorems.

Definition A.1.

Let G,G¯𝒢m,nG,\bar{G}\in\mathcal{G}_{m,n} and let ClV(i),ClW(j)C_{l}^{V}(i),C_{l}^{W}(j) and C¯lV(i),C¯lW(j)\bar{C}_{l}^{V}(i),\bar{C}_{l}^{W}(j) be the colors generated by the WL test (Algorithm 1) for GG and G¯\bar{G}. We say GWG¯G\stackrel{{\scriptstyle W}}{{\sim}}\bar{G} if {{CLV(i):iV}}={{C¯LV(i):iV}}\left\{\left\{C_{L}^{V}(i):i\in V\right\}\right\}=\left\{\left\{\bar{C}_{L}^{V}(i):i\in V\right\}\right\} and CLW(j)=C¯LW(j),jWC_{L}^{W}(j)=\bar{C}_{L}^{W}(j),\ \forall~{}j\in W holds for any LL and any hash functions.

Theorem A.2 ([chen2022representing-lp, Theorem A.2]).

The partition defined in Theorem 4.1 satisfies:

  1. (a)

    vi=viv_{i}=v_{i^{\prime}}, i,iIp,p{1,2,,s}\forall~{}i,i^{\prime}\in I_{p},\ p\in\{1,2,\dots,s\},

  2. (b)

    wj=wjw_{j}=w_{j^{\prime}}, j,jJq,q{1,2,,t}\forall~{}j,j^{\prime}\in J_{q},\ q\in\{1,2,\dots,t\},

  3. (c)

    {{Aij:jJq}}={{Aij:jJq}}\{\{A_{ij}:j\in J_{q}\}\}=\{\{A_{i^{\prime}j}:j\in J_{q}\}\}, i,iIp,p{1,2,,s},q{1,2,,t}\forall~{}i,i^{\prime}\in I_{p},\ p\in\{1,2,\dots,s\},\ q\in\{1,2,\dots,t\},

  4. (d)

    {{Aij:iIp}}={{Aij:iIp}}\{\{A_{ij}:i\in I_{p}\}\}=\{\{A_{ij^{\prime}}:i\in I_{p}\}\}, j,jJq,p{1,2,,s}\forall~{}j,j^{\prime}\in J_{q},\ p\in\{1,2,\dots,s\}, q{1,2,,t}q\in\{1,2,\dots,t\},

where {{}}\{\{\}\} denotes the multiset considering both the elements and the multiplicities.

In Theorem A.2, conditions (a) and (b) mean vertices in the same class share the same features, while conditions (c) and (d) state that vertices in the same class interact with another class with the same multiset of weights. In other words, for any p{1,2,,s}p\in\{1,2,\dots,s\} and q{1,2,,t}q\in\{1,2,\dots,t\}, different rows/columns of the submatrix (Aij)iIp,jJq(A_{ij})_{i\in I_{p},j\in J_{q}} provide the same multiset of entries.

With the above preparations, we can state and prove the main result now.

Theorem A.3.

For any G,G¯𝒢m,nMPG,\bar{G}\in\mathcal{G}_{m,n}^{\textup{MP}} with SB(G)n\textup{SB}(G)\in\mathbb{R}^{n} and SB(G¯)n\textup{SB}(\bar{G})\in\mathbb{R}^{n}, the followings are true:

  1. (a)

    If GWG¯G\stackrel{{\scriptstyle W}}{{\sim}}\bar{G}, then SB(G)=SB(G¯)\textup{SB}(G)=\textup{SB}(\bar{G}).

  2. (b)

    If CLW(j1)=CLW(j2)C_{L}^{W}(j_{1})=C_{L}^{W}(j_{2}) holds for any LL and any hash functions, then SB(G)j1=SB(G)j2\textup{SB}(G)_{j_{1}}=\textup{SB}(G)_{j_{2}}.

Proof.

(a) Since GWG¯G\stackrel{{\scriptstyle W}}{{\sim}}\bar{G}, after applying some permutation on VV (relabelling vertices in VV) in the graph G¯\bar{G}, the two GG and G¯\bar{G} share the same partition ={I1,I2,,Is}\mathcal{I}=\{I_{1},I_{2},\dots,I_{s}\} and 𝒥={J1,J2,,Jt}\mathcal{J}=\{J_{1},J_{2},\dots,J_{t}\} as in Theorem A.2 and we have

  • For any p{1,2,,s}p\in\{1,2,\dots,s\}, vi=v¯iv_{i}=\bar{v}_{i} is constant over all iIpi\in I_{p},

  • For any q{1,2,,t}q\in\{1,2,\dots,t\}, wj=w¯jw_{j}=\bar{w}_{j} is constant over all jJqj\in J_{q},

  • For any p{1,2,,s}p\in\{1,2,\dots,s\} and q{1,2,,t}q\in\{1,2,\dots,t\}, {{Aij:jJq}}={{A¯ij:jJq}}\{\{A_{ij}:j\in J_{q}\}\}=\{\{\bar{A}_{ij}:j\in J_{q}\}\} is constant over all iIpi\in I_{p},

  • For any p{1,2,,s}p\in\{1,2,\dots,s\} and q{1,2,,t}q\in\{1,2,\dots,t\}, {{Aij:iIp}}={{A¯ij:iIp}}\{\{A_{ij}:i\in I_{p}\}\}=\{\{\bar{A}_{ij}:i\in I_{p}\}\} is constant over all jJqj\in J_{q}.

Here, we slightly abuse the notation not to distinguish G¯\bar{G} and the MILP-graph obtained from G¯\bar{G} by relabelling vertice in VV, and these two graphs have exactly the same SB scores since the vertices in WW are not relabelled.

Note that both GG and G¯\bar{G} are MP-tractable, i.e., for any p{1,2,,s}p\in\{1,2,\dots,s\} and q{1,2,,t}q\in\{1,2,\dots,t\}, (Aij)iIp,jJq(A_{ij})_{i\in I_{p},j\in J_{q}} and (A¯ij)iIp,jJq(\bar{A}_{ij})_{i\in I_{p},j\in J_{q}} are both matrices with identical entries, which combined with the third and the fourth conditions above implies that Aij=A¯ijA_{ij}=\bar{A}_{ij} for all iIpi\in I_{p} and jJqj\in J_{q}. Therefore, we have G=G¯G=\bar{G} and hence SB(G)=SB(G¯)\textup{SB}(G)=\textup{SB}(\bar{G}).

(b) The result is a directly corollary of (a) by considering GG and the MILP-graph obtained from GG by relabeling j1j_{1} as j2j_{2} and relabeling j2j_{2} as j1j_{1}. ∎

In addition to Theorem A.3, we also need the following two theorem to prove Theorem 4.4.

Theorem A.4 (Lusin’s theorem [evans2018measure, Theorem 1.14]).

Suppose that μ\mu is a Borel regular measure on n\mathbb{R}^{n} and that f:nmf:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m} is μ\mu-measurable, i.e., for any open subset UmU\subset\mathbb{R}^{m}, f1(U)f^{-1}(U) is μ\mu-measurable. Then for any μ\mu-measurable XnX\subset\mathbb{R}^{n} with μ(X)<\mu(X)<\infty and any ϵ>0\epsilon>0, there exists a compact set EXE\subset X with μ(X\E)<ϵ\mu(X\backslash E)<\epsilon, such that f|Ef|_{E} is continuous.

Theorem A.5 ([chen2022representing-lp, Theorem E.1]).

Let X𝒢m,nX\subset\mathcal{G}_{m,n} be a compact subset that is closed under the action of Sm×SnS_{m}\times S_{n}. Suppose that Φ𝒞(X,n)\Phi\in\mathcal{C}(X,\mathbb{R}^{n}) satisfies the followings:

  1. (a)

    For any σVSm,σWSn\sigma_{V}\in S_{m},\sigma_{W}\in S_{n}, and GXG\in X, it holds that Φ((σV,σW)G)=σW(Φ(G))\Phi((\sigma_{V},\sigma_{W})\ast G)=\sigma_{W}(\Phi(G)), where (σV,σW)G(\sigma_{V},\sigma_{W})\ast G represents the MILP-graph obtained from GG by reordering vertices with permutations σV\sigma_{V} and σW\sigma_{W}.

  2. (b)

    Φ(G)=Φ(G¯)\Phi(G)=\Phi(\bar{G}) holds for all G,G^XG,\hat{G}\in X with GWG¯G\stackrel{{\scriptstyle W}}{{\sim}}\bar{G}.

  3. (c)

    Given any GXG\in X and any j1,j2{1,2,,n}j_{1},j_{2}\in\{1,2,\dots,n\}, if CLW(j1)=CLW(j2)C_{L}^{W}(j_{1})=C_{L}^{W}(j_{2}) holds for any LL and any hash functions, then Φ(G)j1=Φ(G)j2\Phi(G)_{j_{1}}=\Phi(G)_{j_{2}}.

Then for any ϵ>0\epsilon>0, there exists FMP-GNNF\in\mathcal{F}_{\textup{MP-GNN}} such that

supGXΦ(G)F(G)<ϵ.\sup_{G\in X}\|\Phi(G)-F(G)\|<\epsilon.

Now we can present the proof of Theorem 4.4.

Proof of Theorem 4.4.

Lemma F.2 and Lemma F.3 in [chen2022representing-lp] prove that the function that maps LP instances to its optimal objective value/optimal solution with the smallest 2\ell_{2}-norm is Borel measurable. Thus, SB:𝒢m,nSB1(n)n\textup{SB}:\mathcal{G}_{m,n}\supset\textup{SB}^{-1}(\mathbb{R}^{n})\rightarrow\mathbb{R}^{n} is also Borel measurable, and is hence \mathbb{P}-measurable due to Assumption 4.3. In addition, 𝒢m,nMP\mathcal{G}_{m,n}^{\textup{MP}} is a Borel subset of 𝒢m,n\mathcal{G}_{m,n} since the MP-tractability is defined by finitely many operations of comparison and aggregations. By Theorem A.4 and the assumption [G𝒢m,nMP]=1\mathbb{P}[G\in\mathcal{G}_{m,n}^{\textup{MP}}]=1, there exists a compact subset X1𝒢m,nMPSB1(n)X_{1}\subset\mathcal{G}_{m,n}^{\textup{MP}}\cap\textup{SB}^{-1}(\mathbb{R}^{n}) such that [𝒢m,n\X1]ϵ\mathbb{P}[\mathcal{G}_{m,n}\backslash X_{1}]\leq\epsilon and SB|X1\textup{SB}|_{X_{1}} is continuous. For any σVSm\sigma_{V}\in S_{m} and σWSn\sigma_{W}\in S_{n}, (σV,σW)X1(\sigma_{V},\sigma_{W})\ast X_{1} is also compact and SB|(σV,σW)X1\textup{SB}|_{(\sigma_{V},\sigma_{W})\ast X_{1}} is also continuous by the permutation-equivariance of SB. Set

X2=σVSm,σWSn(σV,σW)X1.X_{2}=\bigcup_{\sigma_{V}\in S_{m},\sigma_{W}\in S_{n}}(\sigma_{V},\sigma_{W})\ast X_{1}.

Then X2X_{2} is permutation-invariant and compact with

[𝒢m,n\X2][𝒢m,n\X1]ϵ.\mathbb{P}[\mathcal{G}_{m,n}\backslash X_{2}]\leq\mathbb{P}[\mathcal{G}_{m,n}\backslash X_{1}]\leq\epsilon.

In addition, SB|X2\textup{SB}|_{X_{2}} is continuous by pasting lemma.

The rest of the proof is to apply Theorem A.5 for X=X2X=X_{2} and Φ=SB\Phi=\textup{SB}, for which we need to verify the four conditions in Theorem C.10. Condition (a) is true since SB is permutation-equivalent by its definition. Conditions (b) and (c) follow directly from Theorem A.3. According to Theorem A.5, there exists some F2-FGNNF\in\mathcal{F}_{\textup{2-FGNN}} such that

supGX2F(G)SB(G)δ.\sup_{G\in X_{2}}\|F(G)-\textup{SB}(G)\|\leq\delta.

Therefore, one has

[F(G)SB(G)>δ][𝒢m,n\X2]ϵ,\mathbb{P}[\|F(G)-\textup{SB}(G)\|>\delta]\leq\mathbb{P}[\mathcal{G}_{m,n}\backslash X_{2}]\leq\epsilon,

which completes the proof. ∎

Appendix B Proof of Theorem 4.5

In this section, we verify that the MILP instances (4.1) and (4.2) prove Theorem 4.5. We will first show that they have different SB scores while cannot be distinguished by any MP-GNNs.

Different SB scores

Denote the graph representation of (4.1) and (4.2) as GG and G¯\bar{G}, respectively. For both (4.1) and (4.2), the same optimal objective value is 44 and the optimal solution with the smallest 2\ell_{2}-norm is (1/2,1/2,1/2,1/2,1/2,1/2,1/2,1/2)(1/2,1/2,1/2,1/2,1/2,1/2,1/2,1/2). To calculate SB(G)j\textup{SB}(G)_{j} or SB(G¯)j\textup{SB}(\bar{G})_{j}, it is necessary create two LPs for each variable xjx_{j}. In one LP, the upper bound of xjx_{j} is set to u^j=1/2=0\hat{u}_{j}=\lfloor 1/2\rfloor=0, actually fixing xjx_{j} at its lower bound lj=0l_{j}=0. Similarly, the other LP sets xjx_{j} to 11.

For the problem (4.1), even if we fix x1=1x_{1}=1, the objective value of the LP relaxation can still achieve 44 by x=(1,0,1,0,1,0,1,0)x=(1,0,1,0,1,0,1,0). A similar observation also holds for fixing x1=0x_{1}=0. Therefore, the SB score for x1x_{1} (also for any xjx_{j} in (4.1)) is 0. In other words,

SB(G)=(0,0,0,0,0,0,0,0).\textup{SB}(G)=(0,0,0,0,0,0,0,0).

However, for the problem (4.2), if we fix x1=1x_{1}=1, then the optimal objective value of the LP relaxation is 9/29/2 since

i=18xi=1+(x2+x3)+12(x4+x5)+12(x5+x6)+12(x6+x4)+(x7+x8)9/2\sum_{i=1}^{8}x_{i}=1+(x_{2}+x_{3})+\frac{1}{2}(x_{4}+x_{5})+\frac{1}{2}(x_{5}+x_{6})+\frac{1}{2}(x_{6}+x_{4})+(x_{7}+x_{8})\geq 9/2

and the above inequality is tight as x=(1,1/2,1/2,1/2,1/2,1/2,1/2,1/2)x=(1,1/2,1/2,1/2,1/2,1/2,1/2,1/2). If we fix x1=0x_{1}=0, then x2,x31x_{2},x_{3}\geq 1 and the optimal objective value of the LP relaxation is also 9/29/2 since

i=18xi0+1+1+12(x4+x5)+12(x5+x6)+12(x6+x4)+(x7+x8)9/2,\sum_{i=1}^{8}x_{i}\geq 0+1+1+\frac{1}{2}(x_{4}+x_{5})+\frac{1}{2}(x_{5}+x_{6})+\frac{1}{2}(x_{6}+x_{4})+(x_{7}+x_{8})\geq 9/2,

and the equality holds when x=(0,1,1,1/2,1/2,1/2,1/2,1/2)x=(0,1,1,1/2,1/2,1/2,1/2,1/2). Therefore, the the SB score for x1x_{1} (also for any xi(1i6)x_{i}\ (1\leq i\leq 6) in (4.2)) is (9/24)(9/24)=1/4(9/2-4)\cdot(9/2-4)=1/4. If we fix x7=1x_{7}=1, the optimal objective value of the LP relaxation is still 44 since (1/2,1/2,1/2,1/2,1/2,1/2,1,0)(1/2,1/2,1/2,1/2,1/2,1/2,1,0) is an optimal solution. A similar observation still holds if x7x_{7} is fixed to 0. Thus the SB scores for x7x_{7} and x8x_{8} are both 0. Combining these calculations, we obtain that

SB(G¯)=(14,14,14,14,14,14,0,0).\textup{SB}(\bar{G})=\left(\frac{1}{4},\frac{1}{4},\frac{1}{4},\frac{1}{4},\frac{1}{4},\frac{1}{4},0,0\right).

MP-GNNs’ output

Although GG and G¯\bar{G} are non-isomorphic with different SB scores, they still have the same output for every MP-GNN. We prove this by induction. Referencing the graph representations in Section 2, we explicitly write down the features:

vi=v¯i=(1,),wj=w¯j=(1,0,1,1),for all i{1,,8},j{1,,8}.v_{i}=\bar{v}_{i}=(1,\geq),~{}~{}~{}w_{j}=\bar{w}_{j}=(1,0,1,1),~{}~{}~{}\textup{for all }i\in\{1,\cdots,8\},\ j\in\{1,\cdots,8\}.

Considering the MP-GNN’s initial step where si0=p0(vi)s_{i}^{0}=p^{0}(v_{i}) and tj0=q0(wj)t_{j}^{0}=q^{0}(w_{j}), we can conclude that si0=s¯i0s_{i}^{0}=\bar{s}_{i}^{0} is a constant for all ii and tj0=t¯j0t_{j}^{0}=\bar{t}_{j}^{0} is a constant for all jj, regardless of the choice of functions p0p^{0} and q0q^{0}. Thus, the initial layer generates uniform outcomes for nodes in VV and WW across both graphs, which is the induction base. Suppose that the principle of uniformity applies to sil,s¯il,tjl,t¯jls_{i}^{l},\bar{s}_{i}^{l},t_{j}^{l},\bar{t}_{j}^{l} for some 0lL10\leq l\leq L-1. Since sil,s¯ils_{i}^{l},\bar{s}_{i}^{l} are constant across all ii, we can denote their common value as sls^{l} and hence sl=sil=s¯ils^{l}=s_{i}^{l}=\bar{s}_{i}^{l} for all ii. Similarly, we can define tlt^{l} with tl=tjl=t¯jlt^{l}=t_{j}^{l}=\bar{t}_{j}^{l} for all jj. Then it holds that

sil+1=s¯il+1=pl(sl,2fl(tl,1))andtjl+1=t¯jl+1=ql(tl,2gl(sl,1)),s_{i}^{l+1}=\bar{s}_{i}^{l+1}=p^{l}\big{(}s^{l},2f^{l}(t^{l},1)\big{)}~{}~{}\textup{and}~{}~{}t_{j}^{l+1}=\bar{t}_{j}^{l+1}=q^{l}\big{(}t^{l},2g^{l}(s^{l},1)\big{)},

where we used {{Aij:jW}}={{A¯ij:jW}}={{Aij:iW}}={{A¯ij:iW}}={{1,1,0,0,0,0,0,0}}\{\{A_{ij^{\prime}}:j^{\prime}\in W\}\}=\{\{\bar{A}_{ij^{\prime}}:j^{\prime}\in W\}\}=\{\{A_{i^{\prime}j}:i^{\prime}\in W\}\}=\{\{\bar{A}_{i^{\prime}j}:i^{\prime}\in W\}\}=\{\{1,1,0,0,0,0,0,0\}\} for all ii and jj. This proves the uniformity for l+1l+1. Therefore, we obtain the existence of sL,tLs^{L},t^{L} such that siL=s¯iL=sLs_{i}^{L}=\bar{s}_{i}^{L}=s^{L} and tjL=t¯jL=tLt_{j}^{L}=\bar{t}_{j}^{L}=t^{L} for all i,ji,j. Finally, the output layer yields:

yj=y¯j=r(8sL,8tL,tL)for all j{1,,8},y_{j}=\bar{y}_{j}=r\big{(}8s^{L},8t^{L},t^{L}\big{)}~{}~{}~{}\textup{for all }j\in\{1,\cdots,8\},

which finishes the proof.

Appendix C Proof of Theorem 4.7

This section presents the proof of Theorem 4.7. The central idea is to establish a separation result in the sense that two MILPs with distinct SB scores must be distinguished by at least one F2-FGNNF\in\mathcal{F}_{\textup{2-FGNN}}, and then apply a generalized Stone-Weierstrass theorem in [azizian2020expressive].

C.1. 2-FWL test and its separation power

The 2-FWL test [cai1992optimal], as an extension to the classic WL test [weisfeiler1968reduction], is a more powerful algorithm for the graph isomorphism problem. By applying the 2-FWL test algorithm (formally stated in Algorithm 2) to two graphs and comparing the outcomes, one can determine the non-isomorphism of the two graphs if the results vary. However, identical 2-FWL outcomes do not confirm isomorphism. Although this test does not solve the graph isomorphism problem entirely, it can serve as a measure of 2-FGNN’s separation power, analogous to how the WL test applies to MP-GNN [xu2019powerful].

Algorithm 2 22-FWL test for MILP-Graphs
1:Input: A graph instance G=(V,W,A,FV,FW)G=(V,W,A,F_{V},F_{W}) and iteration limit L>0L>0.
2:Initialize with
C0VW(i,j)\displaystyle C_{0}^{VW}(i,j) =HASH0VW(vi,wj,Aij),\displaystyle=\textup{HASH}_{0}^{VW}(v_{i},w_{j},A_{ij}),
C0WW(j1,j2)\displaystyle C_{0}^{WW}(j_{1},j_{2}) =HASH0WW(wj1,wj2,δj1j2).\displaystyle=\textup{HASH}_{0}^{WW}(w_{j_{1}},w_{j_{2}},\delta_{j_{1}j_{2}}).
3:for l=1,2,,Ll=1,2,\dots,L do
4:     Refine the color
ClVW(i,j)\displaystyle C_{l}^{VW}(i,j) =HASHlVW(Cl1VW(i,j),{{(Cl1WW(j1,j),Cl1VW(i,j1)):j1W}}),\displaystyle=\textup{HASH}_{l}^{VW}\left(C_{l-1}^{VW}(i,j),\left\{\left\{(C_{l-1}^{WW}(j_{1},j),C_{l-1}^{VW}(i,j_{1})):j_{1}\in W\right\}\right\}\right),
ClWW(j1,j2)\displaystyle C_{l}^{WW}(j_{1},j_{2}) =HASHlWW(Cl1WW(j1,j2),{{(Cl1VW(i,j2),Cl1VW(i,j1)):iV}}).\displaystyle=\textup{HASH}_{l}^{WW}\left(C_{l-1}^{WW}(j_{1},j_{2}),\left\{\left\{(C_{l-1}^{VW}(i,j_{2}),C_{l-1}^{VW}(i,j_{1})):i\in V\right\}\right\}\right).
5:end for
6:Output: Final colors CLVW(i,j)C_{L}^{VW}(i,j) for all iV,jWi\in V,j\in W and CLWW(j1,j2)C_{L}^{WW}(j_{1},j_{2}) for all j1,j2Wj_{1},j_{2}\in W.

In particular, given the input graph GG, the 2-FWL test assigns a color for every pair of nodes in the form of (i,j)(i,j) with iV,jWi\in V,j\in W or (j1,j2)(j_{1},j_{2}) with j1,j2Wj_{1},j_{2}\in W. The initial colors are assigned based on the input features and the colors are refined to subcolors at each iteration in the way that two node pairs are of the same subcolor if and only if they have the same color and the same neighbors’ color information. Here, the neighborhood of (i,j)(i,j) involves {{((j1,j),(i,j1)):j1W}}\left\{\left\{((j_{1},j),(i,j_{1})):j_{1}\in W\right\}\right\} and the neighborhood of (j1,j2)(j_{1},j_{2}) involves {{((i,j2),(i,j1)):iV}}\left\{\left\{((i,j_{2}),(i,j_{1})):i\in V\right\}\right\}. After sufficient iterations, the final colors are determined. If the final color multisets of two graphs GG and G¯\bar{G} are identical, they are deemed indistinguishable by the 2-FWL test, denoted by G2G¯G\sim_{2}\bar{G}. One can formally define the separation power of 2-FWL test via two equivalence relations on 𝒢m,n\mathcal{G}_{m,n} as follows.

Definition C.1.

Let G,G¯𝒢m,nG,\bar{G}\in\mathcal{G}_{m,n} and let ClVW(i,j),ClWW(j1,j2)C_{l}^{VW}(i,j),C_{l}^{WW}(j_{1},j_{2}) and C¯lVW(i,j),C¯lWW(j1,j2)\bar{C}_{l}^{VW}(i,j),\bar{C}_{l}^{WW}(j_{1},j_{2}) be the colors generated by 2-FWL test for GG and G¯\bar{G}.

  1. (a)

    We define G2G¯G\sim_{2}\bar{G} if the followings hold for any LL and any hash functions:

    (C.1) {{CLVW(i,j):iV,jW}}\displaystyle\left\{\left\{C_{L}^{VW}(i,j):i\in V,j\in W\right\}\right\} ={{C¯LVW(i,j):iV,jW}},\displaystyle=\left\{\left\{\bar{C}_{L}^{VW}(i,j):i\in V,j\in W\right\}\right\},
    (C.2) {{CLWW(j1,j2):j1,j2W}}\displaystyle\left\{\left\{C_{L}^{WW}(j_{1},j_{2}):j_{1},j_{2}\in W\right\}\right\} ={{C¯LWW(j1,j2):j1,j2W}}.\displaystyle=\left\{\left\{\bar{C}_{L}^{WW}(j_{1},j_{2}):j_{1},j_{2}\in W\right\}\right\}.
  2. (b)

    We define G2WG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\bar{G} if the followings hold for any LL and any hash functions:

    (C.3) {{CLVW(i,j):iV}}\displaystyle\left\{\left\{C_{L}^{VW}(i,j):i\in V\right\}\right\} ={{C¯LVW(i,j):iV}},jW,\displaystyle=\left\{\left\{\bar{C}_{L}^{VW}(i,j):i\in V\right\}\right\},\quad\forall~{}j\in W,
    (C.4) {{CLWW(j1,j):j1W}}\displaystyle\left\{\left\{C_{L}^{WW}(j_{1},j):j_{1}\in W\right\}\right\} ={{C¯LWW(j1,j):j1W}},jW.\displaystyle=\left\{\left\{\bar{C}_{L}^{WW}(j_{1},j):j_{1}\in W\right\}\right\},\quad\forall~{}j\in W.

It can be seen that (C.3) and (C.4) are stronger than (C.1) and (C.2), since the latter requires that the entire color multiset is the same while the former requires that the color multiset associated with every jWj\in W is the same. However, we can show that they are equivalent up to a permutation.

Theorem C.2.

For any G,G¯𝒢m,nG,\bar{G}\in\mathcal{G}_{m,n}, G2G¯G\sim_{2}\bar{G} if and only if there exists a permutation σWSn\sigma_{W}\in S_{n} such that G2WσWG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\sigma_{W}\ast\bar{G}, where σWG¯\sigma_{W}\ast\bar{G} is the graph obtained by relabeling vertices in WW using σW\sigma_{W}.

One can understand that both G2G¯G\sim_{2}\bar{G} and G2WG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\bar{G} mean that GG and G¯\bar{G} cannot be distinguished by 2-FWL test, with the difference that G2G¯G\sim_{2}\bar{G} allows a permutation on WW.

Proof of Theorem C.2.

It is clear that G2WσWG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\sigma_{W}\ast\bar{G} implies that G2G¯G\sim_{2}\bar{G}. We then prove the reverse direction, i.e., G2G¯G\sim_{2}\bar{G} implies G2WσWG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\sigma_{W}\ast\bar{G} for some σWSn\sigma_{W}\in S_{n}. It suffices to consider LL and hash functions such that there are no collisions in Algorithm 2 and no strict color refinement in the LL-th iteration when GG and G¯\bar{G} are the input, which means that two edges are assigned with the same color in the LL-th iteration if and only if their colors are the same in the (L1)(L-1)-th iteration. For any j1,j2,j1,j2Wj_{1},j_{2},j_{1}^{\prime},j_{2}^{\prime}\in W, it holds that

CLWW(j1,j2)=CLWW(j1,j2)\displaystyle C_{L}^{WW}(j_{1},j_{2})=C_{L}^{WW}(j_{1}^{\prime},j_{2}^{\prime})
\displaystyle\implies {{(CLVW(i,j2),CLVW(i,j1)):iV}}={{(CLVW(i,j2),CLVW(i,j1)):iV}}\displaystyle\left\{\left\{(C_{L}^{VW}(i,j_{2}),C_{L}^{VW}(i,j_{1})):i\in V\right\}\right\}=\left\{\left\{(C_{L}^{VW}(i,j_{2}^{\prime}),C_{L}^{VW}(i,j_{1}^{\prime})):i\in V\right\}\right\}
\displaystyle\implies {{CLVW(i,j1):iV}}={{CLVW(i,j1):iV}}and\displaystyle\left\{\left\{C_{L}^{VW}(i,j_{1}):i\in V\right\}\right\}=\left\{\left\{C_{L}^{VW}(i,j_{1}^{\prime}):i\in V\right\}\right\}\ \textup{and}
{{CLVW(i,j2):iV}}={{CLVW(i,j2):iV}}.\displaystyle\left\{\left\{C_{L}^{VW}(i,j_{2}):i\in V\right\}\right\}=\left\{\left\{C_{L}^{VW}(i,j_{2}^{\prime}):i\in V\right\}\right\}.

Similarly, one has that

CLWW(j1,j2)=C¯LWW(j1,j2)\displaystyle C_{L}^{WW}(j_{1},j_{2})=\bar{C}_{L}^{WW}(j_{1}^{\prime},j_{2}^{\prime})
\displaystyle\implies {{CLVW(i,j1):iV}}={{C¯LVW(i,j1):iV}}and\displaystyle\left\{\left\{C_{L}^{VW}(i,j_{1}):i\in V\right\}\right\}=\left\{\left\{\bar{C}_{L}^{VW}(i,j_{1}^{\prime}):i\in V\right\}\right\}\ \textup{and}
{{CLVW(i,j2):iV}}={{C¯LVW(i,j2):iV}},\displaystyle\left\{\left\{C_{L}^{VW}(i,j_{2}):i\in V\right\}\right\}=\left\{\left\{\bar{C}_{L}^{VW}(i,j_{2}^{\prime}):i\in V\right\}\right\},

and that

C¯LWW(j1,j2)=C¯LWW(j1,j2)\displaystyle\bar{C}_{L}^{WW}(j_{1},j_{2})=\bar{C}_{L}^{WW}(j_{1}^{\prime},j_{2}^{\prime})
\displaystyle\implies {{C¯LVW(i,j1):iV}}={{C¯LVW(i,j1):iV}}and\displaystyle\left\{\left\{\bar{C}_{L}^{VW}(i,j_{1}):i\in V\right\}\right\}=\left\{\left\{\bar{C}_{L}^{VW}(i,j_{1}^{\prime}):i\in V\right\}\right\}\ \textup{and}
{{C¯LVW(i,j2):iV}}={{C¯LVW(i,j2):iV}}.\displaystyle\left\{\left\{\bar{C}_{L}^{VW}(i,j_{2}):i\in V\right\}\right\}=\left\{\left\{\bar{C}_{L}^{VW}(i,j_{2}^{\prime}):i\in V\right\}\right\}.

Therefore, for any

𝐂{{{CLVW(i,j):iV}}:jW}{{{C¯LVW(i,j):iV}}:jW},\mathbf{C}\in\left\{\left\{\left\{C_{L}^{VW}(i,j):i\in V\right\}\right\}:j\in W\right\}\cup\left\{\left\{\left\{\bar{C}_{L}^{VW}(i,j):i\in V\right\}\right\}:j\in W\right\},

it follows from (C.2) that

(C.5) {{CLWW(j1,j2):{{CLVW(i,j1):iV}}={{CLVW(i,j2):iV}}=𝐂}}={{C¯LWW(j1,j2):{{C¯LVW(i,j1):iV}}={{C¯LVW(i,j2):iV}}=𝐂}}.\begin{split}&\left\{\left\{C_{L}^{WW}(j_{1},j_{2}):\left\{\left\{C_{L}^{VW}(i,j_{1}):i\in V\right\}\right\}=\left\{\left\{C_{L}^{VW}(i,j_{2}):i\in V\right\}\right\}=\mathbf{C}\right\}\right\}\\ =&\left\{\left\{\bar{C}_{L}^{WW}(j_{1},j_{2}):\left\{\left\{\bar{C}_{L}^{VW}(i,j_{1}):i\in V\right\}\right\}=\left\{\left\{\bar{C}_{L}^{VW}(i,j_{2}):i\in V\right\}\right\}=\mathbf{C}\right\}\right\}.\end{split}

Particularly, the number of elements in the two multisets in (C.5) should be the same, which implies that

#{jW:{{CLVW(i,j):iV}}=𝐂}=#{jW:{{C¯LVW(i,j):iV}}=𝐂},\#\left\{j\in W:\left\{\left\{C_{L}^{VW}(i,j):i\in V\right\}\right\}=\mathbf{C}\right\}=\#\left\{j\in W:\left\{\left\{\bar{C}_{L}^{VW}(i,j):i\in V\right\}\right\}=\mathbf{C}\right\},

which then leads to

{{{{CLVW(i,j):iV}}:jW}}={{{{C¯LVW(i,j):iV}}:jW}}.\left\{\left\{\left\{\left\{C_{L}^{VW}(i,j):i\in V\right\}\right\}:j\in W\right\}\right\}=\left\{\left\{\left\{\left\{\bar{C}_{L}^{VW}(i,j):i\in V\right\}\right\}:j\in W\right\}\right\}.

One can hence apply some permutation on WW to obtain (C.3). Next we prove (C.4). For any jWj\in W, we have

{{CLVW(i,j):iV}}={{C¯LVW(i,j):iV}}\displaystyle\left\{\left\{C_{L}^{VW}(i,j):i\in V\right\}\right\}=\left\{\left\{\bar{C}_{L}^{VW}(i,j):i\in V\right\}\right\}
\displaystyle\implies CLVW(i1,j)=C¯LVW(i2,j)for some i1,i2V\displaystyle C_{L}^{VW}(i_{1},j)=\bar{C}_{L}^{VW}(i_{2},j)\quad\textup{for some }i_{1},i_{2}\in V
\displaystyle\implies {{(CLWW(j1,j),Cl1VW(i1,j1)):j1W}}={{(C¯LWW(j1,j),C¯l1VW(i2,j1)):j1W}}\displaystyle\left\{\left\{(C_{L}^{WW}(j_{1},j),C_{l-1}^{VW}(i_{1},j_{1})):j_{1}\in W\right\}\right\}=\left\{\left\{(\bar{C}_{L}^{WW}(j_{1},j),\bar{C}_{l-1}^{VW}(i_{2},j_{1})):j_{1}\in W\right\}\right\}
for some i1,i2V\displaystyle\qquad\qquad\textup{for some }i_{1},i_{2}\in V
\displaystyle\implies {{CLWW(j1,j):j1W}}={{C¯LWW(j1,j):j1W}},\displaystyle\left\{\left\{C_{L}^{WW}(j_{1},j):j_{1}\in W\right\}\right\}=\left\{\left\{\bar{C}_{L}^{WW}(j_{1},j):j_{1}\in W\right\}\right\},

which completes the proof. ∎

C.2. SB scores of MILPs distinguishable by 2-FWL test

The following theorem establishes that the separation power of 2-FWL test is stronger than or equal to that of SB, in the sense that two MILP-graphs, or two vertices in a single graph, that cannot be distinguished by the 2-FWL test must share the same SB score.

Theorem C.3.

For any G,G¯𝒢m,nG,\bar{G}\in\mathcal{G}_{m,n}, the followings are true:

  1. (a)

    If G2WG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\bar{G}, then SB(G)=SB(G¯)\textup{SB}(G)=\textup{SB}(\bar{G}).

  2. (b)

    If G2G¯G\sim_{2}\bar{G}, then there exists some permutation σWSn\sigma_{W}\in S_{n} such that SB(G)=σW(SB(G¯))\textup{SB}(G)=\sigma_{W}(\textup{SB}(\bar{G})).

  3. (c)

    If {{CLWW(j,j1):jW}}={{CLWW(j,j2):jW}}\left\{\left\{C_{L}^{WW}(j,j_{1}):j\in W\right\}\right\}=\left\{\left\{C_{L}^{WW}(j,j_{2}):j\in W\right\}\right\} holds for any LL and any hash functions, then SB(G)j1=SB(G)j2\textup{SB}(G)_{j_{1}}=\textup{SB}(G)_{j_{2}}.

We briefly describe the intuition behind the proof here. The color updating rule of 2-FWL test is based on monitoring triangles while that of the classic WL test is based on tracking edges. More specifically, in 2-FWL test colors are defined on node pairs and neighbors share the same triangle, while in WL test colors are equipped with nodes with neighbors being connected by edges. When computing the jj-th entry of SB(G)\textup{SB}(G), we change the upper/lower bound of xjx_{j} and solve two LP problems. We can regard jWj\in W as a special node and if we fixed it in 2-FWL test, a triangle containing jWj\in W will be determined by the other two nodes, one in VV and one in WW, and their edge. This “reduces” to the setting of WL test. It is proved in [chen2022representing-lp] that the separation power of WL test is stronger than or equal to the properties of LPs. This is to say that even when fixing a special node, the 2-FWL test still has enough separation power to distinguish different LP properties and hence 2-FWL test could separate different SB scores. We present the detailed proof of Theorem C.3 in the rest of this subsection.

Theorem C.4.

For any G,G¯𝒢m,nG,\bar{G}\in\mathcal{G}_{m,n}, if G2WG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\bar{G}, then for any j{1,2,,n}j\in\{1,2,\dots,n\}, l^j{}\hat{l}_{j}\in\{-\infty\}\cup\mathbb{R}, and u^j{+}\hat{u}_{j}\in\mathbb{R}\cup\{+\infty\}, the two LP problems LP(G,j,l^j,u^j)\textup{LP}(G,j,\hat{l}_{j},\hat{u}_{j}) and LP(G¯,j,l^j,u^j)\textup{LP}(\bar{G},j,\hat{l}_{j},\hat{u}_{j}) have the same optimal objective value.

Theorem C.5 ([chen2022representing-lp]).

Consider two LP problems with nn variables and mm constraints

(C.6) minxncx,s.t.Axb,lxu,\min_{x\in\mathbb{R}^{n}}~{}~{}c^{\top}x,\quad\textup{s.t.}~{}~{}Ax\circ b,~{}~{}l\leq x\leq u,

and

(C.7) minxnc¯x,s.t.A¯x¯b¯,l¯xu¯.\min_{x\in\mathbb{R}^{n}}~{}~{}\bar{c}^{\top}x,\quad\textup{s.t.}~{}~{}\bar{A}x\ \bar{\circ}\ \bar{b},~{}~{}\bar{l}\leq x\leq\bar{u}.

Suppose that there exist ={I1,I2,,Is}\mathcal{I}=\{I_{1},I_{2},\dots,I_{s}\} and 𝒥={J1,J2,,Jt}\mathcal{J}=\{J_{1},J_{2},\dots,J_{t}\} that are partitions of V={1,2,,m}V=\{1,2,\dots,m\} and W={1,2,,n}W=\{1,2,\dots,n\} respectively, such that the followings hold:

  1. (a)

    For any p{1,2,,s}p\in\{1,2,\dots,s\}, (bi,i)=(b¯i,¯i)(b_{i},\circ_{i})=(\bar{b}_{i},\bar{\circ}_{i}) is constant over all iIpi\in I_{p};

  2. (b)

    For any q{1,2,,t}q\in\{1,2,\dots,t\}, (cj,lj,uj)=(c¯j,l¯j,u¯j)(c_{j},l_{j},u_{j})=(\bar{c}_{j},\bar{l}_{j},\bar{u}_{j}) is constant over all jJqj\in J_{q};

  3. (c)

    For any p{1,2,,s}p\in\{1,2,\dots,s\} and q{1,2,,t}q\in\{1,2,\dots,t\}, jJqAij=jJqA¯ij\sum_{j\in J_{q}}A_{ij}=\sum_{j\in J_{q}}\bar{A}_{ij} is constant over all iIpi\in I_{p}.

  4. (d)

    For any p{1,2,,s}p\in\{1,2,\dots,s\} and q{1,2,,t}q\in\{1,2,\dots,t\}, iIpAij=iIpA¯ij\sum_{i\in I_{p}}A_{ij}=\sum_{i\in I_{p}}\bar{A}_{ij} is constant over all jJqj\in J_{q}.

Then the two problems (C.6) and (C.7) have the same feasibility, the same optimal objective value, and the same optimal solution with the smallest 2\ell_{2}-norm (if feasible and bounded).

Proof of Theorem C.4.

Choose LL and hash functions such that there are no collisions in Algorithm 2 and no strict color refinement in the LL-th iteration when GG and G¯\bar{G} are the input. Fix any jWj\in W and construct the partions ={I1,I2,,Is}\mathcal{I}=\{I_{1},I_{2},\dots,I_{s}\} and 𝒥={J1,J2,,Jt}\mathcal{J}=\{J_{1},J_{2},\dots,J_{t}\} as follows:

  • i1,i2Ipi_{1},i_{2}\in I_{p} for some p{1,2,,s}p\in\{1,2,\dots,s\} if and only if CLVW(i1,j)=CLVW(i2,j)C_{L}^{VW}(i_{1},j)=C_{L}^{VW}(i_{2},j).

  • j1,j2Jqj_{1},j_{2}\in J_{q} for some q{1,2,,t}q\in\{1,2,\dots,t\} if and only if CLWW(j1,j)=CLWW(j2,j)C_{L}^{WW}(j_{1},j)=C_{L}^{WW}(j_{2},j).

Without loss of generality, we can assume that jJ1j\in J_{1}. One observation is that J1={j}J_{1}=\{j\}. This is because j1J1j_{1}\in J_{1} implies that CLWW(j1,j)=CLWW(j,j)C_{L}^{WW}(j_{1},j)=C_{L}^{WW}(j,j), which then leads to C0WW(j1,j)=C0WW(j,j)C_{0}^{WW}(j_{1},j)=C_{0}^{WW}(j,j) and δj1j=δjj=1\delta_{j_{1}j}=\delta_{jj}=1 since there is no collisions. We thus have j1=jj_{1}=j.

Note that we have (C.3) and (C.4) from the assumption G2WG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\bar{G}. So after permuting G¯\bar{G} on VV and W\{j}W\backslash\{j\}, one can obtain CLVW(i,j)=C¯LVW(i,j)C_{L}^{VW}(i,j)=\bar{C}_{L}^{VW}(i,j) for all iVi\in V and CLWW(j1,j)=C¯LWW(j1,j)C_{L}^{WW}(j_{1},j)=\bar{C}_{L}^{WW}(j_{1},j) for all j1Wj_{1}\in W. Another observation is that such permutation does not change the optimal objective value of LP(G¯,j,l^j,u^j)\textup{LP}(\bar{G},j,\hat{l}_{j},\hat{u}_{j}) as jj is fixed.

Next, we verify the four conditions in Theorem C.5 for two LP problems LP(G,j,l^j,u^j)\textup{LP}(G,j,\hat{l}_{j},\hat{u}_{j}) and LP(G¯,j,l^j,u^j)\textup{LP}(\bar{G},j,\hat{l}_{j},\hat{u}_{j}) with respect to the partitions ={I1,I2,,Is}\mathcal{I}=\{I_{1},I_{2},\dots,I_{s}\} and 𝒥={J1,J2,,Jt}\mathcal{J}=\{J_{1},J_{2},\dots,J_{t}\}.

Verification of Condition (a) in Theorem C.5

Since there is no collision in the 2-FWL test Algorithm 2, CLVW(i,j)=C¯LVW(i,j)C_{L}^{VW}(i,j)=\bar{C}_{L}^{VW}(i,j) implies that C0VW(i,j)=C¯0VW(i,j)C_{0}^{VW}(i,j)=\bar{C}_{0}^{VW}(i,j) and hence that vi=v¯iv_{i}=\bar{v}_{i}, which is also constant over all iIpi\in I_{p} since CLVW(i,j)C_{L}^{VW}(i,j) is contant over all iIpi\in I_{p} by definition.

Verification of Condition (b) in Theorem C.5

It follows from CLWW(j1,j)=C¯LWW(j1,j)C_{L}^{WW}(j_{1},j)=\bar{C}_{L}^{WW}(j_{1},j) that C0WW(j1,j)=C¯0WW(j1,j)C_{0}^{WW}(j_{1},j)=\bar{C}_{0}^{WW}(j_{1},j) and hence that wj1=w¯j1w_{j_{1}}=\bar{w}_{j_{1}}, which is also constant over all j1Iqj_{1}\in I_{q} since CLWW(j1,j)C_{L}^{WW}(j_{1},j) is contant over all j1Iqj_{1}\in I_{q} by definition.

Verification of Condition (c) in Theorem C.5

Consider any p{1,2,,s}p\in\{1,2,\dots,s\} and any iIpi\in I_{p}. It follows from CLVW(i,j)=C¯LVW(i,j)C_{L}^{VW}(i,j)=\bar{C}_{L}^{VW}(i,j) that

{{(CL1WW(j1,j),CL1VW(i,j1)):j1W}}={{(C¯L1WW(j1,j),C¯L1VW(i,j1)):j1W}},\left\{\left\{(C_{L-1}^{WW}(j_{1},j),C_{L-1}^{VW}(i,j_{1})):j_{1}\in W\right\}\right\}=\left\{\left\{(\bar{C}_{L-1}^{WW}(j_{1},j),\bar{C}_{L-1}^{VW}(i,j_{1})):j_{1}\in W\right\}\right\},

and hence that

{{(CLWW(j1,j),Aij1):j1W}}={{(C¯LWW(j1,j),A¯ij1):j1W}},\left\{\left\{(C_{L}^{WW}(j_{1},j),A_{ij_{1}}):j_{1}\in W\right\}\right\}=\left\{\left\{(\bar{C}_{L}^{WW}(j_{1},j),\bar{A}_{ij_{1}}):j_{1}\in W\right\}\right\},

where we used the fact that there is no strict color refinement in the LL-th iteration and there is no collision in Algorithm 2. We can thus conclude for any q{1,2,,t}q\in\{1,2,\dots,t\} that

{{Aij1:j1Jq}}={{A¯ij1:j1Jq}},\{\{A_{ij_{1}}:j_{1}\in J_{q}\}\}=\{\{\bar{A}_{ij_{1}}:j_{1}\in J_{q}\}\},

which implies that j1JqAij1=j1JqA¯ij1\sum_{j_{1}\in J_{q}}A_{ij_{1}}=\sum_{j_{1}\in J_{q}}\bar{A}_{ij_{1}} that is constant over iIpi\in I_{p} since CLVW(i,j)=C¯LVW(i,j)C_{L}^{VW}(i,j)=\bar{C}_{L}^{VW}(i,j) is constant over iIpi\in I_{p}.

Verification of Condition (d) in Theorem C.5

Consider any q{1,2,,t}q\in\{1,2,\dots,t\} and any j1Jqj_{1}\in J_{q}. It follows from CLWW(j1,j)=C¯LWW(j1,j)C_{L}^{WW}(j_{1},j)=\bar{C}_{L}^{WW}(j_{1},j) that

{{(CL1VW(i,j),CL1VW(i,j1)):iV}}={{(C¯L1VW(i,j),C¯L1VW(i,j1)):iV}},\left\{\left\{(C_{L-1}^{VW}(i,j),C_{L-1}^{VW}(i,j_{1})):i\in V\right\}\right\}=\left\{\left\{(\bar{C}_{L-1}^{VW}(i,j),\bar{C}_{L-1}^{VW}(i,j_{1})):i\in V\right\}\right\},

and hence that

{{(CLVW(i,j),Aij1):iV}}={{(C¯LVW(i,j),A¯ij1):iV}},\left\{\left\{(C_{L}^{VW}(i,j),A_{ij_{1}}):i\in V\right\}\right\}=\left\{\left\{(\bar{C}_{L}^{VW}(i,j),\bar{A}_{ij_{1}}):i\in V\right\}\right\},

where we used the fact that there is no strict color refinement at the LL-th iteration and there is no collision in Algorithm 2. We can thus conclude for any p{1,2,,s}p\in\{1,2,\dots,s\} that

{{Aij1:iIp}}={{A¯ij1:iIp}},\{\{A_{ij_{1}}:i\in I_{p}\}\}=\{\{\bar{A}_{ij_{1}}:i\in I_{p}\}\},

which implies that iIpAij1=iIpA¯ij1\sum_{i\in I_{p}}A_{ij_{1}}=\sum_{i\in I_{p}}\bar{A}_{ij_{1}} that is constant over j1Jqj_{1}\in J_{q} since CLWW(j1,j)=C¯LWW(j1,j)C_{L}^{WW}(j_{1},j)=\bar{C}_{L}^{WW}(j_{1},j) is constant over j1Jqj_{1}\in J_{q}.

Combining all discussion above and noticing that J1={j}J_{1}=\{j\}, one can apply Theorem C.5 and conclude that the two LP problems LP(G,j,l^j,u^j)\textup{LP}(G,j,\hat{l}_{j},\hat{u}_{j}) and LP(G¯,j,l^j,u^j)\textup{LP}(\bar{G},j,\hat{l}_{j},\hat{u}_{j}) have the same optimal objective value, which completes the proof. ∎

Corollary C.6.

For any G,G¯𝒢m,nG,\bar{G}\in\mathcal{G}_{m,n}, if G2WG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\bar{G}, then the LP relaxations of GG and G¯\bar{G} have the same optimal objective value and the same optimal solution with the smallest 2\ell_{2}-norm (if feasible and bounded).

Proof.

If no collision, it follows from (C.4) that CLWW(j,j)=C¯LWW(j,j)C_{L}^{WW}(j,j)=\bar{C}_{L}^{WW}(j,j) which implies lj=l¯jl_{j}=\bar{l}_{j} and uj=u¯ju_{j}=\bar{u}_{j} for any jWj\in W. Then we can apply Theorem C.4 to conclude that two LP problems LP(G,j,lj,uj)\textup{LP}(G,j,l_{j},u_{j}) and LP(G¯,j,l¯j,u¯j)\textup{LP}(\bar{G},j,\bar{l}_{j},\bar{u}_{j}) that are LP relaxations of GG and G¯\bar{G} have the same optimal objective value.

In the case that the LP relaxations of GG and G¯\bar{G} are both feasible and bounded, we use xx and x¯\bar{x} to denote their optimal solutions with the smallest 2\ell_{2}-norm. For any jWj\in W, xx and x¯\bar{x} are also the optimal solutions with the smallest 2\ell_{2}-norm for LP(G,j,lj,uj)\textup{LP}(G,j,l_{j},u_{j}) and LP(G¯,j,l¯j,u¯j)\textup{LP}(\bar{G},j,\bar{l}_{j},\bar{u}_{j}) respectively. By Theorem C.5 and the same arguments as in the proof of Theorem C.4, we have the xj=x¯jx_{j}=\bar{x}_{j}. Note that we cannot infer x=x¯x=\bar{x} by considering a single jWj\in W because we apply permutation on VV and W\{j}W\backslash\{j\} in the proof of Theorem C.4. But we have xj=x¯jx_{j}=\bar{x}_{j} for any jWj\in W which leads to x=x¯x=\bar{x}. ∎

Proof of Theorem C.3.

(a) By Corollary C.6 and Theorem C.4.

(b) By Theorem C.2 and (a).

(c) Apply (a) on GG and the graph obtained from GG by switching j1j_{1} and j2j_{2}. ∎

C.3. Equivalence between the separation powers of the 2-FWL test and 2-FGNNs

The section establishes the equivalence between the separation powers of the 2-FWL test and 2-FGNNs.

Theorem C.7.

For any G,G¯𝒢m,nG,\bar{G}\in\mathcal{G}_{m,n}, the followings are true:

  1. (a)

    G2WG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\bar{G} if and only if F(G)=F(G¯)F(G)=F(\bar{G}) for all F2-FGNNF\in\mathcal{F}_{\textup{2-FGNN}}.

  2. (b)

    {{CLWW(j,j1):jW}}={{CLWW(j,j2):jW}}\left\{\left\{C_{L}^{WW}(j,j_{1}):j\in W\right\}\right\}=\left\{\left\{C_{L}^{WW}(j,j_{2}):j\in W\right\}\right\} holds for any LL and any hash functions if and only if F(G)j1=F(G)j2,F2-FGNNF(G)_{j_{1}}=F(G)_{j_{2}},\ \forall~{}F\in\mathcal{F}_{\textup{2-FGNN}}.

  3. (c)

    G2G¯G\sim_{2}\bar{G} if and only if f(G)=f(G¯)f(G)=f(\bar{G}) for all scalar function ff with f𝟏2-FGNNf\mathbf{1}\in\mathcal{F}_{\textup{2-FGNN}}.

The intuition behind Theorem C.7 is the color updating rule in 2-FWL test is of the same format as the feature updating rule in 2-FGNN, and that the local update mappings pl,ql,fl,gl,rp^{l},q^{l},f^{l},g^{l},r can be chosen as injective on current features. Results of similar spirit also exist in previous literature; see e.g., [xu2019powerful, azizian2020expressive, geerts2022expressiveness, chen2022representing-lp]. We present the detailed proof of Theorem C.7 in the rest of this subsection.

Lemma C.8.

For any G,G¯𝒢m,nG,\bar{G}\in\mathcal{G}_{m,n}, if G2WG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\bar{G}, then F(G)=F(G¯)F(G)=F(\bar{G}) for all F2-FGNNF\in\mathcal{F}_{\textup{2-FGNN}}.

Proof.

Consider any F2-FGNNF\in\mathcal{F}_{\textup{2-FGNN}} with LL layers and let sijl,tj1j2ls_{ij}^{l},t_{j_{1}j_{2}}^{l} and s¯ijl,t¯j1j2l\bar{s}_{ij}^{l},\bar{t}_{j_{1}j_{2}}^{l} be the features in the ll-th layer of FF. Choose LL and hash functions such that there are no collisions in Algorithm 2 when GG and G¯\bar{G} are the input. We will prove the followings by induction for 0lL0\leq l\leq L:

  1. (a)

    ClVW(i,j)=ClVW(i,j)C_{l}^{VW}(i,j)=C_{l}^{VW}(i^{\prime},j^{\prime}) implies sijl=sijls_{ij}^{l}=s_{i^{\prime}j^{\prime}}^{l}.

  2. (b)

    ClVW(i,j)=C¯lVW(i,j)C_{l}^{VW}(i,j)=\bar{C}_{l}^{VW}(i^{\prime},j^{\prime}) implies sijl=s¯ijls_{ij}^{l}=\bar{s}_{i^{\prime}j^{\prime}}^{l}.

  3. (c)

    C¯lVW(i,j)=C¯lVW(i,j)\bar{C}_{l}^{VW}(i,j)=\bar{C}_{l}^{VW}(i^{\prime},j^{\prime}) implies s¯ijl=s¯ijl\bar{s}_{ij}^{l}=\bar{s}_{i^{\prime}j^{\prime}}^{l}.

  4. (d)

    ClWW(j1,j2)=ClWW(j1,j2)C_{l}^{WW}(j_{1},j_{2})=C_{l}^{WW}(j_{1}^{\prime},j_{2}^{\prime}) implies tj1j2l=tj1j2lt_{j_{1}j_{2}}^{l}=t_{j_{1}^{\prime}j_{2}^{\prime}}^{l}.

  5. (e)

    ClWW(j1,j2)=C¯lWW(j1,j2)C_{l}^{WW}(j_{1},j_{2})=\bar{C}_{l}^{WW}(j_{1}^{\prime},j_{2}^{\prime}) implies tj1j2l=t¯j1j2lt_{j_{1}j_{2}}^{l}=\bar{t}_{j_{1}^{\prime}j_{2}^{\prime}}^{l}.

  6. (f)

    C¯lWW(j1,j2)=C¯lWW(j1,j2)\bar{C}_{l}^{WW}(j_{1},j_{2})=\bar{C}_{l}^{WW}(j_{1}^{\prime},j_{2}^{\prime}) implies t¯j1j2l=t¯j1j2l\bar{t}_{j_{1}j_{2}}^{l}=\bar{t}_{j_{1}^{\prime}j_{2}^{\prime}}^{l}.

As the induction base, the claims (a)-(f) are true for l=0l=0 since HASH0VW\textup{HASH}_{0}^{VW} and HASH0WW\textup{HASH}_{0}^{WW} do not have collisions. Now we assume that the claims (a)-(f) are all true for l1l-1 where l{1,2,,L}l\in\{1,2,\dots,L\} and prove them for ll. In fact, one can prove the claim (a) for ll as follow:

ClVW(i,j)=ClVW(i,j)\displaystyle C_{l}^{VW}(i,j)=C_{l}^{VW}(i^{\prime},j^{\prime})
\displaystyle\implies Cl1VW(i,j)=Cl1VW(i,j)and\displaystyle C_{l-1}^{VW}(i,j)=C_{l-1}^{VW}(i^{\prime},j^{\prime})\quad\textup{and}
{{(Cl1WW(j1,j),Cl1VW(i,j1)):j1W}}={{(Cl1WW(j1,j),Cl1VW(i,j1)):j1W}}\displaystyle\left\{\left\{(C_{l-1}^{WW}(j_{1},j),C_{l-1}^{VW}(i,j_{1})):j_{1}\in W\right\}\right\}=\left\{\left\{(C_{l-1}^{WW}(j_{1},j^{\prime}),C_{l-1}^{VW}(i^{\prime},j_{1})):j_{1}\in W\right\}\right\}
\displaystyle\implies sijl1=sijl1and{{(tj1jl1,sij1l1):j1W}}={{(tj1jl1,sij1l1):j1W}}\displaystyle s_{ij}^{l-1}=s_{i^{\prime}j^{\prime}}^{l-1}\quad\textup{and}\quad\left\{\left\{(t_{j_{1}j}^{l-1},s_{ij_{1}}^{l-1}):j_{1}\in W\right\}\right\}=\left\{\left\{(t_{j_{1}j^{\prime}}^{l-1},s_{i^{\prime}j_{1}}^{l-1}):j_{1}\in W\right\}\right\}
\displaystyle\implies sijl=sijl.\displaystyle s_{ij}^{l}=s_{i^{\prime}j^{\prime}}^{l}.

The proof of claims (b)-(f) for ll is very similar and hence omitted.

Using the claims (a)-(f) for LL, we can conclude that

G2WG¯\displaystyle G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\bar{G}
\displaystyle\implies {{CLVW(i,j):iV}}={{C¯LVW(i,j):iV}},jW,and\displaystyle\left\{\left\{C_{L}^{VW}(i,j):i\in V\right\}\right\}=\left\{\left\{\bar{C}_{L}^{VW}(i,j):i\in V\right\}\right\},\ \forall~{}j\in W,\ \textup{and}
{{CLWW(j1,j):j1W}}={{C¯LWW(j1,j):j1W}},jW\displaystyle\left\{\left\{C_{L}^{WW}(j_{1},j):j_{1}\in W\right\}\right\}=\left\{\left\{\bar{C}_{L}^{WW}(j_{1},j):j_{1}\in W\right\}\right\},\ \forall~{}j\in W
\displaystyle\implies {{sijL:iV}}={{s¯ijL:iV}},jW,and\displaystyle\left\{\left\{s_{ij}^{L}:i\in V\right\}\right\}=\left\{\left\{\bar{s}_{ij}^{L}:i\in V\right\}\right\},\ \forall~{}j\in W,\ \textup{and}
{{tj1jL:j1W}}={{t¯j1jL:j1W}},jW\displaystyle\left\{\left\{t_{j_{1}j}^{L}:j_{1}\in W\right\}\right\}=\left\{\left\{\bar{t}_{j_{1}j}^{L}:j_{1}\in W\right\}\right\},\ \forall~{}j\in W
\displaystyle\implies r(iVsijL,j1Wtj1jL)=r(iVs¯ijL,j1Wt¯j1jL),jW\displaystyle r\left(\sum_{i\in V}s_{ij}^{L},\sum_{j_{1}\in W}t_{j_{1}j}^{L}\right)=r\left(\sum_{i\in V}\bar{s}_{ij}^{L},\sum_{j_{1}\in W}\bar{t}_{j_{1}j}^{L}\right),\ \forall~{}j\in W
\displaystyle\implies F(G)=F(G¯),\displaystyle F(G)=F(\bar{G}),

which completes the proof. ∎

Lemma C.9.

For any G,G¯𝒢m,nG,\bar{G}\in\mathcal{G}_{m,n}, if F(G)=F(G¯)F(G)=F(\bar{G}) for all F2-FGNNF\in\mathcal{F}_{\textup{2-FGNN}}, then G2WG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\bar{G}.

Proof.

We claim that for any LL there exists 22-FGNN layers for l=0,1,2,,Ll=0,1,2,\dots,L, such that the followings hold true for any 0lL0\leq l\leq L and any hash functions:

  1. (a)

    sijl=sijls_{ij}^{l}=s_{i^{\prime}j^{\prime}}^{l} implies ClVW(i,j)=ClVW(i,j)C_{l}^{VW}(i,j)=C_{l}^{VW}(i^{\prime},j^{\prime}).

  2. (b)

    sijl=s¯ijls_{ij}^{l}=\bar{s}_{i^{\prime}j^{\prime}}^{l} implies ClVW(i,j)=C¯lVW(i,j)C_{l}^{VW}(i,j)=\bar{C}_{l}^{VW}(i^{\prime},j^{\prime}).

  3. (c)

    s¯ijl=s¯ijl\bar{s}_{ij}^{l}=\bar{s}_{i^{\prime}j^{\prime}}^{l} implies C¯lVW(i,j)=C¯lVW(i,j)\bar{C}_{l}^{VW}(i,j)=\bar{C}_{l}^{VW}(i^{\prime},j^{\prime}).

  4. (d)

    tj1j2l=tj1j2lt_{j_{1}j_{2}}^{l}=t_{j_{1}^{\prime}j_{2}^{\prime}}^{l} implies ClWW(j1,j2)=ClWW(j1,j2)C_{l}^{WW}(j_{1},j_{2})=C_{l}^{WW}(j_{1}^{\prime},j_{2}^{\prime}).

  5. (e)

    tj1j2l=t¯j1j2lt_{j_{1}j_{2}}^{l}=\bar{t}_{j_{1}^{\prime}j_{2}^{\prime}}^{l} implies ClWW(j1,j2)=C¯lWW(j1,j2)C_{l}^{WW}(j_{1},j_{2})=\bar{C}_{l}^{WW}(j_{1}^{\prime},j_{2}^{\prime}).

  6. (f)

    t¯j1j2l=t¯j1j2l\bar{t}_{j_{1}j_{2}}^{l}=\bar{t}_{j_{1}^{\prime}j_{2}^{\prime}}^{l} implies C¯lWW(j1,j2)=C¯lWW(j1,j2)\bar{C}_{l}^{WW}(j_{1},j_{2})=\bar{C}_{l}^{WW}(j_{1}^{\prime},j_{2}^{\prime}).

Such layers can be constructed inductively. First, for l=0l=0, we can simply choose p0p^{0} that is injective on {(vi,wj,Aij):iV,jW}{(v¯i,w¯j,A¯ij):iV,jW}\{(v_{i},w_{j},A_{ij}):i\in V,j\in W\}\cup\{(\bar{v}_{i},\bar{w}_{j},\bar{A}_{ij}):i\in V,j\in W\} and q0q^{0} that is injective on {(wj1,wj2,δj1j2):j1,j2W}{(w¯j1,w¯j2,δj1j2):j1,j2W}\{(w_{j_{1}},w_{j_{2}},\delta_{j_{1}j_{2}}):j_{1},j_{2}\in W\}\cup\{(\bar{w}_{j_{1}},\bar{w}_{j_{2}},\delta_{j_{1}j_{2}}):j_{1},j_{2}\in W\}.

Assume that the conditions (a)-(f) are true for l1l-1 where 1lL1\leq l\leq L, we aim to construct the ll-th layer such that (a)-(f) are also true for ll. Let α1,α2,,αu\alpha_{1},\alpha_{2},\dots,\alpha_{u} collect all different elements in {sijl1:iV,jW}{s¯ijl1:iV,jW}\{s_{ij}^{l-1}:i\in V,j\in W\}\cup\{\bar{s}_{ij}^{l-1}:i\in V,j\in W\} and let β1,β2,,βu\beta_{1},\beta_{2},\dots,\beta_{u^{\prime}} collect all different elements in {tj1j2l1:j1,j2W}{t¯j1j2l1:j1,j2W}\{t_{j_{1}j_{2}}^{l-1}:j_{1},j_{2}\in W\}\cup\{\bar{t}_{j_{1}j_{2}}^{l-1}:j_{1},j_{2}\in W\}. Choose some constinuous flf^{l} such that fl(βk,αk)=ekuekuu×uf^{l}(\beta_{k^{\prime}},\alpha_{k})=e_{k^{\prime}}^{u^{\prime}}\otimes e_{k}^{u}\in\mathbb{R}^{u^{\prime}\times u}, where ekue_{k^{\prime}}^{u^{\prime}} is a vector in u\mathbb{R}^{u^{\prime}} with the kk^{\prime}-th entry being 11 and other entries being 0, and ekue_{k}^{u} is a vector in u\mathbb{R}^{u} with the kk-th entry being 11 and other entries being 0. Choose some continuous plp^{l} that is injective on the set {sijl1,j1Wfl(tj1jl1,sij1l1):iV,jW}{s¯ijl1,j1Wfl(t¯j1jl1,s¯ij1l1):iV,jW}\left\{s_{ij}^{l-1},\sum_{j_{1}\in W}f^{l}(t_{j_{1}j}^{l-1},s_{ij_{1}}^{l-1}):i\in V,j\in W\right\}\cup\left\{\bar{s}_{ij}^{l-1},\sum_{j_{1}\in W}f^{l}(\bar{t}_{j_{1}j}^{l-1},\bar{s}_{ij_{1}}^{l-1}):i\in V,j\in W\right\}. By the injectivity of plp^{l} and the linear independence of {ekueku:1ku,1ku}\{e_{k^{\prime}}^{u^{\prime}}\otimes e_{k}^{u}:1\leq k\leq u,1\leq k^{\prime}\leq u^{\prime}\}, we have that

sijl=sijl\displaystyle s_{ij}^{l}=s_{i^{\prime}j^{\prime}}^{l}
\displaystyle\implies sijl1=sijl1andj1Wfl(tj1jl1,sij1l1)=j1Wfl(tj1jl1,sij1l1)\displaystyle s_{ij}^{l-1}=s_{i^{\prime}j^{\prime}}^{l-1}\quad\textup{and}\quad\sum_{j_{1}\in W}f^{l}(t_{j_{1}j}^{l-1},s_{ij_{1}}^{l-1})=\sum_{j_{1}\in W}f^{l}(t_{j_{1}j^{\prime}}^{l-1},s_{i^{\prime}j_{1}}^{l-1})
\displaystyle\implies sijl1=sijl1and for any 1ku, 1ku\displaystyle s_{ij}^{l-1}=s_{i^{\prime}j^{\prime}}^{l-1}\quad\textup{and for any }1\leq k\leq u,\ 1\leq k^{\prime}\leq u^{\prime}
#{j1W:tj1jl1=βk,sij1l1=αk}=#{j1W:tj1jl1=βk,sij1l1=αk}\displaystyle\#\big{\{}j_{1}\in W:t_{j_{1}j}^{l-1}=\beta_{k^{\prime}},s_{ij_{1}}^{l-1}=\alpha_{k}\big{\}}=\#\big{\{}j_{1}\in W:t_{j_{1}j^{\prime}}^{l-1}=\beta_{k^{\prime}},s_{i^{\prime}j_{1}}^{l-1}=\alpha_{k}\big{\}}
\displaystyle\implies sijl1=sijl1and{{(tj1jl1,sij1l1):j1W}}={{(tj1jl1,sij1l1):j1W}}\displaystyle s_{ij}^{l-1}=s_{i^{\prime}j^{\prime}}^{l-1}\quad\textup{and}\quad\big{\{}\big{\{}(t_{j_{1}j}^{l-1},s_{ij_{1}}^{l-1}):j_{1}\in W\big{\}}\big{\}}=\big{\{}\big{\{}(t_{j_{1}j^{\prime}}^{l-1},s_{i^{\prime}j_{1}}^{l-1}):j_{1}\in W\big{\}}\big{\}}
\displaystyle\implies Cl1VW(i,j)=Cl1VW(i,j)and\displaystyle C_{l-1}^{VW}(i,j)=C_{l-1}^{VW}(i^{\prime},j^{\prime})\quad\textup{and}
{{(Cl1WW(j1,j),Cl1VW(i,j1)):j1W}}={{(Cl1WW(j1,j),Cl1VW(i,j1)):j1W}}\displaystyle\Big{\{}\Big{\{}(C_{l-1}^{WW}(j_{1},j),C_{l-1}^{VW}(i,j_{1})):j_{1}\in W\Big{\}}\Big{\}}=\Big{\{}\Big{\{}(C_{l-1}^{WW}(j_{1},j^{\prime}),C_{l-1}^{VW}(i^{\prime},j_{1})):j_{1}\in W\Big{\}}\Big{\}}
\displaystyle\implies ClVW(i,j)=ClVW(i,j),\displaystyle C_{l}^{VW}(i,j)=C_{l}^{VW}(i^{\prime},j^{\prime}),

which is to say that the condition (a) is satisfied. One can also verify that the conditions (b) and (c) by using the same argument. Similarly, we can also construct glg^{l} and qlq^{l} such that the conditions (d)-(f) are satisfied.

Suppose that G2WG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\bar{G} is not true. Then there exists LL and hash functions such that

{{CLVW(i,j):iV}}{{C¯LVW(i,j):iV}},\left\{\left\{C_{L}^{VW}(i,j):i\in V\right\}\right\}\neq\left\{\left\{\bar{C}_{L}^{VW}(i,j):i\in V\right\}\right\},

or

{{CLWW(j1,j):j1W}}{{C¯LWW(j1,j):j1W}},\left\{\left\{C_{L}^{WW}(j_{1},j):j_{1}\in W\right\}\right\}\neq\left\{\left\{\bar{C}_{L}^{WW}(j_{1},j):j_{1}\in W\right\}\right\},

holds for some jWj\in W. We have shown above that the conditions (a)-(f) are true for LL and some carefully constructed 22-FGNN layers. Then it holds for some jWj\in W that

(C.8) {{sijL:iV}}{{s¯ijL:iV}},\left\{\left\{s_{ij}^{L}:i\in V\right\}\right\}\neq\left\{\left\{\bar{s}_{ij}^{L}:i\in V\right\}\right\},

or

(C.9) {{tj1jL:j1W}}{{t¯j1jL:j1W}}.\left\{\left\{t_{j_{1}j}^{L}:j_{1}\in W\right\}\right\}\neq\left\{\left\{\bar{t}_{j_{1}j}^{L}:j_{1}\in W\right\}\right\}.

In the rest of the proof we work with (C.8), and the argument can be easily modified in the case that (C.9) is true. It follows from (C.8) that there exists some continuous function φ\varphi such that

iVφ(sijL)iVφ(s¯ijL).\sum_{i\in V}\varphi(s_{ij}^{L})\neq\sum_{i\in V}\varphi(\bar{s}_{ij}^{L}).

Then let us construct the (L+1)(L+1)-th layer yielding

sijL+1=φ(sijL)ands¯ijL+1=φ(s¯ijL),s_{ij}^{L+1}=\varphi(s_{ij}^{L})\quad\textup{and}\quad\bar{s}_{ij}^{L+1}=\varphi(\bar{s}_{ij}^{L}),

and the output layer with

r(iVsijL+1,j1Wtj1jL+1)=iVφ(sijL)iVφ(s¯ijL)=r(iVs¯ijL+1,j1Wt¯j1jL+1).r\left(\sum_{i\in V}s_{ij}^{L+1},\sum_{j_{1}\in W}t_{j_{1}j}^{L+1}\right)=\sum_{i\in V}\varphi(s_{ij}^{L})\neq\sum_{i\in V}\varphi(\bar{s}_{ij}^{L})=r\left(\sum_{i\in V}\bar{s}_{ij}^{L+1},\sum_{j_{1}\in W}\bar{t}_{j_{1}j}^{L+1}\right).

This is to say F(G)jF(G¯)jF(G)_{j}\neq F(\bar{G})_{j} for some F2-FGNNF\in\mathcal{F}_{\textup{2-FGNN}}, which contradicts the assumtion that FF has the same output on GG and G¯\bar{G}. Thus we can conclude that G2WG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\bar{G}. ∎

Proof of Theorem C.7 (a).

By Lemma C.8 and Lemma C.9. ∎

Proof of Theorem C.7 (b).

Apply Theorem C.7 on GG and the graph obtained from GG by switching j1j_{1} and j2j_{2}. ∎

Proof of Theorem C.7 (c).

Suppose that G2G¯G\sim_{2}\bar{G}. By Theorem C.2, there exists some permutation σWSn\sigma_{W}\in S_{n} such that G2WσWG¯G\stackrel{{\scriptstyle W}}{{\sim}}_{2}\sigma_{W}\ast\bar{G}. For any scalar function ff with f𝟏2-FGNNf\mathbf{1}\in\mathcal{F}_{\textup{2-FGNN}}, by Theorem C.7, it holds that (f𝟏)(G)=(f𝟏)(σWG¯)=(f𝟏)(G¯)(f\mathbf{1})(G)=(f\mathbf{1})(\sigma_{W}\ast\bar{G})=(f\mathbf{1})(\bar{G}), where we used the fact that f𝟏f\mathbf{1} is permutation-equivariant. We can thus conclude that f(G)=f(G¯)f(G)=f(\bar{G}).

Now suppose that G2G¯G\sim_{2}\bar{G} is not true. Then there exist some LL and hash functions such that

{{CLVW(i,j):iV,jW}}{{C¯LVW(i,j):iV,jW}},\left\{\left\{C_{L}^{VW}(i,j):i\in V,j\in W\right\}\right\}\neq\left\{\left\{\bar{C}_{L}^{VW}(i,j):i\in V,j\in W\right\}\right\},

or

{{CLWW(j1,j2):j1,j2W}}{{C¯LWW(j1,j2):j1,j2W}}.\left\{\left\{C_{L}^{WW}(j_{1},j_{2}):j_{1},j_{2}\in W\right\}\right\}\neq\left\{\left\{\bar{C}_{L}^{WW}(j_{1},j_{2}):j_{1},j_{2}\in W\right\}\right\}.

By the proof of Lemma C.9, one can construct the ll-th 22-FGNN layers inductively for 0lL0\leq l\leq L, such that the condition (a)-(f) in the proof of Lemma C.9 are true. Then we have

(C.10) {{sijL:iV,jW}}{{s¯ijL:iV,jW}},\left\{\left\{s_{ij}^{L}:i\in V,j\in W\right\}\right\}\neq\left\{\left\{\bar{s}_{ij}^{L}:i\in V,j\in W\right\}\right\},

or

(C.11) {{tj1j2L:j1,j2W}}{{t¯j1j2L:j1,j2W}}.\left\{\left\{t_{j_{1}j_{2}}^{L}:j_{1},j_{2}\in W\right\}\right\}\neq\left\{\left\{\bar{t}_{j_{1}j_{2}}^{L}:j_{1},j_{2}\in W\right\}\right\}.

We first assume that (C.10) is true. Then there exists a continuous function φ\varphi with

iV,jWφ(sijL)iV,jWφ(s¯ijL).\sum_{i\in V,j\in W}\varphi(s_{ij}^{L})\neq\sum_{i\in V,j\in W}\varphi(\bar{s}_{ij}^{L}).

Let us construct the (L+1)(L+1)-th layer such that

sijL+1=pL+1(sijL,j1WfL+1(tj1jL,sij1L))=j1Wφ(sij1L),\displaystyle s_{ij}^{L+1}=p^{L+1}\left(s_{ij}^{L},\sum_{j_{1}\in W}f^{L+1}(t_{j_{1}j}^{L},s_{ij_{1}}^{L})\right)=\sum_{j_{1}\in W}\varphi(s_{ij_{1}}^{L}),
s¯ijL+1=pL+1(s¯ijL,j1WfL+1(t¯j1jL,s¯ij1L))=j1Wφ(s¯ij1L),\displaystyle\bar{s}_{ij}^{L+1}=p^{L+1}\left(\bar{s}_{ij}^{L},\sum_{j_{1}\in W}f^{L+1}(\bar{t}_{j_{1}j}^{L},\bar{s}_{ij_{1}}^{L})\right)=\sum_{j_{1}\in W}\varphi(\bar{s}_{ij_{1}}^{L}),

and the output layer with

r(iVsijL+1,j1Wtj1jL+1)=iVj1Wφ(sij1L)iVj1Wφ(s¯ij1L)=r(iVs¯ijL+1,j1Wt¯j1jL+1),r\left(\sum_{i\in V}s_{ij}^{L+1},\sum_{j_{1}\in W}t_{j_{1}j}^{L+1}\right)=\sum_{i\in V}\sum_{j_{1}\in W}\varphi(s_{ij_{1}}^{L})\neq\sum_{i\in V}\sum_{j_{1}\in W}\varphi(\bar{s}_{ij_{1}}^{L})=r\left(\sum_{i\in V}\bar{s}_{ij}^{L+1},\sum_{j_{1}\in W}\bar{t}_{j_{1}j}^{L+1}\right),

which is independent of jWj\in W. This constructs F2-FGNNF\in\mathcal{F}_{\textup{2-FGNN}} of the form F=f𝟏F=f\mathbf{1} with f(G)f(G¯)f(G)\neq f(\bar{G}).

Next, we consider the case that (C.11) is true. Then

(C.12) {{{{tj1j2L:j1W}}:j2W}}{{{{t¯j1j2L:j1W}}:j2W}},\left\{\left\{\left\{\left\{t_{j_{1}j_{2}}^{L}:j_{1}\in W\right\}\right\}:j_{2}\in W\right\}\right\}\neq\left\{\left\{\left\{\left\{\bar{t}_{j_{1}j_{2}}^{L}:j_{1}\in W\right\}\right\}:j_{2}\in W\right\}\right\},

and hence there exists some continuous ψ\psi such that

{{j1Wψ(tj1j2L):j2W}}{{j1Wψ(t¯j1j2L):j2W}}.\left\{\left\{\sum_{j_{1}\in W}\psi(t_{j_{1}j_{2}}^{L}):j_{2}\in W\right\}\right\}\neq\left\{\left\{\sum_{j_{1}\in W}\psi(\bar{t}_{j_{1}j_{2}}^{L}):j_{2}\in W\right\}\right\}.

Let us construct the (L+1)(L+1)-th layer such that

sijL+1=pL+1(sijL,j1WfL+1(tj1jL,sij1L))=j1Wψ(tj1jL),\displaystyle s_{ij}^{L+1}=p^{L+1}\left(s_{ij}^{L},\sum_{j_{1}\in W}f^{L+1}(t_{j_{1}j}^{L},s_{ij_{1}}^{L})\right)=\sum_{j_{1}\in W}\psi(t_{j_{1}j}^{L}),
s¯ijL+1=pL+1(s¯ijL,j1WfL+1(t¯j1jL,s¯ij1L))=j1Wψ(t¯j1jL),\displaystyle\bar{s}_{ij}^{L+1}=p^{L+1}\left(\bar{s}_{ij}^{L},\sum_{j_{1}\in W}f^{L+1}(\bar{t}_{j_{1}j}^{L},\bar{s}_{ij_{1}}^{L})\right)=\sum_{j_{1}\in W}\psi(\bar{t}_{j_{1}j}^{L}),

and we have from (C.12) that

{{sijL+1:iV,jW}}{{s¯ijL+1:iV,jW}}.\left\{\left\{s_{ij}^{L+1}:i\in V,j\in W\right\}\right\}\neq\left\{\left\{\bar{s}_{ij}^{L+1}:i\in V,j\in W\right\}\right\}.

We can therefore repeat the argument for (C.10) and show the existence of ff with f𝟏2-FGNNf\mathbf{1}\in\mathcal{F}_{\textup{2-FGNN}} and f(G)f(G¯)f(G)\neq f(\bar{G}). The proof is hence completed. ∎

C.4. Proof of Theorem 4.7

We finalize the proof of Theorem 4.7 in this subsection. Combining Theorem C.3 and Theorem C.7, one can conclude that the separation power of 2-FGNN\mathcal{F}_{\textup{2-FGNN}} is stronger than or equal to that of SB scores. Hence, we can apply the Stone-Weierstrass-type theorem to prove Theorem 4.7

Theorem C.10 (Generalized Stone-Weierstrass theorem [azizian2020expressive]).

Let XX be a compact topology space and let 𝐆\mathbf{G} be a finite group that acts continuously on XX and n\mathbb{R}^{n}. Define the collection of all equivariant continuous functions from XX to n\mathbb{R}^{n} as follows:

𝒞E(X,n)={F𝒞(X,n):F(gx)=gF(x),xX,g𝐆}.\mathcal{C}_{E}(X,\mathbb{R}^{n})=\{F\in\mathcal{C}(X,\mathbb{R}^{n}):F(g\ast x)=g\ast F(x),\ \forall~{}x\in X,g\in\mathbf{G}\}.

Consider any 𝒞E(X,n)\mathcal{F}\subset\mathcal{C}_{E}(X,\mathbb{R}^{n}) and any Φ𝒞E(X,n)\Phi\in\mathcal{C}_{E}(X,\mathbb{R}^{n}). Suppose the following conditions hold:

  1. (a)

    \mathcal{F} is a subalgebra of 𝒞(X,n)\mathcal{C}(X,\mathbb{R}^{n}) and 𝟏\mathbf{1}\in\mathcal{F}, where 𝟏\mathbf{1} is the constant function whose ouput is always (1,1,,1)n(1,1,\dots,1)\in\mathbb{R}^{n}.

  2. (b)

    For any x,xXx,x^{\prime}\in X, if f(x)=f(x)f(x)=f(x^{\prime}) holds for any f𝒞(X,)f\in\mathcal{C}(X,\mathbb{R}) with f𝟏f\mathbf{1}\in\mathcal{F}, then for any FF\in\mathcal{F}, there exists g𝐆g\in\mathbf{G} such that F(x)=gF(x)F(x)=g\ast F(x^{\prime}).

  3. (c)

    For any x,xXx,x^{\prime}\in X, if F(x)=F(x)F(x)=F(x^{\prime}) holds for any FF\in\mathcal{F}, then Φ(x)=Φ(x)\Phi(x)=\Phi(x^{\prime}).

  4. (d)

    For any xXx\in X, it holds that Φ(x)j1=Φ(x)j2,(j1,j2)J(x)\Phi(x)_{j_{1}}=\Phi(x)_{j_{2}},\ \forall~{}(j_{1},j_{2})\in J(x), where

    J(x)={(j1,j2){1,2,,n}2:F(x)j1=F(x)j2,F}.J(x)=\left\{(j_{1},j_{2})\in\{1,2,\dots,n\}^{2}:F(x)_{j_{1}}=F(x)_{j_{2}},\ \forall~{}F\in\mathcal{F}\right\}.

Then for any ϵ>0\epsilon>0, there exists FF\in\mathcal{F} such that

supxXF(x)Φ(x)ϵ.\sup_{x\in X}\|F(x)-\Phi(x)\|\leq\epsilon.
Proof of Theorem 4.7.

Lemma F.2 and Lemma F.3 in [chen2022representing-lp] prove that the function that maps LP instances to its optimal objective value/optimal solution with the smallest 2\ell_{2}-norm is Borel measurable. Thus, SB:𝒢m,nSB1(n)n\textup{SB}:\mathcal{G}_{m,n}\supset\textup{SB}^{-1}(\mathbb{R}^{n})\rightarrow\mathbb{R}^{n} is also Borel measurable, and is hence \mathbb{P}-measurable due to Assumption 4.3. By Theorem A.4 and Assumption 4.3, there exists a compact subset X1SB1(n)X_{1}\subset\textup{SB}^{-1}(\mathbb{R}^{n}) such that [𝒢m,n\X1]ϵ\mathbb{P}[\mathcal{G}_{m,n}\backslash X_{1}]\leq\epsilon and SB|X1\textup{SB}|_{X_{1}} is continuous. For any σVSm\sigma_{V}\in S_{m} and σWSn\sigma_{W}\in S_{n}, (σV,σW)X1(\sigma_{V},\sigma_{W})\ast X_{1} is also compact and SB|(σV,σW)X1\textup{SB}|_{(\sigma_{V},\sigma_{W})\ast X_{1}} is also continuous by the permutation-equivariance of SB. Set

X2=σVSm,σWSn(σV,σW)X1.X_{2}=\bigcup_{\sigma_{V}\in S_{m},\sigma_{W}\in S_{n}}(\sigma_{V},\sigma_{W})\ast X_{1}.

Then X2X_{2} is permutation-invariant and compact with

[𝒢m,n\X2][𝒢m,n\X1]ϵ.\mathbb{P}[\mathcal{G}_{m,n}\backslash X_{2}]\leq\mathbb{P}[\mathcal{G}_{m,n}\backslash X_{1}]\leq\epsilon.

In addition, SB|X2\textup{SB}|_{X_{2}} is continuous by pasting lemma.

The rest of the proof is to apply Theorem C.10 for X=X2X=X_{2}, 𝐆=Sm×Sn\mathbf{G}=S_{m}\times S_{n}, Φ=SB\Phi=\textup{SB}, and =2-FGNN\mathcal{F}=\mathcal{F}_{\textup{2-FGNN}}. We need to verify the four conditions in Theorem C.10. Condition (a) can be proved by similar arguments as in the proof of Lemma D.2 in [chen2022representing-lp]. Condition (b) follows directly from Theorem C.7 (a) and (c) and Theorem C.2. Condition (c) follows directly from Theorem C.7 (a) and Theorem C.3 (a). Condition (d) follows directly from Theorem C.7 (b) and Theorem C.3 (c). According to Theorem C.10, there exists some F2-FGNNF\in\mathcal{F}_{\textup{2-FGNN}} such that

supGX2F(G)SB(G)δ.\sup_{G\in X_{2}}\|F(G)-\textup{SB}(G)\|\leq\delta.

Therefore, one has

[F(G)SB(G)>δ][𝒢m,n\X2]ϵ,\mathbb{P}[\|F(G)-\textup{SB}(G)\|>\delta]\leq\mathbb{P}[\mathcal{G}_{m,n}\backslash X_{2}]\leq\epsilon,

which completes the proof. ∎

Appendix D Extensions of the theoretical results

This section will explore some extensions of our theoretical results.

D.1. Extension to other types of SB scores

The same analysis for Theorem 4.4 and Theorem 4.7 still works as long as the SB score is a function of fLP(G,j,lj,u^j)f_{\textup{LP}}^{*}(G,j,l_{j},\hat{u}_{j}), fLP(G,j,l^j,uj)f_{\textup{LP}}^{*}(G,j,\hat{l}_{j},u_{j}), and fLP(G)f_{\textup{LP}}^{*}(G):

  • We prove in Theorem A.3 that if two MILP-graphs are indistinguishable by the WL test, then they must be isomorphic and hence have identical SB scores (no matter how we define the SB scores). So Theorem 4.4 is still true.

  • We prove in Theorem C.4 that if two MILP-graphs are indistinguishable by 2-FWL test, then they have the same value of fLP(G,j,lj,u^j)f_{\textup{LP}}^{*}(G,j,l_{j},\hat{u}_{j}) (and fLP(G,j,l^j,uj)f_{\textup{LP}}^{*}(G,j,\hat{l}_{j},u_{j})). Therefore, Theorem C.3 still holds if the SB score is a function of fLP(G,j,lj,u^j)f_{\textup{LP}}^{*}(G,j,l_{j},\hat{u}_{j}), fLP(G,j,l^j,uj)f_{\textup{LP}}^{*}(G,j,\hat{l}_{j},u_{j}), and fLP(G)f_{\textup{LP}}^{*}(G), which implies that Theorem 4.7 is still true.

Therefore, Theorems 4.4 and 4.7 work for both linear product score functions in [dey2024theoretical].

D.2. Extension to varying MILP sizes

While Theorems 4.4 and 4.7 assume MILP sizes mm and nn are fixed, we now discuss extending these results to data distributions with variable mm and nn.

First, our theoretical results can be directly extended to MILP datasets or distributions where mm and nn vary but remain bounded. Following Lemma 36 in [azizian2020expressive], if a universal-approximation theorem applies to 𝒢m,n\mathcal{G}_{m,n} for any fixed mm and nn (as shown in our work) and at least one GNN can distinguish graphs of different sizes, then the result holds across a disjoint union of finitely many 𝒢m,n\mathcal{G}_{m,n}.

If the distribution has unbounded mm or nn, for any ϵ>0\epsilon>0, one can always remove a portion of the tail to ensure boundedness in mm and nn. In particular, there always exist large enough m0m_{0} and n0n_{0} such that [m(G)m0]1ϵ\mathbb{P}[m(G)\leq m_{0}]\geq 1-\epsilon and [n(G)n0]1ϵ\mathbb{P}[n(G)\leq n_{0}]\geq 1-\epsilon. The key point is that for any ϵ>0\epsilon>0, such m0m_{0} and n0n_{0} can always be found. Although these values may be large and dependent on ϵ\epsilon, they are still finite. This allows us to apply the results for the bounded-support case.

Note that the “tail removal” technique mentioned above comes from the fact that a probability distribution has a total mass of 1:

1=n=0[n(G)=n]=limn0n=0n0[n(G)=n]=limn0[n(G)n0].1=\sum_{n=0}^{\infty}\mathbb{P}[n(G)=n]=\lim_{n_{0}\to\infty}\sum_{n=0}^{n_{0}}\mathbb{P}[n(G)=n]=\lim_{n_{0}\to\infty}\mathbb{P}[n(G)\leq n_{0}].

By the definition of a limit, this clearly implies that for any ϵ>0\epsilon>0, there exists a sufficiently large n0n_{0} such that [n(G)n0]1ϵ\mathbb{P}[n(G)\leq n_{0}]\geq 1-\epsilon. A similar argument applies to mm.

Appendix E Details about numerical experiments

Random MILP instances generation

We generate 100 random MILP instances for the experiments in Section 5. We set m=6m=6 and n=20n=20, which means each MILP instance contains 6 constraints and 20 variables. The sampling schemes of problem parameters are described below.

  • The bounds of linear constraints: bi𝒩(0,1)b_{i}\sim\mathcal{N}(0,1).

  • The coefficients of the objective function: cj𝒩(0,1)c_{j}\sim\mathcal{N}(0,1).

  • The non-zero elements in the coefficient matrix: Aij𝒩(0,1)A_{ij}\sim\mathcal{N}(0,1). The coefficient matrix AA contains 60 non-zero elements. The positions are sampled randomly.

  • The lower and upper bounds of variables: lj,uj𝒩(0,102)l_{j},u_{j}\sim\mathcal{N}(0,10^{2}). We swap their values if lj>ujl_{j}>u_{j} after sampling.

  • The constraint types \circ are randomly sampled. Each type (\leq, == or \geq) occurs with equal probability.

  • The variable types are randomly sampled. Each type (continuous or integer) occurs with equal probability.

Implementation and training details

We implement MP-GNN and 2-FGNN with Python 3.6 and TensorFlow 1.15.0 [abadi2016tensorflow]. Our implementation is built by extending the MP-GNN implementation of [gasse2019exact] in https://github.com/ds4dm/learn2branch. The SB scores of randomly generated MILP instances are collected using SCIP [scip].

For both GNNs, p0,q0p^{0},q^{0} are parameterized as linear transformations followed by a non-linear activation function; {pl,ql,fl,gl}l=1L\{p^{l},q^{l},f^{l},g^{l}\}_{l=1}^{L} are parameterized as 3-layer multi-layer perceptrons (MLPs) with respective learnable parameters; and the output mapping rr is parameterized as a 2-layer MLP. All layers map their input to a 1024-dimensional vector and use the ReLU activation function. Under these settings, MP-GNN contains 43.0 millions of learnable parameters and 2-FGNN contains 35.7 millions of parameters.

We adopt Adam [kingma2014adam] to optimize the learnable parameters during training with a learning rate of 10510^{-5} for all networks. We decay the learning rate to 10610^{-6} and 10710^{-7} when the training error reaches 10610^{-6} and 101210^{-12} respectively to help with stabilizing the training process.