This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Graph Sparsifications using Neural Network Assisted Monte Carlo Tree Search

Alvin Chiu1 1University of California, Irvine, CA, USA
2University of Arizona, Tucson, AZ, USA
Mithun Ghosh2 1University of California, Irvine, CA, USA
2University of Arizona, Tucson, AZ, USA
Reyan Ahmed2 1University of California, Irvine, CA, USA
2University of Arizona, Tucson, AZ, USA
Kwang-Sung Jun2 1University of California, Irvine, CA, USA
2University of Arizona, Tucson, AZ, USA

Stephen Kobourov2
1University of California, Irvine, CA, USA
2University of Arizona, Tucson, AZ, USA
Michael T. Goodrich1 1University of California, Irvine, CA, USA
2University of Arizona, Tucson, AZ, USA
Abstract

Graph neural networks have been successful for machine learning, as well as for combinatorial and graph problems such as the Subgraph Isomorphism Problem and the Traveling Salesman Problem. We describe an approach for computing graph sparsifiers by combining a graph neural network and Monte Carlo Tree Search. We first train a graph neural network that takes as input a partial solution and proposes a new node to be added as output. This neural network is then used in a Monte Carlo search to compute a sparsifier. The proposed method consistently outperforms several standard approximation algorithms on different types of graphs and often finds the optimal solution.

1 Introduction

Graph representation of data is a powerful approach to hold the relationships between different objects. Graphs arise in many real-world applications that deal with relational information. Classical machine learning models, such as neural networks and recurrent neural networks, do not naturally handle graphs. Graph neural networks (GNN) were introduced to better capture graph structures [22]. A GNN is a recursive neural network where nodes are treated as state vectors and the relationships between the nodes are quantified by the edges.

Many real-world problems are modeled by combinatorial and graph problems that are known to be NP-hard. GNNs offer an alternative to traditional heuristics and approximation algorithms; indeed the initial GNN model [22] was used to approximate solutions to two classical graph problems: subgraph isomorphism and clique detection.

Recent GNN work [17, 26] suggests that combining neural networks and tree search leads to better results than neural networks alone. Li et al. [17] combine a convolutional neural network with tree search to compute independent sets and other NP-hard problems that are efficiently reducible to the independent set problem. AlphaGo [26] combines deep convolutional neural networks and Monte Carlo Tree Search (MCTS) to assess Go board positions and reduce the search space. Xing et al. [26] used the same framework to tackle the traveling salesman problem (TSP).

Since Xing et al. [26] showed that the AlphaGo framework is effective for TSP, a natural question is whether this framework can be applied to other combinatorial problems such as different graph sparsification problems [6]. The Steiner tree and graph spanners are some examples of graph sparsification. Although these graph sparsification problems are NP-hard similar to TSP, there are several major differences among the natures of these problems. First, the sparsification problems contain a subset of the nodes called terminals that must be spanned, whereas in TSP all nodes are equivalent. Second, the output of the sparsification problem is a subgraph, whereas the output of TSP is a path (or a cycle). When iteratively computing a TSP solution, the next node to be added can only be connected to the previous one, which is much easier than having to choose from a set of nodes when growing a sparsification. Third, TSP and Go are similar in terms of the length of the instance: both the length of the game and the number of nodes in the TSP solution are fixed, and taking an action in Go is equivalent to adding a node to the tour, while the number of nodes in the sparsification problem varies depending on the graph instance. Finally, Xing et al. [26] only considered geometric graphs, which is a restricted class of graphs.

1.1 Background:

A sparsification of a graph GG is a subgraph that preserves some properties of GG [6]. Examples of sparsifications include spanning trees, Steiner trees, spanners, and distance preservers. Many sparsification problems are defined with respect to a given subset of vertices TVT\subseteq V which we call terminals: e.g., a Steiner tree over (G,T)(G,T) requires a tree in GG which spans TT.

The Steiner tree problem is a classical NP-hard problem [2]. In this problem, we are given an edge-weighted graph G=(V,E)G=(V,E), and a set of terminals TVT\subseteq V. And we want to compute a minimum weighted subtree that spans all terminals. For |T|=2|T|=2 this is equivalent to the shortest path problem, for |T|=|V||T|=|V| this is equivalent to the minimum spanning tree problem, while the problem is NP-hard for 2<|T|<|V|2<|T|<|V| [9]. Due to applications in many domains, there is a long history of heuristics, approximation algorithms, and exact algorithms for the Steiner tree problem [2].

A spanner is a subgraph that approximately preserves pairwise distances in the original graph GG [4]. A subset spanner needs only approximately preserve distances between a subset TVT\subseteq V of vertices. Two common types of spanners include multiplicative α\alpha-spanners, which preserve distances in GG up to a multiplicative α\alpha factor, and additive +β+\beta spanners, which preserve distances up to additive +β+\beta error. A distance preserver is a special case of the spanner where distances are preserved exactly. The multiplicative α\alpha–spanner problem is NP-hard [4]. Further, it is NP-hard to approximate the multiplicative α\alpha–spanner problem to within a factor of O(log|V|)O(\log|V|), even when restricted to bipartite graphs [4]. There exists a classical greedy algorithm [4] that constructs a multiplicative α\alpha–spanner given a graph GG and a real number α1\alpha\geq 1. It has been shown that given a weighted graph GG and t1t\geq 1, there is a greedy (2t1)(2t-1)–spanner (α=2t1\alpha=2t-1) HH containing at most nn1/tn\lceil n^{1/t}\rceil edges, and whose weight is at most w(MST(G))(n/t)w(MST(G))(n/t) where w(MST(G))w(MST(G)) denotes the weight of a minimum spanning tree of GG. Later, this greedy spanner algorithm has been generalized for subsetwise spanners [6].

For very large graphs, additive error is arguably a much more appealing paradigm. It has been shown that all graphs have +2+2, +4+4, and +6+6 spanners on O(n3/2)O(n^{3/2}), O(n7/5)O(n^{7/5}), and O(n4/3)O(n^{4/3}) edges respectively [4]. There are several major differences between multiplicative spanners and additive spanners. The construction of additive spanners depends on the additive error as mentioned earlier. Unlike multiplicative spanners, the number of edges in additive spanners does not always decrease as the error increases; an interesting result shows that there exists a class of graphs that do not have +c+c spanners on n4/3ϵn^{4/3-\epsilon} edges [4] where cc and ϵ\epsilon are a small integer and a positive real number respectively. Also, the algorithms for additive spanners do not naturally generalize for weighted graphs. Recently, several algorithms for weighted additive spanners have been provided that require significant changes to the algorithms of unweighted spanners [11, 3].

[Uncaptioned image]
Refer to caption
Figure 1: GNN assisted MCTS: first, train a GNN to evaluate non-terminal nodes, then use the network and heuristics to compute a Steiner tree with MCTS.

1.2 Problem Statement:

We consider three sparsification problems: the Steiner tree, subsetwise multiplicative spanner, and subsetwise additive spanner problems. In the Steiner Tree Problem, we are given a weighted graph G=(V,E)G=(V,E) and a set of terminals TVT\subseteq V, and we want to compute a minimum weighted subtree HH that spans TT. The non-terminal nodes in HH are called the Steiner nodes. We denote the shortest path distance from uu to vv in the graph GG by dG(u,v)d_{G}(u,v). In the subsetwise multiplicative spanner problem, besides GG and TT, we are also given a multiplicative stretch α1\alpha\geq 1, and we want to compute a subgraph HH such that for all u,vT,dH(u,v)αdG(u,v)u,v\in T,d_{H}(u,v)\leq\alpha\cdot d_{G}(u,v). In the subsetwise additive spanner problem, instead of a multiplicative stretch α\alpha, we are given an additive error β0\beta\geq 0. The objective is to compute a subgraph HH such that for all u,vT,dH(u,v)dG(u,v)+βWu,v\in T,d_{H}(u,v)\leq d_{G}(u,v)+\beta W, where WW is the maximum edge weight in GG. The objective of these problems is to either minimize the total edge weights or the number of edges of HH.

1.3 Summary of Contributions:

We describe an approach for computing the above sparsifications by combining a graph neural network and Monte Carlo Tree Search (MCTS). We first train a graph neural network that takes as input a partial solution and proposes a new node to be added as output. This neural network is then used in an MCTS to compute a sparsification. The proposed method consistently outperforms the standard approximation algorithms on different types of graphs and often finds the optimal solution. We illustrate our approach in Figure 1. Our approach builds on the work of Xing et al. [26] for TSP. Since TSP is non-trivially different from the sparsification problems, we needed to address challenges in both training the graph neural network and testing the MCTS. We summarize our contribution below:

  • To train the neural network we generate exact solutions of different graph sparsification instances. From each instance, we generate several data points. The purpose of the neural network is to predict a new important node, given a set of current important nodes. Since any permutation of the set of solution nodes can lead to a valid sequence of predictions, we use random permutations to generate data points for the network.

  • After we select a set of important nodes for a given instance, it is not straightforward to compute a sparsification. We utilize various existing algorithms to compute the sparsification from the selected set of important nodes.

  • Our method can work for non-Euclidean graphs as well. We evaluate our method on some known hard instances from the SteinLib database [16] that are non-Euclidean.

  • We compare our framework with different well-known approximation algorithms. The experimental result shows that our method outperforms these existing algorithms. The proposed method is fully functional and available on GitHub.

2 Our approach

Our approach keeps a set of important nodes S={v1,v2,,vi}S=\{v_{1},v_{2},\cdots,v_{i}\} for the sparsification instance, and gradually adds more nodes in SS. Initially, SS is equal to the set of terminals TT. Then, S¯=VS\overline{S}=V-S is the set of candidate nodes to be added to SS. A natural approach is to train a neural network to predict which node to add to SS at a particular step. That is, neural network f(G|S)f(G|S) takes the graph GG and SS as input, and returns probabilities for the remaining nodes, indicating the likelihood they are important for the sparsification. We adapt the GNN of [26] to represent f(G|S)f(G|S).

Intuitively, we can directly use the probability values, selecting all nodes with a probability higher than a given threshold. We can then construct a subgraph from the selected nodes in different ways. For example, we can compute the induced graph of the selected nodes (add an edge if it connects to selected nodes) and extract a minimum spanning tree [9] for the case of the Steiner tree problem. Note that the induced graph may be disconnected and therefore the spanning tree will be also disconnected. Hence it may not provide a valid solution. This issue can be addressed by reducing the given threshold until we obtain a valid solution.

However, deriving sparsifications in this fashion might not be reliable since it has only one chance to compute the solution, and it never goes back to reverse the decision. To overcome this drawback, we leverage the MCTS. In the MCTS, each tree node represents a state that is a possible set of important nodes for the sparsification problem. We use a variant of PUCT [20] to balance exploration (i.e., visiting a state as suggested by the neural network policy) and exploitation (i.e., visiting a state that has the best value). The overall approach is illustrated in Figure 1.

2.1 Graph neural network architecture:

Some combinatorial problems like the independent set problem and minimum vertex cover problem do not consider edge weights. However, edge weight is an important feature of the sparsification problem as the objective and shortest path distances are computed based on the weights. Hence, we adapt the static edge graph neural network (SE-GNN) [26], to efficiently extract node and edge features. The SE-GNN model only works for Euclidean graphs due to the dependency of node positions. Our generalized SE-GNN (GSE-GNN) model can handle non-Euclidean graphs as well. We illustrate the architecture of the GSE-GNN model in Figure 2.

Refer to caption
(a) Different modules of GSE-GNN.
Refer to caption
(b) The embedding module.
Refer to caption
(c) The convolution module.
Figure 2: The generalized static edge graph neural network (GSE-GNN) model.

2.1.1 The input module:

To train a neural network, information about the structures of the concerned graph, terminal nodes, and contextual information, i.e., the set of important nodes SS, is required. We tag node uu with xut=1x^{t}_{u}=1 if it is a terminal, otherwise xut=0x^{t}_{u}=0. We also tag uu with xua=1x^{a}_{u}=1 if it is already added, otherwise xua=0x^{a}_{u}=0. The SE-GNN model only considers Euclidean graphs since it tags each node by the position of the nodes. For non-Euclidean graphs, there is no trivial way to compute the coordinates of the nodes. In our GSE-GNN model, we resolve this issue by computing the coordinates of non-Euclidean graphs using a spring embedder [15]. Besides that, we also tag each node by several other properties of the input instance: node degree, clustering coefficients [21], and different node centrality values [8]. Let xux_{u} be the feature vector containing all the tags of node uu. Intuitively, f(G|S)f(G|S) should summarize the state of such a “tagged” graph (a concatenation of all the feature vectors) and generate the prior probability for each node to get included in SS.

2.1.2 The embedding module:

The embedding module has a multi-layer perceptron MLP1 that maps a feature vector xux_{u} of node uu to a higher embedding space vector Hu0H_{u}^{0}, see Figure 2. The multi-layer perceptron MLP1 is followed by a convolution module that consists of a stack of LL neural network layers, where each layer aggregates local neighborhood information, i.e., features of neighbors of each node, and then passes this aggregated information to the next layer. This procedure of aggregating neighborhood information is known as message-passing; the original SE-GNN model only considered the graph convolutional message-passing procedure [14] whereas we incorporate the graph attention procedure [24] as well. We use HuldH_{u}^{l}\in\mathbb{R}^{d} to denote the real-valued feature vector associated with node uu at layer ll. Specifically, the basic GNN model [12] can be implemented as follows. In layer l=1,2,,Ll=1,2,\cdots,L, a new feature is computed as given by 2.1.

(2.1) Hul+1=σ(θ1lHul+vN(u)θ2lHvl)H_{u}^{l+1}=\sigma\Big{(}\theta_{1}^{l}H_{u}^{l}+\sum_{v\in N(u)}\theta_{2}^{l}H_{v}^{l}\Big{)}

In 2.1, N(u)N(u) is the set of neighbors of node uu, θ1l\theta_{1}^{l}, and θ2l\theta_{2}^{l} are the parameter matrices for layer ll, and σ()\sigma(\cdot) denotes a component-wise non-linear function such as a ReLU function. The edge features are not taken into account in 2.1. There are some edge features, e.g. edge weights and common neighborhoods [18], that we want to incorporate into our model. We denote the edge features of edge uvuv by euve_{uv}. Some previous methods [25] use the following equation to incorporate edge features.

(2.2) Hul+1=σ(θ1xu+θ2vN(u)Hvl+θ3vN(u)σ(θ4euv))H_{u}^{l+1}=\sigma\Big{(}\theta_{1}x_{u}+\theta_{2}\sum_{v\in N(u)}H_{v}^{l}+\theta_{3}\sum_{v\in N(u)}\sigma(\theta_{4}e_{uv})\Big{)}

In 2.2, θ1\theta_{1}, θ2,θ3\theta_{2},\theta_{3}, and θ4\theta_{4} are all model parameters. We can see in 2.1 and 2.2 that the nonlinear mapping of the aggregated information is a single-layer perceptron, which is not enough to map distinct multisets into unique embeddings. Hence, as suggested in [26], we replace the single perceptron with a multi-layer perceptron. Finally, we compute a new node feature HuH_{u} using 2.3.

(2.3) Hul+1=MLP2l(θ1lHul+vN(u)θ2lHvl+vN(u)θ3leuv)H_{u}^{l+1}=\text{MLP}_{2}^{l}\Big{(}\theta_{1}^{l}H_{u}^{l}+\sum_{v\in N(u)}\theta_{2}^{l}H_{v}^{l}+\sum_{v\in N(u)}\theta_{3}^{l}e_{uv}\Big{)}

In 2.3, θ1l\theta_{1}^{l}, θ2l\theta_{2}^{l}, and θ3l\theta_{3}^{l} are parameter matrices, and MLPl2{}_{2}^{l} is the multi-layer perceptron for layer ll.

2.1.3 The aggregation and output modules:

Once the feature for every node is computed after updating LL layers, we aggregate the new feature vector by summing up all the elements of the vector. We then pass that aggregated value to the softmax function (softmax(z)=ez/iezi\text{softmax}(z)=e^{z}/\sum_{i}e^{z_{i}}) and denote it by f(G|S;θ)f(G|S;\theta). This function f(G|S;θ)f(G|S;\theta) returns the prior probability for each node indicating how likely is the node to be in SS. Specifically, we fuse all node feature HuLH_{u}^{L} as the current state representation of the graph and parameterize f(G|S;θ)f(G|S;\theta) as expressed by 2.4.

(2.4) f(G|S;θ)=softmax(sum(H1L),,sum(H|V|L))f(G|S;\theta)=\text{softmax}(\text{sum}(H_{1}^{L}),\cdots,\text{sum}(H_{|V|}^{L}))

Here, sum(z)=izi\text{sum}(z)=\sum_{i}z_{i}. During training, we minimize the cross-entropy loss for each training sample (Gi,Si)(G_{i},S_{i}) in a supervised manner as given by 2.5.

(2.5) (Si,f(Gi|Si;θ))=j=|Ti|+1NyjTlogf(Gi|Si(1:j1);θ)\ell(S_{i},f(G_{i}|S_{i};\theta))=-\sum_{j=|T_{i}|+1}^{N}y_{j}^{T}\log f(G_{i}|S_{i}(1:j-1);\theta)

In 2.5, SiS_{i} is an ordered set of important nodes of a sparsification which is a permutation of a subset of the nodes of graph GiG_{i}, with Si(1:j1)S_{i}(1:j-1) the ordered subset containing the first j1j-1 elements of SiS_{i}, NN the number of nodes in the sparsification, yjTy_{j}^{T} the transpose of yjy_{j}, and yjy_{j} a vector of length |V||V| with 1 in the Si(j)S_{i}(j)-th position and 0 otherwise. We provide more details in Section 3.

2.2 GNN assisted MCTS:

Several recent GNN-based models for solving combinatorial problems leverage different kinds of greedy or tree searches [10, 17, 26]. We use an MCTS for our sparsification problems. The search space of a sparsification instance can be huge. In a traditional MCTS, random sampling of the search space gradually expands the search tree. Our graph neural network assisted MCTS (GNN-MCTS) adds new nodes in the search tree from the prediction of GSE-GNN instead of random sampling.

For each MCTS node vv, there is an action space A(v)A(v). Each action aA(v)a\in A(v) represents a node in S¯\overline{S}. The MCTS counts the number of times a particular action aa has been selected from an MCTS node vv to compute the uncertainty of aa from vv. We denote this action count by N(v,a)N(v,a). We adapt the standard PUCT [20] algorithm to compute the uncertainty U(v,a)U(v,a) of aa from vv. Similar to PUCT [20], we set the value of U(v,a)U(v,a) equal to cpuctP(v,a)bN(v,b)1+N(v,a)c_{puct}P(v,a)\frac{\sqrt{\sum_{b}N(v,b)}}{1+N(v,a)} where cpuctc_{puct} is a tuning parameter and P(v,a)P(v,a) is the neural network policy.

Another important quantity of our MCTS is the quality Q(v,a)Q(v,a) of an action aa from an MCTS node vv. Let HH be the sparsification after executing aa from vv. We denote the cost of HH by cost(H)\text{cost}(H). Notice that the value cost(H)\text{cost}(H) can be large if the number of edges in HH is relatively larger. However, the standard MCTS takes quality values in the range [0,1][0,1] [20]. We address this issue by normalizing the sparsification cost cost(H)\text{cost}(H) as suggested by [26]. We set the quality Q(v,a)Q(v,a) equal to cost(H)wbw\frac{\text{cost}(H)-w}{b-w} where bb and ww are the minimum and maximum sparsification costs among the actions of vv.

Following the standard MCTS algorithm, we aggregate the quality and uncertainty; and select the best action according to the aggregated value. We also strengthen the MCTS by selecting an action uniformly with a small probability ϵ\epsilon to better explore the search space [23]. In other words, at time step tt, our MCTS selects the action ata_{t} with probability 1ϵ1-\epsilon such that:

(2.6) at=argmaxa(Q(vt,a)+U(vt,a))a_{t}=\text{arg}\text{max}_{a}(Q(v_{t},a)+U(v_{t},a))

And with a small probability ϵ\epsilon, the MCTS selects randomly from among all the nodes in S¯\overline{S} with equal probability. Each round of the MCTS consists of four steps:

  • Selection: The MCTS selects a leaf node uu starting from the root node using 2.6.

  • Expansion: The MCTS creates a new leaf node vv such that vv is the child of selected node uu.

  • Simulation: The MCTS gradually adds nodes from S¯\overline{S} using the neural network prediction. After each addition, the MCTS computes a sparsification (as described later). The number of nodes added is the sample size. Finally, the MCTS selects the best sample and updates the state of vv accordingly.

  • Backpropagation: The MCTS updates the best and worst costs from the state of vv to its ancestors.

Our MCTS is similar to a recent MCTS proposed for computing TSP [26]. However, there are several major changes in our method as described below:

  • The graph sparsification problem is significantly different from TSP that was considered in [26]. Unlike the sparsification problem, all nodes must be present in a traveling salesman tour. Hence in the MCTS of [26], initially the set SS was empty, and gradually they added all the nodes in SS. However, in the sparsification problem, all terminals must be in the final solution. Hence at the beginning of our search, the set SS contains all terminals. Our initial experiment also showed that starting with a set SS that contains all terminals significantly improves the running time than starting from an empty set.

  • The sample size of TSP is huge since different permutations of the nodes provide different tours. A sparsification is the same for different permutations and a large sample size will increase the running time as well since we compute a shortest path tree for each additional node in O(nlogn)O(n\log n) time [9]. Hence we keep the sample size relatively small. Details are provided in the following sections.

  • Since we keep the sample size relatively small, we strengthen the exploration process by mixing in a random search strategy that has been found effective in reinforcement learning [23]: use the uncertainty value from the count of visited nodes most of the time, but every once in a while, say with small probability ϵ\epsilon, select randomly from among all the nodes in S¯\overline{S} with equal probability.

Refer to caption
Figure 3: Example graph for the Steiner tree heuristic. Considering DD as a terminal node and computing the MST on the metric closure provides a better solution than the 2-approximation algorithm.

2.3 Computing a sparsification from SS:

Our heuristics are motivated by existing algorithms of sparsification problems. An algorithm for a sparsification problem takes the set of terminals TT as a parameter. Our MCTS uses the same algorithm, however, instead of TT, the MCTS uses SS as the set of terminals. Initially, the MCTS sets S=TS=T as described above and gradually adds more nodes using the guidance of the GNN. After computing the sparsification, the MCTS applies different pruning algorithms since SS usually contains more nodes than TT. We now describe the existing algorithms we have used.

  1. 1.

    The 2-approximation algorithm for computing Steiner trees: In this algorithm [1], given an input graph G=(V,E)G=(V,E) and a set of terminals SS, we first compute a metric closure graph G=(S,E)G^{\prime}=(S,E^{\prime}). Every pair of nodes in GG^{\prime} is connected by an edge with a weight equal to the shortest path distance between them in GG. The minimum spanning tree of the metric closure provides a 2-approximation to the optimal Steiner tree (if S=TS=T). The MCTS improves the quality by adding new nodes in SS. For example, in Figure 3, AA, BB, and CC are terminal nodes and DD is not. Note that DD does not appear in any shortest path as each shortest path distance between pairs of terminals is 5 and none of them goes through DD. Without loss of generality, the 22-approximation algorithm (when S=TS=T) chooses the ACBA-C-B path with a total cost of 10, while the optimal solution that uses DD has a cost of 9. While the 22-approximation algorithm (when S=TS=T) does not consider any node that does not belong to a shortest path between two terminal nodes, the MCTS considers such nodes.

  2. 2.

    The greedy algorithm for computing subsetwise multiplicative spanners: In this greedy algorithm [6], we are also given a multiplicative stretch α\alpha. We again first compute a metric closure graph. Then we sort the edges of the metric closure in non-decreasing order of weights. Initially, the sparsification HH does not contain any edges. We go through each edge e=uve=uv according to the sorted order and add it in HH if αw(e)distH(u,v)\alpha\cdot w(e)\leq dist_{H}(u,v). Finally, we replace each abstract edge of HH with the corresponding shortest path of GG.

  3. 3.

    The subsetwise +2W+2W algorithm for computing additive spanners: Here, the additive stretch β=2W\beta=2W, where WW is the maximum edge weight of GG. There exist several algorithms for this problem; a recent study compares different algorithms [5]. We use an algorithm in this paper that performs well in practice. This algorithm starts with an empty set HH and for each node in GG, it adds |S|2/3|S|^{2/3} lightest neighboring edges in HH. Later, it adds some more edges to HH such that for all u,vS,distH(u,v)distG(u,v)+2Wu,v\in S,dist_{H}(u,v)\leq dist_{G}(u,v)+2W. We call this algorithm the subsetwise +2W+2W algorithm.

Since the MCTS adds additional nodes in SS, at the end of the algorithm we prune some nodes and edges that are not necessary. We now describe the pruning algorithms that we have used.

  1. 1.

    Pruning for Steiner trees: Let HH be the output of the 2-approximation. Since our goal is to compute a tree, we remove some edges from HH if there exist any cycles. To do that, we compute a minimum spanning tree HH^{\prime} of HH. A node is a pendant node if it has a degree equal to one. We then check whether there exist any pendant nodes that are not in TT. We remove all pendant nodes not in TT from HH^{\prime}. We denote the new tree by H′′H^{\prime\prime}. We return H′′H^{\prime\prime} as the final output.

  2. 2.

    Pruning for spanners: We sort all the edges of the computed spanner HH in the decreasing order of edge weights. We go through each edge ee in this order and delete ee from HH if HeH-e is a valid spanner. Note that, we use the same pruning algorithm for multiplicative and additive spanners.

3 Model setup and training

Our training data consists of input graphs G=(V,E)G=(V,E), edge weights w:E+w:E\rightarrow\mathbb{R}^{+}, terminals TVT\subseteq V, and a stretch value depending on the type of sparsification. Given G,w,TG,w,T, and a stretch value (for spanner instances), our goal is to give label 1 to the next node to be added and 0 to all others. Initially, we set S=TS=T as all terminals must be in the sparsification. Consider a graph with 6 nodes u1,u2,,u6u_{1},u_{2},\cdots,u_{6}, a set of terminals T={u1,u2,u3}T=\{u_{1},u_{2},u_{3}\}, and an optimal sparsification HH contains the first five nodes u1,u2,,u5u_{1},u_{2},\cdots,u_{5}. For this example, initially, we set S=T={u1,u2,u3}S=T=\{u_{1},u_{2},u_{3}\}. Since we have two non-terminal nodes u4u_{4} and u5u_{5} in HH, both permutations u4,u5u_{4},u_{5} and u5,u4u_{5},u_{4} are valid. For the first permutation, after setting S={u1,u2,u3}S=\{u_{1},u_{2},u_{3}\}, the next node to be added to the solution is u4u_{4}. Hence for this data point, only the label for u4u_{4} is 1. This permutation provides another data point where S={u1,u2,u3,u4}S=\{u_{1},u_{2},u_{3},u_{4}\} and only the label for u5u_{5} is equal to 1. Similarly, we can generate two more data points from the other permutation. This exhaustive consideration of all possible permutations does not scale to larger graphs, so we randomly select at most 100 permutations from each optimal solution. The model is trained using the ADAM optimizer [13] to minimize the cross-entropy loss between the model’s prediction and the ground truth (a vector in {0,1}|V|\{0,1\}^{|V|} indicating whether a node is the next solution node or not) for each training sample.

3.1 Data generation:

We produce sparsification instances using the random geometric graph generation model [19]. Let nn be the number of nodes of the graph. In the random geometric graph model, we uniformly select nn points from the Euclidean cube, and connect nodes whose Euclidean distance is not larger than a threshold rr. If rlnnπnr\geq\sqrt{\frac{\ln n}{\pi n}}, then the graph is connected with high probability. To produce relatively denser graphs, we set r=2lnnπnr=\sqrt{\frac{2\ln n}{\pi n}}.

We generate Steiner tree, multiplicative, and additive spanner instances using the above random graph generation model. We assign random integer weights in the range {1,2,,10}\{1,2,\cdots,10\} to each edge. As discussed earlier, we set the multiplicative stretch α=2\alpha=2 and the additive stretch β=2W\beta=2W, where WW is the maximum edge weight of the graph. The number of nodes is in {20,50,100}\{20,50,100\}. We randomly select half of the nodes of each graph and set them as terminals.

For the Steiner tree and multiplicative spanner problems, we train the graph neural network on 5000 random geometric instances of 100 nodes. For the additive spanner problem, we train on the same number of geometric instances of 50 nodes. We use smaller instances for additive spanners because it is not possible to compute optimal solutions of larger instances by the maximum 20 hours time limit we use. Each of these instances generates multiple training data points from different permutations of non-terminal nodes as described above. The number of nodes in the test dataset of MCTS is in {20,50,100}\{20,50,100\}. As random geometric instances can be “easy” to solve, we also evaluate our approach on graphs from the SteinLib library [16], which provides hard instances for the Steiner tree problem. Specifically, we perform experiments on two SteinLib datasets: I080 and I160. Each instance of the I080 and I160 datasets contains 80 nodes and 160 nodes respectively. Unlike geometric graphs, these datasets contain non-Euclidean graphs. We use the spring embedder [15] to compute the positions of SteinLib instances as one of our input features is node position.

Refer to caption
(a) Twenty nodes graphs
Refer to caption
(b) Twenty nodes graphs
Refer to caption
(c) Twenty nodes graphs
Refer to caption
(d) Fifty nodes graphs
Refer to caption
(e) Fifty nodes graphs
Refer to caption
(f) Fifty nodes graphs
Refer to caption
(g) Hundred nodes graphs
Refer to caption
(h) Hundred nodes graphs
Refer to caption
(i) Hundred nodes graphs
Figure 4: Performance on random geometric Steiner instances. The lower the cost the better the algorithm is. Our algorithm (MCTS) is nearly optimal and performs better than 2-approximation.

3.2 Computing optimal solutions:

We need to compute the optimal solutions to evaluate the performance of our approach (and other existing algorithms). There are different integer linear programming (ILP) models for the sparsification problems. The cut-based approach considers all possible combinations of partitions of terminals and ensures that there is an edge between that partition. This ILP is simple but introduces an exponential number of constraints. A better ILP approach in practice considers a set of terminals as source nodes and sends a flow to the rest of the terminals; see [2, 7] for details about these and other ILP methods for the exact sparsification problems.

We compute the exact solution with the flow-based ILP. We use CPLEX 20.10 as the ILP solver on a high-performance computer (Lenovo NeXtScale nx360 M5 system with 400 nodes with 192 GB of memory each). We use Python 3.10 to implement the algorithms described above.

Refer to caption
Refer to caption
Refer to caption
Figure 5: Performance on SteinLib I080 dataset. The lower the cost the better the algorithm is. Our algorithm (MCTS) performs better than 2-approximation.

3.3 GNN architecture:

We illustrate the architecture of our GNN in Figure 2. We use a 12-dimensional node feature vector that includes node positions, an indicator for terminal nodes, an indicator for solution nodes, node degree, clustering coefficients [21], and different node centrality values [8]. For edge features, we use the edge weight and common neighborhoods [18]. The input feature vector is embedded into a higher dimension using a multi-layer perceptron (MLP). We keep three hidden layers and use ReLU activation in the MLP. We set the embedding dimension equal to 128. We use a graph convolutional network (GCN) [14] as a message-passing procedure for our experimental analysis and provide more details about this design choice in the Appendix.

As discussed in Section 2, we use another MLP before mapping the node embedding into the probability space. We use two hidden layers and ReLU activation for that MLP. We have noticed that the GNN achieves good accuracy after 30 epochs and gets saturated during training. Hence we set the maximum epoch equal to 60 with early stopping equal to 15 (the model will automatically stop training when the chosen metric does not improve for 15 epochs).

3.4 MCTS parameters:

We set cpuct=1.3c_{puct}=1.3 according to our initial experiment as well as following the suggestions from previous experimental results [26]. With ϵ\epsilon probability, the MCTS selects an action uniformly. We set ϵ=0.1\epsilon=0.1 since that gives us a reasonable performance confirming existing literature [23]. The MCTS gradually adds new nodes in the simulation step. We set the number of new nodes added at most nn where nn is the number of nodes in the input graph. We stop the MCTS when the height of the search tree is equal to 20%20\% of the number of nodes in the input graph. We discuss the reason for these design choices in the Appendix.

Refer to caption
(a) Twenty nodes graphs
Refer to caption
(b) Twenty nodes graphs
Refer to caption
(c) Twenty nodes graphs
Refer to caption
(d) Fifty nodes graphs
Refer to caption
(e) Fifty nodes graphs
Refer to caption
(f) Fifty nodes graphs
Figure 6: Performance on random geometric additive spanner instances (β=2W\beta=2W). The lower the cost the better the algorithm is. Our algorithm (MCTS) is nearly optimal and performs better than the subsetwise +2W spanner.

4 Experimental results

We evaluate the performance of the proposed approach by comparing the computed sparsification to those computed by the standard algorithms described in Section 2.3 and the optimal solutions. The proposed approach never performs worse than the standard algorithms. We also report running times.

The results for geometric graphs on the Steiner tree problem are shown in Figure 4. We train the model only on geometric graphs having 100 nodes. We test the MCTS on geometric graphs of different node sizes. We illustrate the performance of different algorithms on geometric graphs having 20 nodes in the top row of Figure 4. We illustrate a comparison between the MCTS and the standard 2-approximation algorithm in Figure 4. We can see that the costs of the MCTS are noticeably smaller compared to the 2-approximation algorithm for several instances. As illustrated in Figure 4, the cost difference between the MCTS and the optimal solution is significantly smaller. On the other hand, the cost of 2-approximation is relatively larger compared to the optimal cost as illustrated in Figure 4. We illustrate the performance of different algorithms on geometric graphs having 50 nodes and 100 nodes in the middle and bottom rows of Figure 4 respectively.

It is natural that our method will perform well on geometric graphs since it has been trained on geometric graphs as well. A more interesting experiment would be to run our method on graphs not generated from the same generator. Not only the graphs are not geometric, but also these graphs are from the SteinLib library that contains different datasets of hard Steiner tree instances. We test our MCTS algorithm on the I080 SteinLib dataset; each of these instances contains 80 nodes and six of these nodes are terminals. We illustrate a cost comparison of this dataset in Figure 5.

We discuss the experimental results of the subsetwise multiplicative spanner problem in the Appendix due to the page limit. We here discuss the experimental results of the subsetwise additive spanner problem. For this problem, we consider geometric graphs as before, however, we train the GNN on instances having 50 nodes instead of 100 nodes. The additive spanner problem is relatively harder [4] and computing an exact solution also takes significantly more running time. We set a time limit equal to 20 hours to compute an exact solution, and additive spanner instances having 100 nodes need more than the limit. We test the MCTS on instances having 20 and 50 nodes. Here, we compare our method with the subsetwise +2W+2W algorithm that performs well in practice [5]. The results are illustrated in Figure 6. We can see that the MCTS performs significantly well compared to the subsetwise +2W+2W algorithm and generates nearly optimal solutions.

Graphs/ Algo. ST 20 ST 50 ST 80 ST 100 MS 20 MS 50 MS 100 AS 20 AS 50
Approx. 0.16 0.74 1.29 1.86 0.21 2.38 10.92 0.27 2.72
MCTS 0.64 3.90 5.77 6.32 0.98 9.83 57.17 1.24 11.79
OPT 5.92 165.8 1051 3139 11.79 318.9 16139 37.19 19107
Table 1: Average running time of different algorithms in seconds on test datasets.

4.1 Running time:

We train the GNN for each of the sparsification problems. For the Steiner tree and subsetwise multiplicative spanner problem, we train on geometric instances having 100 nodes. The training times are 20.48 and 21.29 hours respectively. For the subsetwise additive spanner problem, we train on geometric instances having 50 nodes. The training time is 4.92 hours. The average running times of the optimal algorithm, existing approximation algorithms, and our algorithm for different test datasets are shown in Table 1. We denote the Steiner tree, multiplicative spanner, and additive spanner problem instances by ST, MS, and AS respectively. These acronyms are followed by the number of nodes. All of these instances are geometric except the ST 80 dataset which represents the SteinLib 1080 dataset. We can see in Table 1 that the approximation algorithms (Approx.) are the fastest algorithms. Our algorithm is a little slower, however, the solution values are closer to the optimal values.

4.2 Impact of GNN prediction:

The performance of the MCTS depends on the prediction of GNN. The task of GNN is to predict an important node uu from S¯\overline{S} in each step. This non-terminal node uu should connect the terminal nodes in such a way that overall the cost of the sparsification gets reduced. We provide a simple comparison to indicate the importance of the GNN prediction. We compare the MCTS that uses the GNN prediction with another MCTS method that selects random non-terminal nodes (Random MCTS) without using the GNN prediction. We take a dataset of geometric Steiner tree instances having 50 nodes. We illustrate the comparison in Figure 7. Our MCTS method computes Steiner trees with lower costs as expected.

Refer to caption
Figure 7: A comparison of our MCTS with another MCTS that selects random nodes from S¯\overline{S} (Random MCTS). This comparison illustrates the importance of GNN prediction. This is a dataset of geometric Steiner tree instances having 50 nodes.

4.3 Performance on larger instances:

We test our model on instances larger than the training instances to show the scalability of the model. We provide some results in the appendix due to the page limit. For the additive spanner problem, we train the GNN on instances having 50 nodes. We are unable to compute an optimal solution for instances having 100 nodes due to the time limit. However, the MCTS can find a solution only in a few minutes. The solution computed by the MCTS is significantly better than the subsetwise +2W spanner algorithm, see Figure 8. The average running times of subsetwise +2W spanner algorithm and the MCTS are 14.82 and 79.13 seconds respectively.

Refer to caption
Figure 8: Performance on random geometric additive spanner instances (β=2W\beta=2W). Each of these instances has a hundred nodes. Our algorithm (MCTS) is significantly better than the subsetwise +2W spanner.

5 Conclusion

We have described an approach for different sparsification problems based on GNNs and MCTS. An experimental evaluation shows that the proposed method computes solutions that are closer to optimal solutions on different datasets in a reasonable time. The proposed method is a generalization of the approximation algorithms and never performs worse than the approximation algorithms. The source code and experimental data can be found on GitHub.

References

  • [1] Ajit Agrawal, Philip Klein, and Ramamoorthi Ravi. When trees collide: An approximation algorithm for the generalized steiner problem on networks. SIAM journal on Computing, 24(3):440–456, 1995.
  • [2] Reyan Ahmed, Patrizio Angelini, Faryad Darabi Sahneh, Alon Efrat, David Glickenstein, Martin Gronemann, Niklas Heinsohn, Stephen G Kobourov, Richard Spence, Joseph Watkins, and Alexander Wolff. Multi-level steiner trees. Journal of Experimental Algorithmics (JEA), 24:1–22, 2019.
  • [3] Reyan Ahmed, Greg Bodwin, Keaton Hamm, Stephen Kobourov, and Richard Spence. On additive spanners in weighted graphs with local error. In Graph-Theoretic Concepts in Computer Science: 47th International Workshop, WG 2021, Warsaw, Poland, June 23–25, 2021, Revised Selected Papers 47, pages 361–373. Springer, 2021.
  • [4] Reyan Ahmed, Greg Bodwin, Faryad Darabi Sahneh, Keaton Hamm, Mohammad Javad Latifi Jebelli, Stephen Kobourov, and Richard Spence. Graph spanners: A tutorial review. Computer Science Review, 37:100253, 2020.
  • [5] Reyan Ahmed, Greg Bodwin, Faryad Darabi Sahneh, Keaton Hamm, Stephen Kobourov, and Richard Spence. Multi-level weighted additive spanners. arXiv preprint arXiv:2102.05831, 2021.
  • [6] Reyan Ahmed, Keaton Hamm, Stephen Kobourov, Mohammad Javad Latifi Jebelli, Faryad Darabi Sahneh, and Richard Spence. Multi-priority graph sparsification. In International Workshop on Combinatorial Algorithms, pages 1–12. Springer, 2023.
  • [7] Reyan Ahmed, Stephen Kobourov, Faryad Darabi Sahneh, and Richard Spence. Approximation algorithms and an integer program for multi-level graph spanners. Analysis of Experimental Algorithms, page 541.
  • [8] Phillip Bonacich. Power and centrality: A family of measures. American journal of sociology, 92(5):1170–1182, 1987.
  • [9] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009.
  • [10] Hanjun Dai, Elias B Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Proceedings of Neural Information Processing Systems, 2017.
  • [11] Michael Elkin, Yuval Gitlitz, and Ofer Neiman. Improved weighted additive spanners. Distributed Computing, pages 1–10, 2022.
  • [12] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, pages 1024–1034, 2017.
  • [13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [14] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In Proceedings of The International Conference on Learning Representations (ICLR), 2016.
  • [15] Stephen G Kobourov. Spring embedders and force directed graph drawing algorithms. arXiv preprint arXiv:1201.3011, 2012.
  • [16] T. Koch, A. Martin, and S. Voß. SteinLib: An updated library on Steiner tree problems in graphs. Technical Report ZIB-Report 00-37, Konrad-Zuse-Zentrum für Informationstechnik Berlin, Takustr. 7, Berlin, 2000.
  • [17] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. In Proceedings of Neural Information Processing Systems, 2018.
  • [18] David Liben-Nowell and Jon Kleinberg. The link prediction problem for social networks. In Proceedings of the twelfth international conference on Information and knowledge management, pages 556–559, 2003.
  • [19] Mathew Penrose. Random geometric graphs, volume 5. Oxford university press, 2003.
  • [20] Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.
  • [21] Jari Saramäki, Mikko Kivelä, Jukka-Pekka Onnela, Kimmo Kaski, and Janos Kertesz. Generalizations of the clustering coefficient to weighted complex networks. Physical Review E, 75(2):027105, 2007.
  • [22] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
  • [23] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [24] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
  • [25] Tian Xie and Jeffrey C Grossman. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical review letters, 120(14):145301, 2018.
  • [26] Zhihao Xing and Shikui Tu. A graph neural network assisted monte carlo tree search approach to traveling salesman problem. IEEE Access, 8:108418–108428, 2020.

Appendix

A GNN architecture:

One key component of the GNN model is the message-passing procedure that aggregates information from the local neighborhood. We have incorporated two common message-passing procedures: graph convolutional network (GCN) [14] and graph attention network (GAT) [24]. We show a comparison of these two networks on a set of geometric Steiner tree instances having 50 nodes in Figure 9.

Refer to caption
Figure 9: A comparison of GCN and GAT message-passing framework on geometric Steiner instances having 50 nodes.

Since the cost difference between the two types of message-passing models is not significant, we set the GCN model as the default.

B Experimental results

We now consider the subsetwise multiplicative spanner problem. Here, the multiplicative stretch α\alpha is equal to 2. Similar to the Steiner tree problem, we train the GNN on geometric instances having 100 nodes and test the MCTS on instances having 20, 50, and 100 nodes. Here, we compare our method with the well-known greedy algorithm and an optimal algorithm. The greedy algorithm produces asymptotically tight spanners assuming the Erdős girth conjecture and performs well in practice [4]. The results are illustrated in Figure 10. We can see that the MCTS performs significantly well compared to the greedy algorithm for instances having different numbers of nodes. Also, the cost of MCTS is comparable with the optimal cost.

Refer to caption
(a) Twenty nodes graphs
Refer to caption
(b) Twenty nodes graphs
Refer to caption
(c) Twenty nodes graphs
Refer to caption
(d) Fifty nodes graphs
Refer to caption
(e) Fifty nodes graphs
Refer to caption
(f) Fifty nodes graphs
Refer to caption
(g) Hundred nodes graphs
Refer to caption
(h) Hundred nodes graphs
Refer to caption
(i) Hundred nodes graphs
Figure 10: Performance on random geometric multiplicative spanner instances (α=2\alpha=2). The lower the cost the better the algorithm is. Our algorithm (MCTS) performs better than the greedy algorithm.

C Impact of sample size and height of the search tree:

The sample size and height of the search tree are important parameters of the MCTS. We keep the sample size equal to nn, where nn is the number of nodes of the input graph. We stop the MCTS when the height of the search tree is equal to 20% of nn. The reasons for keeping the height only 20% of nn are the computational cost per iteration and the effectiveness of the GNN prediction as discussed in Section 4.2. For each sample node, we need to compute a single source shortest path that increases the total computational cost of the MCTS. Also, the GNN predicts most of the important nodes by the initial set of samples and after that, the sampling process gets saturated and does not increase the solution quality that much. For example, we illustrate the impact of a larger sample size and height of the search tree on geometric instances of the Steiner tree problem having 50 nodes in Figure 11. We can see that the solution quality does not improve that much when we increase the sample size from nn to 2n2n and the height of the search tree from 20% to 40%. However, we significantly increase the solution quality after keeping the sample size equal to nn and the height of the search tree equal to 20% of nn, see Figure 4. On the other hand, the average running time of the MCTS with the first setting is 3.90 seconds. The average running time increases to 7.13 seconds when we increase the sample size and height of the search tree. For each of the settings studied in this paper, we have found that a sample size equal to nn and the height of the search tree equal to 20% of nn is enough. Hence we use this setting for all experiments.

Refer to caption
Figure 11: A comparison of the MCTS using a sample size equal to nn and the height of the search tree equal to 20% of nn with a sample size equal to 2n2n and the height of the search tree equal to 40% of nn. Here, nn is the number of nodes in the input graph. The MCTS gets saturated after using the first setting and does not provide a significant improvement with the second setting. This is a dataset of geometric Steiner tree instances having 50 nodes.

D Performance on larger instances:

For the Steiner tree problem, we train the GNN on instances having 100 nodes. In an earlier section, we compared our method with instances having 100 or fewer nodes. We now compare our method to larger instances. In Figure 12, we illustrate the performance of our method on SteinLib I160 graphs; each of these instances contains 160 nodes. These results indicate that our method performs well on larger instances even after training on small instances.

Refer to caption
Refer to caption
Refer to caption
Figure 12: Performance on SteinLib I160 dataset. The lower the cost the better the algorithm is. Our algorithm (MCTS) performs better than 2-approximation.