Graph Reinforcement Learning for Network Control
via Bi-Level Optimization

Daniele Gammelli James Harrison Kaidi Yang Marco Pavone Filipe Rodrigues Francisco C. Pereira

Abstract

Optimization problems over dynamic networks have been extensively studied and widely used in the past decades to formulate numerous real-world problems. However, (1) traditional optimization-based approaches do not scale to large networks, and (2) the design of good heuristics or approximation algorithms often requires significant manual trial-and-error. In this work, we argue that data-driven strategies can automate this process and learn efficient algorithms without compromising optimality. To do so, we present network control problems through the lens of reinforcement learning and propose a graph network-based framework to handle a broad class of problems. Instead of naively computing actions over high-dimensional graph elements, e.g., edges, we propose a bi-level formulation where we (1) specify a desired next state via RL, and (2) solve a convex program to best achieve it, leading to drastically improved scalability and performance. We further highlight a collection of desirable features to system designers, investigate design decisions, and present experiments on real-world control problems showing the utility, scalability, and flexibility of our framework.

Network Optimization, Graph Neural Networks, Combinatorial Optimization

1 Introduction

Many economically-critical real-world systems are well framed through the lens of control on graphs. For instance, the system-level coordination of power generation systems (Dommel & Tinney, 1968; Huneault & Galiana, 1991; Bienstock et al., 2014); road, rail, and air transportation systems (Wang et al., 2018; Gammelli et al., 2021); complex manufacturing systems, supply chain, and distribution networks (Sarimveis et al., 2008; Bellamy & Basole, 2013); telecommunication networks (Jakobson & Weissman, 1995; Flood, 1997; Popovskij et al., 2011); and many other systems can be cast as controlling flows of products, vehicles, or other quantities on graph-structured environments.

A collection of highly effective solution strategies exist for versions of these problems. Some of the earliest applications of linear programming were network optimization problems (Dantzig, 1982), including examples such as maximum flow (Hillier & Lieberman, 1995; Sarimveis et al., 2008; Ford & Fulkerson, 1956). Within this context, handling multi-stage decision-making is typically addressed via time expansion techniques (Ford & Fulkerson, 1958, 1962). However, despite their broad applicability, these approaches are limited in their ability to handle several classes of problems efficiently. Large-scale time-expanded networks may be prohibitively expensive, as are stochastic systems that require sampling realizations of random variables (Birge & Louveaux, 2011; Shapiro et al., 2014). Moreover, nonlinearities may result in intractable optimization problems.

In this paper, we propose a strategy for simultaneously exploiting the tried-and-true optimization toolkit associated with network control problems while also handling the difficulties associated with stochastic, nonlinear, multi-stage decision-making. To do so, we present dynamic network problems through the lens of reinforcement learning and formalize a problem that is largely scattered across the control, management science, and optimization literature. Specifically, we propose a learning-based framework to handle a broad class of network problems by exploiting the main strengths of graph representation learning, reinforcement learning, and classical operations research tools (Figure 1).

Refer to caption — Figure 1: Many real-world systems (left) such as supply chain networks and mobility systems can be cast as controlling quantities within graph-structured environments (center-left). We present a framework that leverages graph networks (center) within a bi-level formulation. Instead of naively computing actions over graph elements, we first specify a desired next state through RL (center-right), and then solve a convex program to compute the graph actions that can best achieve it (right).

The contributions of this paper are threefold¹¹1Code available at: https://github.com/DanieleGammelli/graph-rl-for-network-optimization:

•

We present a graph network-based bi-level, RL approach that leverages the specific strengths of direct optimization and reinforcement learning.
•

We investigate architectural components and design decisions within our framework, such as the choice of graph aggregation function, action parameterization, how exploration should be achieved, and their impact on system performance.
•

We show that our approach is highly performant, scalable, and robust to changes in operating conditions and network topologies, both on artificial test problems, as well as real-world problems, such as supply chain inventory control and dynamic vehicle routing. Crucially, we show that our approach outperforms classical optimization-based approaches, domain-specific heuristics, and pure end-to-end reinforcement learning.

2 Related Work

Many real-world network control problems rely heavily on convex optimization (Boyd & Vandenberghe, 2004; Hillier & Lieberman, 1995). This is often due to the relative simplicity of constraints and cost functions; for example, capacity constraints on edges may be written as simple linear combinations of flow values, and costs are linear in quantities due to the linearity of prices. In particular, linear programming (as well as specialized versions thereof) is fundamental in problems such as flow optimization, matching, cost minimization and optimal production, and many more. While algorithmic improvements have made many convex problem formulations tractable and efficient to solve, these methods are still not able to handle (i) nonlinear dynamics, (ii) stochasticity, or (iii) the curse of dimensionality in time-expanded networks. In this work, we aim to address these challenges by combining the strengths of direct optimization and reinforcement learning.

Nonlinear dynamics typically requires linearization to yield a tractable optimization problem: either around a nominal trajectory, or iteratively during solution. While sequential convex optimization often yields an effective approximate solution, it is expensive and practically guaranteeing convergence while preserving efficiency may be difficult (Dinh & Diehl, 2010). Stochasticity may be handled in many ways: common strategies are distributional assumptions to achieve analytic tractability (Astrom, 2012), building in sufficient buffer to correct via re-planning in the future (Powell, 2022), or sampling-based methods, often with fixed recourse (Shapiro et al., 2014). Addressing the curse of dimensionality relies on limiting the amount of online optimization; typical approaches include limited-lookahead methods (Bertsekas, 2019) or computing a parameterized policy via approximate dynamic programming or reinforcement learning (Bertsekas, 1995; Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998). However, these policies may be strongly sub-optimal depending on representation capacity and state/action-space coverage. In contrast to these methods, we leverage the strong performance of optimization over short horizons (in which the impact of nonlinearity and stochasticity is typically limited) and exploit an RL-based heuristic for future returns which avoids the curse of dimensionality and the need to solve non-convex or sampled optimization problems.

Our proposed approach results in a bi-level optimization problem. Bi-level optimization—in which one optimization problem depends on the solution to another optimization problem, and is thus nested—has recently attracted substantial attention in machine learning, reinforcement learning, and control (Finn et al., 2017; Harrison et al., 2018; Agrawal et al., 2019a, b; Amos & Kolter, 2017; Landry et al., 2019; Metz et al., 2019). Of particular relevance to our framework are methods that combine principled control strategies with learned components in a hierarchical way. Examples include using LQR control in the inner problem with learnable cost and dynamics (Tamar et al., 2017; Amos et al., 2018; Agrawal et al., 2019b), learning sampling distributions in planning and control (Ichter et al., 2018; Power & Berenson, 2022; Amos & Yarats, 2020), or learning optimization strategies or goals for optimization-based control (Sacks & Boots, 2022; Xiao et al., 2022; Metz et al., 2019, 2022; Lew et al., 2022).

Numerous strategies for learning control with bi-level formulations have been proposed. A simple approach is to insert intermediate goals to train lower-level components, such as imitation (Ichter et al., 2018). This approach is inherently limited by the choice of the intermediate objective; if this objective does not strongly correlate with the downstream task, learning could emphasize unnecessary elements or miss critical ones. An alternate strategy, which we take in this work, is directly optimizing through an inner controller, thus avoiding the problem of goal misspecification. A large body of work has focused on exploiting exact solutions to the gradient of (convex) optimization problems at fixed points (Amos et al., 2018; Agrawal et al., 2019b; Donti et al., 2017). This allows direct backpropagation through optimization problems, allowing them to be used as a generic component in a differentiable computation graph (or neural network). Our approach leverages likelihood ratio gradients (equivalently, policy gradient), an alternate zeroth-order gradient estimator (Glynn, 1990). This enables easy differentiation through lower-level optimization problems without the technical details required by fixed-point differentiation.

3 Problem Setting: Dynamic Network Control

To outline our problem formulation, we first define the linear problem, which yields a classic convex problem formulation. We will then define a nonlinear, dynamic, non-convex problem setting that better corresponds to real-world instances. Much of the classical flow control literature and practice substitute the former linear problem for the latter nonlinear problem to yield tractable optimization problems (Li & Bo, 2007; Zhang et al., 2016; Key & Cope, 1990). Let us consider the control of $N_{c}$ commodities on graphs - for example, vehicles in a transportation problem. A graph $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ is defined as a set $\mathcal{V}$ of $N_{v}$ nodes, and a set $\mathcal{E}$ of $N_{\epsilon}$ ordered pairs of nodes $(i,j)$ called edges, each described by a travel time $t_{ij}$ . We use $\mathcal{N}^{+}(i),\mathcal{N}^{-}(i)\subseteq\mathcal{V}$ for the set of nodes having edges pointing away from or toward node $i$ , respectively. We use $x^{t}_{i}(k)\in\mathbb{R}$ to denote the quantity of commodity $k$ at node $i$ and time $t$ ²²2We consider several reduced views over these quantities: we write $x^{t}_{i}\in\mathbb{R}^{N_{c}}$ to denote the vector of all commodities, $x^{t}(k)\in\mathbb{R}^{N_{v}}$ to denote the vector of commodity $k$ at all nodes, and $x_{i}(k)\in\mathbb{R}^{T}$ to denote commodity $k$ at node $i$ for all times $t$ . .

3.1 The Linear Network Control Problem

Within the linear model, our commodity quantities evolve in time as

x_{i}^{t+1}=x_{i}^{t}+f_{i}^{t}+e_{i}^{t},\quad\forall i\in\mathcal{V}

(1)

where $f_{i}^{t}$ denotes the change due to flow of commodities along edges and $e_{i}^{t}$ denotes the change due to exchange between commodities at the same graph node. We refer to this expression as the conservation of flow. We also accrue money as

m^{t+1}=m^{t}+m_{f}^{t}+m_{e}^{t},

(2)

where $m_{f}^{t},m_{e}^{t}\in\mathbb{R}$ denote the money gained due to flows and exchanges respectively. Our overall problem formulation will typically be to control flows and exchanges so as to maximize money over one or more steps subject to additional constraints such as, e.g., flow limitations through a particular edge.

Flows. We will denote flows along edge $(i,j)$ with $f_{ij}^{t}(k)$ . From these flows, we have

f_{i}^{t}=\sum_{j\in\mathcal{N}^{-}(i)}f_{ji}^{t}-\sum_{j\in\mathcal{N}^{+}(i)}f_{ij}^{t},\quad\forall i\in\mathcal{V}

(3)

which is the net flow (inflows minus outflows). As discussed, associated with each flow is a cost $m_{ij}^{t}(k)$ . Note that given this formulation, the total flow cost for all commodities can be written as $m_{ij}^{t}\cdot f_{ij}^{t}=(m_{ij}^{t})^{\top}f_{ij}^{t}$ . Thus, we can write the total flow cost at time $t$ as

m_{f}^{t}=-\sum_{i\in\mathcal{V}}\left(\sum_{j\in\mathcal{N}^{-}(i)}m_{ji}^{t}\cdot f_{ji}^{t}+\sum_{j\in\mathcal{N}^{+}(i)}m_{ij}^{t}\cdot f_{ij}^{t}\right).

(4)

Exchanges. To define our exchange relations and their effect on commodity quantities and costs, we will write the effect that exchanges have on money for each node; we write this as $m_{i}^{t}$ . Thus, we have $m_{e}^{t}=\sum_{i\in\mathcal{V}}m_{i}^{t}$ . We assume there are $N_{e}(i)$ exchange options at each node $i$ . The exchange relation takes the form

\begin{bmatrix}e_{i}^{t}\\ m_{i}^{t}\end{bmatrix}=E_{i}^{t}w_{i}^{t}

(5)

where $E_{i}^{t}\in\mathbb{R}^{(N_{c}+1)\times N_{e}(i)}$ is an exchange matrix and $w\in\mathbb{R}^{N_{e}(i)}$ are the weights for each exchange. Each column in this exchange matrix denotes an (exogenous) exchange rate between commodities; for example, for $i$ ’th column $[-1,1,0.1]^{\top}$ , one unit of commodity one is exchanged for one unit of commodity two plus $0.1$ units of money. Thus, the choice of exchange weights $w_{i}^{t}$ uniquely determines exchanges $e_{i}^{t}$ and money change due to exchanges, $m_{e}^{t}$ .

Convex constraints. We may impose additional convex constraints on the problem beyond the conservation of flow we have discussed so far. There are a few common examples that one may use in several applications. A common constraint is the non-negativity of commodity values, which we may express as

x_{i}^{t}\geq 0,\quad\forall i,t.

(6)

Note that this inequality is defined element-wise. We may also limit the flow of all commodities through a particular edge via

\sum_{k=1}^{N_{c}}f_{ij}^{t}(k)\leq\overline{f}_{ij}^{t},

(7)

where this sum could also be weighted per commodity. These linear constraints are only a limited selection of some common examples and the particular choice of constraints is problem-specific.

3.2 The Nonlinear Dynamic Network Control Problem

The previous subsection presented a linear, deterministic problem formulation that yields a convex optimization problem for the decision variables—the chosen flows and exchange weights. However, the formulation is limited by the assumption of linear, deterministic state transitions (among others), and is thus limited in its ability to represent typical real-world systems (please refer to Appendix A for a more complete treatment). In this paper, we focus on solving the nonlinear problem (reflecting real, highly-general problem statements) via a bi-level optimization approach, wherein the linear problem (which has been shown to be extremely useful in practice) is used as an inner control primitive.

4 Methodology

In this section, we first introduce a Markov decision process (MDP) for our problem setting in Section 4.1. We further describe the bi-level formulation that is the primary contribution of this paper and provide insights on architectural considerations in Sections 4.2 and 4.3, respectively.

4.1 The Dynamic Network MDP

We consider a discounted MDP $\mathcal{M}=(\mathcal{S},\mathcal{A},P,R,\gamma)$ . Here, $s^{t}\in\mathcal{S}$ is the state and $a^{t}\in\mathcal{A}$ is the action, both at time $t$ . The dynamics, $P:\mathcal{S}\times\mathcal{S}\times\mathcal{A}\to[0,1]$ are probabilistic, with $P(s^{t+1}\mid s^{t},a^{t})$ denoting a conditional distribution over $s^{t+1}$ . Finally, we use $R:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ to denote the reward function and $\gamma\in(0,1]$ the discount factor.

State and state space. Real-world network control problems are typically partially-observed and many features of the world impact the state evolution. However, a small number of features are typically of primary importance, and the impact of the other partially-observed elements can be modeled as stochastic disturbances. Our formulation requires, at each timestep, the commodity values $x^{t}$ . Furthermore, the constraint values are required, such as costs, exchange rates, flow capacities, etc. If the graph topology is time-varying, the connectivity at time $t$ is also critical. More precisely, the state elements that we have discussed so far are either properties of the graph nodes (commodity values) or of the edges (such as flow constraints). This difference is of critical importance in our graph neural network architecture.

Generally, the choice of state elements will depend on the information available to a system designer (what can be measured) and on the particular problem setting. Possible examples of further state elements include forecasts of prices, demand and supply, or constraints at future times.

Action and action space. As discussed in Section 3, an action is defined as all flows and exchanges, $a^{t}=(f^{t},w^{t})$ . In the following subsections, we accurately describe the action parametrization under the bi-level formulation.

Dynamics. The dynamics of the MDP, $P$ , describe the evolution of state elements. We split our discussion into two parts: the dynamics associated with commodity and non-commodity elements.

The commodity dynamics are assumed to be reasonably well-modeled by the conservation of flow (1), subject to the constraints; this forms the basis of the bi-level approach that we describe in the next subsection.

The non-commodity dynamics are assumed to be substantially more complex. For example, buying and selling prices may have a complex dependency on past sales, current demand, current supply (commodity values), as well as random exogenous factors. Thus, we place few assumptions on the evolution of non-commodity dynamics and assume that current values are measurable.

Reward. We assume that our reward is the total discounted money earned over the problem duration. This results in a stage-wise reward function that corresponds to the money earned in that time period, or $R(s^{t},a^{t})=m_{e}^{t}+m_{f}^{t}.$

4.2 The Bi-Level Formulation

The previous subsection presented a general MDP formulation that represents a broad class of relevant network optimization problems. The goal is to find a policy $\tilde{\pi}^{*}\in\tilde{\Pi}$ (where $\tilde{\Pi}$ is the space of realizable Markovian policies) such that $\tilde{\pi}^{*}\in\operatorname*{arg\,max}_{\tilde{\pi}\in\tilde{\Pi}}\mathbb{E}_{\tau}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s^{t},a^{t})\right]$ , where $\tau=(s^{0},a^{0},s^{1},a^{1},\ldots)$ denotes the trajectory of states and actions. This formulation requires specifying a distribution over all flow/exchange actions, which may be an extremely large space. We instead consider a bi-level formulation

	$\displaystyle\pi^{*}\in$	$\displaystyle\operatorname*{arg\,max}_{\pi\in\Pi}\mathbb{E}_{\tau}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s^{t},a^{t})\right]$		(8)
		$\displaystyle\text{s.t.}\,\,a^{t}=\text{LCP}(\hat{s}^{t+1},s^{t}),$		(9)

where we compute $a^{t}$ by replacing a single policy that maps from states to actions (i.e., $s^{t}\rightarrow a^{t}$ ) with two nested policies, mapping from states to desired next states to actions (i.e., $s^{t}\rightarrow\hat{s}^{t+1}\rightarrow a^{t}$ ). As a consequence of this formulation, the desired next state $\hat{s}^{t+1}$ acts as an intermediate variable, thus avoiding the direct parametrization of an extremely large action space, e.g., flows over edges in a graph. This desired next state is then used in a linear control problem ( $\text{LCP}(\cdot,\cdot)$ ), which leverages a (slightly modified) one-step version of the linear problem formulation of Section 3 to map from desired next state to action. Thus, the resulting formulation is a bi-level optimization problem, whereby the policy $\tilde{\pi}$ is the composition of the policy $\pi(\hat{s}^{t+1}\mid s^{t})$ and the solution to the linear control problem. Specifically, given a sample of $\hat{s}^{t+1}$ from the stochastic policy, we select flow and exchange actions by solving


$\displaystyle\operatorname*{arg\,min}_{a^{t}}\quad$	$\displaystyle d(\hat{s}^{t+1},{s}^{t+1})-R(s^{t},a^{t})$	(10a)
$\displaystyle\rm{s.t.}\quad\,\,\,\,\,$	$\displaystyle\text{Conservation of flow }\eqref{eq:con_flow};\text{Net flow }\eqref{eq:flow};$	(10b)
	$\displaystyle\text{Reward }\eqref{eq:money_flow};\text{Exchange conditions }\eqref{eq:exch};$	(10c)
	$\displaystyle\text{Other constraints, e.g.~{}}\eqref{eq:nonneg_comm}\text{ or }\eqref{eq:flow_const}$	(10d)

where $d(\cdot,\cdot)$ is a convex metric which penalizes deviation from the desired next state. The resultant problem is convex and thus may be easily and inexpensively solved to choose actions $a^{t}$ , even for very large problems. Please see Appendix B.2, C for a broader discussion.

As is standard in reinforcement learning, we will aim to solve this problem via learning the policy from data. This may be in the form of online learning (Sutton & Barto, 1998) or via learning from offline data (Levine et al., 2020). There are large bodies of work on both problems, and our presentation will generally aim to be as-agnostic-as-possible to the underlying reinforcement learning algorithm used. Of critical importance is the fact that the majority of reinforcement learning algorithms use likelihood ratio gradient estimation (Williams, 1992), which does not require path-wise backpropagation through the inner problem.

We also note that our formulation assumes access to a model (the linear problem) that is a reasonable approximation of the true dynamics over short horizons. This short-term correspondence is central to our formulation: we exploit exact optimization when it is useful, and otherwise push the impacts of the nonlinearity over time to the learned policy. We assume this model is known in our experiments—which we feel is a reasonable assumption across the problem settings we investigate—but it could be learned from state transitions or as learnable parameters in policy learning.

4.3 Architectural Considerations

After having introduced the problem formulation and a general framework to control graph-structured systems from experience, here and in Appendix B.1, we broaden the discussion on specific algorithmic components.

Table 1: Percentage of oracle performance on different minimum cost flow scenarios.

	Random	MLP-RL	GCN-RL	GAT-RL	MPNN-RL	Oracle
2-hops	9.9% $\pm$ 4.8%	60.2% $\pm$ 2.1%	31.3% $\pm$ 1.3%	22.9% $\pm$ 1.1%	89.7% $\mathbf{\pm 0.9}$ %	-
3-hops	50.3% $\pm$ 8.4%	53.8% $\pm$ 1.6%	68.7% $\pm$ 2.0%	62.4% $\pm$ 1.9%	89.5% $\mathbf{\pm 1.1}$ %	-
4-hops	63.1% $\pm$ 3.9%	67.8% $\pm$ 2.5%	71.4% $\pm$ 1.7%	68.2% $\pm$ 2.3%	87.1% $\mathbf{\pm 1.2}$ %	-
Dynamic travel time	-23.4% $\pm$ 4.3%	-0.7% $\pm$ 1.7%	18.7% $\pm$ 2.0%	17.1% $\pm$ 1.6%	99.1% $\mathbf{\pm 1.3}$ %	-
Dynamic topology	42.5% $\pm$ 6.8%	N/A	53.4% $\pm$ 2.8%	43.4% $\pm$ 3.1%	83.9% $\mathbf{\pm 1.0}$ %	-
Multi-commodity	22.5% $\pm$ 8.2%	41.7% $\pm$ 3.2%	33.8% $\pm$ 2.1%	33.0% $\pm$ 1.7%	72.0% $\mathbf{\pm 1.6}$ %	-
Capacity (Success Rate)	62.6% (82%)	62.7% (82%)	65.2% (87%)	62.9% (80%)	89.8% (87%)	- (88%)

Network architectures. We argue that graph networks represent a natural choice for network optimization problems because of three main properties. First, permutation invariance. Crucially, non-permutation invariant computations would consider each node ordering as fundamentally different and thus require an exponential number of input/output training examples before being able to generalize. Second, locality of the operator. GNNs typically express a local parametric filter (e.g., convolution operator) which enables the same neural network to be applied to graphs of varying size and connectivity and achieve non-parametric expansibility. This is a property of fundamental importance for many real-world graph control problems, which will be dynamic or frequently re-configured, and it is desirable to be able to use the same policy without re-training. Lastly, alignment with the computations used for network optimization problems. As shown in (Xu et al., 2020), GNNs can better match the structure of many network optimization algorithms and are thus likely to achieve better performance.

Action parametrization. Let us consider the problem of controlling flows in a network. We are interested in defining a desired next state $\hat{s}^{t+1}$ that is ideally (i) lower dimensional, (ii) able to capture relevant aspects for control, and (iii) as-robust-as-possible to domain shifts. At a high level, we achieve this by avoiding the direct parametrization of per-edge desired flow values and compute per-node desired inflow quantities. Concretely, given the total availability $M$ of commodity units in the graph, we define $\hat{s}^{t+1}=\{\hat{q}_{i}^{t+1}\}_{i\in\mathcal{V}},\sum_{i}\hat{q}_{i}^{t+1}=M$ as a desired per-node number of commodity units. We do so by first determining $\tilde{q}_{i}^{t+1}=\{\tilde{q}_{i}^{t+1}\}_{i\in\mathcal{V}}$ , where $\tilde{q}_{i}^{t+1}\in[0,1]$ defines the percentage of currently available commodity units to be moved to node $i$ in time step $t$ , and $\sum_{i\in\mathcal{V}}\tilde{q}_{i}^{t+1}=1$ . We then use this to compute $\hat{q}_{i}^{t+1}=\lfloor\tilde{q}_{i}^{t+1}\cdot M\rfloor$ as the actual number of commodity units. In practice, we achieve this by defining the intermediate policy as a Dirichlet distribution over nodes, i.e., $\pi(\hat{s}^{t+1}|s^{t})=\tilde{q}^{t+1}\cdot M,\tilde{q}^{t+1}\sim\text{Dir}(\tilde{q}^{t+1}|s^{t})$ . Crucially, the representation of the desired next state via $\hat{q}_{i}$ (i) is lower-dimensional as it only acts over nodes in the graph, (ii) uses a meaningful aggregated quantity to control flows, and (iii) is scale-invariant by construction as it acts on ratios opposed to raw commodity quantities. Additionally, for problems that require a generation of commodities (e.g., products in a supply chain), we define the desired next state via the exchange weights introduced in Eq (5), $\hat{s}^{t+1}=\{w_{i}^{t+1}\}_{i\in\mathcal{V}},w_{i}^{t+1}\in\mathbb{N}^{+}$ , with $w_{i}^{t+1}$ representing the number of commodity units to generate. In practice, this can be achieved by defining the intermediate policy as a Gaussian distribution over nodes (followed by rounding), i.e., $\pi(\hat{s}^{t+1}|s^{t})=\text{round}(w^{t+1}),w^{t+1}\sim\mathcal{N}(w^{t+1}|s^{t})$ .

5 Experiments

In this section, we first consider an artificial minimum cost flow problem as a simple graph control problem that illustrates the basic principles of our formulation and investigates architectural components (Section 5.1). We further assess the versatility of our framework by applying it to two distinct real-world network problems: the supply chain inventory management problem (Section 5.2) and the dynamic vehicle routing problem (Section 5.2). Specifically, these problems represent two instantiations of economically-critical graph control problems where the task is to control flows of quantities (i.e., packages and vehicles, respectively), generate commodities (i.e., products within a supply chain), or both.

Experimental design. While the specific benchmarks will necessarily depend on the individual problem, in all real-world experiments, we will always compare against the following classes of methods: (i) an Oracle benchmark characterized as an MPC controller which has access to perfect information of all future states of the system and can thus plan for the perfect action sequence, (ii) a Domain-driven Heuristic, i.e., algorithms which are generally accepted as go-to approaches for the types of problems we consider, and (iii) a Randomized heuristic to quantify a reasonable lower-bound of performance within the environment.

5.1 Minimum Cost Flow

Let us consider an artificial minimum cost flow problem where the goal is to control commodities from one or more source nodes to one or more sink nodes, in the minimum time possible. We assess the capability of our formulation to handle several practically-relevant situations. Specifically, we do so by comparing different versions of our method against an oracle benchmark to investigate the effect of different neural network architectures. Results in Table 1 and in Appendix D.1.3, show how graph-RL approaches are able to achieve close-to-optimal performance in all proposed scenarios while greatly reducing the computation cost compared to traditional solutions (Figure 2 and Appendix C.2)³³3All methods used the same computational CPU resources, namely a AMD Ryzen Threadripper 2950X (16-Core, 32 Thread, 40M Cache, 3.4 GHz base).. Among all formulations (please refer to Appendix D.1 for additional details), MPNN-RL is clearly the best performing architecture, achieving $86.7\%$ of oracle performance, on average. As discussed in Section 4.3, this highlights the importance of the algorithmic alignment (Xu et al., 2020) between the neural network architecture and the nature of the computations needed to solve the task. Crucially, results show how our formulation is able to operate reliably within a broad set of situations, ranging from scenarios characterized by dynamic travel times (Dyn. travel time), dynamic topologies, i.e., with nodes and edges that can be removed or added during an episode (Dyn. topology), capacitated-networks (Capacity) with different depth (2-hop, 3-hop, 4-hop), and multi-commodity problems (Multi-commodity).

5.2 Supply Chain Inventory Management (SCIM)

Table 2: System performance on real-world SCIM experiments.

	Avg. Prod.	S-type Policy	End-to-End RL (MLP/GNN)	Graph-RL (ours)	Oracle
1F 2S	-20,334 ( $\pm$ 4,723)	-4,327 ( $\pm$ 251)	-1,832 ( $\pm$ 352) / -17 ( $\pm$ 89)	192 ( $\pm$ 119)	852 ( $\pm$ 152)
% Oracle	0.0%	75.5%	87.3% / 95.8%	96.8%	100.0%
1F 3S	-53,113 ( $\pm$ 7,231)	-5,650 ( $\pm$ 298)	-4,672 ( $\pm$ 258) / -810 ( $\pm$ 258)	997 ( $\pm$ 109)	3,249 ( $\pm$ 102)
% Oracle	0.0%	84.2%	85.9% / 92.7%	96.0%	100.0%
1F 10S	-114,151 ( $\pm$ 4,611)	-14,327 ( $\pm$ 365)	-587,887 ( $\pm$ 5,255) / -568,374 ( $\pm$ 5,255)	890 ( $\pm$ 288)	1,358 ( $\pm$ 460)
% Oracle	0.0%	86.4%	N.A. / N.A.	99.5%	100.0%

In our first real-world experiment, we aim to optimize the performance of a supply chain inventory system. Specifically, this describes the problem of ordering and shipping product inventory within a network of interconnected warehouses and stores in order to meet customer demand while simultaneously minimizing storage and transportation costs. A supply chain system is naturally expressed via a graph $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ , where $\mathcal{V}=\mathcal{V}_{S}\cup\mathcal{V}_{W}$ is the set of both store $\mathcal{V}_{S}$ and warehouse $\mathcal{V}_{W}$ nodes, and $\mathcal{E}$ the set of edges connecting stores to warehouses. Demand $d_{i}^{t}$ materializes in stores $i\in\mathcal{V}_{S}$ at each period $t$ . If inventory is available at the store, it is used to meet customer demand and sold at a price $p$ . Unsatisfied orders are maintained over time and are represented as a negative stock (i.e., backorder). At each time step, the warehouse orders additional units of inventory $w_{i}$ from the manufacturers and stores available ones. As commodities travel across the network, they are delayed by transportation times $t_{ij}$ . Both warehouses and storage facilities have limited storage capacities $c_{i}$ , such that the current inventory $q_{i}$ cannot exceed it. The system incurs a number of operations-related costs: storage costs $m^{S}_{i}$ , production costs $m^{O}_{i}$ , backorder costs $m^{B}_{i}$ , transportation costs $m^{T}_{ij}$ .

SCIM Markov decision process. To apply the methodologies introduced in Section 4, we formulate the SCIM problem as an MDP characterized by the following elements (please refer to Appendix D.2.2 for a formal definition):

Action space ( $\mathcal{A}$ ): we consider the problem of determining (1) the amount of additional inventory $w_{i}$ to order from manufacturers in all warehouse nodes $i\in\mathcal{V}_{W}$ , and (2) the flow $f_{ij}$ of commodities to be shipped from warehouses to stores, such that $\mathbf{a}^{t}=\{w_{i}^{t}\}_{i\in\mathcal{V}_{W}}\cup\{f_{ij}^{t}\}_{(i,j)\in\mathcal{E}}$ .

Reward $\left(R(s^{t},a^{t})\right)$ : we select the reward function in the MDP as the profit of the inventory manager, computed as the difference between sales revenues and costs.

State space ( $\mathcal{S}$ ): the state space describes the current status of the supply network, via node and edge features. Node features contain information on (i) current inventory, (ii) current and estimated demand, (iii) incoming flow, and (iv) incoming orders. Edge features are characterized by (i) travel time $t_{ij}$ , and (ii) transportation cost $m^{T}_{ij}$ .

Bi-Level formulation. In what follows and in Appendix D.2.4, we illustrate a specific instantiation of our framework for the SCIM problem. We define the desired outcome $\hat{s}^{t+1}$ as being characterized by two elements: (i) the desired production in warehouse nodes $\hat{w}_{i}^{t+1},\forall i\in\mathcal{V}_{W}$ , and (ii) a desired inventory in store nodes $\hat{q}_{i}^{t+1},\forall i\in\mathcal{V}_{S}$ .

The LCP selects flow and production actions to best achieve $\hat{s}^{t+1}$ via distance minimization between desired and actual inventory levels. The LCP is further defined by domain-related constraints, such as ensuring that the inventory in store and warehouse nodes does not exceed storage capacity and that shipped products are non-negative and upper bounded by inventory.

Inventory management via graph control.

For the SCIM problem, we define the domain-driven heuristic as a prototypical S-type (or “order-up-to”) policy, which is generally accepted as an effective heuristic (Van Roy et al., 1997). Appendix D.2 provides further experimental details.

Concretely, we measure overall system performance on three different supply chain networks characterized by increasing complexity. Results in Table 2 show that our framework achieves close-to-optimal performance in all tasks. Specifically, Graph-RL achieves 96.8% (1F2S), 96% (1F3S), and 99.5% (1F10S) of oracle performance. Qualitatively, Figure 3 highlights how Graph-RL learns to control the production and shipping policies to match consumer demand while maintaining low inventory storage. More subtly, Figure 3 shows how policies learned through Graph-RL manage to anticipate demand so that products are promptly available in stores by taking production and shipping time under consideration. Results in Table 2 also show how S-type policies, despite being explicitly fine-tuned for all tasks, are largely inefficient and thus incur unnecessary costs and revenue losses, resulting in a profit gap of approximately $15\%$ compared to Graph-RL, on average.

Table 3: System performance on real-world DVR experiments.

	Random	Evenly-balanced System	End-to-end RL	Graph-RL (ours)	Oracle
New York	-10,778 ( $\pm$ 659)	9,037 ( $\pm$ 797)	-6,043 ( $\pm$ 2,584)	15,481 ( $\pm$ 397)	16,867 ( $\pm$ 547)
% Oracle	0.0%	71.6%	17.2%	94.9%	100.0%
Shenzhen	19,406 ( $\pm$ 1,894)	29,826 ( $\pm$ 706)	18,889 ( $\pm$ 1,207)	36,918 ( $\pm$ 616)	40,332 ( $\pm$ 724)
% Oracle	0.0%	50.1%	-0.02%	83.8%	100.0%
Zero Shot NY $\rightarrow$ SHE	-	-	18,568 ( $\pm$ 1,358)	36,100 ( $\pm$ 657)	-
Zero Shot SHE $\rightarrow$ NY	-	-	-4,083 ( $\pm$ 1,278)	14,495 ( $\pm$ 426)	-

LCP as inductive bias for network computations. As a further analysis, we compare with an ablation of our framework, which, as in the majority of literature, is defined as a purely end-to-end RL agent that avoids the LCP and directly maps from environment states to production and shipping actions through either MLPs (Peng et al., 2019; Oroojlooyjadid et al., 2022) or GNNs. Results in Figure 4 clearly highlight how the bi-level formulation exhibits significantly improved sample efficiency and performance compared to its end-to-end counterpart, which is either substantially slower at converging to good-quality solutions or does not converge at all, as in Figure 4 (c). We argue that this behavior is due to two main factors: (1) the bi-level agent operates on a lower-dimensional and well-structured representation via $\hat{s}^{t+1}$ , and (2) the bi-level formulation provides an implicit inductive bias towards feasible, high-quality solutions via the definition of the LCP. Together, these two properties define an RL agent that exhibits improved efficiency and performance.

Parameter	Explanation	Value	Parameter	Explanation	Value
$d^{\max}$	Maximum demand	[2, 16]	$m^{S}$	Storage cost	[3, 2, 1]
$d^{var}$	Demand variance	[2, 2]	$m^{O}$	Production cost	5
$T$	Episode length	30	$m^{B}$	Backorder cost	21
$t^{P}$	Production time	1	$m^{T}$	Transportation cost	[0.3, 0.6]
$t_{ij}$	Travel time	[1, 1]	$p$	Price	15
$c$	Storage capacity	[20, 9, 12]

Parameter	Explanation	Value	Parameter	Explanation	Value
$d^{\max}$	Maximum demand	[1, 5, 24]	$m^{S}$	Storage cost	[2, 1, 1]
$d^{var}$	Demand variance	[2, 2, 2]	$m^{O}$	Production cost	5
$T$	Episode length	30	$m^{B}$	Backorder cost	21
$t^{P}$	Production time	1	$m^{T}$	Transportation cost	[0.3, 0.3, 0.3]
$t_{ij}$	Travel time	[1, 1, 1]	$p$	Price	15
$c$	Storage capacity	[30, 15, 15, 15]

Parameter	Explanation	Value	Parameter	Explanation	Value
$d^{\max}$	Maximum demand	[2, 2, 2, 2, 10, 10, 10, 18, 18, 18]	$m^{S}$	Storage cost	$[1,2\quad\forall i\in\mathcal{V}/0]$
$d^{var}$	Demand variance	$[2]_{i\in\mathcal{V}}$	$m^{O}$	Production cost	5
$T$	Episode length	30	$m^{B}$	Backorder cost	21
$t^{P}$	Production time	1	$m^{T}$	Transportation cost	$[0.3]_{i\in\mathcal{V}}$
$t_{ij}$	Travel time	$[1]_{i\in\mathcal{V}}$	$p$	Price	15
$c$	Storage capacity	$[100,15\quad\forall i\in\mathcal{V}/0]$


$\displaystyle\min_{f_{ij}^{t},w^{t},\epsilon_{i}^{f},\epsilon_{i}^{w}}$	$\displaystyle\sum_{i\in\mathcal{V}_{S}}\|\epsilon_{i}^{f}\|+\sum_{i\in\mathcal{V}_{W}}\|\epsilon_{i}^{w}\|$		(15a)
s.t.	$\displaystyle\sum_{j\in\mathcal{N}^{-}(i)}f_{ji}^{t}=\hat{q}_{i}^{t+1}+\epsilon_{i}^{f},$	$\displaystyle i\in\mathcal{V}_{S}$	(15b)
	$\displaystyle q_{i}^{t}+\sum_{j\in\mathcal{N}^{-}(i)}f_{ji}^{t}-d_{i}^{t}\leq c_{i}^{t},$	$\displaystyle i\in\mathcal{V}_{S}$	(15c)
	$\displaystyle\sum_{j\in\mathcal{N}^{+}(i)}f_{ij}^{t}\leq q_{i}^{t},$	$\displaystyle i\in\mathcal{V}_{W}$	(15d)
	$\displaystyle q_{i}^{t}+w_{i}^{t}-\sum_{j\in\mathcal{N}^{+}(i)}f_{ij}^{t}\leq c_{i}^{t},$	$\displaystyle i\in\mathcal{V}_{W}$	(15e)
	$\displaystyle w_{i}^{t}=\hat{w}_{i}^{t}+\epsilon_{i}^{w},$	$\displaystyle i\in\mathcal{V}_{W}$	(15f)
	$\displaystyle f_{ij}^{t}\geq 0,$	$\displaystyle(i,j)\in\mathcal{E}$	(15g)


$\displaystyle\min_{f_{ij,R}^{t}}$	$\displaystyle\sum_{(i,j)\in\mathcal{E}}m_{ij}^{t}f_{ij,R}^{t}$		(17a)
s.t.	$\displaystyle\sum_{j\neq i}(f_{ji,R}^{t}-f_{ij,R}^{t})+q_{i}^{t}\geq\hat{q}_{i}^{t},$	$\displaystyle i\in\mathcal{V}$	(17b)
	$\displaystyle\sum_{j\neq i}f_{ij,R}^{t}\leq q_{i}^{t},$	$\displaystyle i\in\mathcal{V}$	(17c)
	$\displaystyle f_{ij,R}^{t}\geq 0,$	$\displaystyle(i,j)\in\mathcal{E}$	(17d)

			Greedy
	Graph-RL		(i.e., $\operatorname*{arg\,min}_{a^{t}}-R(s^{t},a^{t})$ )
	(i.e., $\operatorname*{arg\,min}_{a^{t}}d(\hat{s}^{t+1},{s}^{t+1})-R(s^{t},a^{t})$ )
SCIM	1F2S	Reward	-102,919 ( $\pm$ 2,767)	192 ( $\pm$ 119)
	1F2S	%Oracle	N.A.	96.8%
	1F3S	Reward	-169,433 ( $\pm$ 2,880)	997 ( $\pm$ 109)
	1F3S	%Oracle	N.A.	96.0%
	1F10S	Reward	-587,661 ( $\pm$ 3,862)	890 ( $\pm$ 288)
	1F10S	%Oracle	N.A.	99.5%
DVR	New York	Reward	13,978 ( $\pm$ 391)	15,481 ( $\pm$ 397)
		Served Demand	1,357 ( $\pm$ 92)	1,824 ( $\pm$ 87)
		%Oracle	90.13%	94.9%
	Shenzhen	Reward	35,996 ( $\pm$ 499)	36,918 ( $\pm$ 616)
		Served Demand	2,881 ( $\pm$ 98)	3,310 ( $\pm$ 92)
		%Oracle	79.27%	83.9%

Graph Reinforcement Learning for Network Control via Bi-Level Optimization

Abstract

1 Introduction

2 Related Work

3 Problem Setting: Dynamic Network Control

3.1 The Linear Network Control Problem

3.2 The Nonlinear Dynamic Network Control Problem

4 Methodology

4.1 The Dynamic Network MDP

4.2 The Bi-Level Formulation

4.3 Architectural Considerations

5 Experiments

5.1 Minimum Cost Flow

5.2 Supply Chain Inventory Management (SCIM)

5.3 Dynamic Vehicle Routing

5.4 Comparison to Greedy Planning

6 Conclusion

References

Appendix A Dynamic Network Control

Elements violating the linearity assumption

Appendix B Methodology

B.1 Network Architecture

Handling dynamic topologies.

B.2 RL Details

Exploration.

Integer-valued flows.

Appendix C Discussion and Algorithmic Components

C.1 Distance metric as value function

C.2 Computational efficiency

Appendix D Additional Experiment Details

D.1 Minimum Cost Flow

D.1.1 Environment details

Generalities.

2-hop, 3-hop, 4-hop.

Dynamic travel times

Capacity constraints.

Multi-commodity.

Computational analysis.

D.1.2 Model implementation

Randomized heuristics.

Learning-based.

D.1.3 Additional results

Minimum cost flow through message passing.

Dynamic travel times.

Dynamic topology.

Capacity constraints.

Multi-commodity.

D.2 Supply Chain Inventory Management

D.2.1 Environment details

D.2.2 MDP details

D.2.3 Model implementation

Randomized heuristics.

Domain-driven heuristics.

Learning-based approaches.

D.2.4 LCP formulation

D.3 Dynamic Vehicle Routing

D.3.1 Environment details

D.3.2 MDP details

D.3.3 Model implementation

Randomized heuristics.

Domain-driven heuristics.

Learning-based approaches.

D.3.4 LCP formulation

D.3.5 Additional results

Appendix E Additional Visualizations

Graph Reinforcement Learning for Network Control
via Bi-Level Optimization