Exploiting Instance and Variable Similarity to Improve Learning-Enhanced Branching

Xiaoyi Gu xiaoyigu@gatech.edu, H. Milton Stewart School of Industrial & Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332. Santanu S. Dey santanu.dey@isye.gatech.edu, H. Milton Stewart School of Industrial & Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332. Álinson S. Xavier axavier@anl.gov, Energy Systems Division, Argonne National Laboratory Feng Qiu fqiu@anl.gov, Energy Systems Division, Argonne National Laboratory

(Aug 21, 2022)

Abstract

In many operational applications, it is necessary to routinely find, within a very limited time window, provably good solutions to challenging mixed-integer linear programming (MILP) problems. An example is the Security-Constrained Unit Commitment (SCUC) problem, solved daily to clear the day-ahead electricity markets. Previous research demonstrated that machine learning (ML) methods can produce high-quality heuristic solutions to combinatorial problems, but proving the optimality of these solutions, even with recently-proposed learning-enhanced branching methods, can still be time-consuming. In this paper, we propose a simple modification to improve the performance of learning-enhanced branching methods based on the key observation that, in such operational applications, instances are significantly similar to each other. Specifically, instances typically share the same size and problem structure, with slight differences only on matrix coefficients, right-hand sides and objective function. In addition, certain groups of variables within a given instance are also typically similar to each other. Therefore, unlike previous works in the literature which predicted all branching scores with a single ML model, we propose training separate ML models per variable or per groups of variables, based on their similarity. We evaluate this enhancement on realistic large-scale SCUC instances and we obtain significantly better gap closures than previous works with the same amount of training data.

1 Introduction

Many practical optimization problems related to business or industrial operations require the solution of challenging Mixed-Integer Programming Problems (MILP) on a regular basis. Unlike MILPs related to long-term planning, these operational problems typically need to be solved within a limited time window, since problem data only becomes available a few hours or even minutes before the solution is to be implemented in the real world, and therefore impose great challenges to even state-of-the-art MILP solvers. An example is the Security-Constrained Unit Commitment (SCUC) problem, solved daily by Independent System Operators to clear the day-ahead electricity markets. Because operators only have 3 to 4 hours to clear the markets after receiving bids and offers from market participants, SCUC needs to be solved within 15 to 30 minutes [10]. Another example is Optimal Transmission Switching (OTS) [9, 20, 32, 29], which aims to optimize the topology of the power grid. To be used in real-time operations, OTS needs to be solved every 15 minutes, within approximately 5 minutes to allow time for feasibility checks and other post-processing.

Previous research has demonstrated that machine learning (ML) methods can quickly produce high-quality heuristic solutions to a variety of challenging combinatorial problems, including some operational problems. For example, [29] uses k-nearest neighbors to predict solutions for the real-time DC optimal transmission switching; [40] applies deep learning to construct solutions to the newsvendor problem; [44] uses machine learning to construct partial solutions to the security-constraint unit commitment problem; and [34] uses deep learning to predict solutions to stochastic load planning problems. Proving the optimality of these heuristic solutions, however, may still require the exploration of large branch-and-bound trees. Motivated by this, previous research has also shown that ML methods can help MILP solvers to explore the branch-and-bound tree more efficiently through learning-enhanced branching rules. For example, [3] approximates strong branching decisions using supervised learning and extremely-randomized trees; [31] models the variable branching task as an online learning-to-rank problem; [23] mimics strong branching decisions using graph convolutional networks and a variable-constraint bipartite graph; [39] uses deep learning to mimic a variant of full strong branching that scales to large instances using GPUs; [42] and [48] use reinforcement learning to guide branching decisions.

In this paper, we propose a simple enhancement to improve the performance of learning-enhanced branching methods based on the key observation that, in operational problems, instances that are solved on a regular basis share significant similarities with each other. Specifically, we routinely need to solve problems of the form

\displaystyle\begin{array}[]{rl}\textup{minimize}&c_{x}^{T}x+c_{y}^{T}y\\ \text{subject to}&Ax+Gy\geq b\\ &x\in\{0,1\}^{n}\\ &y\in\mathbb{R}^{m}\end{array}

(5)

where the matrices $A$ and $G$ , and the vectors $c$ and $b$ only differ slightly. The identity of each decision variable, therefore, remains the same across instances. In addition, certain groups of variables within a given instance can also be naturally grouped, based on their physical representaiton, and these groups remain the same across multiple instances. For example, SCUC includes groups of decision variables correponding to the same generator, and groups of variables corresponding to the same operation, such as switching on a generator. We propose, therefore, training separate ML models per variable or per group of variables, based on their meaning and similarities. This is in contrast with previous works, which used a single ML model to predict branching decisions for all variables. While sophisticated machine learning models could potentially learn these groupings automatically, given enough training data, we stress that generating training data for learning-enhanced branching methods can be extremely time-consuming, especially for large-scale MILP instances, as highlighted in previous works [39]. Furthermore, the groupings we propose are straightforward to identify and require negligible manual labeling effort.

To evaluate the proposed enhancement, we modify the supervised learning method proposed by [3] and perform comprehensive benchmarks on realistic large-scale SCUC instances. We carefully evaluate the impact of various different groupings, including “per-variable”, “per-generator”, “per-time” and “per-type” on ML accuracy scores, as well as MIP gap closure. Our experiments show that variable grouping can significantly improve the performance of the original method, bringing it much closer to strong branching, while using exactly the same training data set.

The rest of the paper is organized as follows. In Section 2, we review SCUC and existing learning-enhanced branching rules. In Section 3, we explain our proposed branching schemes in further detail. In Section 4, we describe our experimental setup, including implementation details, training data generation and hyperparameters. In Section 5 we present our experimental results, and finally in Section 6 we discuss some conclusions and future directions of research.

2 Background

In this section, we review the Security-Constrained Unit Commitment problem, the branch-and-bound algorithm and previously-proposed learning-enhanced branching rules.

2.1 Security-Constrained Unit Commitment

Security-Constrained Unit Commitment (SCUC) is a fundamental optimization problem solved daily by power system operators that seeks to minimize electricity production costs by finding the most effective generator commitment schedule and power output levels and is widely used in power system applications such as electricity market clearing and reliability assessment. SCUC is a generalization of the unit commitment problem ([12]), and is typically modeled in industrial practice as a large-scale Mixed Integer Linear Programming (MILP) problem. Numerous MILP formulations for the problem have been proposed (see, for example, [22, 41, 38]). Besides being difficult in the theoretical sense (see [8] for NP-hardness of SCUC), SCUC is made even more challenging by the fact that a new near-optimal solution must be obtained every day within a very limited time window.

2.2 Branch-and-Bound Algorithm

The branch-and-bound algorithm was proposed in the landmark paper [33] to provide a finite algorithm to solve MILPs, and today it is at the heart of all state-of-the-art MILP solvers. To completely define the algorithm, it is necessary to specify a variable selection rule, which decides how to partition the problem into multiple subproblems, and a node selection rule, which defines what subproblem to process next ([43, 13]). On the node selection side, it is well-known that using the best-bound rule produces the smallest trees for realistic benchmark instances ([1]). Recently, [16] also showed that, for certain classes of random integer programs, using this rule guarantees a polynomial-size branch-and-bound tree with high probability. The more challenging rule to optimize is the variable selection rule. Negative results due to [28, 11, 14, 15, 7, 17, 19] show that, no matter what variable selection rule is used, the branch-and-bound tree is, of exponential size in the worst case. In practice, the variable selection rule has a huge influence on the size of the branch-and-bound tree. Today, it is well-established that strong branching ([6]) typically produces the smallest trees for a variety of optimization problems. Although this rule produces high-quality decisions ([18]), it is extremely computationally demanding, and most times prohibitive to implement ([2]). See [35, 36] for a nice overview of computational issues with the branch-and-bound algorithm. Since strong branching produces high-quality decisions, but at a high computational cost, multiple works in the literature have attempted to mimic it using machine learning methods [31, 4, 23, 47, 25, 39]. See [37] for a review of this direction of research.

2.3 Non-ML Branching Methods

Most traditional non-ML branching methods can be generally modelled as a score function $S(i,I)$ , where $i$ represents the index of the candidate variable and $I$ represents general information about current and previous nodes. Given this function, the candidate variable chosen for branching is the one with the best (highest) score.

Most-infeasible branching (MIB) is one of the most simple branching strategies, which uses a score function of $S_{MIB}(i,I)=\min\{x_{i},1-x_{i}\}$ , the infeasibility or “fraction” of the candidate variable at current node. While MIB is extremely cheap to evaluate, it is unfortunately known to perform poorly compared to other branching methods. Reliability branching (RB: $\lambda$ : $\eta$ ) is the state-of-the-art branching scheme with respect to running time ([2]). The method relies on probing, during which it actually solves the linear relaxations of both upward and downward branches for the fractional variable. More specifically, at each node, the candidate fractional variables are sorted by pseudocost, then at most $\lambda$ variables are probed. If a given variable has already been probed $\eta$ times during the entire execution of the algorithm, its pseudocosts are deemed reliable, and no more probes are performed for the variable. The score function is usually $S(i,I)=\max\{\tilde{\Delta}_{i}^{-},\epsilon\}\cdot\max\{\tilde{\Delta}_{i}^{+},\epsilon\}$ , where $\tilde{\Delta}_{i}^{\pm}$ are the objective increases if probed, or the pseudocost estimations otherwise.

The scheme covers RB:inf:inf, so called full strong branching, where every fractional variable is probed, and which was shown in numerous results (like [2]) to generate the smallest branch-and-bound trees compared to other rules. However, since every fractional variable to be probed translates to two linear relaxations to be solved, this scheme is computationally prohibitive for large scale problems. On the other hand, reduced versions of RB: $\lambda$ : $\eta$ with smaller values of $\lambda$ and $\eta$ can achieve good branching while being considerably faster. Considering the size of our problems, we use settings of RB:100:inf as the practical strong branching oracle.

2.4 Previous ML Branching Methods

In this work, we evaluate our proposed ML scheme against the method described in [4], denoted as ML:ET. In general, ML:ET can be formulated as

S_{ML:ET}(i,I)=f_{ML:ET}(\phi_{i})\approx S_{oracle}(i,I),

where $\phi_{i}$ is the feature vector from available information $I$ , and $f_{ML:ET}$ is the learned machine-learning model, which is used to mimic a good branching oracle like strong branching. A wide variety of handcrafted features, including static problem features, dynamic problem features and dynamic optimization features, are used. Training problem instances are solved using strong branching, without any heuristics or cut generation, to collect strong branching scores, and then Extremely Randomized Trees (ExtraTrees, [24]) are used to approximate them.

3 Proposed Per-Variable and Per-Group ML Methods

The per-variable ML scheme we propose, denoted by ML:PV, looks similar to ML:ET, with the key difference that one model is built for each variable. In general, our scheme can be formulated as

S_{ML:PV}(i,I)=f^{i}_{ML:PV}(\phi_{i})\approx S_{oracle}(i,I).

While the idea applies to any regressor, we use ExtraTrees to keep it consistent with ML:ET. However, since the size of the training dataset for each per-variable model is significantly smaller, the hyperparameters need to be modified to avoid over-fitting. Specifically, smaller trees need to be generated.

We also propose a per-group ML scheme, denoted by ML:PG, which is similar to the per-variable scheme, which trains one model per variable group. It is formulated as

S_{ML:PG}(i,I)=f^{g(i)}_{ML:PG}(\phi_{i})\approx S_{oracle}(i,I),

where $g(i)$ is the corresponding group to which the $i$ -th column belongs. Similarly, we use ExtraTrees as the regressor and we tune the hyperparameters individually, for each group of variables, to avoid over-fitting.

Note that both ML:ET and ML:PV could be viewed as extreme cases of per-group settings, since ML:ET is basically grouping all variables into one group, while ML:PV is grouping every variable into its own group. We anticipate more accurate predictions with finer grouping, at the cost of reduced model generality.

4 Experimental Setup

In this section, we describe our branch-and-bound implementation, computational environment, benchmark instances and training data generation.

4.1 Implementation

To ensure consistent training data and reproducible experimental results, in this work we implemented, in the Julia programming language, a textbook version of the branch-and-bound algorithm, as well multiple variable branching rules and multiple node selection rules. While cutting planes can significantly enhance the performance of textbook branch-and-bound methods, in this work we did not consider their usage, as they make the analysis of branching rules significantly more difficult. The algorithm was also given the primal optimal value to eliminate the influence of primal heuristics. The implementation relies on an external LP solver, accessed through MathOptInterface, to process node relaxations and evaluate strong branching decisions. In our experiments, we used best-bound with plunging for node selection and Gurobi 9.0 ([26]) as the LP solver. The branch-and-bound implementation has been made available as part of the open-source package MIPLearn ([45]). MIPLearn was also used to compute variable features. Machine learning models were implemented in Python 3.9 with scikit-learn, and PyCall.jl was used to call the ML models from Julia.

4.2 Environment

Both training and testing were processed in parallel on the high-performance computational cluster of ISyE, Georgia Tech, which contains roughly 2,340 cores of x86-64 processing with over 28.9TB of memory spread across the systems and each task was processed on a dedicated core with 8GB of memory. A time limit of one week (604,800 seconds) was imposed when solving all MILPs, though never activated, since all runs finished within the allotted time. In all runs, the branch-and-bound algorithm was also configured to terminate when a 0.01% relative gap was reached.

4.3 Instances

Five realistic large-scale security-constrained unit commitment instances from UnitCommitment.jl [46], corresponding to five European transmission networks from MATPOWER ([49, 21, 30]), were used to evaluate the performance of the ML branching rules. For each network, 50 instance variations were generated by applying the randomization algorithm described in [44] – thus giving us 250 instance variations in total. For each network, the first 40 instances variations were used for training, while the remaining 10 were used for testing. The sizes of the resulting instances are displayed in Table 1. We highlight that the instances used in our experiments are quite large, with up to 400,000 decision variables and 350,000 constraints. To the best of our knowledge, most papers in the learning-to-branch literature have not dealt with such large instances, with the notable exception of [39].

Table 1: Size of Networks

Network	Hours	Generators	Buses	Lines	Variables	Rows	Binaries
case1888rte	24	296	1,888	2,531	235,591	196,783	41,232
case1951rte	24	390	1,951	2,596	266,088	244,220	54,144
case2848rte	24	544	2,848	3,776	377,760	340,228	71,904
case3012wp	24	496	3,012	3,572	357,146	305,076	59,712
case3375wp	24	590	3,374	4,161	413,161	357,065	71,856

The package UnitCommitment.jl was also used to construct the MILP, using a state-of-the-art formulation of the problem. In the following, we list the description of all binary variables in the formulated MILP, as it is of particular interest for both per-variable and per-group approach. We refer to the package documentation for more details:

•

is_on[g,t]: True if generator $g$ is on at time $t$
•

switch_on[g,t]: True if generator $g$ switches on at time $t$ .
•

switch_off[g,t]: True if generator $g$ switches off at time $t$ .
•

startup[g,t,s]: True if generator $g$ switches on at time $t$ incurring start-up costs from start-up category $s$ .

4.4 Training dataset

To generate the training dataset, we solved each training instance using RB:100:inf and, for each strong branching call made by the algorithm, we collected features describing the evaluated variable, as well as the logarithm of the computed strong branching score (due to the fact that it spans several orders of magnitude). We enforced a node limit of 1000, a time limit of $1.2\times 10^{6}$ seconds (roughly 2 weeks, due to server limitation) and a relative gap limit of 0.01%. The algorithm stopped when any of these limits were reached.

4.5 ML models

For each network, one model was trained for ML:ET, while a series of models were trained for the per-variable scheme or the per-group schemes. To the best of our ability and understanding, ML:ET is an accurate implementation of the method described in [4]. In particular, we used the same set of features and the same ML regressors. We also performed hyperparameters tuning, using Scikit-Learn’s GridSearchCV, to improve the classifier’s performance and to manage its memory requirements. Our final set of hyperparameters for ML:ET were min_samples_split=10, max_depth=25 and n_estimators=25.

For the per-group scheme, we chose three different grouping methods:

•

Per-generator (ML:PGE): Two variables are grouped together if they have the same base name (e.g. is_on, switch_on, switch_off), the same generator index $g$ and the same startup category $s$ . Each group, therefore, consists of 24 variables, corresponding to different values of time $t$ .
•

Per-time (ML:PTI): Variables are grouped together if they have the same base name, the same time $t$ and the same startup category $s$ . We allow, therefore, grouping of variables corresponding to different generators.
•

Per-name (ML:PNA): Variables are grouped together if the have the same base name and the same startup category $s$ . Since the benchmark instances have three different startup categories, we have exactly 6 groups in total. This grouping strategy was the closest to ML:ET in the number of models.

As described in Section 3, we also used ExtraTrees as classifier for per-group strategies, but a different set of hyperparameters, found through Scikit-Learn’s GridSearchCV, to avoid over-fitting. For all regressors, we used n_estimators=25. For per-variable ML:PV and the per-generator ML:PGE, we selected min_samples_split=5 and max_depth=12. For per-time ML:PTI, we used min_samples_split=8 and max_depth=12. For per-name ML:PNA, we used min_samples_split=8 and max_depth=16.

In Table 2, we display the number of groups for each grouping strategy and for each network. Note that these numbers indicate the maximum number of ML models trained; if the strong branching routine is never called during training for all variables in certain group, or if it is only called less than 10 times, then not enough training data is available for that group, and therefore no ML model is trained. We recall that the strong branching routine is not called for a given variable if either the variable is already integral at the node, or if the maximum number of strong branching evaluations has been reached. The actual number of generated models is displayed in Table 3. We noticed that the difference between the number of generated models and the theoretical limits gets more prominent as the groups become finer, which is expected. During inference time, if a per-variable or per-group model is not available, we fall back to the general-purpose ML:ET model.

Table 2: Maximum number of ML models for different grouping strategies.

Network	ML:ET	ML:PNA	ML:PTI	ML:PGE	ML:PV
case1888rte	1	6	144	1,718	41,232
case1951rte	1	6	144	2,256	54,144
case2848rte	1	6	144	2,996	71,904
case3012wp	1	6	144	2,488	59,712
case3375wp	1	6	144	2,994	71,856

Table 3: Actual number of ML models for different grouping strategies.

Network	ML:ET	ML:PNA	ML:PTI	ML:PGE	ML:PV
case1888rte	1	6	130	488	3,156
case1951rte	1	6	123	406	2,957
case2848rte	1	6	129	502	3,076
case3012wp	1	6	127	479	3,053
case3375wp	1	6	126	703	3,710

5 Evaluation

In this section, we evaluate the performance of different grouping strategies on SCUC problems. In Subsection 5.1, we start by evaluating the accuracy of different ML models in predicting strong branching scores. In Subsections 5.2 and 5.3, we evaluate their impact on MIP gap closure after 1,000 and 10,000 branch-and-bound nodes, respectively. Finally, in Subsection 5.4, we evaluate the impact of presolve on the effectiveness of different methods.

5.1 Model accuracy

To measure the accuracy of different grouping strategies, we used $k$ -fold cross-validation, a standard approach to test the quality of ML classifiers and regressors. The training data was split into $k$ parts (folds), then $k$ estimators were trained on data from $k-1$ folds and their performance was evaluated on the remaining fold. See [27] for details.

Figure 1 shows the 5-fold cross-validated mean squared error (MSE) for all grouping strategies on one particular network (case1888rte). We omit other networks, since they presented similar results. In the charts, “actual value” corresponds to the strong branching scores, as generated by the non-ML branching oracle RB:100:inf during training data collection, while “predicted value” is the branching score predicted by each ML model.

While all grouping strategies were able to approximate strong branching score to some extent, MSE scores got better as grouping became finer and each group became smaller. This was especially true for the per-variable model ML:PV, the finest grouping, which achieved much better MSE results than all other methods, including ML:ET.

Refer to caption — Figure 1: Comparison of actual versus predicted strong branching scores on network case1888rte.

5.2 Gap closure on small trees (1,000 node limit)

In the previous subsection, we evaluated the performance of the ML models in isolation, based on MSE. In this subsection, we integrate the models into the branch-and-bound algorithm and evaluate their effectiveness at solving an actual MILP. Our main focus is the quality of branching, measured by the final relative MIP gap after a certain number of branch-and-bound nodes. We do not focus on total running time since this metric depends heavily on the quality of the software implementation, instead of simply on the fundamental properties of the branching rule. It is also self-evident, in our opinion, that all evaluated ML-based branching rules, if carefully implemented, would require significantly less computational effort per node than strong branching, since they involve only the evaluation of a few small decision trees per variable, instead of the solution of a large-scale linear programming problem.

Table 4: Relative MIP gap after at most 1,000 branch-and-bound nodes.

Network	Relative MIP gap (%)
Network	MIB	RB:100:inf	ML:ET	ML:PNA	ML:PTI	ML:PGE	ML:PV
case1888rte	1.78	0.76	1.31	1.35	0.90	1.22	0.88
case1951rte	0.41	0.19	0.23	0.24	0.20	0.22	0.20
case2848rte	0.83	0.37	0.54	0.59	0.45	0.58	0.41
case3012wp	0.29	0.10	0.08	0.01	0.02	0.01	0.04
case3375wp	0.48	0.14	0.52	0.46	0.50	0.41	0.48
Average	0.76	0.31	0.54	0.53	0.41	0.49	0.40

In our first set of experiments, we compared the relative MIP gap attained by seven different ML and non-ML branching methods (MIB, RB:100:inf, ML:ET, ML:PNA, ML:PTI, ML:PGE and ML:PV) after the exploration of at most 1,000 branch-and-bound nodes. This node limit was set relatively low because we wanted to include RB:100:inf, a computationally expensive branching rule. Note that even though RB:100:inf is faster than full strong branching (RB:inf:inf), for the size of our instances, it is not fast enough to be a practical branching method. Table 4 presents a summary of the results. We remind the reader that 10 test instances were solved for each network; therefore, each cell in the table represents the average of 10 values. For each network, we also highlighted the ML branching rule with best performance.

As clearly illustrated, the per-variable approach ML:PV outperformed ML:ET on almost all networks. Moreover, ML:PV presented a relative MIP gap of 0.40%, which is only about 22% worse than RB:100:inf, and therefore provides a relatively close approximation of strong branching.

Per-generator ML:PGE and per-time ML:PTI approaches also significantly outperformed ML:ET, although to a smaller degree in most instances. The per-name ML:PNA approach overall offered little improvement, likely due to its similarity to the original ML:ET.

As explained in Subsection 4.5, when we need to estimate the strong branching score of a variable that does not have a per-group ML model, we fall back to the most general ML:ET model. It is thus worth investigating how often such phenomenon occurred. Table 5 shows what percentage of all evaluations relied on the fallback model, instead of the specific per-group model, due to missing data.

It is clear that, with finer grouping strategies, fallback happened more often, indicating that fewer variables were covered during training. This highlights the need to balance between more groups and more data per group. While finer groups leads to improved accuracy, this comes at the cost of reduced coverage. Nevertheless, it is also clear that the fallback rate is quite low, and that fallback does not hinder the potential of the per-group or per-variable approach. With a more substantial dataset, we also expect that we should be able to reduce the occurrence of fallback or even eliminate it altogether.

Table 5: Fallback Rate, node_limit=1000

Network	Fallback rate (%)
Network	ML:ET	ML:PNA	ML:PTI	ML:PGE	ML:PV
case1888rte	0.0	0.0	0.0	0.1	9.8
case1951rte	0.0	0.0	0.2	0.1	4.5
case2848rte	0.0	0.0	0.0	0.1	7.2
case3012wp	0.0	0.0	0.0	2.8	14.6
case3375wp	0.0	0.0	0.0	0.4	8.5
Average	0.0	0.0	0.0	0.7	9.0

5.3 Gap closure on large trees (10,000 node limit)

While the experiments in Section 5.2 clearly demonstrate the superiority of the per-group methods in smaller branch-and-bound trees, it is natural to ask whether these results still hold for larger trees. To answer this question, in this section we repeat the experiments of Section 5.2 with a larger budget of 10,000 nodes. Due to the significant computational cost of RB:100:inf, we omit this method in the following experiments.

Table 6 shows a summary of our results. With larger trees, the performance improvement of ML:PV over ML:ET becomes more clear, and the per-variable method now outperforms ML:ET in every network. Consistently with our previous experients, other per-group schemes, especially per-time (ML:PTI) or per-generator (ML:PGE), also outperform ML:ET in most cases. For network case3375wp, in particular, ML:PGE presents much better performance than all the other ML methods.

Table 6: Relative MIP Gap, node_limit=10000

Network	Relative MIP gap (%)
Network	MIB	ML:ET	ML:PNA	ML:PTI	ML:PGE	ML:PV
case1888rte	1.75	1.13	1.20	0.66	1.03	0.61
case1951rte	0.37	0.11	0.11	0.08	0.11	0.08
case2848rte	0.77	0.43	0.47	0.33	0.46	0.30
case3012wp	0.27	0.05	0.01	0.01	0.01	0.02
case3375wp	0.45	0.51	0.42	0.47	0.31	0.44
Average	0.72	0.45	0.44	0.31	0.39	0.29

5.4 Presolved Instances

The approach we propose in this work fundamentally depends on variables and variable groups keeping their identity across multiple instances. While it is true that, in operational problems, one often has to solve instances where changes are only made to the matrix coefficients, objective function and right-hand side, in practice these MILPs would still go through an extra presolve step just before being solved, where the MILP solver makes further changes to the problem in an attempt to make it easier, based on the actual problem data. These changes could potentially modify the problem structure. To determine whether the proposed scheme would still work under this scenario, we ran further computational experiments where the test instances are presolved by Gurobi prior to being solved by our branch-and-bound implementation. No presolve, however, is applied to the training instances. Because of this, during inference, we made the decision to provide to the ML models static variable features (e.g. objective coefficient) corresponding to the original problem, not the presolved one, together with dynamic features extracted at the current node (e.g. depth of the node).

Tables 7 and 8 show the relative MIP gap results after 1,000 and 10,000 branch-and-bound nodes, respectively. As shown in the tables, even though the test instances are slightly different the training ones, the machine learning methods still presented good relative MIP gap closures, demonstrating the robustness of the method. Moreover, the per-variable and per-group approaches again outperformed ML:ET on most networks, for both node limit settings.

Table 7: Relative MIP Gap after 1,000 nodes, with presolved test instances.

Network	Relative MIP gap (%)
Network	MIB	RB:100:inf	ML:ET	ML:PNA	ML:PTI	ML:PGE	ML:PV
case1888rte	1.00	0.23	0.51	0.52	0.38	0.41	0.33
case1951rte	0.29	0.14	0.17	0.17	0.15	0.17	0.16
case2848rte	0.43	0.19	0.30	0.33	0.28	0.29	0.26
case3012wp	0.05	0.01	0.01	0.01	0.01	0.01	0.01
case3375wp	0.31	0.02	0.31	0.27	0.32	0.20	0.31
Average	0.41	0.12	0.26	0.26	0.23	0.22	0.22

Table 8: Relative MIP Gap after 10,000 nodes, with presolved test instances.

Network	Relative MIP gap (%)
Network	MIB	ML:ET	ML:PNA	ML:PTI	ML:PGE	ML:PV
case1888rte	0.95	0.37	0.36	0.29	0.27	0.25
case1951rte	0.25	0.04	0.04	0.04	0.05	0.05
case2848rte	0.39	0.21	0.24	0.19	0.19	0.17
case3012wp	0.03	0.01	0.01	0.01	0.01	0.01
case3375wp	0.30	0.27	0.22	0.29	0.13	0.25
Average	0.38	0.18	0.18	0.16	0.13	0.14

6 Discussion and future research directions

In this work, we have provided evidence that building separate ML models for individual variables, or groups of variables, can result in significantly better branching decisions in the context of SCUC. Specifically, in our experiments, the proposed per-variable ML:PV method outperformed the previously described ML:ET in almost every instance, under various settings. We also found that ML:PV could mimic strong branching decisions very well. The per-generator ML:PGE and per-time ML:PTI methods presented strong performance on selected networks, frequently outperforming ML:ET. Both per-variable and per-group schemes offered stable and robust performance, even on presolved instances.

Finally, we discuss some future research directions.

1.

Application to other ML branching methods. While we focused on modifying the method proposed by [4], note that the method can be applied to other ML branching methods as well. For example, it would be straighforward to adapt it to the online learning method proposed by [5], and we could apply it to the learning-to-rank approach of [31] by focusing on ranking variables within each group. Another question is whether graph-based problem representations, as proposed by [23], would eliminate the advantages of variable grouping. In our understanding, the main advantage of such representation is handling instances with significantly different structure. When the problems have a fixed structure, which is the setting of our present work, the advantage of graph-based representation is much less clear, since the graph would remain fixed. We note that many other works in the learning-to-branch literature are orthogonal to our work. For example, the GPU-based strong branching approach in [39] could be applied in our work to accelerate training data generation, but, in our understanding, would not fundamentally change the results presented.
2.

Training on small instances then testing on larger instances. With appropriate variable grouping, it may be possible to collect data from smaller instances, where it becomes feasible to use high-quality oracles, such as full strong branching, then use the trained ML models to solve instances of much larger scale. In the SCUC problem, for example, if the variable groups don’t take time into consideration, then training on instances that have a shorter time horizon may be a feasible strategy.
3.

Integrating with other MILP solvers techniques. One particularly interesting topic is integrating this method with primal heuristics and cutting planes, which are widely used in the current commercial solvers to accelerate solving MILP. In this work we demonstrated that the proposed per-variable and per-group approaches are robust against presolve, but it would be interesting to evaluate their performance on the presence of other solver features.
4.

Evaluating on other operational problems. In this work, we focused exclusively on SCUC, but the method could potentially benefit other operational problems. As mentioned in the introduction, another example would be the Optimal Transmission Switching problem (OTS), which is not currently used in real-time operations due to slow computational performance. An important future direction is to explore if OTS and other important operational problems could also benefit from the proposed method.

7 Acknowledgments

This material is based upon work supported by the U.S. Department of Energy Advanced Grid Modeling Program. Santanu S. Dey gratefully acknowledges the support by Airforce Office of Scientific Research.

References

[1] Tobias Achterberg. Constraint Integer Programming. Doctoral thesis, Technische Universität Berlin, Fakultät II - Mathematik und Naturwissenschaften, Berlin, 2007.
[2] Tobias Achterberg, Thorsten Koch, and Alexander Martin. Branching rules revisited. Operations Research Letters, 33(1):42–54, 2005.
[3] Alejandro Marcos Alvarez, Quentin Louveaux, and Louis Wehenkel. A supervised machine learning approach to variable branching in branch-and-bound. In In ecml. Citeseer, 2014.
[4] Alejandro Marcos Alvarez, Quentin Louveaux, and Louis Wehenkel. A machine learning-based approximation of strong branching. INFORMS Journal on Computing, 29(1):185–195, 2017.
[5] Alejandro Marcos Alvarez, Louis Wehenkel, and Quentin Louveaux. Online learning for strong branching approximation in branch-and-bound. 2016.
[6] David Applegate, Robert Bixby, Vašek Chvátal, and William Cook. Finding cuts in the tsp (a preliminary report). Technical report, Citeseer, 1995.
[7] Amitabh Basu, Michele Conforti, Marco Di Summa, and Hongyi Jiang. Complexity of branch-and-bound and cutting planes in mixed-integer optimization-ii. In Integer Programming and Combinatorial Optimization: 22nd International Conference, IPCO 2021, Atlanta, GA, USA, May 19–21, 2021, Proceedings, volume 12707, page 383. Springer Nature, 2021.
[8] Pascale Bendotti, Pierre Fouilhoux, and Cécile Rottner. On the complexity of the unit commitment problem. Annals of Operations Research, 274(1-2):119–130, 2019.
[9] Seth Blumsack, Lester B Lave, and Marija Ilic. A quantitative analysis of the relationship between congestion and reliability in electric power networks. The Energy Journal, 28(4), 2007.
[10] Yonghong Chen, Aaron Casto, Fengyu Wang, Qianfan Wang, Xing Wang, and Jie Wan. Improving large scale day-ahead security constrained unit commitment performance. IEEE Transactions on Power Systems, 31(6):4732–4743, 2016.
[11] Vasek Chvátal. Hard knapsack problems. Operations Research, 28(6):1402–1411, 1980.
[12] Arthur I Cohen and Vahid R Sherkat. Optimization-based methods for operations scheduling. Proceedings of the IEEE, 75(12):1574–1591, 1987.
[13] Michele Conforti, Gérard Cornuéjols, Giacomo Zambelli, et al. Integer programming, volume 271. Springer, 2014.
[14] William J Cook and Mark Hartmann. On the complexity of branch and cut methods for the traveling salesman problem. Polyhedral Combinatorics, 1:75–82, 1990.
[15] Daniel Dadush and Samarth Tiwari. On the complexity of branching proofs. arXiv preprint arXiv:2006.04124, 2020.
[16] Santanu S Dey, Yatharth Dubey, and Marco Molinaro. Branch-and-bound solves random binary packing ips in polytime. arXiv preprint arXiv:2007.15192, 2020.
[17] Santanu S Dey, Yatharth Dubey, and Marco Molinaro. Lower bounds on the size of general branch-and-bound trees. Mathematical Programming, pages 1–21, 2022.
[18] Santanu S Dey, Yatharth Dubey, Marco Molinaro, and Prachi Shah. A theoretical and computational analysis of full strong-branching. arXiv preprint arXiv:2110.10754, 2021.
[19] Santanu S Dey and Prachi Shah. Lower bound on size of branch-and-bound trees for solving lot-sizing problem. arXiv preprint arXiv:2112.03965, 2021.
[20] Emily B Fisher, Richard P O’Neill, and Michael C Ferris. Optimal transmission switching. IEEE Transactions on Power Systems, 23(3):1346–1355, 2008.
[21] Stéphane Fliscounakis, Patrick Panciatici, Florin Capitanescu, and Louis Wehenkel. Contingency ranking with respect to overloads in very large power systems taking into account uncertainty, preventive, and corrective actions. IEEE Transactions on Power Systems, 28(4):4909–4917, 2013.
[22] Len L Garver. Power generation scheduling by integer programming-development of theory. Transactions of the American Institute of Electrical Engineers. Part III: Power Apparatus and Systems, 81(3):730–734, 1962.
[23] Maxime Gasse, Didier Chételat, Nicola Ferroni, Laurent Charlin, and Andrea Lodi. Exact combinatorial optimization with graph convolutional neural networks. arXiv preprint arXiv:1906.01629, 2019.
[24] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine learning, 63(1):3–42, 2006.
[25] Prateek Gupta, Maxime Gasse, Elias B Khalil, M Pawan Kumar, Andrea Lodi, and Yoshua Bengio. Hybrid models for learning to branch. arXiv preprint arXiv:2006.15212, 2020.
[26] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2022.
[27] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
[28] Robert G Jeroslow. Trivial integer programs unsolvable by branch-and-bound. Mathematical Programming, 6(1):105–109, 1974.
[29] Emma S Johnson, Shabbir Ahmed, Santanu S Dey, and Jean-Paul Watson. A k-nearest neighbor heuristic for real-time dc optimal transmission switching. arXiv preprint arXiv:2003.10565, 2020.
[30] Cédric Josz, Stéphane Fliscounakis, Jean Maeght, and Patrick Panciatici. Ac power flow data in matpower and qcqp format: itesla, rte snapshots, and pegase. arXiv preprint arXiv:1603.01533, 2016.
[31] Elias Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina. Learning to branch in mixed integer programming. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
[32] Burak Kocuk, Hyemin Jeon, Santanu S Dey, Jeff Linderoth, James Luedtke, and Xu Andy Sun. A cycle-based formulation and valid inequalities for dc power transmission problems with switching. Operations Research, 64(4):922–938, 2016.
[33] A. H. Land and A. G. Doig. An automatic method for solving discrete programming problems. ECONOMETRICA, 28(3):497–520, 1960.
[34] Eric Larsen, Sébastien Lachapelle, Yoshua Bengio, Emma Frejinger, Simon Lacoste-Julien, and Andrea Lodi. Predicting tactical solutions to operational planning problems under imperfect information. INFORMS Journal on Computing, 34(1):227–242, 2022.
[35] Jeff T Linderoth and Martin WP Savelsbergh. A computational study of search strategies for mixed integer programming. INFORMS Journal on Computing, 11(2):173–187, 1999.
[36] Andrea Lodi. Mixed integer programming computation. In Michael Jünger, Thomas M. Liebling, Denis Naddef, George L. Nemhauser, William R. Pulleyblank, Gerhard Reinelt, Giovanni Rinaldi, and Laurence A. Wolsey, editors, 50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art, pages 619–645. Springer, 2010.
[37] Andrea Lodi and Giulia Zarpellon. On learning and branching: a survey. Top, 25(2):207–236, 2017.
[38] Germán Morales-España, Jesus M Latorre, and Andres Ramos. Tight and compact milp formulation for the thermal unit commitment problem. IEEE Transactions on Power Systems, 28(4):4897–4908, 2013.
[39] Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O’Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, et al. Solving mixed integer programs using neural networks. arXiv preprint arXiv:2012.13349, 2020.
[40] Afshin Oroojlooyjadid, Lawrence V Snyder, and Martin Takáč. Applying deep learning to the newsvendor problem. IISE Transactions, 52(4):444–463, 2020.
[41] James Ostrowski, Miguel F Anjos, and Anthony Vannelli. Tight mixed integer linear programming formulations for the unit commitment problem. IEEE Transactions on Power Systems, 27(1):39–46, 2011.
[42] Antoine Prouvost, Justin Dumouchelle, Lara Scavuzzo, Maxime Gasse, Didier Chételat, and Andrea Lodi. Ecole: A gym-like library for machine learning in combinatorial optimization solvers. arXiv preprint arXiv:2011.06069, 2020.
[43] Laurence A Wolsey and George L Nemhauser. Integer and combinatorial optimization, volume 55. John Wiley & Sons, 1999.
[44] Álinson S Xavier, Feng Qiu, and Shabbir Ahmed. Learning to solve large-scale security-constrained unit commitment problems. INFORMS Journal on Computing, 33(2):739–756, 2021.
[45] Alinson Santos Xavier and Feng Qiu. MIPLearn: An Extensible Framework for Learning-Enhanced Optimization, 2020.
[46] Alinson Santos Xavier and Feng Qiu. UnitCommitment.jl: A Julia/JuMP Optimization Package for Security-Constrained Unit Commitment, 2020.
[47] Yu Yang, Natashia Boland, Bistra Dilkina, and Martin Savelsbergh. Learning generalized strong branching for set covering, set packing, and 0-1 knapsack problems. Technical report, Technical report, URL http://www. optimization-online. org/DB_HTML/2020/02 …, 2020.
[48] Tianyu Zhang, Amin Banitalebi-Dehkordi, and Yong Zhang. Deep reinforcement learning for exact combinatorial optimization: Learning to branch. arXiv preprint arXiv:2206.06965, 2022.
[49] Ray Daniel Zimmerman, Carlos Edmundo Murillo-Sánchez, and Robert John Thomas. Matpower: Steady-state operations, planning, and analysis tools for power systems research and education. IEEE Transactions on power systems, 26(1):12–19, 2010.