This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Differentiating Through Integer Linear Programs with Quadratic Regularization and Davis-Yin Splitting

Daniel McKenzie dmckenzie@mines.edu
Department of Applied Mathematics and Statistics
Colorado School of Mines
Samy Wu Fung swufung@mines.edu
Department of Applied Mathematics and Statistics
Department of Computer Science
Colorado School of Mines
Howard Heaton research@typal.academy
Typal Academy
Abstract

In many applications, a combinatorial problem must be repeatedly solved with similar, but distinct parameters. Yet, the parameters ww are not directly observed; only contextual data dd that correlates with ww is available. It is tempting to use a neural network to predict ww given dd. However, training such a model requires reconciling the discrete nature of combinatorial optimization with the gradient-based frameworks used to train neural networks. We study the case where the problem in question is an Integer Linear Program (ILP). We propose applying a three-operator splitting technique, also known as Davis-Yin splitting (DYS), to the quadratically regularized continuous relaxation of the ILP. We prove that the resulting scheme is compatible with the recently introduced Jacobian-free backpropagation (JFB). Our experiments on two representative ILPs: the shortest path problem and the knapsack problem, demonstrate that this combination—DYS on the forward pass, JFB on the backward pass—yields a scheme which scales more effectively to high-dimensional problems than existing schemes. All code associated with this paper is available at https://github.com/mines-opt-ml/fpo-dys.

1 Introduction

Many high-stakes decision problems in healthcare (Zhong & Tang, 2021), logistics and scheduling (Sbihi & Eglese, 2010; Kacem et al., 2021), and transportation (Wang & Tang, 2021) can be viewed as a two step process. In the first step, one gathers data about the situation at hand. This data is used to assign a value (or cost) to outcomes arising from each possible action. The second step is to select the action yielding maximum value (alternatively, lowest cost). Mathematically, this can be framed as an optimization problem with a data-dependent cost function:

x(d)argminx𝒳f(x;d).x^{\star}(d)\triangleq\operatorname*{arg\,min}_{x\in\mathcal{X}}f(x;d). (1)

In this work, we focus on the case where 𝒳n\mathcal{X}\subset\mathbb{R}^{n} is a finite constraint set and f(x;d)=w(d)xf(x;d)=w(d)^{\top}x is a linear function. This class of problems is quite rich, containing the shortest path, traveling salesperson, and sequence alignment problems, to name a few. Given w(d)w(d), solving equation 1 may be straightforward (e.g. shortest path) or NP-hard (e.g. traveling salesperson problem (Karp, 1972)). However, our present interest is settings where the dependence of w(d)w(d) on dd is unknown. In such settings, it is intuitive to learn a mapping wΘ(d)w(d)w_{\Theta}(d)\approx w(d) and then solve

xΘ(d)argminx𝒳wΘ(d)x{x}_{\Theta}(d)\triangleq\operatorname*{arg\,min}_{x\in\mathcal{X}}w_{\Theta}(d)^{\top}x (2)

in lieu of w(d)w(d). The observed data dd is called the context. As an illustrative running example, consider the shortest path prediction problem shown in Figure 1, which is studied in Berthet et al. (2020) and Pogančić et al. (2019).

At first glance, it may appear natural to tune the weights Θ\Theta so as to minimize the difference between wΘ(d)w_{\Theta}(d) and w(d)w(d). However, this is only possible if w(d)w(d) is available at training time. Even if this approach is feasible, it may not be advisable. If wΘ(d)w_{\Theta}(d) is a near-perfect predictor of w(d)w(d) we can expect xΘ(d)x(d)x_{\Theta}(d)\approx x^{\star}(d). However, if there are even small differences between wΘ(d)w_{\Theta}(d) and w(d)w(d) this can manifest in wildly different solutions (Bengio, 1997; Wilder et al., 2019). Thus, we focus on methods where Θ\Theta is tuned by minimizing the discrepancy between xΘ(d)x_{\Theta}(d) and x(d)x^{\star}(d). Such methods are instances of the decision-focused learning paradigm Mandi et al. (2023), as the criterion we are optimizing for is the quality of xΘ(d)x_{\Theta}(d) (the “decision”) not the fidelity of wΘ(d)w_{\Theta}(d) (the “prediction”).

The key obstacle in using gradient-based algorithms to tune Θ\Theta in such approaches is “differentiating through” the solution xΘ(d)x_{\Theta}(d) of equation 2 to obtain a gradient with which to update Θ\Theta. Specifically, the combinatorial nature of 𝒳\mathcal{X} may cause the solution xΘ(d){x}_{\Theta}(d) to remain unchanged for many small perturbations to Θ\Theta; yet, for some perturbations xΘ(d){x}_{\Theta}(d) may “jump” to a different point in 𝒳\mathcal{X}. Hence, the gradient dxΘ/dwΘ\mbox{d}{x}_{\Theta}\big{/}\mbox{d}w_{\Theta} is always either zero or undefined. To compute an informative gradient, we follow Wilder et al. (2019) by relaxing equation 2 to a quadratic program over the convex hull of 𝒳\mathcal{X} with an added regularizer (see equation 8).

Refer to caption Refer to caption Refer to caption
Figure 1: The shortest path prediction problem (Pogančić et al., 2019). The goal is to find the shortest path (from top-left to bottom-right) through a randomly generated terrain map from the Warcraft II tileset (Guyomarch, ). The contextual data dd, shown in (a), is an image sub-divided into 8-by-8 squares, each representing a vertex in a 12-by-12 grid graph. The cost of traversing each square, i.e. w(d)w(d), is shown in (b), with darker shading representing lower cost. The true shortest path is shown in (c).
Contribution

Drawing upon recent advances in convex optimization (Ryu & Yin, 2022) and implicit neural networks (Fung et al., 2022; McKenzie et al., 2021), we propose a method designed specifically for “differentiating through” large-scale integer linear programs. Our approach is fast, easy to implement using our provided code, and, unlike several prior works (e.g. see Pogančić et al. (2019); Berthet et al. (2020)), trains completely on GPU. Numerical examples herein demonstrate our approach, run using only standard computing resources, easily scales to problems with tens of thousands of variables. Theoretically, we verify our approach computes an informative gradient via a refined analysis of Jacobian-free Backpropagation (JFB) (Fung et al., 2022). In summary, we do the following.

  • \vartriangleright

    Building upon McKenzie et al. (2021), we use the three operator splitting technique of Davis & Yin (2017) to propose DYS-Net.

  • \vartriangleright

    We provide theoretical guarantees for differentiating through the fixed point of a certain non-expansive, but not contractive, operator.

  • \vartriangleright

    We numerically show DYS-Net easily handles combinatorial problems with tens of thousands of variables.

2 Preliminaries

LP Reformulation

We focus on optimization problems of the form equation 1 where f(x;d)=w(d)xf(x;d)=w(d)^{\top}x and 𝒳\mathcal{X} is the integer or binary points of a polytope, which we may assume to be expressed in standard form (Ziegler, 2012):

𝒳=𝒞n or 𝒳=𝒞{0,1}n, where 𝒞={xn:Ax=b and x0}.\mathcal{X}=\mathcal{C}\cap\mathbb{Z}^{n}\ \ \text{ or }\ \ \mathcal{X}=\mathcal{C}\cap\{0,1\}^{n},\quad\text{ where }\quad\mathcal{C}=\left\{x\in\mathbb{R}^{n}:\ Ax=b\text{ and }x\geq 0\right\}. (3)

In other words, equation 1 is an Integer Linear Program (ILP). Similar to works by Elmachtoub & Grigas (2022); Wilder et al. (2019); Mandi & Guns (2020) and others we replace the model equation 2 with its continuous relaxation, and redefine

xΘ(d)argminx𝒞wΘ(d)x{x}_{\Theta}(d)\triangleq\operatorname*{arg\,min}_{x\in\mathcal{C}}w_{\Theta}(d)^{\top}x (4)

as a step towards making the computation of dxΘ/dwΘ\mbox{d}{x}_{\Theta}\big{/}\mbox{d}w_{\Theta} feasible.

Losses and Training Data

We assume access to training data in the tuple form (d,x(d))(d,x^{\star}(d)). Such data may be obtained by observing experienced agents solve equation 1, for example taxi drivers in a shortest path problem Piorkowski et al. (2009). We focus on the 2\ell_{2} loss:

(Θ)𝔼d𝒟[2(Θ;d)], where 2(Θ;d)=x(d)xΘ(d)2,\mathcal{L}(\Theta)\triangleq\mathbb{E}_{d\sim\mathcal{D}}\left[\ell_{2}(\Theta;d)\right],\\ \quad\text{ where }\quad\ell_{2}(\Theta;d)=\|x^{\star}(d)-{x}_{\Theta}(d)\|^{2}, (5)

and 𝒟\mathcal{D} is the distribution of contextual data, although more sophisticated losses can be used with only superficial changes to our results. In principle we select the optimal weights by solving argminΘ(Θ)\operatorname*{arg\,min}_{\Theta}\mathcal{L}(\Theta). In practice, the population risk is inaccessible, and so we minimize empirical risk instead (Vapnik, 1999):

argminΘ1Ni=1N2(Θ;di).\operatorname*{arg\,min}_{\Theta}\frac{1}{N}\sum_{i=1}^{N}\ell_{2}\left(\Theta;d_{i}\right). (6)
Argmin Differentiation

For notational brevity, we temporarily suppress the dependence on dd. The gradient of 2\ell_{2} with respect to Θ\Theta, evaluated on a single data point is

ddΘ[2(Θ)]=ddΘ[xΘx2]=(xΘx)xΘwΘdwΘdΘ.\frac{\mathrm{d}}{\mathrm{d}\Theta}\left[\ell_{2}(\Theta)\right]=\frac{\mathrm{d}}{\mathrm{d}\Theta}\left[\|x_{\Theta}-x^{\star}\|^{2}\right]=(x_{\Theta}-x^{\star})^{\top}\frac{\partial x_{\Theta}}{\partial w_{\Theta}}\frac{\mathrm{d}w_{\Theta}}{\mathrm{d}\Theta}.

As discussed in Section 1, xΘ{x}^{\star}_{\Theta} is piecewise constant as a function of wΘw_{\Theta}, and this remains true for the LP relaxation equation 4. Consequently, for all wΘw_{\Theta}, either xΘ/Θ=0\partial{x}_{\Theta}/\partial\Theta=0 or xΘ/Θ\partial{x}_{\Theta}/\partial\Theta is undefined—neither case yields an informative gradient. To remedy this, Wilder et al. (2019); Mandi & Guns (2020) propose adding a small amount of regularization to the objective function in equation 4 to make it strongly convex. This ensures xΘ{x}_{\Theta} is a continuously differentiable function of wΘw_{\Theta}. We add a small quadratic regularizer, modulated by γ0\gamma\geq 0, to obtain the objective

fΘ(x;γ,d)wΘ(d)x+γx22f_{\Theta}(x;\gamma,d)\triangleq w_{\Theta}(d)^{\top}x+\gamma\|x\|_{2}^{2} (7)

and henceforth replace equation 4 by

xΘ(d)argminx𝒞fΘ(x;γ,d).{x}_{\Theta}(d)\triangleq\operatorname*{arg\,min}_{x\in\mathcal{C}}f_{\Theta}(x;\gamma,d). (8)

During training, we aim to solve equation 8 and efficiently compute the derivative xΘ/Θ\partial{x}_{\Theta}/\partial\Theta.

Pitfalls of Relaxation

Finally, we offer a note of caution when using the quadratically regularized problem equation 8 as a proxy for the original integer linear program equation 1. Temporarily defining

x^Θ(d)argminx𝒳wΘ(d)xx~Θ(d)argminx𝒞wΘ(d)xxΘ(d)argminx𝒞fΘ(x;γ,d),\hat{x}_{\Theta}(d)\triangleq\operatorname*{arg\,min}_{x\in\mathcal{X}}w_{\Theta}(d)^{\top}x\quad\tilde{x}_{\Theta}(d)\triangleq\operatorname*{arg\,min}_{x\in\mathcal{C}}w_{\Theta}(d)^{\top}x\quad{x}_{\Theta}(d)\triangleq\operatorname*{arg\,min}_{x\in\mathcal{C}}f_{\Theta}(x;\gamma,d),

we observe that two sources of error have arisen; one from replacing 𝒳\mathcal{X} with 𝒞\mathcal{C} (i.e. the discrepancy between x^Θ\hat{x}_{\Theta} and x~Θ\tilde{x}_{\Theta}), and one from replacing the linear objective function with its quadratically regularized counterpart (i.e. the discrepancy between x~Θ\tilde{x}_{\Theta} and xΘx_{\Theta}). When AA is totally unimodular, x~Θ\tilde{x}_{\Theta} is guaranteed to be integral, in which case x^Θ=x~Θ\hat{x}_{\Theta}=\tilde{x}_{\Theta}. Moreover, Wilder et al. (2019, Theorem 2) shows that

0wΘ(d)(xΘ(d)x~Θ(d))γmaxx,y𝒞xy20\leq w_{\Theta}(d)^{\top}\left(x_{\Theta}(d)-\tilde{x}_{\Theta}(d)\right)\leq\gamma\max_{x,y\in\mathcal{C}}\|x-y\|^{2}

whence, as in the proof Berthet et al. (2020, Proposition 2.2), limγ0+xΘ(d)=x~Θ(d)\lim_{\gamma\to 0^{+}}x_{\Theta}(d)=\tilde{x}_{\Theta}(d) (at least when 𝒞\mathcal{C} is compact). Thus, for totally unimodular ILPs (e.g. the shortest path problem), it is reasonable to expect that xΘ(d)x_{\Theta}(d) obtained from equation 8, for well-tuned Θ\Theta and small γ\gamma, are of acceptable quality. For ILPs which are not totally unimodular no such theoretical guarantees exist. However, we observe that this relaxation works reasonably well, at least for the knapsack problem.

3 Related Works

We highlight recent works in decision-focused learning—particularly for mixed integer programs—and related areas.

Optimization Layers

Introduced in Amos & Kolter (2017), and studied further in Agrawal et al. (2019b; c); Bolte et al. (2021); Blondel et al. (2022), an optimization layer is a modular component of a deep learning framework where forward propagation consists of solving a parametrized optimization problem. Consequently, backpropagation entails differentiating the solution to this problem with respect to the parameters. Optimization layers are a promising technology as they are able to incorporate hard constraints into deep neural networks. Moreover, as their output is a (constrained) minimizer of the objective function, it is easier to interpret than the output of a generic layer.

Decision-focused learning for LPs

When an optimization layer is fitted to a data-dependent optimization problems of the form equation 1, with the goal of maximizing the quality of the predicted solution, this is usually referred to as decision-focused learning. We narrow our focus to the case where the objective function in equation 1 is linear.Wilder et al. (2019) and Elmachtoub & Grigas (2022) are two of the earliest works applying deep learning techniques to data-driven LPs, and delineate two fundamentally different approaches. Specifically, Wilder et al. (2019) proposes replacing the LP with a continuous and strongly convex relaxation, as described in Section 2. This approach is extended to ILPs in Ferber et al. (2020), and to non-quadratic regularizers in Mandi & Guns (2020). On the other hand, Elmachtoub & Grigas (2022) avoid modifying the underlying optimization problem and instead propose a new loss function; dubbed the Smart Predict-then-Optimize (SPO) loss, to be used instead of the 2\ell_{2} loss. We emphasize that the SPO loss requires access to the true cost vectors w(d)w(d), a setting which we do not consider.
Several works define a continuously differentiable proxy for the solution to the unregularized LP equation 4, which we rewrite here as111For sake of notational simplicity, the dependence on dd is implicit.

x(w)=argminx𝒞wx,x^{\star}(w)=\operatorname*{arg\,min}_{x\in\mathcal{C}}w^{\top}x, (9)

In Berthet et al. (2020), a stochastic perturbation is considered:

xε(w)=𝔼Z[argminx𝒞(w+εZ)x],x^{\star}_{\varepsilon}(w)=\mathbb{E}_{Z}\left[\operatorname*{arg\,min}_{x\in\mathcal{C}}\left(w+\varepsilon Z\right)^{\top}x\right], (10)

which is somewhat analogous to Nesterov-Spokoiny smoothing (Nesterov & Spokoiny, 2017) in zeroth-order optimization. For some draws of ZZ the additively perturbed cost vector w+εzw+\varepsilon z may have negative entries, which may cause a solver (e.g. Dijkstra’s algorithm) applied to the associated LP to fail. To avoid this, Dalle et al. (2022) suggest using a multiplicative perturbation of ww instead. Pogančić et al. (2019) define a piecewise-affine interpolant to (x(w))\ell(x^{\star}(w)), where \ell is a suitable loss function.
We note a bifurcation in the literature; Wilder et al. (2019); Ferber et al. (2020); Elmachtoub & Grigas (2022) assume access to training data of the form (d,w(d))(d,w(d)), whereas Pogančić et al. (2019); Berthet et al. (2020); Sahoo et al. (2022) use training data of the form (d,x(d))(d,x^{\star}(d)). The former setting is frequently referred to as the predict-then-optimize problem. Our focus is on the latter setting. We refer the reader to Kotary et al. (2021); Mandi et al. (2023) for recent surveys of this area.

Deep Equilibrium Models

Bai et al. (2019); El Ghaoui et al. (2021) propose the notion of deep equilibrium model (DEQ), also known as an implicit neural network. A DEQ is a neural network for which forward propagation consists of (approximately) computing a fixed point of a parametrized operator. We note that equation 8 may be reformulated as a fixed point problem,

Find xΘ such that xΘ=P𝒞(xΘαxfΘ(xΘ;d)),\text{Find }{x}_{\Theta}\text{ such that }{x}_{\Theta}=P_{\mathcal{C}}\left({x}_{\Theta}-\alpha\nabla_{x}f_{\Theta}({x}_{\Theta};d)\right), (11)

where P𝒞P_{\mathcal{C}} is the orthogonal projection222For a set 𝒜n\mathcal{A}\subseteq\mathbb{R}^{n}, P𝒜(x)argminz𝒜zxP_{\mathcal{A}}(x)\triangleq\operatorname*{arg\,min}_{z\in\mathcal{A}}\|z-x\|. onto 𝒞\mathcal{C}. Thus, DEQ techniques may be applied to constrained optimization layers (Chen et al., 2021; Blondel et al., 2022; McKenzie et al., 2021). However, the cost of computing P𝒞P_{\mathcal{C}} can be prohibitive, see the discussion in Section 4.

Learning-to-Optimize (L2O)

Our work is related to the learning-to-optimize (L2O) framework (Chen et al., 2022; Li & Malik, 2017; Chen et al., 2018; Liu et al., 2023), where an optimization algorithm is learned and its outputs are used to perform inferences. However, traditional L2O methods use a fixed number of layers (i.e. unrolled algorithms). Our approach attempts to differentiate through the solution xΘx^{\star}_{\Theta} and is therefore most closely aligned with works that employ implicit networks within the L2O framework (Amos & Kolter, 2017; Heaton & Fung, 2023; McKenzie et al., 2021; Gilton et al., 2021; Liu et al., 2022). We highlight that, unlike many L2O works, our goal is not to learn a faster optimization method for fully specified problems. Rather, we seek to solve partially specified problems given contextual data by combining learning and optimization techniques.

Computing the derivative of a minimizer with respect to parameters

In all of the aforementioned works, the same fundamental problem is encountered: xΘ/Θ\partial{x}_{\Theta}/\partial\Theta must be computed where xΘx_{\Theta} is the solution of a (constrained) optimization problem with objective function parametrized by Θ\Theta. The most common approach to doing so, proposed in Amos & Kolter (2017) and used in Ruthotto et al. (2018); Wilder et al. (2019); Mandi & Guns (2020); Ferber et al. (2020), starts with the KKT conditions for constrained optimality:

fΘx(xΘ)+Aλ^+ν^=0,Axb=0,D(ν^)xΘ=0,\frac{\partial f_{\Theta}}{\partial x}({x}_{\Theta})+A^{\top}\hat{\lambda}+\hat{\nu}=0,\ A{x}-b=0,\ D(\hat{\nu}){x}_{\Theta}=0,

where λ^\hat{\lambda} and ν^0\hat{\nu}\geq 0 are Lagrange multipliers associated to the optimal solution xΘ{x}_{\Theta} (Bertsekas, 1997) and D(ν^)D(\hat{\nu}) is a matrix with ν^\hat{\nu} along its diagonal. Differentiating these equations with respect to Θ\Theta and rearranging yields

[2fΘx2AIA00D(ν^)0D(xΘ)][dxΘdΘdλ^dΘdν^dΘ]=[2fΘxΘ00].\begin{bmatrix}\frac{\partial^{2}f_{\Theta}}{\partial x^{2}}&A&I\\[4.0pt] A^{\top}&0&0\\[4.0pt] D(\hat{\nu})&0&D({x}_{\Theta})\end{bmatrix}\begin{bmatrix}\frac{d{x}_{\Theta}}{d\Theta}\\[4.0pt] \frac{d\hat{\lambda}}{d\Theta}\\[4.0pt] \frac{d\hat{\nu}}{d\Theta}\end{bmatrix}=\begin{bmatrix}\frac{\partial^{2}f_{\Theta}}{\partial x\partial\Theta}\\[4.0pt] 0\\[4.0pt] 0\end{bmatrix}. (12)

The matrix and right hand side vector in equation 12 are computable, thus enabling one to solve for dxΘdΘ\frac{d{x}_{\Theta}}{d\Theta} (as well as dλ^dΘ\frac{d\hat{\lambda}}{d\Theta} and dν^dΘ\frac{d\hat{\nu}}{d\Theta}). We emphasize that both primal (i.e., xΘx_{\Theta}) and dual (i.e., λ^\hat{\lambda} and ν^\hat{\nu}) variables at optimality are required. This constrains the choice of optimization schemes to those that track primal and dual variables, and prohibits the use of fast, primal-only schemes. Following (Amos & Kolter, 2017), most works employing this approach use a primal-dual interior point method, which computes acceptable approximations to xΘx_{\Theta}, λ^\hat{\lambda}, and ν^\hat{\nu} at a cost of 𝒪(max{n,m}3)\mathcal{O}\left(\max\{n,m\}^{3}\right), assuming xnx\in\mathbb{R}^{n} and Am×nA\in\mathbb{R}^{m\times n}.
Using the stochastic proxy equation 10, Berthet et al. (2020) derive a formula for dxε/dwdx^{\star}_{\varepsilon}/dw which is also an expectation, and hence can be efficiently approximated using Monte Carlo methods. Pogančić et al. (2019) show that the gradients of their interpolant are strikingly easy to compute, requiring just one additional solve of equation 9 with perturbed cost ww^{\prime}. This approach is extended by Sahoo et al. (2022) which proposes to avoid computing dxε/dwdx^{\star}_{\varepsilon}/dw entirely, replacing it with the negative identity matrix. This is similar in spirit to our use of Jacobian-free Backpropagation (see Section 4).

4 DYS-Net

We now introduce our proposed model, DYS-net. We use this term to refer to the model and the training procedure. Fixing an architecture for wΘw_{\Theta}, and an input dd, DYS-net computes an approximation to xΘ(d)x_{\Theta}(d) in a way that is easy to backpropagate through:

DYS-net(d;Θ)xΘargminx𝒞fΘ(x;γ,d).\texttt{DYS-net}(d;\ \Theta)\approx x_{\Theta}\triangleq\operatorname*{arg\,min}_{x\in\mathcal{C}}f_{\Theta}(x;\gamma,d). (13)
The Forward Pass

As we wish to compute xΘ{x}_{\Theta} and xΘ/Θ\partial{x}_{\Theta}/\partial\Theta for high dimensional settings (i.e. large nn), we eschew second-order methods (e.g. Newton’s method) in favor of first-order methods such as projected gradient descent (PGD). With PGD, a sequence {xk}\{x^{k}\} of estimates of xΘx_{\Theta} are computed so that xΘ=limkxkx_{\Theta}=\lim_{k\rightarrow\infty}x^{k} where

xk+1=P𝒞(xkαxf(xk;γ,d))for allk.x^{k+1}=P_{\mathcal{C}}\left(x^{k}-\alpha\nabla_{x}f(x^{k};\gamma,d)\right)\quad\text{for all}\ k\in\mathbb{N}.

This approach works for simple sets 𝒞\mathcal{C} for which there exists an explicit form of P𝒞P_{\mathcal{C}}, e.g. when 𝒞\mathcal{C} is the probability simplex (Duchi et al., 2008; Condat, 2016; Li et al., 2023). However, for general polytopes 𝒞\mathcal{C} no such form exists, thereby requiring a second iterative procedure to compute P𝒞(xk)P_{\mathcal{C}}(x^{k}). This projection must be computed at every iteration of every forward pass for every sample of every epoch, and this dominates the computational cost, see McKenzie et al. (2021, Appendix 4).

To avoid this expense, we draw on recent advances in the convex optimization literature, particularly Davis-Yin splitting or DYS (Davis & Yin, 2017). DYS is a three operator splitting technique, and hence is able to decouple the polytope constraints into relatively simple constraints Davis & Yin (2017). This technique has been used elsewhere, e.g. Pedregosa & Gidel (2018); Yurtsever et al. (2021). Specifically, we adapt the architecture incorporating DYS proposed in McKenzie et al. (2021). To this end, we rewrite 𝒞\mathcal{C} as

𝒞={x:Ax=b and x0}={x:Ax=b}𝒞1{x:x0}𝒞2=𝒞1𝒞2.\mathcal{C}=\{x:\ Ax=b\text{ and }x\geq 0\}=\underbrace{\{x:\ Ax=b\}}_{\triangleq\,\mathcal{C}_{1}}\cap\underbrace{\{x:\ x\geq 0\}}_{\triangleq\,\mathcal{C}_{2}}=\mathcal{C}_{1}\cap\mathcal{C}_{2}. (14)

While P𝒞P_{\mathcal{C}} is hard to compute, both P𝒞1P_{\mathcal{C}_{1}} and P𝒞2P_{\mathcal{C}_{2}} can be computed cheaply via explicit formulas, once a singular value decomposition (SVD) is computed for AA. We verify this via the following lemma (included for completeness, as the two results are already known).

Lemma 1.

If 𝒞1,𝒞2\mathcal{C}_{1},\mathcal{C}_{2} are as in equation 14 and AA is full-rank then:

  1. 1.

    P𝒞1(z)=zA(Azb)P_{\mathcal{C}_{1}}(z)=z-A^{\dagger}(Az-b), where A=VΣ1UA^{\dagger}=V\Sigma^{-1}U^{\top} and UΣVU\Sigma V^{\top} is the compact SVD of AA.

  2. 2.

    P𝒞2(z)=ReLU(z)max{0,z}P_{\mathcal{C}_{2}}(z)=\operatorname{ReLU}(z)\triangleq\max\{0,z\}.

We remark that the SVD of AA is computed offline, and is a once-off cost. Alternatively, one could compute P𝒞1P_{\mathcal{C}_{1}} by using an iterative method (e.g. GMRES) to find the (least squares) solution to Av=AzbAv=Az-b, whence P𝒞1(z)=zvP_{\mathcal{C}_{1}}(z)=z-v, but this would need to be done at each iteration of each forward pass for every sample of every epoch. Thus, it is unclear whether this procides computational savings over the once-off computation of the SVD. However, this could prove useful when AA is only provided implicitly, i.e. via access to a matrix-vector product oracle.

The following theorem allows one to approximate xΘx_{\Theta} using only P𝒞1P_{\mathcal{C}_{1}} and P𝒞2P_{\mathcal{C}_{2}}, not P𝒞P_{\mathcal{C}}.

Theorem 2.

Let 𝒞1,𝒞2\mathcal{C}_{1},\mathcal{C}_{2} be as in equation 14, and suppose fΘ(x;γ,d)=wΘ(d)x+γ2x22f_{\Theta}(x;\gamma,d)=w_{\Theta}(d)^{\top}x+\frac{\gamma}{2}\|x\|_{2}^{2} for any neural network wΘ(d)w_{\Theta}(d). For any α(0,2/γ)\alpha\in(0,2/\gamma) define the sequence {zk}\{z^{k}\} by:

zk+1=TΘ(zk) for all kz^{k+1}=T_{\Theta}(z^{k})\quad\text{ for all }k\in\mathbb{N} (15)

where

TΘ(z)zP𝒞2(z)+P𝒞1((2αγ)P𝒞2(z)zαwΘ(d)).T_{\Theta}(z)\!\!\triangleq z\!-\!P_{\mathcal{C}_{2}}(z)+P_{\mathcal{C}_{1}}\!\left(\left(2-\alpha\gamma\right)P_{\mathcal{C}_{2}}(z)\!-\!z\!-\!\alpha w_{\Theta}(d)\right). (16)

If xkP𝒞2(zk)x^{k}\triangleq P_{\mathcal{C}_{2}}(z^{k}) then the sequence {xk}\{x^{k}\} converges to xΘx_{\Theta} in equation 8 and xk+1xk22=O(1/k)\|x^{k+1}-x^{k}\|^{2}_{2}=O(1/k).

A full proof is included in Appendix A; here we provide a proof sketch. Substituting the gradient

zfΘ(z;γ,d)=wΘ(d)+γz\nabla_{z}f_{\Theta}(z;\gamma,d)=w_{\Theta}(d)+\gamma z

for FΘ(,d)F_{\Theta}(\cdot,d) into the formula for TΘ()T_{\Theta}(\cdot) given in McKenzie et al. (2021, Theorem 3.2) and rearranging yields equation 16. Verifying the assumptions of McKenzie et al. (2021, Theorem 3.3) are satisfied is straightforward. For example, zfΘ(z;γ,d)\nabla_{z}f_{\Theta}(z;\gamma,d) is 1/γ1/\gamma-cocoercive by the Baillon-Haddad theorem (Baillon & Haddad, 1977; Bauschke & Combettes, 2009).
The forward pass of DYS-net consists of iterating equation 15 until a suitable stopping criterion is met. For simplicity, in presenting Algorithm 1 we assume this to be a maximum number of iterations is reached. One may also consider a stopping criterion of zk+1zktol\|z_{k+1}-z_{k}\|\leq\mathrm{tol}.

The Backward Pass

From Theorem 2, we deduce the following fixed point condition:

xΘ=P𝒞2(zΘ), for some zΘ{z:z=TΘ(z)}.{x}_{\Theta}=P_{\mathcal{C}_{2}}({z}_{\Theta}),\quad\text{ for some }\ {z}_{\Theta}\in\{z:z=T_{\Theta}({z})\}. (17)

As discussed in McKenzie et al. (2021), instead of backpropagating through every iterate of the forward pass, we may derive a formula for dxΘ/dΘdx_{\Theta}/d\Theta by appealing to the implicit function theorem and differentiating both sides of equation 17:

dzΘdΘ=TΘΘ+TΘzdzΘdΘ𝒥Θ(zΘ)dzΘdΘ=TΘΘ where 𝒥Θ(z)=ITΘz.\frac{\mathrm{d}{z}_{\Theta}}{\mathrm{d}\Theta}=\frac{\partial T_{\Theta}}{\partial\Theta}+\frac{\partial T_{\Theta}}{\partial z}\frac{\mathrm{d}{z}_{\Theta}}{\mathrm{d}\Theta}\quad\implies\quad\mathcal{J}_{\Theta}(z_{\Theta})\ \frac{\mathrm{d}{z}_{\Theta}}{\mathrm{d}\Theta}=\frac{\partial T_{\Theta}}{\partial\Theta}\text{ where }\mathcal{J}_{\Theta}(z)=I-\frac{\partial T_{\Theta}}{\partial z}. (18)

We notice two immediate problems not addressed by McKenzie et al. (2021): (i) TΘT_{\Theta} is not everywhere differentiable with respect to zz, as P𝒞2P_{\mathcal{C}_{2}} is not; (ii) if TΘT_{\Theta} were a contraction (i.e. Lipschitz in zz with constant less than 11), then 𝒥Θ\mathcal{J}_{\Theta} would be invertible. However, this is not necessarily the case. Thus, it is not clear a priori that equation 18 can be solved for dzΘ/dΘ\mbox{d}{{z}_{\Theta}}/\mbox{d}{\Theta}. Our key result (Theorem 6 below) is to provide reasonable conditions under which 𝒥Θ(zΘ)\mathcal{J}_{\Theta}(z_{\Theta}) is invertible.

Assuming these issues can be resolved, one may compute the gradient of the loss using the chain rule:

ddΘ=ddxdxΘdΘ=ddx(dP𝒞2dzdzΘdΘ)=ddxdP𝒞2dz𝒥Θ1TΘΘ\frac{\mathrm{d}\ell}{\mathrm{d}\Theta}=\frac{\mathrm{d}\ell}{\mathrm{d}x}\frac{\mathrm{d}{x}_{\Theta}}{\mathrm{d}\Theta}=\frac{\mathrm{d}\ell}{\mathrm{d}x}\left(\frac{\mathrm{d}P_{\mathcal{C}_{2}}}{\mathrm{d}z}\frac{\mathrm{d}{z}_{\Theta}}{\mathrm{d}\Theta}\right)=\frac{\mathrm{d}\ell}{\mathrm{d}x}\frac{\mathrm{d}P_{\mathcal{C}_{2}}}{\mathrm{d}z}\mathcal{J}_{\Theta}^{-1}\frac{\partial T_{\Theta}}{\partial\Theta}

This approach requires solving a linear system with 𝒥Θ\mathcal{J}_{\Theta} which becomes particularly expensive when nn is large. Instead, we use Jacobian-free Backpropagation (JFB) (Fung et al., 2022), also known as one-step differentiation (Bolte et al., 2024), in which the Jacobian 𝒥Θ\mathcal{J}_{\Theta} is replaced with the identity matrix. This leads to an approximation of the true gradient d/dΘ\mbox{d}{\ell}/\mbox{d}{\Theta} using

pΘ(xΘ)=[xdP𝒞2dzTΘΘ](x,z)=(xΘ,zΘ).p_{\Theta}(x_{\Theta})=\left[\frac{\partial\ell}{\partial x}\frac{\mathrm{d}P_{\mathcal{C}_{2}}}{\mathrm{d}z}\frac{\partial T_{\Theta}}{\partial\Theta}\right]_{(x,z)=(x_{\Theta},z_{\Theta})}. (19)

This update can be seen as a zeroth order Neumann expansion of 𝒥Θ1\mathcal{J}_{\Theta}^{-1} Fung et al. (2022); Bolte et al. (2024). We show equation 19 is a valid descent direction by resolving the two problems highlighted above. We begin by rigorously deriving a formula for TΘ/z\partial{T_{\Theta}}/\partial{z}. Recall the following generalization of the Jacobian to non-smooth operators due to Clarke (1983).

Definition 3.

For any locally Lipschitz F:nnF:\mathbb{R}^{n}\to\mathbb{R}^{n} let DFD_{F} denote the set upon which FF is differentiable. The Clarke Jacobian of FF is the set-valued function defined as

F(z¯)={dFdz|z=z¯ if z¯DFCon{limzz¯:zDFdFdz|z=z} if z¯DF\partial F(\bar{z})=\left\{\begin{array}[]{cc}\left.\frac{\mathrm{d}F}{\mathrm{d}z}\right|_{z=\bar{z}}&\text{ if }\bar{z}\in D_{F}\\ \operatorname{Con}\left\{\lim_{z^{\prime}\to\bar{z}:z^{\prime}\in D_{F}}\left.\frac{\mathrm{d}F}{\mathrm{d}z}\right|_{z=z^{\prime}}\right\}&\text{ if }\bar{z}\notin D_{F}\end{array}\right.

Where Con{}\operatorname{Con}\left\{\right\} denotes the convex hull of a set.

The Clarke Jacobian of P𝒞2P_{\mathcal{C}_{2}} is easily computable, see Lemma 9. Define the (set-valued) functions

c(α)max(0,α)={1 if α>00 if α<0 if α=0 and c~(α)={1 if α>00 if α0c(\alpha)\triangleq\partial\max(0,\alpha)=\begin{cases}\begin{array}[]{cl}1&\text{ if }\alpha>0\\ 0&\text{ if }\alpha<0\\[0.1pt] &\text{ if }\alpha=0\end{array}\end{cases}\quad\text{ and }\quad\tilde{c}(\alpha)=\begin{cases}1&\text{ if }\alpha>0\\ 0&\text{ if }\alpha\leq 0\end{cases}

Then

P𝒞2(z¯)=[ddzReLU(z)]z=z¯=diag(c(z¯)),\partial P_{\mathcal{C}_{2}}(\bar{z})=\left[\frac{\mathrm{d}}{\mathrm{d}z}\operatorname{ReLU}(z)\right]_{z=\overline{z}}=\mathrm{diag}(c(\overline{z})), (20)

where cc is applied element-wise. We shall now provide a consistent rule for selecting, at any z¯\bar{z}, an element of P𝒞2(z¯)\partial P_{\mathcal{C}_{2}}(\bar{z}). If zi0z_{i}\neq 0 for all ii then P𝒞2\partial P_{\mathcal{C}_{2}} is a singleton. If one or more zi=0z_{i}=0 then P𝒞2\partial P_{\mathcal{C}_{2}} is multi-valued, so we choose the element of P𝒞2\partial P_{\mathcal{C}_{2}} with 0 in the (i,i)(i,i) position for every zi=0z_{i}=0. We write dP𝒞2/dzdP_{\mathcal{C}_{2}}/dz for this chosen element, and note that

dP𝒞2dz|z=z¯=diag(c~(z¯))P𝒞2(z¯)\left.\frac{\mathrm{d}P_{\mathcal{C}_{2}}}{\mathrm{d}z}\right|_{z=\bar{z}}=\mathrm{diag}(\tilde{c}(\overline{z}))\in\partial P_{\mathcal{C}_{2}}(\bar{z})

This aligns with the default rule for assigning a sub-gradient to ReLU\operatorname{ReLU} used in the popular machine learning libraries TensorFlow(Abadi et al., 2016), PyTorch (Paszke et al., 2019) and JAX (Bradbury et al., 2018), and has been observed to yield networks which are more stable to train than other choices (Bertoin et al., 2021).

Given the above convention, we can compute TΘ/z\partial{T_{\Theta}}/\partial{z}, interpreted as an element of the Clarke Jacobian of TΘT_{\Theta} chosen according to a consistent rule. Surprisingly, TΘ/z\partial{T_{\Theta}}/\partial{z} may be expressed using only orthogonal projections to hyperplanes. Throughout, we let eine_{i}\in\mathbb{R}^{n} be the one-hot vector with 11 in the ii-th position and zeros elsewhere, and aia_{i}^{\top} be the ii-th row of AA. For any subspace \mathcal{H} we denote the orthogonal subspace as \mathcal{H}^{\perp}. The following theorem, the proof of which is given in Appendix A, characterizes TΘ/z\partial{T_{\Theta}}/\partial{z}.

Theorem 4.

Suppose AA is full-rank and 1Null(A)\mathcal{H}_{1}\triangleq\operatorname{Null}(A), 2,zSpan(ei:zi>0)\mathcal{H}_{2,z}\triangleq\operatorname{Span}\left(e_{i}:z_{i}>0\right). Then for all z^n\hat{z}\in\mathbb{R}^{n}

TΘz|z=z¯=P1P2,z¯+(1αγ)P1P2,z¯.\left.\frac{\partial T_{\Theta}}{\partial z}\right|_{z=\bar{z}}=P_{\mathcal{H}_{1}^{\bot}}P_{\mathcal{H}_{2,\bar{z}}^{\bot}}+(1-\alpha\gamma)\cdot P_{\mathcal{H}_{1}}P_{\mathcal{H}_{2,\bar{z}}}. (21)
Algorithm 1 DYS-Net
1:  Inputs:AA and bb defining 𝒞\mathcal{C}, hyperparameters α,γ,K\alpha,\gamma,K
2:  Initialize: Attach neural network wΘ()w_{\Theta}(\cdot). Compute SVD of AA for P𝒞2P_{\mathcal{C}_{2}} formula
3:  Forward Pass: Initialize z0z^{0}. Given wΘ(d)w_{\Theta}(d) compute z1,,zKz^{1},\ldots,z^{K} using equation 15. Return xKP𝒞1(zK)xΘx^{K}\triangleq P_{\mathcal{C}_{1}}(z^{K})\approx{x}_{\Theta}.
4:  Backward Pass: Compute pΘ(xK)pΘ(xΘ)p_{\Theta}(x^{K})\approx p_{\Theta}({x}_{\Theta}) using equation 19 and return.

To show JFB is applicable, it suffices to verify TΘ/z<1\|\partial{T_{\Theta}}/\partial{z}\|<1 when evaluated at zΘ{z}_{\Theta}. Theorem 4 enables us to show this inequality holds when xΘ{x}_{\Theta} satisfies a commonly-used regularity condition, which we formalize as follows.

Definition 5 (LICQ condition, specialized to our case).

Let xΘ{x}_{\Theta} denote the solution to equation 8. Let 𝒜(xΘ){1,,n}\mathcal{A}({x}_{\Theta})\subseteq\{1,\ldots,n\} denote the set of active positivity constraints:

𝒜(xΘ){i:[xΘ]i=0}.\mathcal{A}({x}_{\Theta})\triangleq\{i:\ [{x}_{\Theta}]_{i}=0\}. (22)

The point xΘ{x}_{\Theta} satisfies the Linear Independence Constraint Qualification (LICQ) condition if the vectors

{a1,,am}{ei:i𝒜(xΘ)}\{a_{1},\ldots,a_{m}\}\cup\{e_{i}:i\in\mathcal{A}({x}_{\Theta})\} (23)

are linearly independent.

Theorem 6.

If the LICQ condition holds at xΘ{x}_{\Theta}, AA is full-rank and α(0,2/γ)\alpha\in(0,2/\gamma), then TΘ/zz=zΘ<1\|\partial T_{\Theta}/\partial z\|_{z={z}_{\Theta}}<1.

The significance of Theorem 6, which is proved in Appendix A, is outlined by the following corollary, stating that using pΘp_{\Theta} instead of the true gradient d/dΘd\ell/d\Theta is still guaranteed to decrease (Θ)\ell(\Theta), at least for small enough step-size.

Corollary 7.

If TΘT_{\Theta} is continuously differentiable with respect to Θ\Theta at zΘz_{\Theta}, the assumptions in Theorem 6 hold and (TΘ/Θ)(TΘ/Θ)(\partial T_{\Theta}/\partial\Theta)^{\top}(\partial T_{\Theta}/\partial\Theta) has condition number sufficiently small, then pΘ(xΘ)p_{\Theta}(x_{\Theta}) is a descent direction for \ell with respect to Θ\Theta.

Theorem 6 also proves that 𝒥Θ(zΘ)\mathcal{J}_{\Theta}(z_{\Theta}) in equation 18 is invertible, thus also justifying the use of Jacobian-based backprop as well as numerous other gradient approximation techniques (Liao et al., 2018; Geng et al., 2021).

Finally, we note that in practice pΘ(xK)p_{\Theta}(x^{K}) is used instead of pΘ(xΘ)p_{\Theta}(x_{\Theta}). Fung et al. (2022, Corollary 0.1) guarantees that pΘ(xK)p_{\Theta}(x^{K}) will still be a descent direction for KK sufficiently large, see also Bolte et al. (2024, Cor. 1).

Synergy between forward and backward pass

We highlight that the pseudogradient computed in the backward pass (i.e., pΘp_{\Theta}) does not require any Lagrange multipliers, in contrast with the gradient computed using equation 12. This is why we may use Davis-Yin splitting, as opposed to a primal-dual interior point method, on the forward pass. This allows efficiency gains on the forward pass, as Davis-Yin splitting scales favorably with the problem dimension.

Implementation

DYS-net is implemented as an abstract PyTorch (Paszke et al., 2019) layer in the provided code. To instantiate it, a user only need specify AA and bb (see equation 3) and choose an architecture for wΘw_{\Theta}. At test time, it can be beneficial to solve the underlying combinatorial problem equation 2 exactly using a specialized algorithm, e.g. commercial integer programming software for the knapsack prediction problem. This can easily be done using DYS-net by instantiating the test-time-forward abstract method.

5 Numerical Experiments

We demonstrate the effectiveness of DYS-Net through four experiments. In all our experiments with DYS-net we used α=0.05\alpha=0.05 and K=1,000K=1,000, with an early stopping condition of |zk+1zk|2tol|z_{k+1}-z_{k}|_{2}\leq\mathrm{tol} with tol=0.01\mathrm{tol}=0.01. We compare DYS-net against three other methods: Perturbed Optimization (Berthet et al., 2020)PertOpt-net; an approach using Blackbox Backpropagation (Vlastelica et al., 2019)BB-net; and an approach using cvxpylayers (Agrawal et al., 2019a)CVX-net. All code needed to reproduce our experiments is available at https://github.com/mines-opt-ml/fpo-dys.

5.1 Experiments using PyEPO

First, we verify the efficiency of DYS-net by performing benchmarking experiments within the PyEPO (Tang & Khalil, 2022) framework.

Models

We use the PyEPO implementations of Perturbed Optimization, respectively Blackbox Backpropagation, as the core of PertOpt-net, respectively BB-net. We use the same architecture for wΘ(d)w_{\Theta}(d) for all models333The architecture we use for shortest path prediction problems differs from that used for knapsack prediction problems. See the appendix for details. and use an appropriate combinatorial solver at test time, via the PyEPO interface to Gurobi. Thus, the methods differ only in training procedure. The precise architectures used for wΘ(d)w_{\Theta}(d) for each problem are described in Appendix C, and are also viewable in the accompanying code repository. Interestingly, we observed that adding drop-out during training to the output layer proved beneficial for the knapsack prediction problem. Without this, we find that wΘw_{\Theta} tends to output a sparse approximation to ww supported on a feasible set of items, and so does not generalize well.

Training

We train for a maximum of 100 epochs or 30 minutes, whichever comes first. We use a validation set for model selection as we observe that, for all methods, the best loss is seldom achieved at the final iteration. Exact hyperparameters are discussed in Appendix C. For each problem size, we perform three repeats with different randomly generated data. Results are displayed in Figure 2.

Evaluation

Given test data {(di,w(di),x(di))}i=1N\{(d_{i},w(d_{i}),x^{\star}(d_{i}))\}_{i=1}^{N} and a fully trained model wΘ(d)w_{\Theta}(d) we define the regret:

r(d)=w(di)(xΘ(di)x(di)).r(d)=w(d_{i})^{\top}\left(x_{\Theta}(d_{i})-x^{\star}(d_{i})\right). (24)

Note that regret is non-negative, and is zero precisely when xΘ(di)x_{\Theta}(d_{i}) is a solution of equal quality to x(di)x^{\star}(d_{i}). Following Tang & Khalil (2022), we evaluate the quality of wΘ(d)w_{\Theta}(d) using normalized regret, defined as

r~=i=1Nr(di)i=1N[w(di)x(di)].\tilde{r}=\frac{\sum_{i=1}^{N}r(d_{i})}{\sum_{i=1}^{N}\left[w(d_{i})^{\top}x^{\star}(d_{i})\right]}. (25)

5.1.1 Shortest Path Prediction

The shortest path between two vertices in a graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) can be found by:

x=argminx𝒳w(d)x;𝒳={x{0,1}||:Ex=b},x^{\star}=\operatorname*{arg\,min}_{x\in\mathcal{X}}w(d)^{\top}x;\ \mathcal{X}=\{x\in\{0,1\}^{|\mathcal{E}|}:Ex=b\}, (26)

where EE is the vertex-edge adjacency matrix, bb encodes the initial and terminal vertices, and w(d)||w(d)\in\mathbb{R}^{|\mathcal{E}|} is a vector encoding (dd-dependent) edge lengths. Here, xx^{\star} will encode the edges included in the optimal path. In this experiment we focus on the case where 𝒢\mathcal{G} is the k×kk\times k grid graph. We use the PyEPO Tang & Khalil (2022) benchmarking software to generate training datasets of the form {(di,xix(di))}i=1N\{(d_{i},x^{\star}_{i}\approx x^{\star}(d_{i}))\}_{i=1}^{N} where N=1000N=1000 for each k{5,10,15,20,25,30}k\in\{5,10,15,20,25,30\}. The did_{i} are sampled from the five-dimensional multivariate Gaussian distribution with mean 0 and identity covariance. Note that the number of variables in equation 26 scales like k2k^{2}, not kk.

5.1.2 Knapsack Prediction

In the (0–1, single) knapsack prediction problem, we are presented with a container (i.e. a knapsack) of size cc and II items, of sizes s1,,sIs_{1},\ldots,s_{I} and values w1(d),,wI(d)w_{1}(d),\ldots,w_{I}(d). The goal is to select the subset of maximum value that fits in the container, i.e. to solve:

x=argmaxx𝒳w(d)x;𝒳={x{0,1}I:sxc}x^{\star}=\operatorname*{arg\,max}_{x\in\mathcal{X}}w(d)^{\top}x;\ \mathcal{X}=\{x\in\{0,1\}^{I}:\ s^{\top}x\leq c\}

Here, xx^{\star} encodes the selected items. In the (0–1) kk-knapsack prediction problem we imagine the container having various notions of “size” (i.e. length, volume, weight limit) and hence a kk-tuple of capacities ckc\in\mathbb{R}^{k}. Correspondingly, the items each have a kk-tuple of sizes s1,,sIks_{1},\ldots,s_{I}\in\mathbb{R}^{k}. We aim to select a subset of maximum value, amongst all subsets satisfying the kk capacity constraints:

x=argmaxx𝒳w(d)x𝒳={x{0,1}I:Sxc}S=[s1sk]k×I\begin{split}&x^{\star}=\operatorname*{arg\,max}_{x\in\mathcal{X}}w(d)^{\top}x\\ &\mathcal{X}=\{x\in\{0,1\}^{I}:\ Sx\leq c\}\\ &S=\begin{bmatrix}s_{1}&\cdots&s_{k}\end{bmatrix}\in\mathbb{R}^{k\times I}\end{split} (27)

In Appendix B we discuss how to transform 𝒳\mathcal{X} into the canonical form discussed in Section 2. We again use PyEPO to generate datasets of the form {(di,xix(di))}i=1N\{(d_{i},x^{\star}_{i}\approx x^{\star}(d_{i}))\}_{i=1}^{N} where N=1000N=1000 for k=2k=2 and II varying from 5050 to 750750 inclusive in increments of 5050. The did_{i} from the five-dimensional multivariate Gaussian distribution with mean 0 and identity covariance.

Refer to caption
(a) Normalized Regret
Refer to caption
(b) Train Time (in seconds)
Refer to caption
(c) Normalized Regret
Refer to caption
(d) Train Time (in seconds)
Figure 2: Results for the shortest path and knapsack prediction problems. Figures (a) and (b) show normalized regret and train time for the shortest path prediction problem, while Figures (c) and (d) show normalized regret and train time for the knapsack prediction problem. Note that the train time in figures (b) and (d) is the time till the model achieving best normalized regret on the validation set is reached.

5.1.3 Results

As is clear from Figure 2 DYS-net trains the fastest among the four methods, and achieves the best or near-best normalized regret in all cases. We attribute the accuracy to the fact that it is not a zeroth order method. Moreover, avoiding computation and backpropagation through the projection P𝒞P_{\mathcal{C}} allows DYS to have fast runtimes.

5.2 Large-scale shortest path prediction

Our shortest path prediction experiment described in Section 5.1.1 is bottlenecked by the fact that the “academic” license of Gurobi does not allow for shortest path prediction problems on grids larger than 30×3030\times 30. To further explore the limits of DYS-net, we redo this experiment using a custom pyTorch implementation of Dijkstra’s algorithm444adapted from Tensorflow code available at https://github.com/google-research/google-research/blob/master/perturbations/experiments/shortest_path.py as our base solver for equation 26.

Data Generation

We generate datasets 𝒟={(di,x(di)}i=11,000\mathcal{D}=\{(d_{i},x^{\star}(d_{i})\}_{i=1}^{1,000} for kk-by-kk grids where k{5,10,20,30,50,100}k\in\{5,10,20,30,50,100\} and the did_{i} are sampled uniformly at random from [0,1]5[0,1]^{5}, the true edge weights are computed as w(d)=Wdw(d)=Wd for fixed W||×5W\in\mathbb{R}^{|\mathcal{E}|\times 5}, and x(d)x^{\star}(d) is computed given w(d)w(d) using the aforementioned pyTorch implementation of Dijkstra’s algorithm. Thus, all entries of w(d)w(d) are non-negative, and are positive with overwhelming probability.

Refer to caption
Figure 3: Accuracy (in percentage) of predicted paths on 5-by-5 grid during training.
grid size number of variables neural network size
5-by-5 40 500
10-by-10 180 2040
20-by-20 760 8420
30-by-30 1740 19200
50-by-50 4900 53960
100-by-100 19800 217860
Table 1: Number of variables (i.e. number of edges) per grid size for the shortest path prediction problem of Section 5.2. Third column: number of parameters in wΘ(d)w_{\Theta}(d) for DYS-Net, CVX-net and PertOpt-net. For BB-Net, we found a latent dimension that is 20-times larger than the aforementioned three to be more effective.
Refer to caption
(a) Test MSE Loss
Refer to caption
(b) Training Time (min)
Refer to caption
(c) Regret Values
Figure 4: Results for the shortest path prediction problem. a) Test MSE loss (left), b) training time in minutes (middle), and c) regret values (right) vs. gridsize for DYS-Net (proposed) and approaches using cvxpylayers (Agrawal et al., 2019b) labeled CVX; Perturbed Optimization (Berthet et al., 2020) labeled PertOpt; and Blackbox Backpropagation (Vlastelica et al., 2019), labeled BB. Note CVX is unable to load or run problems with gridsize over 30. Dimensions of the variables can be found in Table 1.
Models and Training

We test the same four approaches as in Sections 5.1.1 and 5.1.2, but unlike in these sections we do not use the PyEPO implementations of PertOpt-net or BB-net. We cannot, as the PyEPO implementations call Gurobi to solve equation 26 which, as mentioned above, cannot handle grids larger than 30×3030\times 30. Instead, we use custom implementations based on the code accompanying the works introducing PertOpt (Berthet et al., 2020) and BB (Vlastelica et al., 2019). We use the same architecture for wΘ(d)w_{\Theta}(d) for DYS-net, PertOpt-net, and Cvx-net; a two layer fully connected neural network with leaky ReLU activation functions. For DYS-net and Cvx-net we do not modify the forward pass at test time. For BB-net we use a larger network by making the latent dimension 20-times larger than that of the first three as we found this more effective. Network sizes can be seen in Table 1.
We tuned the hyperparameters for each approach to the best of our ability on the smallest problem (5-by-5 grid) and then used these hyperparameter values for all other graph sizes. See Figure 3 for the results of this training run. We train all approaches for 100 epochs total on each problem using the 2\ell_{2} loss equation 5.

Results

The results are displayed in Figure 4. While CVX-net and PertOpt-net achieve low regret for small grids, DYS-net model achieves a low regret for all grids. In addition to training faster, DYS-net can also be trained for much larger problems, e.g., 100-by-100 grids, as shown in Figure 4. We found that CVX-net could not handle grids larger than 30-by-30, i.e. , problems with more than 1740 variables555This is to be expected, as discussed in in Amos & Kolter (2017); Agrawal et al. (2019a) (see Table 1). Importantly, PertOpt-net takes close to a week to train for the 100-by-100 problem, whereas DYS-net takes about a day (see right Figure 4b). On the other hand, the training speed of BB-net is comparable to that of DYS-net, but does not lead to competitive accuracy as shown in Figure 4(a). Interpreting the outputs of DYS-net and CVX-net as (unnormalized) probabilities over the grid, one can use a greedy algorithm to determine the most probable path from top-left to bottom-right. For small grids, e.g. 5-by-5, this path coincides exactly with the true path for most dd (see Fig. 3).

5.3 Warcraft shortest path prediction

Finally, as an illustrative example, we consider the Warcraft terrains dataset first studied in Vlastelica et al. (2019). As shown in Figure 1, dd is a 96-by-96 RGB image, divided into a 12-by-12 grid of terrain tiles. Each tile has a different cost to traverse, and the goal is to find the quickest path from the top-left corner to the bottom-right corner.

Models, Training, and Evaluation

We build upon the code provided as part of the PyEPO package (Tang & Khalil, 2022). We use the same architecture for wΘ(d)w_{\Theta}(d) for BB-net, PertOpt-net, and DYS-net—a truncated ResNet18 (He et al., 2016) as first proposed in Vlastelica et al. (2019). We train each network for 50 epochs, except for the baseline. The initial learning rate is 5×1045\times 10^{-4} and it is decreased by a factor of 10 after 30 and 40 epochs respectively. PertOpt-net and DYS-net are trained to minimize the 2\ell_{2} loss equation 5, while BB-net is trained to minimize the Hamming loss as described in Vlastelica et al. (2019). We evaluate the quality of wΘ(d)w_{\Theta}(d) using normalized regret equation 25.

Results

The results for this experiment are shown in Table 2. Interestingly, BB-net and PertOpt-net are much more competitive in this experiment than in the experiments of Sections 5.1.1, 5.1.2, and 5.2. We attribute this to the “discrete” nature of the true cost vector w(d)w(d)—for the Warcraft problem entries of w(d)w(d) can only take on four, well-separated values—as opposed to the more “continuous” nature of w(d)w(d) in the previous experiments. Thus, an algorithm that learns a rough approximation to the true cost vector will be more successful in the Warcraft problem than in the shortest path problems of Sections 5.1.1 and 5.2. This difference is illustrated in Figure 5. We also note that model selection using a validation set is beneficial for all approaches, but particularly so for BB-net and DYS-net, which achieve their best-performing models in the first several epochs of training.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Left two figures: Sample cost matrices for shortest problem considered in Section 5.1.1. Right two figures: Sample cost matrices for the Warcraft shortest path prediction problem considered in Section 5.3. Note that in Section 5.3 the node weighted shortest path prediction problem is considered, while in Section 5.1.1 the edge weighted variant is solved. For ease of comparison, in the left two figures we have reshaped the edge cost vector into a node cost matrix.
Algorithm Test Normalized Regret Time (in hours)
BB-net 0.12040.1204 0.110.11
PertOpt-net 0.10890.1089 2.072.07
DYS-net 0.08890.0889 0.090.09
CVX-net 0.02140.0214 2.332.33
Table 2: Results for the 12-by-12 Warcraft shortest path prediction problem. We select the model achieving best normalized regret on the validation set. The time displayed is the time till this best normalized regret is achieved. All results are averaged over three runs.

6 Conclusions

This work presents a new method for learning to solve ILPs using Davis-Yin splitting which we call DYS-net. We prove that the gradient approximation computed in the backward pass of DYS-net is indeed a descent direction, thus advancing the current understanding of implicit networks. Our experiments show that DYS-net is capable of scaling to truly large ILPs, and outperforms existing state-of-the-art methods, at least when training data is of the form (d,x(d))(d,x^{\star}(d)). We note that in principle DYS-net may be applied given training data of the form (d,w(d))(d,w(d)); a common paradigm for many predict-then-optimize problems Mandi et al. (2023). However, preliminary experiments showed that SPO+ Elmachtoub & Grigas (2022) achieved a substantially lower regret than DYS-net in this setting. We leave the extension of DYS-net to this data type to future work. Our experiments also reveal an interesting dichotomy between problems in which entries of w(d)w(d) may take on only a handful of discrete values and problems in which w(d)w(d) is more “continuous”. Future work could explore this dichotomy further, as well as apply DYS-net to additional ILPs, for example the traveling salesman problem.

References

  • Abadi et al. (2016) Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp.  265–283, 2016.
  • Agrawal et al. (2019a) A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and Z. Kolter. Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, 2019a.
  • Agrawal et al. (2019b) Akshay Agrawal, Brandon Amos, Shane Barratt, Stephen Boyd, Steven Diamond, and J Zico Kolter. Differentiable convex optimization layers. Advances in neural information processing systems, 32, 2019b.
  • Agrawal et al. (2019c) Akshay Agrawal, Shane Barratt, Stephen Boyd, Enzo Busseti, and Walaa M Moursi. Differentiating through a cone program. arXiv preprint arXiv:1904.09043, 2019c.
  • Amos & Kolter (2017) Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pp.  136–145. PMLR, 2017.
  • Bai et al. (2019) Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Advances in Neural Information Processing Systems, 32, 2019.
  • Baillon & Haddad (1977) Jean-Bernard Baillon and Georges Haddad. Quelques propriétés des opérateurs angle-bornés et n-cycliquement monotones. Israel Journal of Mathematics, 26:137–150, 1977.
  • Bauschke & Combettes (2009) Heinz H Bauschke and Patrick L Combettes. The Baillon-Haddad theorem revisited. arXiv preprint arXiv:0906.0807, 2009.
  • Bengio (1997) Yoshua Bengio. Using a financial training criterion rather than a prediction criterion. International journal of neural systems, 8(04):433–443, 1997.
  • Berthet et al. (2020) Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, and Francis Bach. Learning with differentiable perturbed optimizers. Advances in neural information processing systems, 33:9508–9519, 2020.
  • Bertoin et al. (2021) David Bertoin, Jérôme Bolte, Sébastien Gerchinovitz, and Edouard Pauwels. Numerical influence of ReLU’(0) on backpropagation. Advances in Neural Information Processing Systems, 34:468–479, 2021.
  • Bertsekas (1997) Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334–334, 1997.
  • Blondel et al. (2022) Mathieu Blondel, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares-López, Fabian Pedregosa, and Jean-Philippe Vert. Efficient and modular implicit differentiation. Advances in neural information processing systems, 35:5230–5242, 2022.
  • Bolte et al. (2021) Jérôme Bolte, Tam Le, Edouard Pauwels, and Tony Silveti-Falls. Nonsmooth implicit differentiation for machine-learning and optimization. Advances in neural information processing systems, 34:13537–13549, 2021.
  • Bolte et al. (2024) Jérôme Bolte, Edouard Pauwels, and Samuel Vaiter. One-step differentiation of iterative algorithms. Advances in Neural Information Processing Systems, 36, 2024.
  • Bradbury et al. (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, et al. Jax: composable transformations of python+ numpy programs. 2018.
  • Chen et al. (2021) Bingqing Chen, Priya L Donti, Kyri Baker, J Zico Kolter, and Mario Bergés. Enforcing policy feasibility constraints through differentiable projection for energy optimization. In Proceedings of the Twelfth ACM International Conference on Future Energy Systems, pp.  199–210, 2021.
  • Chen et al. (2022) Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, and Wotao Yin. Learning to optimize: A primer and a benchmark. Journal of Machine Learning Research, 23(189):1–59, 2022.
  • Chen et al. (2018) Xiaohan Chen, Jialin Liu, Zhangyang Wang, and Wotao Yin. Theoretical linear convergence of unfolded ista and its practical weights and thresholds. Advances in Neural Information Processing Systems, 31, 2018.
  • Clarke (1983) FH Clarke. Optimization and nonsmooth analysis, wiley-interscience. New York, 1983.
  • Condat (2016) Laurent Condat. Fast projection onto the simplex and the 1\ell_{1} ball. Mathematical Programming, 158(1):575–585, 2016.
  • Dalle et al. (2022) Guillaume Dalle, Léo Baty, Louis Bouvier, and Axel Parmentier. Learning with combinatorial optimization layers: a probabilistic approach. arXiv preprint arXiv:2207.13513, 2022.
  • Davis & Yin (2017) Damek Davis and Wotao Yin. A three-operator splitting scheme and its optimization applications. Set-valued and variational analysis, 25(4):829–858, 2017.
  • Duchi et al. (2008) John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the 1\ell_{1}-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pp.  272–279, 2008.
  • El Ghaoui et al. (2021) Laurent El Ghaoui, Fangda Gu, Bertrand Travacca, Armin Askari, and Alicia Tsai. Implicit deep learning. SIAM Journal on Mathematics of Data Science, 3(3):930–958, 2021.
  • Elmachtoub & Grigas (2022) Adam N Elmachtoub and Paul Grigas. Smart “predict, then optimize”. Management Science, 68(1):9–26, 2022.
  • Ferber et al. (2020) Aaron Ferber, Bryan Wilder, Bistra Dilkina, and Milind Tambe. Mipaal: Mixed integer program as a layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  1504–1511, 2020.
  • Fung et al. (2022) Samy Wu Fung, Howard Heaton, Qiuwei Li, Daniel McKenzie, Stanley Osher, and Wotao Yin. JFB: Jacobian-free backpropagation for implicit networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  • Geng et al. (2021) Zhengyang Geng, Xin-Yu Zhang, Shaojie Bai, Yisen Wang, and Zhouchen Lin. On training implicit models. Advances in Neural Information Processing Systems, 34:24247–24260, 2021.
  • Gilton et al. (2021) Davis Gilton, Gregory Ongie, and Rebecca Willett. Deep equilibrium architectures for inverse problems in imaging. IEEE Transactions on Computational Imaging, 7:1123–1133, 2021.
  • (31) Jean Guyomarch. Warcraft ii open-source map editor, 2017. http://github. com/war2/war2edit.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Heaton & Fung (2023) Howard Heaton and Samy Wu Fung. Explainable ai via learning to optimize. Scientific Reports, 13(1):10103, 2023.
  • Kacem et al. (2021) Imed Kacem, Hans Kellerer, and A Ridha Mahjoub. Preface: New trends on combinatorial optimization for network and logistical applications. Annals of Operations Research, 298(1):1–5, 2021.
  • Karp (1972) Richard M Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pp.  85–103. Springer, 1972.
  • Kotary et al. (2021) James Kotary, Ferdinando Fioretto, Pascal van Hentenryck, and Bryan Wilder. End-to-end constrained optimization learning: A survey. In 30th International Joint Conference on Artificial Intelligence, IJCAI 2021, pp.  4475–4482. International Joint Conferences on Artificial Intelligence, 2021.
  • Li & Malik (2017) Ke Li and Jitendra Malik. Learning to optimize. In International Conference on Learning Representations, 2017.
  • Li et al. (2023) Qiuwei Li, Daniel McKenzie, and Wotao Yin. From the simplex to the sphere: faster constrained optimization using the hadamard parametrization. Information and Inference: A Journal of the IMA, 12(3):1898–1937, 2023.
  • Liao et al. (2018) Renjie Liao, Yuwen Xiong, Ethan Fetaya, Lisa Zhang, KiJung Yoon, Xaq Pitkow, Raquel Urtasun, and Richard Zemel. Reviving and improving recurrent back-propagation. In International Conference on Machine Learning, pp.  3082–3091. PMLR, 2018.
  • Liu et al. (2023) Jialin Liu, Xiaohan Chen, Zhangyang Wang, Wotao Yin, and HanQin Cai. Towards constituting mathematical structures for learning to optimize. In International Conference on Machine Learning, pp.  21426–21449. PMLR, 2023.
  • Liu et al. (2022) Jiaming Liu, Xiaojian Xu, Weijie Gan, Ulugbek Kamilov, et al. Online deep equilibrium learning for regularization by denoising. Advances in Neural Information Processing Systems, 35:25363–25376, 2022.
  • Mandi & Guns (2020) Jayanta Mandi and Tias Guns. Interior point solving for lp-based prediction+ optimisation. Advances in Neural Information Processing Systems, 33:7272–7282, 2020.
  • Mandi et al. (2023) Jayanta Mandi, James Kotary, Senne Berden, Maxime Mulamba, Victor Bucarey, Tias Guns, and Ferdinando Fioretto. Decision-focused learning: Foundations, state of the art, benchmark and future opportunities. arXiv preprint arXiv:2307.13565, 2023.
  • McKenzie et al. (2021) Daniel McKenzie, Howard Heaton, Qiuwei Li, Samy Wu Fung, Stanley Osher, and Wotao Yin. Operator splitting for learning to predict equilibria in convex games. arXiv e-prints, pp.  arXiv–2106, 2021.
  • Nesterov & Spokoiny (2017) Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019.
  • Pedregosa & Gidel (2018) Fabian Pedregosa and Gauthier Gidel. Adaptive three operator splitting. In International Conference on Machine Learning, pp.  4085–4094. PMLR, 2018.
  • Piorkowski et al. (2009) Michal Piorkowski, Natasa Sarafijanovic-Djukic, and Matthias Grossglauser. Crawdad data set epfl/mobility (v. 2009-02-24), 2009.
  • Pogančić et al. (2019) Marin Vlastelica Pogančić, Anselm Paulus, Vit Musil, Georg Martius, and Michal Rolinek. Differentiation of blackbox combinatorial solvers. In International Conference on Learning Representations, 2019.
  • Ruthotto et al. (2018) Lars Ruthotto, Julianne Chung, and Matthias Chung. Optimal experimental design for inverse problems with state constraints. SIAM Journal on Scientific Computing, 40(4):B1080–B1100, 2018.
  • Ryu & Yin (2022) Ernest Ryu and Wotao Yin. Large-Scale Convex Optimization: Algorithm Designs via Monotone Operators. Cambridge University Press, Cambridge, England, 2022.
  • Sahoo et al. (2022) Subham Sekhar Sahoo, Anselm Paulus, Marin Vlastelica, Vít Musil, Volodymyr Kuleshov, and Georg Martius. Backpropagation through combinatorial algorithms: Identity with projection works. In The Eleventh International Conference on Learning Representations, 2022.
  • Sbihi & Eglese (2010) Abdelkader Sbihi and Richard W Eglese. Combinatorial optimization and green logistics. Annals of Operations Research, 175(1):159–175, 2010.
  • Tang & Khalil (2022) Bo Tang and Elias Boutros Khalil. Pyepo: A pytorch-based end-to-end predict-then-optimize library with linear objective function. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022.
  • Vapnik (1999) Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1999.
  • Vlastelica et al. (2019) Marin Vlastelica, Anselm Paulus, Vit Musil, Georg Martius, and Michal Rolinek. Differentiation of blackbox combinatorial solvers. In International Conference on Learning Representations, 2019.
  • Wang & Tang (2021) Qi Wang and Chunlei Tang. Deep reinforcement learning for transportation network combinatorial optimization: A survey. Knowledge-Based Systems, 233:107526, 2021.
  • Wilder et al. (2019) Bryan Wilder, Bistra Dilkina, and Milind Tambe. Melding the data-decisions pipeline: Decision-focused learning for combinatorial optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  1658–1665, 2019.
  • Yurtsever et al. (2021) Alp Yurtsever, Varun Mangalick, and Suvrit Sra. Three operator splitting with a nonconvex loss function. In International Conference on Machine Learning, pp.  12267–12277. PMLR, 2021.
  • Zhong & Tang (2021) Liwei Zhong and Guochun Tang. Preface: Combinatorial optimization drives the future of health care. Journal of Combinatorial Optimization, 42(4):675–676, 2021.
  • Ziegler (2012) Günter M Ziegler. Lectures on polytopes, volume 152. Springer Science & Business Media, 2012.

Appendix A Appendix A: Proofs

For the reader’s convenience we restate each result given in the main text before proving it.

Theorem 2. Let 𝒞1,𝒞2\mathcal{C}_{1},\mathcal{C}_{2} be as in equation 14, and suppose fΘ(x;γ,d)=wΘ(d)x+γ2x22f_{\Theta}(x;\gamma,d)=w_{\Theta}(d)^{\top}x+\frac{\gamma}{2}\|x\|_{2}^{2} for any neural network wΘ(d)w_{\Theta}(d). For any α(0,2/γ)\alpha\in(0,2/\gamma) define the sequence {zk}\{z^{k}\} by:

zk+1=TΘ(zk) for all kz^{k+1}=T_{\Theta}(z^{k})\quad\text{ for all }k\in\mathbb{N} (28)

where

TΘ(z)zP𝒞2(z)+P𝒞1((2αγ)P𝒞2(z)zαwΘ(d)).T_{\Theta}(z)\!\!\triangleq z\!-\!P_{\mathcal{C}_{2}}(z)+P_{\mathcal{C}_{1}}\!\left(\left(2-\alpha\gamma\right)P_{\mathcal{C}_{2}}(z)\!-\!z\!-\!\alpha w_{\Theta}(d)\right). (29)

If xkP𝒞2(zk)x^{k}\triangleq P_{\mathcal{C}_{2}}(z^{k}) then the sequence {xk}\{x^{k}\} converges to xΘx_{\Theta} in equation 8 and xk+1xk22=O(1/k)\|x^{k+1}-x^{k}\|^{2}_{2}=O(1/k).

Proof.

First note that xfΘ(xΘ;γ,d)=wΘ(d)+γx\nabla_{x}f_{\Theta}({x}_{\Theta};\gamma,d)=w_{\Theta}(d)+\gamma x. Because fΘ(xΘ;γ,d)f_{\Theta}({x}_{\Theta};\gamma,d) is strongly convex, xΘ{x}_{\Theta} is unique and is characterized by the first order optimality condition:

xfΘ(xΘ;d)(xxΘ)0 for all x𝒞.\nabla_{x}f_{\Theta}({x}_{\Theta};d)^{\top}\left(x-{x}_{\Theta}\right)\geq 0\quad\text{ for all }x\in\mathcal{C}. (30)

Note that equation 30 can equally be viewed as a variational inequality with operator FΘ(x;d)=xfΘ(xΘ;d)F_{\Theta}(x;d)=\nabla_{x}f_{\Theta}({x}_{\Theta};d). We deduce that zΘz_{\Theta} with xΘ=P𝒞2(zΘ)x_{\Theta}=P_{\mathcal{C}_{2}}(z_{\Theta}) is a fixed point of

TΘ(z)\displaystyle T_{\Theta}(z) zP𝒞2(z)+P𝒞1(2P𝒞2(z)zα[wΘ(d)+γP𝒞2(z)]))\displaystyle\triangleq z\!-\!P_{\mathcal{C}_{2}}(z)+P_{\mathcal{C}_{1}}\left(2\cdot P_{\mathcal{C}_{2}}(z)\!-\!z\!-\!\alpha\left[w_{\Theta}(d)+\gamma P_{\mathcal{C}_{2}}(z)\right])\right) (31a)
=zP𝒞2(z)+P𝒞1((2αγ)P𝒞2(z)zαwΘ(d))\displaystyle=z\!-\!P_{\mathcal{C}_{2}}(z)+P_{\mathcal{C}_{1}}\left(\left(2-\alpha\gamma\right)\cdot P_{\mathcal{C}_{2}}(z)-z-\alpha w_{\Theta}(d)\right) (31b)

from McKenzie et al. (2021, Theorem 4.2), which is itself a standard application of ideas from operator splitting (Davis & Yin, 2017; Ryu & Yin, 2022).
As xfΘ(xΘ;γ,d)\nabla_{x}f_{\Theta}({x}_{\Theta};\gamma,d) is γ\gamma-Lipschitz continuous it is also xfΘ\nabla_{x}f_{\Theta} is 1/γ1/\gamma-cocoercive by the Baillon-Haddad theorem (Baillon & Haddad, 1977; Bauschke & Combettes, 2009). McKenzie et al. (2021, Theorem 3.3) then implies that zkzΘz^{k}\to z_{\Theta} and xkxΘx^{k}\to x_{\Theta}. However, this theorem does not give a convergence rate. In fact, the convergence rate can be deduced from known results; because TΘT_{\Theta} is averaged for α<2/γ\alpha<2/\gamma the rate zk+1zk=𝒪(1/k)\|z^{k+1}-z^{k}\|=\mathcal{O}(1/k) follows from Ryu & Yin (2022, Theorem 1). Thus,

xk+1xk=P2(zk+1)P1(zk)zk+1zk=𝒪(1/k),\|x^{k+1}-x^{k}\|=\|P_{{\mathbb{C}}_{2}}(z^{k+1})-P_{{\mathbb{C}}_{1}}(z^{k})\|\leq\|z^{k+1}-z^{k}\|=\mathcal{O}(1/k), (32)

as desired.

Next, we state two auxiliary lemmas relating the Jacobian matrices to projections onto linear subspaces.

Lemma 8.

If 𝒞1{x:Ax=b}\mathcal{C}_{1}\triangleq\{x:Ax=b\}, for full-rank Am×nA\in\mathbb{R}^{m\times n}, bnb\in\mathbb{R}^{n} with m<nm<n, and 1Null(A)\mathcal{H}_{1}\triangleq\operatorname{Null}(A) then

P𝒞1z=P1,for allzn.\frac{\partial P_{\mathcal{C}_{1}}}{\partial z}=P_{\mathcal{H}_{1}},\quad\mbox{for all}\ z\in\mathbb{R}^{n}. (33)
Proof.

Let A=UΣVA=U\Sigma V^{\top} denote the reduced SVD of AA, and note that as Am×nA\in\mathbb{R}^{m\times n} with m<nm<n we have Um×m,Σm×mU\in\mathbb{R}^{m\times m},\Sigma\in\mathbb{R}^{m\times m} and Vn×mV\in\mathbb{R}^{n\times m}. Differentiating the formula for P𝒞1P_{\mathcal{C}_{1}} given in Lemma 1 yields

P𝒞1z=IAA,\frac{\partial P_{\mathcal{C}_{1}}}{\partial z}=I-A^{\dagger}A, (34)

where AVΣ1UA^{\dagger}\triangleq V\Sigma^{-1}U^{\top}. Note

AA=(VΣ1U)(UΣV)=VV,A^{\dagger}A=\left(V\Sigma^{-1}U^{\top}\right)\left(U\Sigma V^{\top}\right)=VV^{\top}, (35)

which is the orthogonal projection onto Range(V)=Range(A)\operatorname{Range}(V)=\operatorname{Range}(A^{\top}). It follows that IAAI-A^{\dagger}A is the orthogonal projection on to Range(A)=Null(A)\operatorname{Range}(A^{\top})^{\bot}=\operatorname{Null}(A). ∎

Lemma 9.

Define the multi-valued function

c(α)max(0,α)={1 if α>00 if α<0[0,1] if α=0c(\alpha)\triangleq\partial\max(0,\alpha)=\begin{cases}1&\text{ if }\alpha>0\\ 0&\text{ if }\alpha<0\\ [0,1]&\text{ if }\alpha=0\end{cases} (36)

and, for znz\in\mathbb{R}^{n}, define 2,zSpan(ei:zi>0)\mathcal{H}_{2,z}\triangleq\operatorname{Span}(e_{i}:z_{i}>0). Then

P𝒞2(z¯)=[ddzReLU(z)]z=z¯=diag(c(z¯)),\partial P_{\mathcal{C}_{2}}(\bar{z})=\left[\frac{\mathrm{d}}{\mathrm{d}z}\operatorname{ReLU}(z)\right]_{z=\overline{z}}=\mathrm{diag}(c(\overline{z})), (37)

and adopting the convention for choosing an element of P𝒞2(z¯)\partial P_{\mathcal{C}_{2}}(\bar{z}) stated in the main text:

dP𝒞2dz|z=z¯=diag(c~(z¯))=P2,z¯.\left.\frac{\mathrm{d}P_{\mathcal{C}_{2}}}{\mathrm{d}z}\right|_{z=\bar{z}}=\mathrm{diag}(\tilde{c}(\overline{z}))=P_{\mathcal{H}_{2,\overline{z}}}. (38)
Proof.

First, suppose z¯n\overline{z}\in\mathbb{R}^{n} satisfies z¯i0\overline{z}_{i}\neq 0, for all i[n]i\in[n], i.e. z¯\overline{z} is a smooth point of P𝒞2P_{\mathcal{C}_{2}}. Note

d[ReLU(z¯i)]dz=1ifi=jandzi>0andd[ReLU(z¯i)]dz=0ifijorzi<0.\frac{\mathrm{d}[\operatorname{ReLU}(\overline{z}_{i})]}{\mathrm{d}z}=1\ \mbox{if}\ i=j\ \mbox{and}\ z_{i}>0\qquad\mbox{and}\qquad\frac{\mathrm{d}[\operatorname{ReLU}(\overline{z}_{i})]}{\mathrm{d}z}=0\ \mbox{if}\ i\neq j\ \mbox{or}\ z_{i}<0. (39)

Thus, the Jacobian matrix is diagonal with a 11 in the (i,i)(i,i)-th position whenever z¯i>0\overline{z}_{i}>0 and 0 otherwise, i.e. dP𝒞2dz|z=z¯=diag(c(z¯))\left.\frac{\mathrm{d}P_{\mathcal{C}_{2}}}{\mathrm{d}z}\right|_{z=\bar{z}}=\mathrm{diag}(c(\overline{z})). Now suppose zi=0z_{i}=0 for one ii. For all znz\in\mathbb{R}^{n} with zi<0z_{i}<0, the Jacobian dP𝒞2dz|z\left.\frac{\mathrm{d}P_{\mathcal{C}_{2}}}{\mathrm{d}z}\right|_{z} is well-defined and has a 0 in the (i,i)(i,i)-th position, while for znz\in\mathbb{R}^{n} with zi>0z_{i}>0, the Jacobian dP𝒞2dz|z\left.\frac{\mathrm{d}P_{\mathcal{C}_{2}}}{\mathrm{d}z}\right|_{z} is well-defined and has a 1 in the (i,i)(i,i)-th position. Taking the convex hull yields the interval [0,1][0,1] in the (i,i)(i,i)-th position, as claimed. The case where z¯i=0\overline{z}_{i}=0 for multiple ii is similar.

Consequently, the product of dP𝒞2dz|z=z¯\left.\frac{\mathrm{d}P_{\mathcal{C}_{2}}}{\mathrm{d}z}\right|_{z=\bar{z}} and any vector vnv\in\mathbb{R}^{n} equals vv if and only if vSpan(ei:z¯i>0)v\in\operatorname{Span}(e_{i}:\overline{z}_{i}>0). This shows the linear operator is idempotent with fixed point set 2,z¯\mathcal{H}_{2,\overline{z}}, i.e. it is the projection operator P2,z¯P_{\mathcal{H}_{2,\overline{z}}}. ∎

Theorem 4. Suppose AA is full-rank and 1Null(A)\mathcal{H}_{1}\triangleq\operatorname{Null}(A), 2,zSpan(ei:zi>0)\mathcal{H}_{2,z}\triangleq\operatorname{Span}\left(e_{i}:z_{i}>0\right). Then for all z^n\hat{z}\in\mathbb{R}^{n}

TΘz|z=z¯=P1P2,z¯+(1αγ)P1P2,z¯.\left.\frac{\partial T_{\Theta}}{\partial z}\right|_{z=\bar{z}}=P_{\mathcal{H}_{1}^{\bot}}P_{\mathcal{H}_{2,\bar{z}}^{\bot}}+(1-\alpha\gamma)\cdot P_{\mathcal{H}_{1}}P_{\mathcal{H}_{2,\bar{z}}}. (40)
Proof.

Differentiating the expression for TΘT_{\Theta} in equation 16 with respect to zz yields

TΘz|z=z^\displaystyle\left.\frac{\partial T_{\Theta}}{\partial z}\right|_{z=\hat{z}} =IdP𝒞2dz|z=z^+dP𝒞1dz|z=y(z^)[(2αγ)dP𝒞2dz|z=z^I]\displaystyle=I-\left.\frac{\mathrm{d}P_{\mathcal{C}_{2}}}{\mathrm{d}z}\right|_{z=\hat{z}}+\left.\frac{\mathrm{d}P_{\mathcal{C}_{1}}}{\mathrm{d}z}\right|_{z=y(\hat{z})}\left[(2-\alpha\gamma)\cdot\left.\frac{\mathrm{d}P_{\mathcal{C}_{2}}}{\mathrm{d}z}\right|_{z=\hat{z}}-I\right] (41a)
=IP2,z^+P1((2αγ)P2,z^I),for allz^n,\displaystyle=I-P_{\mathcal{H}_{2,\hat{z}}}+P_{\mathcal{H}_{1}}\left((2-\alpha\gamma)P_{\mathcal{H}_{2,\hat{z}}}-I\right),\quad\mbox{for all}\ \hat{z}\in\mathbb{R}^{n}, (41b)

where, for notational brevity, we set y(z^)(2αγ)P𝒞2(z^)z^αwΘ(d)y(\hat{z})\triangleq\left(2-\alpha\gamma\right)\cdot P_{\mathcal{C}_{2}}(\hat{z})-\hat{z}-\alpha w_{\Theta}(d) in the first line and the second line follows from Lemmas 8 and 9. Repeatedly using the fact, for any subspace n\mathcal{H}\subset\mathbb{R}^{n}, P=IPP_{\mathcal{H}^{\bot}}=I-P_{\mathcal{H}}, the derivative TΘ/z\partial T_{\Theta}/\partial z can be further rewritten:

TΘz|z=z^\displaystyle\left.\frac{\partial T_{\Theta}}{\partial z}\right|_{z=\hat{z}} =IP2,z^+(2αγ)P1P2,z^P1\displaystyle={\color[rgb]{0,0.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.5,0}I-P_{\mathcal{H}_{2,\hat{z}}}}+(2-\alpha\gamma)\cdot P_{\mathcal{H}_{1}}P_{\mathcal{H}_{2,\hat{z}}}-P_{\mathcal{H}_{1}} (42a)
=P2,z^+(2αγ)P1(IP2,z^)P1\displaystyle={\color[rgb]{0,0.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.5,0}P_{\mathcal{H}_{2,\hat{z}}^{\bot}}}+(2-\alpha\gamma)\cdot P_{\mathcal{H}_{1}}\left(I-P_{\mathcal{H}_{2,\hat{z}}^{\bot}}\right)-P_{\mathcal{H}_{1}} (42b)
=P2,z^+P1+(1αγ)P1P1P2,z^(1αγ)P1P2,z^P1\displaystyle={\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}P_{\mathcal{H}_{2,\hat{z}}^{\bot}}}+P_{\mathcal{H}_{1}}+{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(1-\alpha\gamma)\cdot P_{\mathcal{H}_{1}}}-{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}P_{\mathcal{H}_{1}}P_{\mathcal{H}_{2,\hat{z}}^{\bot}}}-{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(1-\alpha\gamma)\cdot P_{\mathcal{H}_{1}}P_{\mathcal{H}_{2,\hat{z}}^{\bot}}}-P_{\mathcal{H}_{1}} (42c)
=(IP1)P2,z^+(1αγ)P1(IP2,z^)\displaystyle={\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(I-P_{\mathcal{H}_{1}})P_{\mathcal{H}_{2,\hat{z}}^{\top}}}+{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(1-\alpha\gamma)\cdot P_{\mathcal{H}_{1}}(I-P_{\mathcal{H}_{2,\hat{z}}^{\bot}})} (42d)
=P1P2,z^+(1αγ)P1P2,z^,for allz^n,\displaystyle=P_{\mathcal{H}_{1}^{\bot}}P_{\mathcal{H}_{2,\hat{z}}^{\bot}}+(1-\alpha\gamma)\cdot P_{\mathcal{H}_{1}}P_{\mathcal{H}_{2,\hat{z}}},\quad\mbox{for all}\ \hat{z}\in\mathbb{R}^{n}, (42e)

completing the proof. ∎

We use the following lemma to prove Theorem 6.

Lemma 10.

If the LICQ condition holds at xΘ{x}_{\Theta}, then 12,zΘ={0}\mathcal{H}_{1}^{\bot}\cap\mathcal{H}_{2,{z}_{\Theta}}^{\bot}=\{0\}.

Proof.

We first rewrite 1\mathcal{H}_{1}^{\perp} and 2,zΘ\mathcal{H}_{2,{z}_{\Theta}}^{\perp}. The subspace 2,zΘ\mathcal{H}_{2,{z}_{\Theta}}^{\bot} is spanned by all non-positive coordinates of zΘ{z}_{\Theta}. By equation 17, [xΘ]i=max{0,[zΘ]i}[{x}_{\Theta}]_{i}=\max\{0,[{z}_{\Theta}]_{i}\}, and so i𝒜(xΘ)i\in\mathcal{A}({x}_{\Theta}) if and only if [zΘ]i0[{z}_{\Theta}]_{i}\leq 0. It follows that

2,zΘSpan{ei:[zΘ]i0}=Span{ei:i𝒜(xΘ)}=Span{ei1,,ei},\mathcal{H}_{2,{z}_{\Theta}}^{\bot}\triangleq\operatorname{Span}\{e_{i}:[{z}_{\Theta}]_{i}\leq 0\}=\operatorname{Span}\{e_{i}:i\in\mathcal{A}({x}_{\Theta})\}=\operatorname{Span}\{e_{i_{1}},\ldots,e_{i_{\ell}}\}, (43)

where we enumerate 𝒜(xΘ)\mathcal{A}(x_{\Theta}) via 𝒜(xΘ)={i1,,i}\mathcal{A}(x_{\Theta})=\{i_{1},\ldots,i_{\ell}\}. On the other hand, 1=Range(A)=Span(a1,,am)\mathcal{H}_{1}^{\bot}=\operatorname{Range}(A^{\top})=\operatorname{Span}(a_{1},\ldots,a_{m}) where aia_{i}^{\top} denotes the ii-th row of AA.

Let v12,zΘv\in\mathcal{H}_{1}^{\bot}\cap\mathcal{H}_{2,{z}_{\Theta}}^{\bot} be given. Since v1v\in\mathcal{H}_{1}^{\perp}, there are scalars α1,,α\alpha_{1},\ldots,\alpha_{\ell} such that v=α1ei1++αei.v=\alpha_{1}e_{i_{1}}+\cdots+\alpha_{\ell}e_{i_{\ell}}. Similarly, since v2,zΘv\in\mathcal{H}_{2,{z}_{\Theta}}^{\perp}, there are scalars β1,,βm\beta_{1},\ldots,\beta_{m} such that v=β1a1++βmam.v=\beta_{1}a_{1}+\cdots+\beta_{m}a_{m}. Hence

0=vv=(α1ei1++αei)(β1a1++βmam).\displaystyle 0=v-v=\big{(}\alpha_{1}e_{i_{1}}+\ldots+\alpha_{\ell}e_{i_{\ell}}\big{)}-\big{(}\beta_{1}a_{1}+\ldots+\beta_{m}a_{m}\big{)}. (44)

By the LICQ condition, {ei1,,ei}{a1,,am}\{e_{i_{1}},\ldots,e_{i_{\ell}}\}\cup\{a_{1},\ldots,a_{m}\} is a linearly independent set of vectors; hence α1==α=β1==βm=0\alpha_{1}=\ldots=\alpha_{\ell}=\beta_{1}=\ldots=\beta_{m}=0 and, thus, v=0v=0. Since vv was arbitrarily chosen in 12,zΘ\mathcal{H}_{1}^{\bot}\cap\mathcal{H}_{2,{z}_{\Theta}}^{\bot}, the result follows. ∎

Theorem 6. If the LICQ condition holds at xΘ{x}_{\Theta}, AA is full-rank and α(0,2/γ)\alpha\in(0,2/\gamma), then TΘ/zz=zΘ<1\|\partial T_{\Theta}/\partial z\|_{z={z}_{\Theta}}<1.

Proof.

By Lemma 10, the LICQ condition implies 12,zΘ={0}\mathcal{H}_{1}^{\bot}\cap\mathcal{H}_{2,{z}_{\Theta}}^{\bot}=\{0\}. This implies that either (i) the first principal angle τ\tau between these two subspaces is nonzero, and so the cosine of this angle is less than unity, i.e.

1>cos(τ)maxu1:u=1maxv2,z:v=1u,v,1>\cos(\tau)\triangleq\max_{u\in\mathcal{H}_{1}^{\bot}:\|u\|=1}\max_{v\in\mathcal{H}_{2,z}^{\bot}:\|v\|=1}\langle u,v\rangle, (45)

or (ii) (at least) one of 1,2,zΘ\mathcal{H}_{1}^{\bot},\mathcal{H}_{2,{z}_{\Theta}}^{\bot} is the trivial vector space {0}\{0\}. In either case, let wnw\in\mathbb{R}^{n} be given. By Theorem 4, in case (ii)

[TΘzw]z=zΘ=P1P2,zΘw+(1αγ)P1P2,zΘw=(1αγ)P1P2,zΘw\left[\frac{\partial T_{\Theta}}{\partial z}w\right]_{z={z}_{\Theta}}=P_{\mathcal{H}_{1}^{\bot}}P_{\mathcal{H}_{2,{z}_{\Theta}}^{\bot}}w+(1-\alpha\gamma)\cdot P_{\mathcal{H}_{1}}P_{\mathcal{H}_{2,{z}_{\Theta}}}w=(1-\alpha\gamma)\cdot P_{\mathcal{H}_{1}}P_{\mathcal{H}_{2,{z}_{\Theta}}}w (46)

implying that

TΘzwz=zΘ=(1αγ)P1P2,zΘw(1αγ)w,\left\|\frac{\partial T_{\Theta}}{\partial z}w\right\|_{z={z}_{\Theta}}=(1-\alpha\gamma)\left\|P_{\mathcal{H}_{1}}P_{\mathcal{H}_{2,{z}_{\Theta}}}w\right\|\leq(1-\alpha\gamma)\|w\|, (47)

where the inequality follows as projection operators are firmly nonexpansive. In case (i), write w=w1+w2w=w_{1}+w_{2}, where w12,zΘw_{1}\in\mathcal{H}_{2,{z}_{\Theta}} and w22,zΘw_{2}\in\mathcal{H}_{2,{z}_{\Theta}}^{\bot}. Appealing to Theorem 4 again

[TΘzw]z=zΘ=P1P2,zΘw+(1αγ)P1P2,zΘw=P1w2+(1αγ)P1w1.\displaystyle\left[\frac{\partial T_{\Theta}}{\partial z}w\right]_{z={z}_{\Theta}}=P_{\mathcal{H}_{1}^{\bot}}P_{\mathcal{H}_{2,{z}_{\Theta}}^{\bot}}w+(1-\alpha\gamma)\cdot P_{\mathcal{H}_{1}}P_{\mathcal{H}_{2,{z}_{\Theta}}}w=P_{\mathcal{H}_{1}^{\bot}}w_{2}+(1-\alpha\gamma)\cdot P_{\mathcal{H}_{1}}w_{1}. (48)

Pythagoras’ theorem may be applied to deduce, together with the fact P1w2P_{\mathcal{H}_{1}^{\bot}}w_{2} and P1w1P_{\mathcal{H}_{1}}w_{1} are orthogonal,

TΘzwz=zΘ2=P1w22+(1αγ)2P1w12.\left\|\frac{\partial T_{\Theta}}{\partial z}w\right\|^{2}_{z={z}_{\Theta}}=\left\|P_{\mathcal{H}_{1}^{\bot}}w_{2}\right\|^{2}+(1-\alpha\gamma)^{2}\cdot\left\|P_{\mathcal{H}_{1}}w_{1}\right\|^{2}. (49)

Since w22,zΘw_{2}\in\mathcal{H}_{2,{z}_{\Theta}}^{\bot}, the angle condition (45) implies

P1w22=P1w2,P1w2=w2,P1P1w2=w2,P1w2cos(τ)w22,\left\|P_{\mathcal{H}_{1}^{\bot}}w_{2}\right\|^{2}=\langle P_{\mathcal{H}_{1}^{\bot}}w_{2},P_{\mathcal{H}_{1}^{\bot}}w_{2}\rangle=\langle w_{2},P_{\mathcal{H}_{1}^{\bot}}P_{\mathcal{H}_{1}^{\bot}}w_{2}\rangle=\langle w_{2},P_{\mathcal{H}_{1}^{\bot}}w_{2}\rangle\leq\cos(\tau)\cdot\|w_{2}\|^{2}, (50)

where the third equality holds since orthogonal linear projections are symmetric and idempotent. Because projections are non-expansive and P2,zΘP_{\mathcal{H}_{2,{z}_{\Theta}}} is linear,

P1w12=P1w1P102w102=w12.\left\|P_{\mathcal{H}_{1}}w_{1}\right\|^{2}=\left\|P_{\mathcal{H}_{1}}w_{1}-P_{\mathcal{H}_{1}}0\right\|^{2}\leq\|w_{1}-0\|^{2}=\|w_{1}\|^{2}. (51)

Combining (49), (50) and (51) reveals

TΘzwz=zΘ2\displaystyle{\left\|\frac{\partial T_{\Theta}}{\partial z}w\right\|^{2}_{z={z}_{\Theta}}} cos(τ)w22+(1αγ)2w12\displaystyle\leq\cos(\tau)\cdot\|w_{2}\|^{2}+(1-\alpha\gamma)^{2}\|w_{1}\|^{2} (52a)
max{cos(τ),(1αγ)2}(w12+w22)\displaystyle\leq\max\{\cos(\tau),(1-\alpha\gamma)^{2}\}\cdot\left(\|w_{1}\|^{2}+\|w_{2}\|^{2}\right) (52b)
=max{cos(τ),(1αγ)2}w2,\displaystyle=\max\{\cos(\tau),(1-\alpha\gamma)^{2}\}\cdot\|w\|^{2}, (52c)

noting the final equality holds since w1w_{1} and w2w_{2} are orthogonal. Because (52) holds for arbitrarily chosen wnw\in\mathbb{R}^{n},

TΘzz=zΘsup{TΘzwz=zΘ:w=1}max{cos(τ),(1αγ)2}<1,{\left\|\frac{\partial T_{\Theta}}{\partial z}\right\|_{z={z}_{\Theta}}}\triangleq\sup\left\{{\left\|\frac{\partial T_{\Theta}}{\partial z}w\right\|_{z=z_{\Theta}}}:\|w\|=1\right\}\leq\sqrt{\max\{\cos(\tau),(1-\alpha\gamma)^{2}\}}<1, (53)

where the final inequality holds by (45) and the fact α(0,2/γ)\alpha\in(0,2/\gamma) implies 1αγ(1,1)1-\alpha\gamma\in(-1,1), as desired. ∎

Corollary 7. If TΘT_{\Theta} is continuously differentiable with respect to Θ\Theta at zΘz_{\Theta}, the assumptions in Theorem 6 hold and (TΘ/Θ)(TΘ/Θ)(\partial T_{\Theta}/\partial\Theta)^{\top}(\partial T_{\Theta}/\partial\Theta) has condition number sufficiently small, then pΘ(xΘ)p_{\Theta}(x_{\Theta}) is a descent direction for \ell with respect to Θ\Theta.

Proof.

From the proof of Theorem 6 we see that TΘT_{\Theta} is contractive with constant Γ=max{cos(τ),(1αγ)2}\Gamma=\sqrt{\max\{\cos(\tau),(1-\alpha\gamma)^{2}\}} and so the main theorem of (Fung et al., 2022), guaranteeing pΘp_{\Theta} is a descent direction, as long as the condition number of (TΘ/Θ)(TΘ/Θ)(\partial T_{\Theta}/\partial\Theta)^{\top}(\partial T_{\Theta}/\partial\Theta) is less than 1/Γ1/\Gamma. ∎

Remark 11.

Similar guarantees, albeit with less restrictive assumptions on TΘ/Θ\partial T_{\Theta}/\partial\Theta, can be deduced from the results of the recent work (Bolte et al., 2024).

Appendix B Derivation for Canonical Form of Knapsack Prediction Problem

For completeness, we explain how to transform the kk-knapsack prediction problem into the canonical form equation 8, and how to derive the standardized representation of the constraint polytope 𝒞\mathcal{C}. Recall that the kk-knapsack prediction problem, as originally stated, is

x=argmaxx𝒳wx where 𝒳={x{0,1}:Sxc} and S=[s1s]k×x^{\star}=\operatorname*{arg\,max}_{x\in\mathcal{X}}w^{\top}x\text{ where }\mathcal{X}=\{x\in\{0,1\}^{\ell}:\ Sx\leq c\}\text{ and }S=\begin{bmatrix}s_{1}&\cdots&s_{\ell}\end{bmatrix}\in\mathbb{R}^{k\times\ell} (54)

We introduce slack variables y1,,yky_{1},\ldots,y_{k} so that the inequality constraint SxcSx\leq c becomes

Sx+c0\displaystyle-Sx+c\geq 0 Sx+c=y and y0\displaystyle\Longrightarrow-Sx+c=y\text{ and }y\geq 0
[SIk][xy]=c\displaystyle\Longrightarrow\begin{bmatrix}S&I_{k}\end{bmatrix}\begin{bmatrix}x\\ y\end{bmatrix}=c

We relax the binary constraint xi{0,1}x_{i}\in\{0,1\} to 0xi10\leq x_{i}\leq 1. We add additional slack variables z1,,zz_{1},\ldots,z_{\ell} to account for the upper bound:

1xi01xi=zi and zi0[I×0×kI×][xyz]=𝟏\displaystyle 1-x_{i}\geq 0\Longrightarrow 1-x_{i}=z_{i}\text{ and }z_{i}\geq 0\Longrightarrow\begin{bmatrix}I_{\ell\times\ell}&0_{\ell\times k}&I_{\ell\times\ell}\end{bmatrix}\begin{bmatrix}x\\ y\\ z\end{bmatrix}=\mathbf{1} (55)

Putting this together, define

A=[SIk×k0k×I×0×kI×](k+)×(2+k) and b=[c𝟏]k+A=\begin{bmatrix}-S&-I_{k\times k}&0_{k\times\ell}\\ I_{\ell\times\ell}&0_{\ell\times k}&I_{\ell\times\ell}\end{bmatrix}\in\mathbb{R}^{(k+\ell)\times(2\ell+k)}\text{ and }b=\begin{bmatrix}-c\\ \mathbf{1}_{\ell}\end{bmatrix}\in\mathbb{R}^{k+\ell} (56)

Finally, redefine x=[xyz]x=\begin{bmatrix}x&y&z\end{bmatrix}^{\top} and w=[w𝟎k𝟎]w=\begin{bmatrix}-w&\mathbf{0}_{k}&\mathbf{0}_{\ell}\end{bmatrix} (where we’re using w-w to switch the argmax to an argmin) and obtain:

x=argminxConv(𝒳)w(d)x+γx22 where Conv(𝒳)={x:Ax=b and x0}x^{\star}=\operatorname*{arg\,min}_{x\in\mathrm{Conv}(\mathcal{X})}w^{\top}(d)x+\gamma\|x\|_{2}^{2}\text{ where }\mathrm{Conv}(\mathcal{X})=\{x:Ax=b\text{ and }x\geq 0\} (57)

Appendix C Experimental Details

C.1 Additional Data Details for PyEPO problems

We use PyEPO (Tang & Khalil, 2022), specifically the functions data.shortestpath.genData and data.knapsack.genData to generate the training data. Both functions sample Bn×5B\in\mathbb{R}^{n\times 5} with Bij=+1B_{ij}=+1 with probability 0.50.5 and 1-1 with probability 0.50.5 and then construct the cost vector w(d)w(d) as

[w(d)]i=[13.5deg(15(Bd)i+3)deg+1]ϵij[w(d)]_{i}=\left[\frac{1}{3.5^{\mathrm{deg}}}\left(\frac{1}{\sqrt{5}}(Bd)_{i}+3\right)^{\mathrm{deg}}+1\right]\cdot\epsilon_{ij}

where deg=4\mathrm{deg}=4 and ϵij\epsilon_{ij} is sampled uniformly from the interval [0.5,1.5][0.5,1.5]. Note that w(d)w(d) is, by construction, non-negative.

C.2 Additional Model Details for PyEPO problems

For PertOpt-net and BBOpt-net we use the default hyperparameter values provided by PyEPO, namely λ=5\lambda=5 for BBOpt-net, number of samples equal 3, ϵ=1\epsilon=1 and Gumbel noise for PertOpt-net. See Pogančić et al. (2019) and Berthet et al. (2020) respectively for descriptions of these hyperparameters. We do so as these are selected via hyperparameter tuning Tang & Khalil (2022). As DYS-net and CVX-net solve a regularized form of the underlying LP equation 7, the regularization parameter γ\gamma needs to be set. After experimenting with different values, we select γ=5×104\gamma=5\times 10^{-4} for DYS-net and γ=5×101\gamma=5\times 10^{-1} for CVX-net.

C.3 Additional Training Details for Knapsack Problem

For all models we use an initial learning rate of 10310^{-3} and a scheduler that reduces the learning rate whenever the validation loss plateaus. We also used weight decay with a parameter of 5×1045\times 10^{-4}.

C.4 Additional Model Details for Large-Scale Shortest Path Problem

Our implementation of PertOpt-net used a PyTorch implementation666See code at github.com/tuero/perturbations-differential-pytorch of the original TensorFlow code777See code at github.com/google-research/google-research/tree/master/perturbations associated to the paper Berthet et al. (2020). We experimented with various hyperparameter settings for 5-by-5 grids and found setting the number of samples equal to 3, the temperature (i.e. ε\varepsilon) to 1 and using Gumbel noise to work best, so we used these values for all other shortest path experiments. Our implementation of BB-net uses the blackbox-backprop package888See code at https://github.com/martius-lab/blackbox-backprop associated to the paper Pogančić et al. (2019). We found setting λ=100\lambda=100 to work best for 5-by-5 grids, so we use this value for all other shortest path experiments. Again, we select γ=5×104\gamma=5\times 10^{-4} for DYS-net and γ=5×101\gamma=5\times 10^{-1} for CVX-net

C.5 Additional Training Details for Large-Scale Shortest Path Problem

We use the MSE loss to train DYS-net, PertOpt-net, and CVX-net. We tried using the MSE loss with BB-net but this did not work well, so we used the Hamming (also known as 0–1) loss, as done in Pogančić et al. (2019).

To train DYS-net and CVX-net, we use an initial learning rate of 10210^{-2} and use a scheduler that reduces the learning rate whenever the test loss plateaus—we found this to perform the best for these two models. For PertOpt-net we found that using a fixed learning rate of 10210^{-2} performed the best. For BB-net, we performed a logarithmic grid-search on the learning rate between 10110{-1} to 10410^{-4} and found that 10310^{-3} performed best; we also attempted adaptive learning rate schemes such as reducing learning rates on plateau but did not obtain improved performance.

C.6 Additional Model Details for Warcraft Experiment

For PertOpt-net and BBOpt-net we again use the default hyperparameter values provided by PyEPO, namely λ=10\lambda=10 for BBOpt-net, number of samples equal 3, ϵ=1\epsilon=1 and Gumbel noise for PertOpt-net. We set γ=0.5\gamma=0.5 for both CVX-net and DYS-net.

C.7 Hardware

All networks were trained using a AMD Threadripper Pro 3955WX: 16 cores, 3.90 GHz, 64 MB cache, PCIe 4.0 CPU and an NVIDIA RTX A6000 GPU.

Appendix D Additional Experimental Results

In Figure 6, we show the test loss and training time per epoch for all three architectures: DYS-net, CVX-net, and PertOpt-net for 10-by-10, 20-by-20, and 30-by-30 grids. In terms of MSE loss, CVX-net and DYS-net lead to comparable performance. In the second row of Figure 6, we observe the benefits of combining the three-operator splitting with JFB (Fung et al., 2022); in particular, DYS-net trains much faster. Figure 7 shows some randomly selected outputs for the three architectures once fully trained.

10-by-10 20-by-20 30-by-30
Test MSE Loss
Refer to caption Refer to caption Refer to caption
Training Time (in minutes)
Refer to caption Refer to caption Refer to caption
Figure 6: Comparison of of DYS-Net, cvxpylayers (Agrawal et al., 2019a), PertOptNet (Berthet et al., 2020), and Blackbox Backpropagation-net (BB-Net) (Pogančić et al., 2019) for three different grid sizes: 10×1010\times 10 (first column), 20×2020\times 20 (second column), and 30×3030\times 30 (third column). The first row shows the MSE loss vs. epochs of the testing dataset. The second row shows the training time vs. epochs.
True Path DYS-net CVX-net PertOpt-net
10-by-10
Refer to caption Refer to caption Refer to caption Refer to caption
20-by-20
Refer to caption Refer to caption Refer to caption Refer to caption
30-by-30
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 7: True paths (column 1), paths predicted by DYS-net (column 2), CVX-net (column 3), and PertOpt-net (column 4). Samples are taken from different grid sizes: 10-by-10 (row 1), 20-by-20 (row 2), and 30-by-30 (row 3).