This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Value-Function-based Sequential Minimization for Bi-level Optimization

Risheng Liu,  Xuan Liu, Shangzhi Zeng, Jin Zhang, and Yixuan Zhang R. Liu and X. Liu are with the DUT-RU International School of Information Science &\& Engineering, Dalian University of Technology, and the Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian, Liaoning, China. R. Liu is also with the Pazhou Lab, Guangzhou, Guangdong, China. E-mail: rsliu@dlut.edu.cn, liuxuan_16@126.com. S. Zeng is with the Department of Mathematics and Statistics, University of Victoria, Victoria, B.C., Canada. E-mail: zengshangzhi@uvic.ca. J. Zhang is with the Department of Mathematics, SUSTech International Center for Mathematics, Southern University of Science and Technology, National Center for Applied Mathematics Shenzhen, and Peng Cheng Laboratory, Shenzhen, Guangdong, China. (Corresponding author, E-mail: zhangj9@sustech.edu.cn.) Y. Zhang is with the Department of Applied Mathematics, the Hong Kong Polytechnic University, Hong Kong, China. E-mail:
yi-xuan.zhang@connect.polyu.hk. Manuscript received April 19, 2005; revised August 26, 2015.
Abstract

Gradient-based Bi-Level Optimization (BLO) methods have been widely applied to handle modern learning tasks. However, most existing strategies are theoretically designed based on restrictive assumptions (e.g., convexity of the lower-level sub-problem), and computationally not applicable for high-dimensional tasks. Moreover, there are almost no gradient-based methods able to solve BLO in those challenging scenarios, such as BLO with functional constraints and pessimistic BLO. In this work, by reformulating BLO into approximated single-level problems, we provide a new algorithm, named Bi-level Value-Function-based Sequential Minimization (BVFSM), to address the above issues. Specifically, BVFSM constructs a series of value-function-based approximations, and thus avoids repeated calculations of recurrent gradient and Hessian inverse required by existing approaches, time-consuming especially for high-dimensional tasks. We also extend BVFSM to address BLO with additional functional constraints. More importantly, BVFSM can be used for the challenging pessimistic BLO, which has never been properly solved before. In theory, we prove the asymptotic convergence of BVFSM on these types of BLO, in which the restrictive lower-level convexity assumption is discarded. To our best knowledge, this is the first gradient-based algorithm that can solve different kinds of BLO (e.g., optimistic, pessimistic, and with constraints) with solid convergence guarantees. Extensive experiments verify the theoretical investigations and demonstrate our superiority on various real-world applications.

Index Terms:
Bi-level optimization, gradient-based method, value-function, sequential minimization, hyper-parameter optimization.

1 Introduction

Currently, a number of important machine learning and deep learning tasks can be captured by hierarchical models, such as hyper-parameter optimization [1, 2, 3, 4], neural architecture search [5, 6, 7], meta learning [8, 9, 10], Generative Adversarial Networks (GAN) [11, 12], reinforcement learning [13], image processing [14, 15, 16, 17], and so on. In general, these hierarchical models can be formulated as the following Bi-Level Optimization (BLO) problem [18, 19, 20]:

``min𝐱𝒳"F(𝐱,𝐲),s.t.𝐲𝒮(𝐱):=argmin𝐲f(𝐱,𝐲),``\min\limits_{\mathbf{x}\in\mathcal{X}}"\ F(\mathbf{x},\mathbf{y}),\ \mathrm{\ s.t.\ }\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x}):=\mathop{\arg\min}_{\mathbf{y}}f(\mathbf{x},\mathbf{y}), (1)

where 𝐱𝒳\mathbf{x}\in\mathcal{X} is the Upper-Level (UL) variable, 𝐲n\mathbf{y}\in\mathbb{R}^{n} is the Lower-Level (LL) variable, the UL objective F(𝐱,𝐲):𝒳×nF(\mathbf{x},\mathbf{y}):\mathcal{X}\times\mathbb{R}^{n}\rightarrow\mathbb{R} and the LL objective f(𝐱,𝐲):m×nf(\mathbf{x},\mathbf{y}):\mathbb{R}^{m}\times\mathbb{R}^{n}\rightarrow\mathbb{R}, are continuously differentiable and jointly continuous functions, and the UL constraint 𝒳m\mathcal{X}\subset\mathbb{R}^{m} is a compact set. Nevertheless, the model in Eq. (1) cannot be solved directly. Some existing works only consider the case that the LL solution set 𝒮(𝐱)\mathcal{S}(\mathbf{x}) is a singleton. However, since this may not be satisfied and 𝐲𝒮(𝐱)\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x}) may not be unique, Eq. (1) is not a rigorous BLO model in mathematics, and thus we use the quotation marks around “min” to denote the slightly imprecise definition of the UL objective [18, 21].

Strictly, people usually focus on an extreme situation of the BLO model, i.e., the optimistic BLO [18]:

min𝐱𝒳min𝐲nF(𝐱,𝐲),s.t.𝐲𝒮(𝐱).\min\limits_{\mathbf{x}\in\mathcal{X}}\min\limits_{\mathbf{y}\in\mathbb{R}^{n}}\ F(\mathbf{x},\mathbf{y}),\ \mathrm{\ s.t.\ }\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x}). (2)

It can be found from the above expression that in optimistic BLO, 𝐱\mathbf{x} and 𝐲\mathbf{y} are in a cooperative relationship, aiming to minimize F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) at the same time. Therefore, it can be applied to a variety of learning and vision tasks, such as hyper-parameter optimization, meta learning, and so on. Sometimes we also need to study BLO problems with inequality constraints on the UL or LL for capturing constraints in real tasks. Another situation one can consider is the pessimistic BLO, which changes the min𝐲n\min_{\mathbf{y}\in\mathbb{R}^{n}} in Eq. (2) into max𝐲n\max_{\mathbf{y}\in\mathbb{R}^{n}} [18]. In the pessimistic case, 𝐱\mathbf{x} and 𝐲\mathbf{y} are in an adversarial relationship, and hence solving pessimistic BLO can be applied to adversarial learning and GAN.

Actually, BLO is challenging to solve, because in the hierarchical structure, we need to solve 𝒮(𝐱)\mathcal{S}(\mathbf{x}) governed by the fixed 𝐱\mathbf{x}, and select an appropriate 𝐲\mathbf{y} from 𝒮(𝐱)\mathcal{S}(\mathbf{x}) to optimize the UL F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}), making 𝐱\mathbf{x} and 𝐲\mathbf{y} intricately dependent of each other, especially when 𝒮(𝐱)\mathcal{S}(\mathbf{x}) is not a singleton [22]. In classical optimization, KKT condition is utilized to characterize the problem, but this method is not applicable to machine learning tasks of large scale due to the use of too many multipliers [23, 24]. In the machine learning community, a class of mainstream and popular methods are gradient-based methods, divided into Explicit Gradient-Based Methods (EGBMs) [2, 8, 25, 26, 5] and Implicit Gradient-Based Methods (IGBMs) [27, 9, 28], according to divergent ideas of calculating the gradient needed for implementing gradient descent. EGBMs implement this process via unrolled differentiation, and IGBMs use the implicit function theorem to obtain the gradient. Both of them usually deal with the problem where the LL solution set 𝒮(𝐱)\mathcal{S}(\mathbf{x}) is a singleton, which is a quite restrictive condition in real application tasks. In dealing with this, Liu et al. [29, 30] proposed Bi-level Descent Aggregation (BDA) as a new EGBM, which removes this assumption and solves the model from the perspective of optimistic BLO.

Nevertheless, there still exists a bottleneck hard to break through, that the LL problems in real learning tasks are usually too complex for EGBMs and IGBMs. In theory, all of the EGBMs and IGBMs require the convexity of the LL problem, or the Lower-Level Convexity, denoted as LLC for short, which is a strong condition and not satisfied in many complicated real-world tasks. For example, since the layer of chosen network is usually greater than one, LLC is not satisfied, so the convergence of these methods cannot be guaranteed. In computation, additionally, EGBMs using unrolled differentiation request large time and space complexity, while IGBMs need to approximate the inverse of a matrix, also with high computational cost, especially when the LL variable 𝐲\mathbf{y} is of large scale, which means the dimension of 𝐲\mathbf{y} is large, generating matrices and vectors of high dimension during the calculating procedure. Furthermore, it has been rarely discussed how to handle machine learning tasks by solving an optimization problem with functional constraints on the UL and LL, or by solving a pessimistic BLO. However, these problems are worth discussing, because pessimistic BLO can be used to capture min-max bi-level structures, which is suitable for GAN and so on, and optimization problems with constraints can be used to represent learning tasks more accurately. Unfortunately, existing methods including EGBMs and IGBMs, are not able to handle these problems.

To address the above limitations of existing methods, in this work, we propose a novel framework, named Bi-level Value-Function-based Sequential Minimization (BVFSM) 111A preliminary version of this work has been published in [1].. To be specific, we start with reformulating BLO into a simple bi-level optimization problem by the value-function [31, 32] of UL objective. After that, we further transform it into a single-level optimization problem with an inequality constraint through the value-function of LL objective. Then, by using the smoothing technique via regularization and adding the constraint into the objective by an auxiliary function of penalty or barrier, eventually the original problem can be transformed into a sequence of unconstrained differentiable single-level problems, which can be solved by gradient descent. Thanks to the re-characterization via the value-function of LL problem, our computational cost is the least to implement the algorithm, and simultaneously, BVFSM can be applied under more relaxed conditions.

Specifically, BVFSM avoids solving an unrolled dynamic system by recurrent gradient or approximating the inverse of Hessian during each iteration like existing methods. Instead, we only need to calculate the first-order gradient in each iteration, reduces the computational complexity relative to the LL problem size by an order of magnitude compared to existing gradient-based BLO methods, and thus require less time and space complexity than EGBMs and IGBMs, especially for complex high-dimensional BLO. Besides, BVFSM enables to maintain the level of complexity when applying BLO to networks, thereby making it possible to use BLO in existing networks and expanding its range of applications significantly. We illustrate the efficiency of BVFSM over existing methods through complexity analysis in theory and various experimental results in reality. In addition, we consider the asymptotic convergence different from some previous gradient-based methods inspired from the perspective of sequential minimization, and prove that the solutions to the sequence of approximate sub-problems converge to the true solution of the original BLO without the restrictive LLC assumption as before. Also, BVFSM can be extended to more complicated and challenging scenarios, namely, BLO with functional constraints and pessimistic BLO problems. We regard pessimistic BLO as a new viewpoint to deal with learning tasks, which has not been solved by gradient-based methods before to our best knowledge. Specially, we use the experiment of GAN as an example to illustrate the application of our method for solving pessimistic BLO. We summarize our contributions as follows.

  • By reformulating the original BLO as an approximated single-level problem based on the value-function, BVFSM breaks the traditional mindset in gradient-based methods, and establishes a competently new sequential minimization algorithmic framework, which not only can be used to address optimistic BLO, but also has the ability to handle BLO in other more challenging scenarios (i.e., with functional constraints and pessimistic), which have seldom been discussed.

  • BVFSM significantly reduces the computational complexity by an order of magnitude compared to existing gradient-based BLO methods with the help of value-function-based reformulation which breaks the traditional mindset. Also, BVFSM avoids the repeated calculation of recurrent gradients and Hessian inverse, which are the core bottleneck for solving high-dimensional BLO problems in existing approaches. The superiority allows BVFSM to be applied to large-scale networks and frontier tasks effectively.

  • We rigorously analyze the asymptotic convergence behaviors of BVFSM on all types of BLO mentioned above. Our theoretical investigations successfully remove the restrictive LLC condition, required in most existing works but actually too ambitious to satisfy in real-world applications.

  • In terms of experiments, we conduct extensive experiments to verify our theoretical findings and demonstrate the superiority of BVFSM on various learning tasks. Especially, by formulating and solving GAN by BVFSM, we also show the application potential of our solution strategy on pessimistic BLO for complex learning problems.

2 Related Works

As aforementioned, BLO is challenging to solve due to its nested structures between UL and LL. Early methods can only handle models with not too many hyper-parameters. For example, to find appropriate parameters, the standard method is to use random search [33] through randomly sampling, or to use Bayesian optimization [34]. However, in real learning tasks, the dimension of hyper-parameters is very large, which early methods cannot deal with, so gradient-based methods are proposed. Here we first put forward a unified form of gradient-based methods, and then discuss the existing methods for further comparing them with our proposed method.

Existing gradient-based methods mainly focus on the optimistic BLO only, so we use the optimistic scenario to illustrate our algorithmic framework clearly, while in Section 3.4, we will discuss how to use our method to solve pessimistic BLO. For optimistic BLO, it can be found from Eq. (2) that the UL variable 𝐱\mathbf{x} and LL variable 𝐲\mathbf{y} will effect each other in a nested relationship. To address this issue, one can transform it into the following form, where φ(𝐱)\varphi(\mathbf{x}) is the value-function of the sub-problem,

min𝐱𝒳φ(𝐱),φ(𝐱):=min𝐲{F(𝐱,𝐲):𝐲𝒮(𝐱)}.\min\limits_{\mathbf{x}\in\mathcal{X}}\ \varphi(\mathbf{x}),\quad\varphi(\mathbf{x}):=\mathop{\min}_{\mathbf{y}}\Big{\{}F(\mathbf{x},\mathbf{y}):\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x})\Big{\}}. (3)

For a fixed 𝐱\mathbf{x}, this sub-problem for solving φ(𝐱)\varphi(\mathbf{x}) is an inner simple BLO task, as it is only about one variable 𝐲\mathbf{y}, with 𝐱\mathbf{x} as a parameter. Then, we hope to minimize φ(𝐱)\varphi(\mathbf{x}) through gradient descent. However, as a value-function, φ(𝐱)\varphi(\mathbf{x}) is non-smooth, non-convex, even with jumps, and thus ill-conditioned, so we use a smooth function to approximate φ(𝐱)\varphi(\mathbf{x}) and approach φ(𝐱)𝐱\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}}. Existing methods can be classified into two categories according to divergent ways to calculate φ(𝐱)𝐱\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}} [20], i.e., Explicit Gradient-Based Methods (EGBMs), which derives the gradient by Automatic Differentiation (AD), and Implicit Gradient-Based Methods (IGBMs), which apply implicit function theorem to deal with the optimality conditions of LL problems.

Note that both EGBMs and IGBMs require 𝐲𝒮(𝐱)\mathbf{y}\in\mathcal{S}(\mathbf{x}) to be unique (except BDA), denoted as 𝐲(𝐱)\mathbf{y}^{*}(\mathbf{x}), while for BDA, by integrating information from both the UL and LL sub-problem, 𝐲T(𝐱)\mathbf{y}_{T}(\mathbf{x}) is obtained by iterations to approach the appropriate 𝐲(𝐱)\mathbf{y}^{*}(\mathbf{x}). Hence, φ(𝐱)=F(𝐱,𝐲(𝐱))\varphi(\mathbf{x})=F(\mathbf{x},\mathbf{y}^{*}(\mathbf{x})), and therefore by the chain rule, the approximated φ(𝐱)𝐱\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}} is split into direct and indirect gradients of 𝐱\mathbf{x},

φ(𝐱)𝐱=F(𝐱,𝐲(𝐱))𝐱+G(𝐱),\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}}=\frac{\partial F(\mathbf{x},\mathbf{y}^{*}(\mathbf{x}))}{\partial\mathbf{x}}+G(\mathbf{x}), (4)

where F(𝐱,𝐲)𝐱\frac{\partial F(\mathbf{x},\mathbf{y})}{\partial\mathbf{x}} is the direct gradient and G(𝐱)G(\mathbf{x}) is the indirect gradient, G(𝐱)=(𝐲(𝐱)𝐱)F(𝐱,𝐲)𝐲G(\mathbf{x})=\left(\frac{\partial\mathbf{y}^{*}(\mathbf{x})}{\partial\mathbf{x}}\right)^{\top}\frac{\partial F(\mathbf{x},\mathbf{y}^{*})}{\partial\mathbf{y}^{*}}. Then we need to compute G(𝐱)G(\mathbf{x}), in other words, the value of 𝐲(𝐱)𝐱\frac{\partial\mathbf{y}^{*}(\mathbf{x})}{\partial\mathbf{x}}.

Explicit Gradient-Based Methods (EGBMs). Maclaurin et al. [25] and Franceschi et al. [2, 8] first proposed Reverse Hyper-Gradient (RHG) and Forward Hyper-Gradient (FHG) respectively, to implement a dynamic system, under the LLC assumption. Given an initial point 𝐲0\mathbf{y}_{0}, denote the iteration process to approach 𝐲(𝐱)\mathbf{y}^{*}(\mathbf{x}) as 𝐲t+1(𝐱)=Φt(𝐱,𝐲t(𝐱)),t=0,1,,T1,\mathbf{y}_{t+1}(\mathbf{x})=\Phi_{t}(\mathbf{x},\mathbf{y}_{t}(\mathbf{x})),\ t=0,1,\cdots,T-1, where Φt\Phi_{t} is a smooth mapping performed to solve 𝐲T(𝐱)\mathbf{y}_{T}(\mathbf{x}) and TT is the number of iterations. In particular, for example, if the process is gradient descent, Φt(𝐱,𝐲t(𝐱))=𝐲t(𝐱)stf(𝐱,𝐲t(𝐱))𝐲t,\Phi_{t}\left(\mathbf{x},\mathbf{y}_{t}(\mathbf{x})\right)=\mathbf{y}_{t}(\mathbf{x})-s_{t}\frac{\partial f\left(\mathbf{x},\mathbf{y}_{t}(\mathbf{x})\right)}{\partial\mathbf{y}_{t}},where st>0s_{t}>0 is the corresponding step size. Then φ(𝐱)\varphi(\mathbf{x}) in Eq. (3) can be approximated by φ(𝐱)φT(𝐱)=F(𝐱,𝐲T(𝐱))\varphi(\mathbf{x})\approx\varphi_{T}(\mathbf{x})=F(\mathbf{x},\mathbf{y}_{T}(\mathbf{x})). As TT increases, φT(𝐱)\varphi_{T}(\mathbf{x}) approaches φ(𝐱)\varphi(\mathbf{x}) generally, and a sequence of unconstrained minimization problems is obtained. Thus, gradient-based methods can be regarded as a kind of sequential-minimization-type scheme [35]. From the chain rule, we have 𝐲t(𝐱)𝐱=(Φt1(𝐱,𝐲t1)𝐲t1)𝐲t1(𝐱)𝐱+Φt1(𝐱,𝐲t1)𝐱,\frac{\partial\mathbf{y}_{t}(\mathbf{x})}{\partial\mathbf{x}}=\left(\frac{\partial\Phi_{t-1}(\mathbf{x},\mathbf{y}_{t-1})}{\partial\mathbf{y}_{t-1}}\right)^{\top}\frac{\partial\mathbf{y}_{t-1}(\mathbf{x})}{\partial\mathbf{x}}+\frac{\partial\Phi_{t-1}(\mathbf{x},\mathbf{y}_{t-1})}{\partial\mathbf{x}}, and 𝐲T(𝐱)𝐱\frac{\partial\mathbf{y}_{T}(\mathbf{x})}{\partial\mathbf{x}} can be obtained from this unrolled procedure. However, FHG and RHG require calculating the gradient of 𝐱\mathbf{x} composed of the first-order condition of LL problem by AD during the entire trajectory, so the computational cost owing to the time and space complexity is very high. In dealing with this, Shaban et al. [26] proposed Truncated Reverse Hyper-Gradient (TRHG) to truncate the iteration, and thus TRHG only needs to store the last II iterations, reducing the computational load. Nevertheless, it additionally requires ff to be strongly convex, and the truncated path length is hard to determine. Another method Liu et al. [5] tried is to use the difference of vectors to approximate the gradient, but the accuracy of using the difference is not promised and there is no theoretical guarantee for this method. On the other hand, from the viewpoint of theory, for more relaxed conditions, Liu et al. [29, 30] proposed Bi-level Descent Aggregation (BDA) to remove the assumption that the LL solution set is a singleton, which is a simplification of real-world problems. Specifically, BDA uses information from both the UL and the LL problem as an aggregation during iterations. However, the obstacle of LLC and computational cost still exists.

Implicit Gradient-Based Methods (IGBMs). IGBMs or implicit differentiation [27, 9, 28], can be applied to obtain 𝐲(𝐱)𝐱\frac{\partial\mathbf{y}^{*}(\mathbf{x})}{\partial\mathbf{x}} under the LLC assumption. If 2f(𝐱,𝐲(𝐱))𝐲𝐲\frac{\partial^{2}f(\mathbf{x},\mathbf{y}^{*}(\mathbf{x}))}{\partial\mathbf{y}\partial\mathbf{y}} is assumed to be invertible in advance as an additional condition, by using the implicit function theorem on the optimality condition f(𝐱,𝐲(𝐱))𝐲=0\frac{\partial f(\mathbf{x},\mathbf{y}^{*}(\mathbf{x}))}{\partial\mathbf{y}}=0, the LL problem is replaced with an implicit equation, and then 𝐲(𝐱)𝐱=(2f(𝐱,𝐲(𝐱))𝐲𝐲)12f(𝐱,𝐲(𝐱))𝐲𝐱.\frac{\partial\mathbf{y}^{*}(\mathbf{x})}{\partial\mathbf{x}}=-\left(\frac{\partial^{2}f(\mathbf{x},\mathbf{y}^{*}(\mathbf{x}))}{\partial\mathbf{y}\partial\mathbf{y}}\right)^{-1}\frac{\partial^{2}f(\mathbf{x},\mathbf{y}^{*}(\mathbf{x}))}{\partial\mathbf{y}\partial\mathbf{x}}. Unlike EGBMs relying on the first-order condition during the entire trajectory, IGBMs only depends on the first-order condition once, which decouples the computational burden from the solution trajectory of the LL problem, but this leads to repeated computation of the inverse of Hessian matrix, which is still a heavy burden. In dealing with this, to avoid direct inverse calculation, the Conjugate Gradient (CG) method [27, 9] changes it into solving a linear system, and Neumann method [28] uses the Neumann series to calculate the Hessian inverse. However, after using these methods, the computational requirements are reduced but still large, because the burden of computing the inverse of matrix changes into computing Hessian-vector products. Additionally, the accuracy of solving a linear system highly depends on its condition number [36], and the ill condition may result in numerical instabilities. A large quadratic term is added on the LL objective to eliminate the ill-condition in [9], but this approach may change the solution set greatly.

As discussed above, EGBMs and IGBMs need repeated calculations of recurrent gradient or Hessian inverse, leading to high time and space complexity in numerical computation, and require the LLC assumption in theory. Actually, when the dimension of 𝐲\mathbf{y} is very large, which happens in practical problems usually, the computational burden of massively computing the products of matrices and vectors might be too heavy to carry. In addition, the LLC assumption is also not suitable for most complex real-world tasks.

3 The Proposed Algorithm

In this section, we illustrate our algorithmic framework, named Bi-level Value-Function-based Sequential Minimization (BVFSM). Our method also follows the idea of constructing a sequence of unconstrained minimization problems to approximate the original bi-level problem, but different from existing methods, BVFSM uses the re-characterization via the value-function of the LL problem. Thanks to this strategy, our algorithm is able to handle problems with complicated non-convex high-dimensional LL, which existing methods are not able to deal with.

3.1 Value-Function-based Single-level Reformulation

BVFSM designs a sequence of single-level unconstrained minimization problems to approximate the original problem through a value-function-based reformulation. We first present this procedure under the optimistic BLO case.

Recall the original optimistic BLO in Eq. (2) has been transformed into Eq. (3), and we hope to compute G(𝐱)G(\mathbf{x}) in Eq. (4). Note that the difficulty of computing φ(𝐱)𝐱\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}} comes from the ill-condition of φ(𝐱)\varphi(\mathbf{x}), owing to the nested structure of the bi-level sub-problem for solving φ(𝐱)\varphi(\mathbf{x}). Hence, we introduce the value-function of the LL problem f(𝐱):=min𝐲f(𝐱,𝐲)f^{*}(\mathbf{x}):=\min_{\mathbf{y}}f(\mathbf{x},\mathbf{y}) to transform it into a single-level problem. Then the problem can be reformulated as

φ(𝐱)=min𝐲{F(𝐱,𝐲):f(𝐱,𝐲)f(𝐱)}.\varphi(\mathbf{x})=\mathop{\min}_{\mathbf{y}}\Big{\{}F(\mathbf{x},\mathbf{y}):f(\mathbf{x},\mathbf{y})\leq f^{*}(\mathbf{x})\Big{\}}. (5)

However, the inequality constraint f(𝐱,𝐲)f(𝐱)f(\mathbf{x},\mathbf{y})\leq f^{*}(\mathbf{x}) is still ill-posed, because it does not satisfy any standard regularity condition and f(𝐱)f^{*}(\mathbf{x}) is non-smooth. In dealing with such difficulty, we approximate f(𝐱)f^{*}(\mathbf{x}) with regularization:

fμ(𝐱)=min𝐲{f(𝐱,𝐲)+μ2𝐲2},f_{\mu}^{*}(\mathbf{x})=\min\limits_{\mathbf{y}}\left\{f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}\right\}, (6)

where μ2𝐲2\frac{\mu}{2}\|\mathbf{y}\|^{2} (μ>0\mu>0) is the regularization term.

We further add an auxiliary function of the inequality constraints to the objective, and obtain

φμ,θ,σ(𝐱)=min𝐲\displaystyle\varphi_{\mu,\theta,\sigma}(\mathbf{x})\!=\!\min\limits_{\mathbf{y}} {F(𝐱,𝐲)+Pσ(f(𝐱,𝐲)fμ(𝐱))+θ2𝐲2},\displaystyle\left\{F(\mathbf{x},\mathbf{y})\!+\!P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)\!+\!\frac{\theta}{2}\|\mathbf{y}\|^{2}\right\}, (7)

where (μ,θ,σ)>0(\mu,\theta,\sigma)>0, θ2𝐲2\frac{\theta}{2}\|\mathbf{y}\|^{2} is the regularization term, and Pσ:¯P_{\sigma}:\mathbb{R}\rightarrow\mathbb{\overline{R}} (where ¯={}\mathbb{\overline{R}}=\mathbb{R}\cup\{\infty\}) is the selected auxiliary function for the sequential unconstrained minimization method with parameter σ\sigma, which will be defined in Eq. (8) and discussed in detail next. This reformulation changes the constrained problem Eq. (5) into a sequence of unconstrained problems Eq. (7) under different parameters. The regularization terms in Eq. 6 and Eq. 7 are to guarantee the uniqueness of solution to these two problems, which is essential for the differentiability of φμ,θ,σ(𝐱)\varphi_{\mu,\theta,\sigma}(\mathbf{x}), and will be discussed in Remark 1 of Section 3.2. Experiments in Section 5.1.1 also demonstrate that introducing the regularization terms for differentiability to avoid possible jumps matters to improve the computational stability.

The sequential unconstrained minimization method is mainly used for solving constrained nonlinear programming by changing the problem into a sequence of unconstrained minimization problems [35, 37, 38]. To be specific, we add to the objective a selected auxiliary function of the constraints with a sequence of parameters, and obtain a series of unconstrained problems. The convergence of parameters makes the sequential unconstrained problems converge to the original constrained problem, leading to the convergence of the solution. Based on the property of auxiliary functions, they are divided mainly into two types, barrier functions and penalty functions [39, 40], whose definitions are provided here.

Definition 1

A continuous, differentiable, and non-decreasing function ρ:¯\rho:\mathbb{R}\rightarrow\mathbb{\overline{R}} is called a standard barrier function if ρ(ω;σ)\rho(\omega;\sigma) satisfies ρ(ω;σ)0\rho(\omega;\sigma)\geq 0 and limσ0ρ(ω;σ)=0\lim_{\sigma\rightarrow 0}\rho(\omega;\sigma)=0, when ω<0\omega<0; and ρ(ω;σ)\rho(\omega;\sigma)\rightarrow\infty when ω0\omega\rightarrow 0. It is called a standard penalty function if it satisfies ρ(ω;σ)=0\rho(\omega;\sigma)=0 when ω0\omega\leq 0; and ρ(ω;σ)>0\rho(\omega;\sigma)>0 and limσ0ρ(ω;σ)=\lim_{\sigma\rightarrow 0}\rho(\omega;\sigma)=\infty when ω>0\omega>0. Here σ>0\sigma>0 is the barrier or penalty parameter. In addition, if ρ(ω;σ(1))\rho(\omega;\sigma^{(1)}) is a standard barrier function, then ρ(ωσ(2);σ(1))\rho(\omega-\sigma^{(2)}\;;\sigma^{(1)})\ is called a modified barrier function (σ(1),σ(2)>0)(\sigma^{(1)},\sigma^{(2)}>0).

For the simplicity of expression later, we denote the function PσP_{\sigma} in Eq. (7) to be

Pσ(ω):={ρ(ω;σ),ifρ is a penalty or standard barrier function,ρ(ωσ(2);σ(1)),ifρ is a modified barrier function.P_{\sigma}(\omega):=\left\{\begin{aligned} &\rho(\omega;\sigma),\text{if}\ \rho\ \text{ is a penalty}\\ &\qquad\qquad\qquad\text{ or standard barrier function},\\ &\rho(\omega-\sigma^{(2)};\sigma^{(1)}),\text{if}\ \rho\ \text{ is a modified barrier function}.\end{aligned}\right. (8)

Here for a modified barrier function, σ(2)>0\sigma^{(2)}>0 is to guarantee that in Eq. (7), f(𝐱,𝐲)fμ(𝐱)σ(2)<0f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})-\sigma^{(2)}<0, and the barrier function is well-defined.

Classical examples of auxiliary functions are the quadratic penalty function, inverse barrier function and log barrier function [41, 40]. There are also some other popular examples, such as the polynomial penalty function [39] and truncated log barrier function [42]. These examples of standard penalty and barrier functions are listed in Table I. Note that we need the smoothness of φμ,θ,σ(𝐱)\varphi_{\mu,\theta,\sigma}(\mathbf{x}), and will calculate the gradient of PσP_{\sigma} afterwards, so we choose smooth auxiliary functions rather than non-smooth exact penalty functions. Note that when ρ\rho is a modified barrier function, σ\sigma of Pσ(ω)P_{\sigma}(\omega) in Eq. (8) has two components σ(1)\sigma^{(1)} and σ(2)\sigma^{(2)}, and the specific form of Pσ(ω)P_{\sigma}(\omega) comes from substituting ωσ(2)\omega-\sigma^{(2)} and σ(1)\sigma^{(1)} for ω\omega and σ\sigma in Table I.

TABLE I: Some available standard penalty and barrier functions.
Penalty functions
Quadratic ρ(ω;σ)=12σ(ω+)2\rho\!\left(\omega;\sigma\right)=\frac{1}{2\sigma}\left(\omega^{+}\right)^{2}, where ω+=max{ω,0}\omega^{+}=\max\left\{\omega,0\right\}
Polynomial ρ(ω;σ)=1qσ(ω+)q\rho\!\left(\omega;\sigma\right)=\frac{1}{q\sigma}\left(\omega^{+}\right)^{q}, where qq is a positive integer
chosen such that ρ(ω;σ)\rho\!\left(\omega;\sigma\right) is differentiable
Barrier functions
Inverse ρ(ω;σ)={σω,ω<0,ω0\rho\!\left(\omega;\sigma\right)=\left\{\begin{aligned} &-\frac{\sigma}{\omega},&\omega<0\\ \\[-17.07182pt] &\infty,&\omega\geq 0\end{aligned}\right.
Truncated Log ρ(ω;σ)={σ(log(ω)+β1),κω<0σ(β2+β3ω2+β4ω),ω<κ,ω0\rho\!\left(\omega;\sigma\right)=\left\{\begin{aligned} &-\sigma\left(\log\left(-\omega\right)+\beta_{1}\right),&-\kappa\leq\omega<0\\ \\[-14.22636pt] &-\sigma\left(\beta_{2}+\frac{\beta_{3}}{\omega^{2}}+\frac{\beta_{4}}{\omega}\right),&\omega<-\kappa\\ \\[-17.07182pt] &\infty,&\omega\geq 0\end{aligned}\right.
where 0<κ10<\kappa\leq 1, β1,β2,β3,β4\beta_{1},\beta_{2},\beta_{3},\beta_{4} are chosen
such that ρ(ω;σ)0\rho\!\left(\omega;\sigma\right)\geq 0 and is twice differentiable

3.2 Sequential Minimization Strategy

From the discussion above, we then hope to solve

min𝐱𝒳φμ,θ,σ(𝐱),\min\limits_{\mathbf{x}\in\mathcal{X}}\ \varphi_{\mu,\theta,\sigma}(\mathbf{x}), (9)

with φμ,θ,σ(𝐱)\varphi_{\mu,\theta,\sigma}(\mathbf{x}) in Eq. (7). First denote

𝐳μ(𝐱)=argmin𝐲{f(𝐱,𝐲)+μ2𝐲2},\mathbf{z}^{*}_{\mu}(\mathbf{x})=\underset{\mathbf{y}}{\mathrm{argmin}}\left\{f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}\right\}, (10)
𝐲μ,θ,σ(𝐱)=argmin𝐲\displaystyle\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})=\underset{\mathbf{y}}{\mathrm{argmin}} {F(𝐱,𝐲)+Pσ(f(𝐱,𝐲)fμ(𝐱))+θ2𝐲2}.\displaystyle\left\{F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)+\frac{\theta}{2}\|\mathbf{y}\|^{2}\right\}. (11)

The following proposition gives the smoothness of φμ,θ,σ(𝐱)\varphi_{\mu,\theta,\sigma}\left(\mathbf{x}\right) and the formula for computing φμ,θ,σ(𝐱)𝐱\frac{\partial\varphi_{\mu,\theta,\sigma}\left(\mathbf{x}\right)}{\partial\mathbf{x}} or G(𝐱)G(\mathbf{x}), which serves as the ground for our algorithm.

Proposition 1 (Calculation of G(𝐱)G(\mathbf{x}))

Suppose F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) and f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) are bounded below and continuously differentiable. Given 𝐱𝒳\mathbf{x}\in\mathcal{X} and μ,θ,σ>0\mu,\theta,\sigma>0, when 𝐳μ(𝐱)\mathbf{z}^{*}_{\mu}(\mathbf{x}) and 𝐲μ,θ,σ(𝐱)\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x}) are unique, then φμ,θ,σ(𝐱)\varphi_{\mu,\theta,\sigma}(\mathbf{x}) is differentiable and

G(𝐱)=Pσ(f(𝐱,𝐲)fμ(𝐱))𝐱|𝐲=𝐲μ,θ,σ(𝐱),\displaystyle G(\mathbf{x})=\frac{\partial P_{\sigma}\!\left(f\left(\mathbf{x},\mathbf{y}\right)-f_{\mu}^{*}(\mathbf{x})\right)}{\partial\mathbf{x}}\big{|}_{\mathbf{y}=\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})}, (12)

where fμ(𝐱)=f(𝐱,𝐳μ(𝐱))+μ2𝐳μ(𝐱)2,\begin{aligned} f_{\mu}^{*}(\mathbf{x})=f\!\left(\mathbf{x},\mathbf{z}^{*}_{\mu}(\mathbf{x})\right)+\frac{\mu}{2}\|\mathbf{z}^{*}_{\mu}(\mathbf{x})\|^{2},\end{aligned} and fμ(𝐱)𝐱=f(𝐱,𝐲)𝐱|𝐲=𝐳μ(𝐱).\frac{\partial f_{\mu}^{*}(\mathbf{x})}{\partial\mathbf{x}}=\frac{\partial f\!\left(\mathbf{x},\mathbf{y}\right)}{\partial\mathbf{x}}\big{|}_{\mathbf{y}=\mathbf{z}^{*}_{\mu}(\mathbf{x})}.

Proof.

We first prove that for any 𝐱¯𝒳\bar{\mathbf{x}}\in\mathcal{X}, f(𝐱,𝐲)+μ2𝐲2f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2} is level-bounded in 𝐲\mathbf{y} locally uniformly in 𝐱¯𝒳\bar{\mathbf{x}}\in\mathcal{X} (see [29, Definition 3]). That means for any cc\in\mathbb{R}, there exist δ>0\delta>0 and a bounded set n\mathcal{B}\subset\mathbb{R}^{n}, such that {𝐲n:f(𝐱,𝐲)+μ2𝐲2c},\left\{\mathbf{y}\in\mathbb{R}^{n}:f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}\leq c\right\}\subset\mathcal{B}, for all 𝐱δ(𝐱¯)𝒳\mathbf{x}\in\mathcal{B}_{\delta}(\bar{\mathbf{x}})\cap\mathcal{X}, where δ(𝐱¯)\mathcal{B}_{\delta}(\bar{\mathbf{x}}) denotes the open ball with center at 𝐱¯\bar{\mathbf{x}} and radius δ\delta, i.e., δ(𝐱¯)={𝐱^𝒳:𝐱¯𝐱^<δ}\mathcal{B}_{\delta}(\bar{\mathbf{x}})=\{\hat{\mathbf{x}}\in\mathcal{X}:\|\bar{\mathbf{x}}-\hat{\mathbf{x}}\|<\delta\}. Assume by contradiction that the above does not hold. Then there exist sequences {𝐱k}\{\mathbf{x}_{k}\} and {𝐲k}\{\mathbf{y}_{k}\} satisfying 𝐱k𝐱¯\mathbf{x}_{k}\rightarrow\bar{\mathbf{x}} and 𝐲k+\|\mathbf{y}_{k}\|\rightarrow+\infty, such that f(𝐱k,𝐲k)+μ2𝐲k2c.f(\mathbf{x}_{k},\mathbf{y}_{k})+\frac{\mu}{2}\|\mathbf{y}_{k}\|^{2}\leq c. As f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) is bounded below, then 𝐲k\|\mathbf{y}_{k}\|\rightarrow\infty implies f(𝐱k,𝐲k)+μ2𝐲k2f(\mathbf{x}_{k},\mathbf{y}_{k})+\frac{\mu}{2}\|\mathbf{y}_{k}\|^{2}\rightarrow\infty, which contradicts with f(𝐱k,𝐲k)+μ2𝐲k2cf(\mathbf{x}_{k},\mathbf{y}_{k})+\frac{\mu}{2}\|\mathbf{y}_{k}\|^{2}\leq c and cc\in\mathbb{R}.

Hence, from the arbitrariness of 𝐱¯\bar{\mathbf{x}}, we have f(𝐱,𝐲)+μ2𝐲2f(\mathbf{x},\mathbf{y})\!+\!\frac{\mu}{2}\|\mathbf{y}\|^{2} is level-bounded in 𝐲\mathbf{y} locally uniformly in 𝐱𝒳\mathbf{x}\!\!\!\in\!\!\!\mathcal{X}, and then the inf-compactness condition in [43, Theorem 4.13] holds for f(𝐱,𝐲)+μ2𝐲2f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}. Since argmin𝐲n{f(𝐱,𝐲)+μ2𝐲2}{\mathrm{argmin}}_{\mathbf{y}\in\mathbb{R}^{n}}\left\{f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}\right\} is a singleton, it follows from [43, Theorem 4.13, Remark 4.14] that

fμ(𝐱)𝐱\displaystyle{\color[rgb]{0,0,0}\frac{\partial f_{\mu}^{*}(\mathbf{x})}{\partial\mathbf{x}}} =(f(𝐱,𝐲)+μ2𝐲2)𝐱|𝐲=𝐳μ(𝐱)=f(𝐱,𝐲)𝐱|𝐲=𝐳μ(𝐱).\displaystyle=\frac{\partial\left(f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}\right)}{\partial\mathbf{x}}\big{|}_{\mathbf{y}=\mathbf{z}_{\mu}^{*}(\mathbf{x})}=\frac{\partial f(\mathbf{x},\mathbf{y})}{\partial\mathbf{x}}\big{|}_{\mathbf{y}=\mathbf{z}_{\mu}^{*}(\mathbf{x})}.

Next, from definitions of penalty and barrier functions (Definition 1), we have ρ(ω;σ)0\rho(\omega;\sigma)\geq 0 for any ω\omega, so Pσ(ω)0P_{\sigma}(\omega)\geq 0 holds. Then, since F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) is assumed to be bounded below, similar to f(𝐱,𝐲)+μ2𝐲2f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}, the inf-compactness condition in [43, Theorem 4.13] also holds for F(𝐱,𝐲)+Pσ(f(𝐱,𝐲)fμ(𝐱))+θ2𝐲2F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)+\frac{\theta}{2}\|\mathbf{y}\|^{2}. Combining with the fact that

argmin𝐲n{F(𝐱,𝐲)+Pσ(f(𝐱,𝐲)fμ(𝐱))+θ2𝐲2}{\mathrm{argmin}}_{\mathbf{y}\in\mathbb{R}^{n}}\left\{F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)+\frac{\theta}{2}\|\mathbf{y}\|^{2}\right\}

is a singleton, [43, Theorem 4.13, Remark 4.14] shows that

φμ,θ,σ(𝐱)𝐱\displaystyle\frac{\partial\varphi_{\mu,\theta,\sigma}\left(\mathbf{x}\right)}{\partial\mathbf{x}}
=(F(𝐱,𝐲)+Pσ(f(𝐱,𝐲)fμ(𝐱))+θ2𝐲2)𝐱|𝐲=𝐲μ,θ,σ(𝐱)\displaystyle=\frac{\partial\left(F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)+\frac{\theta}{2}\|\mathbf{y}\|^{2}\right)}{\partial\mathbf{x}}\big{|}_{\mathbf{y}=\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})}
=(F(𝐱,𝐲)𝐱+Pσ(f(𝐱,𝐲)fμ(𝐱))𝐱)|𝐲=𝐲μ,θ,σ(𝐱).\displaystyle=\left(\frac{\partial F(\mathbf{x},\mathbf{y})}{\partial\mathbf{x}}+\frac{\partial P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)}{\partial\mathbf{x}}\right)\big{|}_{\mathbf{y}=\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})}.

Therefore, the conclusion in Eq. (12) follows immediately. ∎

Remark 1

In Proposition 1 we require the uniqueness of 𝐳μ(𝐱)\mathbf{z}^{*}_{\mu}(\mathbf{x}) and 𝐲μ,θ,σ(𝐱)\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x}) to guarantee the differentiability of φμ,θ,σ(𝐱)\varphi_{\mu,\theta,\sigma}(\mathbf{x}). This can be achieved by conditions much weaker than convexity, such as the convexity only on a level set. We start with the uniqueness of 𝐳μ(𝐱)\mathbf{z}^{*}_{\mu}(\mathbf{x}). For any given 𝐱𝒳\mathbf{x}\in\mathcal{X}, consider a function f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) satisfying that there exists a constant c>min𝐲f(𝐱,𝐲)c>\min_{\mathbf{y}}f(\mathbf{x},\mathbf{y}) such that f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) is convex in 𝐲\mathbf{y} on the level set {𝐲:f(𝐱,𝐲)c}\{\mathbf{y}:f(\mathbf{x},\mathbf{y})\leq c\}. Suppose 𝐲^\hat{\mathbf{y}} is a minimum of f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}). Then inf𝐲{f(𝐱,𝐲)+μ/2𝐲2}f(𝐱,𝐲^)+μ/2𝐲^2=min𝐲f(𝐱,𝐲)+μ/2𝐲^2<c,\inf_{\mathbf{y}}\{f(\mathbf{x},\mathbf{y})+\mu/2\|\mathbf{y}\|^{2}\}\leq f(\mathbf{x},\hat{\mathbf{y}})+\mu/2\|\hat{\mathbf{y}}\|^{2}=\min_{\mathbf{y}}f(\mathbf{x},\mathbf{y})+\mu/2\|\hat{\mathbf{y}}\|^{2}<c, for a sufficiently small μ>0\mu>0. Thus, 𝐳μ(𝐱)\mathbf{z}^{*}_{\mu}(\mathbf{x}) locates inside the level set {𝐲:f(𝐱,𝐲)c}\{\mathbf{y}:f(\mathbf{x},\mathbf{y})\leq c\} on which f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) is convex. Hence, f(𝐱,𝐲)+μ/2𝐲2f(\mathbf{x},\mathbf{y})+\mu/2\|\mathbf{y}\|^{2} is strictly convex on {𝐲:f(𝐱,𝐲)c}\{\mathbf{y}:f(\mathbf{x},\mathbf{y})\leq c\}, and the uniqueness of 𝐳μ(𝐱)\mathbf{z}^{*}_{\mu}(\mathbf{x}) follows.

As for the uniqueness of 𝐲μ,θ,σ(𝐱)\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x}), suppose given 𝐱\mathbf{x}, there exist constants c1>min𝐲𝒮(𝐱)F(𝐱,𝐲)c_{1}>\min_{\mathbf{y}\in\mathcal{S}(\mathbf{x})}F(\mathbf{x},\mathbf{y}) and c2>min𝐲f(𝐱,𝐲)c_{2}>\min_{\mathbf{y}}f(\mathbf{x},\mathbf{y}) such that FF and ff are convex in 𝐲\mathbf{y} on the set {𝐲:F(𝐱,𝐲)c1 and f(𝐱,𝐲)c2}\{\mathbf{y}:F(\mathbf{x},\mathbf{y})\leq c_{1}\text{ and }f(\mathbf{x},\mathbf{y})\leq c_{2}\}. If we select a non-decreasing and convex auxiliary function Pσ()P_{\sigma}(\cdot) (such as those in Table I), then Pσ(f(𝐱,𝐲)fμ(𝐱))P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right) is convex in 𝐲\mathbf{y} on the set (see [44] Proposition 1.54). Or simply if there exists c>min𝐲F(𝐱,𝐲)+Pσ(f(𝐱,𝐲)fμ(𝐱))c>\min_{\mathbf{y}}F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right) such that F(𝐱,𝐲)+Pσ(f(𝐱,𝐲)fμ(𝐱))F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right) is convex on the set {𝐲:F(𝐱,𝐲)+Pσ(f(𝐱,𝐲)fμ(𝐱))c}\{\mathbf{y}:F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)\leq c\}, then it derives the strict convexity of F(𝐱,𝐲)+Pσ(f(𝐱,𝐲)fμ(𝐱))+θ2𝐲2F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)+\frac{\theta}{2}\|\mathbf{y}\|^{2} in 𝐲\mathbf{y} on the set similarly, and the uniqueness of 𝐲μ,θ,σ(𝐱)\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x}) follows.

Proposition 1 serves as the foundation for our algorithmic framework. Denote φk(𝐱):=φμk,θk,σk(𝐱)\varphi_{k}(\mathbf{x}):=\varphi_{\mu_{k},\theta_{k},\sigma_{k}}(\mathbf{x}). Next, we will illustrate the implementation at the kk-th step of the outer loop and the ll-th step of the inner loop, that is, to calculate φk(𝐱l)𝐱\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}}, as a guide.

We first calculate 𝐳μk(𝐱l)\mathbf{z}^{*}_{\mu_{k}}(\mathbf{x}_{l}) in Eq. (10) through T𝐳T_{\mathbf{z}} steps of gradient descent, and denote the output as 𝐳k,lT𝐳\mathbf{z}_{k,l}^{{T_{\mathbf{z}}}}, regarded as an approximation of 𝐳μk(𝐱l)\mathbf{z}^{*}_{\mu_{k}}(\mathbf{x}_{l}). After that, we calculate 𝐲μk,θk,σk(𝐱l)\mathbf{y}_{\mu_{k},\theta_{k},\sigma_{k}}^{*}(\mathbf{x}_{l}) in Eq. (11) through T𝐲T_{\mathbf{y}} steps of gradient descent, and denote the output as 𝐲k,lT𝐲\mathbf{y}_{k,l}^{T_{\mathbf{y}}}. Note that if the objective function is convex, running some number of steps of the method would lead to an approximation of the minimizer. Meanwhile, the convexity of objective functions for approximating 𝐳μk(𝐱l)\mathbf{z}^{*}_{\mu_{k}}(\mathbf{x}_{l}) and 𝐲μk,θk,σk(𝐱l)\mathbf{y}_{\mu_{k},\theta_{k},\sigma_{k}}^{*}(\mathbf{x}_{l}) can be guaranteed if ff and FF are convex in 𝐲\mathbf{y} as discussed in Remark 1. Also, if the objective function is not convex but all of its stationary points are minimizers, which is a weaker condition than convexity, gradient descent would still help to converge to minimizers.

Then, according to Proposition 1, we can obtain

φk(𝐱l)𝐱F(𝐱l,𝐲k,lT𝐲)𝐱+Gk,l,\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}}\approx\frac{\partial F(\mathbf{x}_{l},\mathbf{y}_{k,l}^{T_{\mathbf{y}}})}{\partial\mathbf{x}}+G_{k,l}\ , (13)

with Gk,l=Pσk(f(𝐱l,𝐲k,lT𝐲)fk,lT𝐳)𝐱,G_{k,l}=\frac{\partial P_{\sigma_{k}}\!\left(f\left(\mathbf{x}_{l},\mathbf{y}_{k,l}^{T_{\mathbf{y}}}\right)-f_{k,l}^{T_{\mathbf{z}}}\right)}{\partial\mathbf{x}}, where fk,lT𝐳=f(𝐱l,𝐳k,lT𝐳)+μk2𝐳k,lT𝐳2.f_{k,l}^{T_{\mathbf{z}}}=f(\mathbf{x}_{l},\mathbf{z}_{k,l}^{T_{\mathbf{z}}})+\frac{\mu_{k}}{2}\|\mathbf{z}_{k,l}^{{T_{\mathbf{z}}}}\|^{2}. As a summary, the algorithm for solving Eq. (9) is shown in Algorithms 1 and 2.

Algorithm 1 Our Solution Strategy for Eq. (9).
0:  𝐱0\mathbf{x}_{0}, (μ0,θ0,σ0)(\mu_{0},\theta_{0},\sigma_{0}), step size α>0\alpha>0.
0:  The optimal UL solution 𝐱\mathbf{x}^{*}.
1:  for k=0K1k=0\rightarrow K-1 do
2:     Calculate (μk,θk,σk)(\mu_{k},\theta_{k},\sigma_{k}).
3:     for l=0L1l=0\rightarrow L-1 do
4:        Calculate φk(𝐱l)𝐱\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}} by Algorithm 2.
5:        𝐱l+1=𝙿𝚛𝚘𝚓𝒳(𝐱lαφk(𝐱l)𝐱)\mathbf{x}_{l+1}=\mathtt{Proj}_{\mathcal{X}}\left(\mathbf{x}_{l}-\alpha\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}}\right).
6:     end for
7:     𝐱0=𝐱L\mathbf{x}_{0}=\mathbf{x}_{L}.
8:  end for
9:  return  𝐱\mathbf{x}^{*}.
Algorithm 2 Calculation of φk(𝐱l)𝐱\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}}.
0:  𝐱l,(μk,θk,σk)\mathbf{x}_{l},(\mu_{k},\theta_{k},\sigma_{k}).
0:  φk(𝐱l)𝐱\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}}.
1:  Calculate 𝐳k,lT𝐳\mathbf{z}_{k,l}^{{T_{\mathbf{z}}}} as an approximation of 𝐳μk(𝐱l)\mathbf{z}^{*}_{\mu_{k}}\!(\mathbf{x}_{l}) by performing T𝐳T_{\mathbf{z}}-step gradient descent on Eq. (10).
2:  Calculate 𝐲k,lT𝐲\mathbf{y}_{k,l}^{T_{\mathbf{y}}} as an approximation of 𝐲μk,θk,σk(𝐱l)\mathbf{y}_{\mu_{k},\theta_{k},\sigma_{k}}^{*}\!(\mathbf{x}_{l}) by performing T𝐲T_{\mathbf{y}}-step gradient descent on Eq. (11).
3:  Calculate an approximation of φk(𝐱l)𝐱\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}} by Eq. (13).

3.3 Extension for BLO with Functional Constraints

We consider the BLO with functional constraints on UL and LL problems in this subsection, which is a more general setting, and the above discussion without constraints can be extended to the case with inequality constraints.

The optimistic BLO problems in Eq. (2) with functional constraints are then

min𝐱𝒳,𝐲nF(𝐱,𝐲)\displaystyle\min\limits_{\mathbf{x}\in\mathcal{X},\mathbf{y}\in\mathbb{R}^{n}}\ F(\mathbf{x},\mathbf{y})
s.t.{Hj(𝐱,𝐲)0,for any j𝐲𝒮(𝐱):=argmin𝐲{f(𝐱,𝐲):hj(𝐱,𝐲)0,for any j},\displaystyle\mathrm{\ s.t.\ }\left\{\begin{aligned} &H_{j}(\mathbf{x},\mathbf{y})\leq 0,\text{for any }j\ \\ &\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x}):=\mathop{\arg\min}_{\mathbf{y}}\Big{\{}f(\mathbf{x},\mathbf{y}):h_{j^{\prime}}(\mathbf{x},\mathbf{y})\leq 0,\text{for any }j^{\prime}\Big{\}}\end{aligned}\right.,

where the UL constraints Hj(𝐱,𝐲):m×nH_{j}(\mathbf{x},\mathbf{y}):\mathbb{R}^{m}\times\mathbb{R}^{n}\rightarrow\mathbb{R} (for any j{1,2,,J})\left(\text{for any }{j}\in\left\{1,2,\cdots,J\right\}\right) and the LL constraints hj(𝐱,𝐲):m×nh_{j^{\prime}}(\mathbf{x},\mathbf{y}):\mathbb{R}^{m}\times\mathbb{R}^{n}\rightarrow\mathbb{R} (for any j{1,2,,J})\left(\text{for any }{j^{\prime}}\in\left\{1,2,\cdots,J^{\prime}\right\}\right) are continuously differentiable functions. This is equivalent to min𝐱𝒳φ(𝐱),\min_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}), where φ(𝐱)\varphi(\mathbf{x}), the value-function of the sub-problem in Eq. (3), is adapted correspondingly to be

φ(𝐱):=min𝐲{F(𝐱,𝐲):Hj(𝐱,𝐲)0,j, and 𝐲𝒮(𝐱)}.\varphi(\mathbf{x}):=\mathop{\min}_{\mathbf{y}}\Big{\{}F(\mathbf{x},\mathbf{y}):H_{j}(\mathbf{x},\mathbf{y})\leq 0,\ \forall j,\text{ and }\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x})\Big{\}}.

Also, the counterpart for value-function of the LL problem is f(𝐱):=min𝐲{f(𝐱,𝐲):hj(𝐱,𝐲)0,j}f^{*}(\mathbf{x}):=\min_{\mathbf{y}}\left\{f(\mathbf{x},\mathbf{y}):h_{j^{\prime}}(\mathbf{x},\mathbf{y})\leq 0,\forall\ {j^{\prime}}\right\}, and following the technique within Eq. (5) to transform the LL problem into an inequality constraint, the problem can be reformulated as

φ(𝐱)=min𝐲{\displaystyle\varphi(\mathbf{x})=\mathop{\min}_{\mathbf{y}}\Big{\{} F(𝐱,𝐲):Hj(𝐱,𝐲)0,j,\displaystyle F(\mathbf{x},\mathbf{y}):H_{j}(\mathbf{x},\mathbf{y})\leq 0,\forall\ j, (14)
f(𝐱,𝐲)f(𝐱), and hj(𝐱,𝐲)0,j}.\displaystyle f(\mathbf{x},\mathbf{y})\leq f^{*}(\mathbf{x}),\text{ and }h_{j^{\prime}}(\mathbf{x},\mathbf{y})\leq 0,\ \forall\ {j^{\prime}}\Big{\}}.

After that, using the same idea of sequential minimization method, inspired by the regularized smoothing method in [45], the value-function f(𝐱)f^{*}(\mathbf{x}) can be approximated with a barrier function (different from Eq. (6) due to the LL constraints) and a regularization term:

fμ,σB(𝐱)=min𝐲{f(𝐱,𝐲)+j=1JPB,σB(hj(𝐱,𝐲))+μ2𝐲2},f_{\mu,\sigma_{B}}^{*}(\mathbf{x})=\min\limits_{\mathbf{y}}\Bigg{\{}f(\mathbf{x},\mathbf{y})+\sum_{{j^{\prime}}=1}^{J^{\prime}}P_{B,\sigma_{B}}\!\left(h_{j^{\prime}}(\mathbf{x},\mathbf{y})\right)+\frac{\mu}{2}\|\mathbf{y}\|^{2}\Bigg{\}},

where PB,σB(ω):¯P_{B,\sigma_{B}}(\omega):\mathbb{R}\rightarrow\mathbb{\overline{R}} is the selected standard barrier function for the LL constraint hjh_{j^{\prime}} as defined in Eq. (8), with σB\sigma_{B} as the barrier parameter. Note that here we define PB,σBP_{B,\sigma_{B}} as a standard barrier function but not a modified barrier function. As for the approximation of φ(𝐱)\varphi(\mathbf{x}), Eq. (7) is transferred into φμ,θ,σ(𝐱)=min𝐲{F(𝐱,𝐲)+j=1JPH,σH(Hj(𝐱,𝐲))+j=1JPh,σh(hj(𝐱,𝐲))+Pf,σf(f(𝐱,𝐲)fμ,σB(𝐱))+θ2𝐲2},\small\begin{aligned} &\varphi_{\mu,\theta,\sigma}(\mathbf{x})=\min\limits_{\mathbf{y}}\Bigg{\{}F(\mathbf{x},\mathbf{y})+\sum_{j=1}^{J}P_{H,\sigma_{H}}\!\left(H_{j}(\mathbf{x},\mathbf{y})\right)\\ &+\sum_{{j^{\prime}}=1}^{J^{\prime}}P_{h,\sigma_{h}}\!\left(h_{j^{\prime}}(\mathbf{x},\mathbf{y})\right)+\ P_{f,\sigma_{f}}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu,\sigma_{B}}^{*}(\mathbf{x})\right)+\frac{\theta}{2}\|\mathbf{y}\|^{2}\Bigg{\}},\end{aligned} where PH,σH,Ph,σh,Pf,σf:¯P_{H,\sigma_{H}},P_{h,\sigma_{h}},P_{f,\sigma_{f}}:\mathbb{R}\rightarrow\mathbb{\overline{R}} are the selected auxiliary functions of penalty or modified barrier with parameters σH\sigma_{H}, σh\sigma_{h} and σf\sigma_{f}, and (μ,θ,σ)=(μ,θ,σB,σH,σh,σf)>0(\mu,\theta,\sigma)=(\mu,\theta,\sigma_{B},\sigma_{H},\sigma_{h},\sigma_{f})>0.

Then corresponding to Eq. (10) and Eq. (11), by denoting

𝐳μ,σB(𝐱)=argmin𝐲{f(𝐱,𝐲)+j=1JPB,σB(hj(𝐱,𝐲))+μ2𝐲2},\small\mathbf{z}^{*}_{\mu,\sigma_{B}}(\mathbf{x})\!=\!\underset{\mathbf{y}}{\mathrm{argmin}}\Bigg{\{}f(\mathbf{x},\mathbf{y})+\sum_{{j^{\prime}}=1}^{J^{\prime}}P_{B,\sigma_{B}}\!\left(h_{j^{\prime}}(\mathbf{x},\mathbf{y})\right)+\frac{\mu}{2}\|\mathbf{y}\|^{2}\Bigg{\}},

𝐲μ,θ,σ(𝐱)=argmin𝐲{F(𝐱,𝐲)+j=1JPH,σH(Hj(𝐱,𝐲))+j=1JPh,σh(hj(𝐱,𝐲))+Pf,σf(f(𝐱,𝐲)fμ,σB(𝐱))+θ2𝐲2},\small\begin{aligned} &\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})=\underset{\mathbf{y}}{\mathrm{argmin}}\Bigg{\{}F(\mathbf{x},\mathbf{y})+\sum_{j=1}^{J}P_{H,\sigma_{H}}\!\left(H_{j}(\mathbf{x},\mathbf{y})\right)\\ &+\sum_{{j^{\prime}}=1}^{J^{\prime}}P_{h,\sigma_{h}}\!\left(h_{j^{\prime}}(\mathbf{x},\mathbf{y})\right)+\ P_{f,\sigma_{f}}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu,\sigma_{B}}^{*}(\mathbf{x})\right)+\frac{\theta}{2}\|\mathbf{y}\|^{2}\Bigg{\}},\end{aligned} we have the next proposition, which follows the same idea from Proposition 1.

Proposition 2

Suppose F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) and f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) are bounded below and continuously differentiable. Given 𝐱𝒳\mathbf{x}\in\mathcal{X} and μ,θ,σ>0\mu,\theta,\sigma>0, when 𝐳μ,σB(𝐱)\mathbf{z}^{*}_{\mu,\sigma_{B}}(\mathbf{x}) and 𝐲μ,θ,σ(𝐱)\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x}) are unique, then φμ,θ,σ(𝐱)\varphi_{\mu,\theta,\sigma}(\mathbf{x}) is differentiable and G(𝐱)=[j=1JPH,σH(Hj(𝐱,𝐲))𝐱+j=1JPh,σh(hj(𝐱,𝐲))𝐱+Pf,σf(f(𝐱,𝐲)fμ,σB(𝐱))𝐱]|𝐲=𝐲μ,θ,σ(𝐱),\begin{aligned} G(\mathbf{x})=&\Bigg{[}\sum_{j=1}^{J}\frac{\partial P_{H,\sigma_{H}}\!\left(H_{j}(\mathbf{x},\mathbf{y})\right)}{\partial\mathbf{x}}+\sum_{{j^{\prime}}=1}^{J^{\prime}}\frac{\partial P_{h,\sigma_{h}}\!\left(h_{j^{\prime}}(\mathbf{x},\mathbf{y})\right)}{\partial\mathbf{x}}\\ &+\frac{\partial P_{f,\sigma_{f}}\!\left(f\left(\mathbf{x},\mathbf{y}\right)-f_{\mu,\sigma_{B}}^{*}(\mathbf{x})\right)}{\partial\mathbf{x}}\Bigg{]}\big{|}_{\mathbf{y}=\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})},\end{aligned} where

fμ,σB(𝐱)=f(𝐱,𝐳μ,σB(𝐱))+j=1JPB,σB(hj(𝐱,𝐳μ,σB(𝐱)))\displaystyle f_{\mu,\sigma_{B}}^{*}(\mathbf{x})=f\!\left(\mathbf{x},\mathbf{z}^{*}_{\mu,\sigma_{B}}(\mathbf{x})\right)+\sum_{{j^{\prime}}=1}^{J^{\prime}}\partial P_{B,\sigma_{B}}\!\left(h_{j^{\prime}}(\mathbf{x},\mathbf{z}^{*}_{\mu,\sigma_{B}}(\mathbf{x}))\right)
+μ2𝐳μ,σB(𝐱)2,\displaystyle+\frac{\mu}{2}\|\mathbf{z}^{*}_{\mu,\sigma_{B}}(\mathbf{x})\|^{2},
fμ,σB(𝐱)𝐱=[f(𝐱,𝐲)𝐱+j=1JPB,σB(hj(𝐱,𝐲))𝐱]|𝐲=𝐳μ,σB(𝐱).\small\frac{\partial f_{\mu,\sigma_{B}}^{*}(\mathbf{x})}{\partial\mathbf{x}}\!\!=\!\!\Bigg{[}\frac{\partial f\!\left(\mathbf{x},\mathbf{y}\right)}{\partial\mathbf{x}}+\sum_{{j^{\prime}}=1}^{J^{\prime}}\frac{\partial P_{B,\sigma_{B}}\!\left(h_{j^{\prime}}\!\left(\mathbf{x},\mathbf{y}\right)\right)}{\partial\mathbf{x}}\Bigg{]}\big{|}_{\mathbf{y}=\mathbf{z}^{*}_{\mu,\sigma_{B}}(\mathbf{x})}.

The proof is similar to Proposition 1, obtained by applying [43, Theorem 4.13, Remark 4.14]. The algorithm is then based on Proposition 2 and similar to Algorithm 1 and 2. Note that in Section 4.1, our convergence analysis is carried out under this constrained setting, because problems without constraints can be regarded as its special case.

Remark 2

In terms of the uniqueness of 𝐳μ,σB(𝐱)\mathbf{z}^{*}_{\mu,\sigma_{B}}(\mathbf{x}) and 𝐲μ,θ,σ(𝐱)\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x}), if we select convex auxiliary functions PB,σBP_{B,\sigma_{B}}, PH,σHP_{H,\sigma_{H}}, Ph,σhP_{h,\sigma_{h}}, Pf,σfP_{f,\sigma_{f}}, and suppose hj(𝐱,𝐲),j,h_{j^{\prime}}(\mathbf{x},\mathbf{y}),\forall\ j^{\prime}, and Hj(𝐱,𝐲),j,H_{j}(\mathbf{x},\mathbf{y}),\forall\ j, in the constraints are convex in the level set, then the uniqueness follows similarly as in Remark 1.

3.4 Extension for Pessimistic BLO

In this part, we consider the pessimistic BLO, which has been rarely discussed for handling learning tasks to our best knowledge. For brevity, we focus on problems without constraints on UL and LL, and this can be extended to the case with constraints easily. As what we have discussed about pessimistic BLO at the very beginning, its form is

min𝐱𝒳\displaystyle\min\limits_{\mathbf{x}\in\mathcal{X}} max𝐲nF(𝐱,𝐲),s.t.𝐲𝒮(𝐱)=argmin𝐲f(𝐱,𝐲).\displaystyle\max\limits_{\mathbf{y}\in\mathbb{R}^{n}}\ F(\mathbf{x},\mathbf{y}),\ \mathrm{\ s.t.\ }\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x})=\mathop{\arg\min}_{\mathbf{y}}f(\mathbf{x},\mathbf{y}). (15)

Similar to the optimistic case, this problem can be transformed into min𝐱φp(𝐱),\min_{\mathbf{x}}\ \varphi^{p}(\mathbf{x}), where the value-function φ(𝐱)\varphi(\mathbf{x}) in Eq. (3) is redefined as φp(𝐱):=max𝐲{F(𝐱,𝐲):𝐲𝒮(𝐱)}.\begin{aligned} \varphi^{p}(\mathbf{x}):=&\mathop{\max}_{\mathbf{y}}\Big{\{}F(\mathbf{x},\mathbf{y}):\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x})\Big{\}}.\end{aligned} Considering the value-function f(𝐱)f^{*}(\mathbf{x}), we have the same regularized fμ(𝐱)f_{\mu}^{*}(\mathbf{x}) to the optimistic case in Eq. (6). As for the approximation of φp(𝐱)\varphi^{p}(\mathbf{x}), thanks to the value-function-based sequential minimization, different from Eq. (7), we have

φμ,θ,σp(𝐱)=max𝐲\displaystyle\varphi^{p}_{\mu,\theta,\sigma}(\mathbf{x})=\max\limits_{\mathbf{y}} {F(𝐱,𝐲)Pσ(f(𝐱,𝐲)fμ(𝐱))θ2𝐲2},\displaystyle\Big{\{}F(\mathbf{x},\mathbf{y})-P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)-\frac{\theta}{2}\|\mathbf{y}\|^{2}\Big{\}},

where PσP_{\sigma} is defined in Eq. (8). Same as before, our goal is to solve min𝐱φμ,θ,σp(𝐱).\min_{\mathbf{x}}\varphi^{p}_{\mu,\theta,\sigma}(\mathbf{x}).

Denote 𝐳μ(𝐱)\mathbf{z}^{*}_{\mu}(\mathbf{x}) to be the same as in Eq. (10), and Eq. (11) is changed into

𝐲μ,θ,σ(𝐱)=argmax𝐲\displaystyle\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})=\underset{\mathbf{y}}{\mathrm{argmax}} {F(𝐱,𝐲)Pσ(f(𝐱,𝐲)fμ(𝐱))θ2𝐲2}.\displaystyle\Big{\{}F(\mathbf{x},\mathbf{y})-P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)-\frac{\theta}{2}\|\mathbf{y}\|^{2}\Big{\}}.

Then Proposition 1 in the optimistic case is changed into the following in the pessimistic case.

Proposition 3

Suppose F(𝐱,𝐲)-F(\mathbf{x},\mathbf{y}) and f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) are bounded below and continuously differentiable. Given 𝐱𝒳\mathbf{x}\in\mathcal{X} and μ,θ,σ>0\mu,\theta,\sigma>0, when 𝐳μ(𝐱)\mathbf{z}^{*}_{\mu}(\mathbf{x}) and 𝐲μ,θ,σ(𝐱)\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x}) are unique, then φμ,θ,σp(𝐱)\varphi^{p}_{\mu,\theta,\sigma}(\mathbf{x}) is differentiable and

φμ,θ,σp(𝐱)𝐱=F(𝐱,𝐲μ,θ,σ(𝐱))𝐱+G(𝐱),\frac{\partial\varphi^{p}_{\mu,\theta,\sigma}\left(\mathbf{x}\right)}{\partial\mathbf{x}}=\frac{\partial F\!\left(\mathbf{x},\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})\right)}{\partial\mathbf{x}}+G(\mathbf{x}),

with G(𝐱)=Pσ(f(𝐱,𝐲)fμ(𝐱))𝐱|𝐲=𝐲μ,θ,σ(𝐱),\begin{aligned} G(\mathbf{x})=\frac{-\partial P_{\sigma}\!\left(f\left(\mathbf{x},\mathbf{y}\right)-f_{\mu}^{*}(\mathbf{x})\right)}{\partial\mathbf{x}}\big{|}_{\mathbf{y}=\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})},\end{aligned} where fμ(𝐱)=f(𝐱,𝐳μ(𝐱))+μ2𝐳μ(𝐱)2,\begin{aligned} f_{\mu}^{*}(\mathbf{x})=f\!\left(\mathbf{x},\mathbf{z}^{*}_{\mu}(\mathbf{x})\right)+\frac{\mu}{2}\|\mathbf{z}^{*}_{\mu}(\mathbf{x})\|^{2},\end{aligned} and fμ(𝐱)𝐱=f(𝐱,𝐲)𝐱|𝐲=𝐳μ(𝐱).\frac{\partial f_{\mu}^{*}(\mathbf{x})}{\partial\mathbf{x}}=\frac{\partial f\!\left(\mathbf{x},\mathbf{y}\right)}{\partial\mathbf{x}}\big{|}_{\mathbf{y}=\mathbf{z}^{*}_{\mu}(\mathbf{x})}.

The proof is similar to Proposition 1, obtained by applying [43, Theorem 4.13, Remark 4.14]. The algorithm in the pessimistic case can then be derived similar to the optimistic case, but when calculating 𝐲μ,θ,σ(𝐱)\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x}), gradient ascent is performed instead of gradient descent. In addition, the convergence of BVFSM for pessimistic BLO will be discussed in detail in Section 4.1.

Remark 3

The uniqueness of 𝐳μ(𝐱)\mathbf{z}^{*}_{\mu}(\mathbf{x}) can be guaranteed same to the analysis in Remark 1. Similarly, suppose given 𝐱\mathbf{x}, there exist constants c1<max𝐲𝒮(𝐱)F(𝐱,𝐲)c_{1}<\max_{\mathbf{y}\in\mathcal{S}(\mathbf{x})}F(\mathbf{x},\mathbf{y}) and c2>min𝐲f(𝐱,𝐲),c_{2}>\min_{\mathbf{y}}f(\mathbf{x},\mathbf{y}), such that FF is concave and ff is convex in 𝐲\mathbf{y} on the set {𝐲:F(𝐱,𝐲)c1 and f(𝐱,𝐲)c2}\{\mathbf{y}:F(\mathbf{x},\mathbf{y})\geq c_{1}\text{ and }f(\mathbf{x},\mathbf{y})\leq c_{2}\}, and we select a non-decreasing and convex auxiliary function Pσ()P_{\sigma}(\cdot). Or simply suppose there exists c<max𝐲F(𝐱,𝐲)+Pσ(f(𝐱,𝐲)fμ(𝐱))c<\max_{\mathbf{y}}F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right) such that F(𝐱,𝐲)+Pσ(f(𝐱,𝐲)fμ(𝐱))F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right) is concave on the set {𝐲:F(𝐱,𝐲)Pσ(f(𝐱,𝐲)fμ(𝐱))c}\{\mathbf{y}:F(\mathbf{x},\mathbf{y})-P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)\geq c\}. Then the uniqueness of 𝐲μ,θ,σ(𝐱)\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x}) follows.

4 Theoretical Analysis

This section brings out the convergence analysis and complexity analysis of the proposed BVFSM.

4.1 Convergence Analysis

Here we show the convergence analysis of the proposed method. As BLO without constraints can be seen as a special case of BLO with constraints by regarding constraints as Hj(𝐱,𝐲)0,j,H_{j}(\mathbf{x},\mathbf{y})\equiv 0,\forall j, and hj(𝐱,𝐲)0,jh_{j^{\prime}}(\mathbf{x},\mathbf{y})\equiv 0,\forall{j^{\prime}}, we prove the more general constrained case. Also, for brevity, we first prove in the optimistic BLO case, and the pessimistic case will be analyzed later in Corollary 1.

Note that for sequential-minimization-type scheme, including EGBMs and BVFSM, the convergence analysis can be classified into asymptotic and non-asymptotic convergence [46, 47]. This work considers asymptotic convergence and focuses on the approximation quality. That is, whether the solutions to approximate problems converge to the original solution, which comes from the sequential approximated sub-problems converging to the original bi-level problem. We prove the asymptotic convergence from the aspect of global solution, and start by recalling the equivalent definition of epiconvergence given in [43, pp. 41].

Definition 2

φkeφ\varphi_{k}\stackrel{{\scriptstyle e}}{{\longrightarrow}}\varphi if and only if for all 𝐱m\mathbf{x}\in\mathbb{R}^{m}, the following two conditions hold:

  1. (1)

    for any sequence {𝐱k}\{\mathbf{x}_{k}\} converging to 𝐱\mathbf{x},

    lim infkφk(𝐱k)φ(𝐱);\liminf\limits_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})\geq\varphi(\mathbf{x}); (16)
  2. (2)

    there is a sequence {𝐱k}\{\mathbf{x}_{k}\} converging to 𝐱\mathbf{x} such that

    lim supkφk(𝐱k)φ(𝐱).\limsup\limits_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})\leq\varphi(\mathbf{x}). (17)

The convergence results are given under the following statements as our blanket assumption.

Assumption 1 (Assumptions for the problem)

   

  1. (1)

    𝒮(𝐱)\mathcal{S}(\mathbf{x}) is nonempty for 𝐱𝒳\mathbf{x}\in\mathcal{X}.

  2. (2)

    Both F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) and f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) are jointly continuous and continuously differentiable. Both Hj(𝐱,𝐲)H_{j}(\mathbf{x},\mathbf{y}), j\forall\ {j} and hj(𝐱,𝐲)h_{j^{\prime}}(\mathbf{x},\mathbf{y}), j\forall\ {j^{\prime}} are continuously differentiable.

  3. (3)

    F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) is level-bounded in 𝐲\mathbf{y} locally uniformly in 𝐱𝒳\mathbf{x}\in\mathcal{X} (see [29, Definition 3]).

  4. (4)

    For constrained BLO, 0 is not a local optimal value of hj(𝐱,𝐲)h_{j^{\prime}}(\mathbf{x},\mathbf{y}) w.r.t. 𝐲\mathbf{y} for all j{j^{\prime}}.

For the simplicity of symbols, here we let j=j=1j=j^{\prime}=1, meaning that there is one constraint on the UL and LL problem respectively, and denote them as H(𝐱,𝐲)H(\mathbf{x},\mathbf{y}) and h(𝐱,𝐲)h(\mathbf{x},\mathbf{y}). When j1j\neq 1 or j1j^{\prime}\neq 1, the proofs parallel actually. In addition, denote fk(𝐱):=fμk,σB,k(𝐱)f^{*}_{k}(\mathbf{x}):=f_{\mu_{k},\sigma_{B,k}}^{*}(\mathbf{x}), φk(𝐱):=φμk,θk,σk(𝐱)\varphi_{k}(\mathbf{x}):=\varphi_{\mu_{k},\theta_{k},\sigma_{k}}(\mathbf{x}), and Pk(ω):=Pσk(ω)P_{k}(\omega)\!:=\!P_{\sigma_{k}}(\omega) defined in Eq. (8). Also, PB,k(ω),PH,k(ω),Ph,k(ω),Pf,k(ω)P_{B,k}(\omega),P_{H,k}(\omega),P_{h,k}(\omega),P_{f,k}(\omega) are defined similarly. Note that PB,kP_{B,k} is the standard barrier function, while PH,kP_{H,k}, Ph,kP_{h,k}, Pf,kP_{f,k} are penalty or modified barrier functions. Then

fk(𝐱)=min𝐲{f(𝐱,𝐲)+PB,k(h(𝐱,𝐲))+μk2𝐲2},f^{*}_{k}(\mathbf{x})=\min\limits_{\mathbf{y}}\left\{f(\mathbf{x},\mathbf{y})+P_{B,k}\!\left(h(\mathbf{x},\mathbf{y})\right)+\frac{\mu_{k}}{2}\|\mathbf{y}\|^{2}\right\},
φk(𝐱)=min𝐲{F(𝐱,𝐲)+PH,k(H(𝐱,𝐲))+Ph,k(h(𝐱,𝐲))\displaystyle\varphi_{k}(\mathbf{x})=\min\limits_{\mathbf{y}}\Big{\{}F(\mathbf{x},\mathbf{y})+P_{H,k}\!\left(H(\mathbf{x},\mathbf{y})\right)+P_{h,k}\!\left(h(\mathbf{x},\mathbf{y})\right)
+Pf,k(f(𝐱,𝐲)fk(𝐱))+θk2𝐲2}.\displaystyle+P_{f,k}\!\left(f(\mathbf{x},\mathbf{y})-f_{k}^{*}(\mathbf{x})\right)+\frac{\theta_{k}}{2}\|\mathbf{y}\|^{2}\Big{\}}.

To begin with, we present the following lemma on the properties of penalty and modified barrier functions, as the preparation for further discussion and proofs222Proofs of the four lemmas are provided in Appendix A, available at https://arxiv.org/abs/2110.04974..

Lemma 1

Let {σk}\{\sigma_{k}\} in Pk(ω)=Pσk(ω)P_{k}(\omega)\!=\!P_{\sigma_{k}}(\omega) be a positive sequence such that limkσk=0\lim_{k\rightarrow\infty}\sigma_{k}=0. Additionally assume that limkρ(σk(2);σk(1))=0\lim_{k\rightarrow\infty}\rho(-\sigma^{(2)}_{k}\;;\sigma_{k}^{(1)})=0 when ρ\rho is a modified barrier function. Then we have

  1. (1)

    Pk(ω)P_{k}(\omega) is continuous, differentiable and non-decreasing, and satisfies Pk(ω)0P_{k}(\omega)\geq 0.

  2. (2)

    For any ω0\omega\leq 0, limkPk(ω)=0\lim_{k\rightarrow\infty}P_{k}(\omega)=0.

  3. (3)

    For any sequence {ωk}\{\omega_{k}\}, limkPk(ωk)<+\lim_{k\rightarrow\infty}P_{k}(\omega_{k})<+\infty implies that lim supkωk0\limsup_{k\rightarrow\infty}\omega_{k}\leq 0.

We will use these properties in later proofs, so we hold these requirements on parameters in Lemma 1 from now on. To prove the convergence result, we verify the two conditions given in Definition 2, and show that φk(𝐱)+δ𝒳(𝐱)eφ(𝐱)+δ𝒳(𝐱)\varphi_{k}(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x})\stackrel{{\scriptstyle e}}{{\longrightarrow}}\varphi(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x}), where δ𝒳(𝐱)\delta_{\mathcal{X}}(\mathbf{x}) denotes the indicator function of the set 𝒳\mathcal{X}, i.e., δ𝒳(𝐱)=0\delta_{\mathcal{X}}(\mathbf{x})=0 if 𝐱𝒳\mathbf{x}\in\mathcal{X} and δ𝒳(𝐱)=+\delta_{\mathcal{X}}(\mathbf{x})=+\infty if 𝐱𝒳\mathbf{x}\notin\mathcal{X}. To begin with, we propose the following three lemmas to verify the two condition in Eq. (16) and Eq. (17) in Definition 2.

Lemma 2

Let {(μk,σk)}\{(\mu_{k},\sigma_{k})\} be a positive sequence such that (μk,σk)0(\mu_{k},\sigma_{k})\rightarrow 0, also satisfying the same setting as in Lemma 1. Then for any sequence {𝐱k}\{\mathbf{x}_{k}\} converging to 𝐱¯\bar{\mathbf{x}}, lim supkfk(𝐱k)f(𝐱¯).\limsup\limits_{k\rightarrow\infty}f_{k}^{*}(\mathbf{x}_{k})\leq~{}f^{*}(\bar{\mathbf{x}}).

Lemma 3

Let {(μk,θk,σk)}\{(\mu_{k},\theta_{k},\sigma_{k})\} be a positive sequence such that limk(μk,θk,σk)=0\lim_{k\rightarrow\infty}(\mu_{k},\theta_{k},\sigma_{k})=0, and satisfy the same setting as in Lemma 1. Given 𝐱¯𝒳\bar{\mathbf{x}}\in\mathcal{X}, then for any sequence {𝐱k}\{\mathbf{x}_{k}\} converging to 𝐱¯\bar{\mathbf{x}}, we have lim infkφk(𝐱k)φ(𝐱¯).\liminf\limits_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})\geq\varphi(\bar{\mathbf{x}}).

Lemma 4

Let {(μk,θk,σk)}\{(\mu_{k},\theta_{k},\sigma_{k})\} be a positive sequence such that limk(μk,θk,σk)=0\lim_{k\rightarrow\infty}(\mu_{k},\theta_{k},\sigma_{k})=0, and satisfy the same setting as in Lemma 1. Then for any 𝐱𝒳\mathbf{x}\in\mathcal{X}, lim supkφk(𝐱)φ(𝐱).\limsup\limits_{k\rightarrow\infty}\varphi_{k}(\mathbf{x})\leq\varphi(\mathbf{x}).

TABLE II: Convergence of existing methods and BVFSM. We present the convergence results and conditions whether it is available without the LLC condition, whether it can be extended to BLO with constraints and pessimistic BLO, respectively for each method. Note that these convergence results are studied via two different types: the asymptotic and non-asymptotic analysis [46, 47], and BVFSM achieves the asymptotic convergence without the LLC condition. BVFSM can also be extended to BLO with constraints and pessimistic BLO which other methods cannot carry out.
Category Method Convergence Results Required Conditions w/o LLC Constraints Pessimistic
EGBMs FHG/RHG Asymptotic F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) and f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) are C1C^{1}.
inf𝐱𝒳φk(𝐱)inf𝐱𝒳φ(𝐱)\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi_{k}(\mathbf{x})\rightarrow\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}) 𝒮(𝐱)\mathcal{S}(\mathbf{x}) is a singleton.
TRHG Non-asymptotic F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) is C1C^{1} and bounded below.
𝐱k𝐱^\mathbf{x}_{k}{\longrightarrow}\widehat{\mathbf{x}}^{*} f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) is C1C^{1}, LfL_{f}-smooth and strongly convex.
BDA Asymptotic F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) is LFL_{F}-smooth, convex, bounded below.
inf𝐱𝒳φk(𝐱)inf𝐱𝒳φ(𝐱)\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi_{k}(\mathbf{x})\rightarrow\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}) f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) is LfL_{f}-smooth. 𝒮(𝐱)\mathcal{S}(\mathbf{x}) is a singleton.
IGBMs CG/Neumann Non-asymptotic F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) and f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) are C1C^{1}.
𝐱k𝐱^\mathbf{x}_{k}{\longrightarrow}\widehat{\mathbf{x}}^{*} 2f(𝐱,𝐲)𝐲𝐲\frac{\partial^{2}f(\mathbf{x},\mathbf{y})}{\partial\mathbf{y}\partial\mathbf{y}} is invertible. 𝒮(𝐱)\mathcal{S}(\mathbf{x}) is a singleton.
Ours BVFSM Asymptotic F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) and f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) are C1C^{1}
inf𝐱𝒳φk(𝐱)inf𝐱𝒳φ(𝐱)\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi_{k}(\mathbf{x})\rightarrow\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}) and level-bounded.
  • 1

    C1C^{1} denotes continuously differentiable. LfL_{f} (or LFL_{F})-smooth means the gradient of ff (or FF) is Lipschitz continuous with Lipschitz constant LfL_{f} (or LFL_{F}). “Level-bounded” is short for “level-bounded in 𝐲\mathbf{y} locally uniformly in 𝐱𝒳\mathbf{x}\in\mathcal{X}”.

  • 2

    𝐱^\widehat{\mathbf{x}}^{*} denotes the stationary point.

TABLE III: Complexity of existing gradient-based methods and BVFSM. We show the key update ideas for calculating G(𝐱)G(\mathbf{x}) in φ(𝐱)𝐱\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}}. Please see  [2, 27, 28] for more details of EGBMs and IGBMs. Note that our method avoids solving an unrolled dynamic system or approximating the inverse of Hessian.
Category Method Key point for calculating G(𝐱)G(\mathbf{x}) Time Space
EGBMs FHG G(𝐱)𝐙TF(𝐱,𝐲T)𝐲G(\mathbf{x})\approx\mathbf{Z}_{T}^{\top}\frac{\partial F(\mathbf{x},\mathbf{y}_{T})}{\partial\mathbf{y}} 𝐙t=2f𝐲2𝐙t1+2f𝐲𝐱\mathbf{Z}_{t}=\frac{\partial^{2}f}{\partial\mathbf{y}^{2}}\mathbf{Z}_{t-1}+\frac{\partial^{2}f}{\partial\mathbf{y}\partial\mathbf{x}} O(m2nT)O(m^{2}nT) O(mn)O(mn)
RHG G(𝐱)𝐪1G(\mathbf{x})\approx\mathbf{q}_{-1} 𝐪t1=𝐪t+(2f𝐱𝐲)𝐩t,𝐩t1=(2f𝐲2)𝐩t\mathbf{q}_{t-1}=\mathbf{q}_{t}+\left(\frac{\partial^{2}f}{\partial\mathbf{x}\partial\mathbf{y}}\right)^{\top}\mathbf{p}_{t},\ \mathbf{p}_{t-1}=\left(\frac{\partial^{2}f}{\partial\mathbf{y}^{2}}\right)^{\top}\mathbf{p}_{t} O(n(m+n)T)O(n(m+n)T) O(m+nT)O(m+nT)
TRHG G(𝐱)𝐪I1G(\mathbf{x})\approx\mathbf{q}_{I-1} O(n(m+n)I)O(n(m+n)I) O(m+nI)O(m+nI)
BDA G(𝐱)𝐪1G(\mathbf{x})\approx\mathbf{q}_{-1} Same as RHG, but replace ff with (1α)f+αF(1-\alpha)f+\alpha F O(n(m+n)T)O(n(m+n)T) O(m+nT)O(m+nT)
IGBMs CG G(𝐱)(2f(𝐱,𝐲T)𝐲𝐱)𝐪G(\mathbf{x})\approx-\left(\frac{\partial^{2}f(\mathbf{x},\mathbf{y}_{T})}{\partial\mathbf{y}\partial\mathbf{x}}\right)^{\top}\mathbf{q} 2f𝐲2𝐪=F𝐲\frac{\partial^{2}f}{\partial\mathbf{y}^{2}}\mathbf{q}=\frac{\partial F}{\partial\mathbf{y}} O(m+nT+n2Q)O(m+nT+n^{2}Q) O(m+n)O(m+n)
Neumann 𝐪=i=0Q(𝐈2f𝐲2)iF𝐲\mathbf{q}=\sum_{i=0}^{Q}\left(\mathbf{I}-\frac{\partial^{2}f}{\partial\mathbf{y}^{2}}\right)^{i}\frac{\partial F}{\partial\mathbf{y}} O(m+nT+n2Q)O(m+nT+n^{2}Q) O(m+n)O(m+n)
Ours BVFSM G(𝐱)Pσk(f(𝐱l,𝐲k,lT𝐲)fk,lT𝐳)𝐱G(\mathbf{x})\approx\frac{\partial P_{\sigma_{k}}\!\left(f\left(\mathbf{x}_{l},\mathbf{y}_{k,l}^{T_{\mathbf{y}}}\right)-f_{k,l}^{T_{\mathbf{z}}}\right)}{\partial\mathbf{x}} fk,lT𝐳=f(𝐱l,𝐳k,lT𝐳)+μk2𝐳k,lT𝐳2f_{k,l}^{T_{\mathbf{z}}}=f(\mathbf{x}_{l},\mathbf{z}_{k,l}^{T_{\mathbf{z}}})+\frac{\mu_{k}}{2}\|\mathbf{z}_{k,l}^{{T_{\mathbf{z}}}}\|^{2} O(m+n(T𝐳+T𝐲))O(m+n(T_{\mathbf{z}}+T_{\mathbf{y}})) O(m+n)O(m+n)

Now, by combining the above results, we can obtain the desired epiconvergence result, which also indicates the convergence of our method. Note that this is another type of the convergence of algorithm iterates in asymptotic convergence different from non-asymptotic convergence.

Theorem 1 (Convergence for Optimistic BLO)

   

Let {(μk,θk,σk)}\{(\mu_{k},\theta_{k},\sigma_{k})\} be a positive sequence such that (μk,θk,σk)0(\mu_{k},\theta_{k},\sigma_{k})\rightarrow 0, also satisfying the same setting as in Lemma 1.

  1. (1)

    The epiconvergence holds:

    φk(𝐱)+δ𝒳(𝐱)eφ(𝐱)+δ𝒳(𝐱).\varphi_{k}(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x})\stackrel{{\scriptstyle e}}{{\longrightarrow}}\varphi(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x}).
  2. (2)

    We have the following inequality:

    lim supk(inf𝐱𝒳φk(𝐱))inf𝐱𝒳φ(𝐱).\limsup\limits_{k\rightarrow\infty}\left(\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi_{k}(\mathbf{x})\right)\leq\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}).

    In addition, if 𝐱argmin𝐱𝒳φ(𝐱)\mathbf{x}_{\ell}\in\mathrm{argmin}_{\mathbf{x}\in\mathcal{X}}\varphi_{\ell}(\mathbf{x}) for some subsequence {}\{\ell\}\subset\mathbb{N}, and 𝐱\mathbf{x}_{\ell} converges to 𝐱~\tilde{\mathbf{x}}, then 𝐱~argmin𝐱𝒳φ(𝐱)\tilde{\mathbf{x}}\in\mathrm{argmin}_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}) and

    lim(inf𝐱𝒳φ(𝐱))=inf𝐱𝒳φ(𝐱).\lim\limits_{\ell\rightarrow\infty}\left(\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi_{\ell}(\mathbf{x})\right)=\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}).
Proof.

To prove the epiconvergence of φk\varphi_{k} to φ\varphi, we just need to verify that the sequence {φk}\{\varphi_{k}\} satisfies the two conditions given in Definition 2. Considering any sequence {𝐱k}\{\mathbf{x}_{k}\} converging to 𝐱\mathbf{x}, if 𝐱𝒳\mathbf{x}\in\mathcal{X}, from Lemma 3 we have

φ(𝐱)+δ𝒳(𝐱)=φ(𝐱)\displaystyle\varphi(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x})=\varphi(\mathbf{x}) lim infkφk(𝐱k)\displaystyle\leq\liminf_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})
lim infkφk(𝐱k)+δ𝒳(𝐱k).\displaystyle\leq\liminf_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})+\delta_{\mathcal{X}}(\mathbf{x}_{k}).

When 𝐱𝒳\mathbf{x}\notin\mathcal{X}, we have lim infkφk(𝐱k)+δ𝒳(𝐱k)=+\liminf_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})+\delta_{\mathcal{X}}(\mathbf{x}_{k})=+\infty because 𝒳\mathcal{X} is closed. Thus the first condition Eq. (16) in Definition 2 is satisfied.

Next, for any 𝐱m\mathbf{x}\in\mathbb{R}^{m}, if 𝐱𝒳\mathbf{x}\in\mathcal{X}, then it follows from Lemma 4 that

lim supkφk(𝐱)+δ𝒳(𝐱)\displaystyle\limsup_{k\rightarrow\infty}\varphi_{k}(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x}) =lim supkφk(𝐱)\displaystyle=\limsup_{k\rightarrow\infty}\varphi_{k}(\mathbf{x})
φ(𝐱)=φ(𝐱)+δ𝒳(𝐱).\displaystyle\leq\varphi(\mathbf{x})=\varphi(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x}).

When 𝐱𝒳\mathbf{x}\notin\mathcal{X}, we have φ(𝐱)+δ𝒳(𝐱)=+\varphi(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x})=+\infty. Thus, the second condition Eq. (17) in Definition 2 is satisfied. Therefore, we get the conclusion (1) immediately from Definition 2, and the conclusion (2) follows from [43, Proposition 4.6]. ∎

Next, we consider the convergence for pessimistic BLO. To begin with, for pessimistic BLO without functional constraints, we denote φkp(𝐱)\varphi^{p}_{k}(\mathbf{x}) similarly to the optimistic case:

φkp(𝐱):=\displaystyle\varphi^{p}_{k}(\mathbf{x}):= φμk,θk,σkp(𝐱)\displaystyle\varphi^{p}_{\mu_{k},\theta_{k},\sigma_{k}}(\mathbf{x})
=\displaystyle= max𝐲{F(𝐱,𝐲)Pk(f(𝐱,𝐲)fk(𝐱))θk2𝐲2},\displaystyle\max\limits_{\mathbf{y}}\Big{\{}F(\mathbf{x},\mathbf{y})-P_{k}\!\left(f(\mathbf{x},\mathbf{y})-f_{k}^{*}(\mathbf{x})\right)-\frac{\theta_{k}}{2}\|\mathbf{y}\|^{2}\Big{\}},

where fk(𝐱)=min𝐲{f(𝐱,𝐲)+μk2𝐲2}.f^{*}_{k}(\mathbf{x})=\min_{\mathbf{y}}\left\{f(\mathbf{x},\mathbf{y})+\frac{\mu_{k}}{2}\|\mathbf{y}\|^{2}\right\}. Then we have the following corollary. Note that this convergence result can also be extended to pessimistic BLO with constraints easily.

Corollary 1 (Convergence for Pessimistic BLO)

   

Let {(μk,θk,σk)}\{(\mu_{k},\theta_{k},\sigma_{k})\} be a positive sequence such that (μk,θk,σk)0(\mu_{k},\theta_{k},\sigma_{k})\rightarrow 0, also satisfying the same setting as in Lemma 1. Then we have the following inequality:

lim supk(inf𝐱𝒳φkp(𝐱))inf𝐱𝒳φp(𝐱).\limsup\limits_{k\rightarrow\infty}\left(\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi^{p}_{k}(\mathbf{x})\right)\leq\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi^{p}(\mathbf{x}).

In addition, if 𝐱argmin𝐱𝒳φp(𝐱)\mathbf{x}_{\ell}\in\mathrm{argmin}_{\mathbf{x}\in\mathcal{X}}\varphi^{p}_{\ell}(\mathbf{x}) for some subsequence {}\{\ell\}\subset\mathbb{N}, and 𝐱\mathbf{x}_{\ell} converges to 𝐱~\tilde{\mathbf{x}}, then we have 𝐱~argmin𝐱𝒳φp(𝐱)\tilde{\mathbf{x}}\in\mathrm{argmin}_{\mathbf{x}\in\mathcal{X}}\varphi^{p}(\mathbf{x}) and

lim(inf𝐱𝒳φp(𝐱))=inf𝐱𝒳φp(𝐱).\lim\limits_{\ell\rightarrow\infty}\left(\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi^{p}_{\ell}(\mathbf{x})\right)=\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi^{p}(\mathbf{x}).
Proof.

Based on the proof of Theorem 1, we first need to prove φkp(𝐱)+δ𝒳(𝐱)eφp(𝐱)+δ𝒳(𝐱)\varphi^{p}_{k}(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x})\stackrel{{\scriptstyle e}}{{\longrightarrow}}\varphi^{p}(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x}) by Lemma 123, and 4. Lemma 1 and 2 are unrelated to whether it is the optimistic or pessimistic case, and thus holds naturally. The corresponding results for Lemma 3 and 4 can be derived simply by replacing FF in their proof with F-F. Then the conclusion can be obtained by the process same to Theorem 1. ∎

In Table II, we present the comparison among existing methods and our BVFSM. It can be seen that under mild assumptions, BVFSM is able to achieve asymptotic convergence without the LLC restriction, and be applied in BLO with constraints and pessimistic BLO, which is not available by other methods. In addition, as shown in Theorem 1, our asymptotic convergence is obtained from the epiconvergence property, which is a stronger result than solely asymptotic convergence.

4.2 Complexity Analysis

In this part, we compare the time and space complexity of Algorithms 2 with EGBMs (i.e., FHG, RHG, TRHG and BDA) and IGBMs (i.e., CG and Neumann) for computing G(𝐱)G(\mathbf{x}) or φ(𝐱)𝐱\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}}, i.e., the direction for updating variable 𝐱\mathbf{x}. Table III summarizes the complexity results. Our complexity analysis follows the assumptions in [5]. Note that BVFSM has an order of magnitude lower time complexity with respect to the LL dimension nn compared to existing methods. For all existing methods, we assume solving the optimal solution of the LL problem, also the transition function Φ\Phi in EGBMs for obtaining 𝐲T(𝐱)\mathbf{y}_{T}(\mathbf{x}), is the process of a TT-step gradient descent.

EGBMs. As discussed in [2, 26], after implementing TT steps of gradient descent with time and space complexity of O(n)O(n) to solve the LL problem, FHG for forward calculation of Hessian-matrix product can be evaluated in time O(m2nT)O(m^{2}nT) and space O(mn)O(mn), and RHG for reverse calculation of Hessian- and Jacobian-vector products can be evaluated in time O(n(m+n)T)O(n(m+n)T) and space O(m+nT)O(m+nT). TRHG truncates the length of back-propagation trajectory to II after a TT-step gradient descent, and thus reduces the time and space complexity to O(n(m+n)I)O(n(m+n)I) and space O(m+nI)O(m+nI). BDA uses the same idea to RHG, except that it combines UL and LL objectives during back propagation, so the order of complexity of time and space is the same to RHG. The time complexity for EGBMs to calculate the UL gradient is proportional to TT, the number of iterations of the LL problem, and thus EGBMs take a large amount of time to ensure convergence.

IGBMs. After implementing a TT-step gradient descent for the LL problem, IGBMs approximate the inverse of Hessian matrix by conjugate gradient (CG), which solves a linear system of QQ steps, or by Neumann series. Note that each step of CG and Neumann method includes Hessian-vector products, requiring O(m+n2Q)O(m+n^{2}Q) time and O(m+n)O(m+n) space, so IGBMs run in time O(m+nT+n2Q)O(m+nT+n^{2}Q) and space O(m+n)O(m+~{}n). IGBMs decouple the complexity of calculating the UL gradient from being proportional to TT, but the iteration number QQ always relies on the properties of the Hessian matrix, and in some cases, QQ can be much larger than TT.

BVFSM. In our algorithm, it takes time O(nT𝐳)O(nT_{\mathbf{z}}) and space O(n)O(n) to calculate T𝐳T_{\mathbf{z}} steps of gradient descent on Eq. (10) for the solution of LL problem 𝐳T𝐳\mathbf{z}^{T_{\mathbf{z}}}. Then T𝐲T_{\mathbf{y}} steps of gradient descent on Eq. (11) are used to calculate 𝐲T𝐲\mathbf{y}^{T_{\mathbf{y}}}, which requires time O(nT𝐲)O(nT_{\mathbf{y}}) and space O(n)O(n). After that, the direction can be obtained according to the formula given in Eq. (13) by several computations of the gradient f𝐱\frac{\partial f}{\partial\mathbf{x}} and F𝐱\frac{\partial F}{\partial\mathbf{x}} without any intermediate update, which requires time O(m)O(m) and space O(m+n)O(m+n). Therefore, BVFSM runs in time O(m+n(T𝐳+T𝐲))O(m+n(T_{\mathbf{z}}+T_{\mathbf{y}})) and space O(m+n)O(m+n) for each iteration.

It can be observed from Table III that BVFSM needs less space than EGBMs, and it takes much less time than EGBMs and IGBMs, especially when nn is large, meaning the LL problem is high-dimensional, such as in application tasks with a large-scale network. Overall, this is because BVFSM does not need any computation of Hessian- or Jacobian-vector products for solving the unrolled dynamic system by recurrent iteration or approximating the inverse of Hessian. Its complexity only comes from calculating the gradients of FF and ff, which is much easier than calculating Hessian- and Jacobian-vector products (even by AD). Besides, although BVFSM has the same order of space complexity to IGBMs, it is indeed smaller, because the memory is saved by eliminating the need to save the computational graph used for calculating Hessian. We will further verify these advantages through numerical results in Section 5.

Refer to captionRefer to caption
(a) Convergence with the initial point (𝐱,𝐲)=(8,8,8)(\mathbf{x},\mathbf{y})=(8,8,8)
Refer to captionRefer to caption
(b) Convergence with the initial point (𝐱,𝐲)=(0,0,0)(\mathbf{x},\mathbf{y})=(0,0,0)
Figure 1: Convergence behavior of gradient-based BLO algorithms with different initial points. The (1st,3rd) and (2nd,4th) subfigures respectively show the errors of UL variable (i.e., 𝐱𝐱/𝐱\|{\mathbf{x}}-{\mathbf{x}}^{*}\|/\|{\mathbf{x}}^{*}\|) and UL objective (i.e., F(𝐱,𝐲)F(𝐱,𝐲)/F(𝐱,𝐲)\|F({\mathbf{x}},{\mathbf{y}})-F({\mathbf{x}^{*}},{\mathbf{y}^{*}})\|/\|F({\mathbf{x}^{*}},{\mathbf{y}^{*}})\|). The legend is only shown in the first plot.

5 Experimental Results

In this section, we quantitatively demonstrate the performance of our BVFSM333Code is available at https://github.com/vis-opt-group/BVFSM., especially when dealing with complicated and high-dimensional problems. We start with investigating the convergence performance, computational efficiency, and effect of hyper-parameters on numerical examples in Section 5.1. In Section 5.2, we apply BVFSM in the hyper-parameter optimization for the data hyper-cleaning task under different settings including the type of auxiliary functions, contamination rates, and various network structures. To further validate the generality of our method, we conduct experiments on other tasks such as few-shot learning in Section 5.3 and GAN in Section 5.4. The experiments were conducted on a PC with Intel Core i7-9700K CPU (4.2 GHz), 32GB RAM and an NVIDIA GeForce RTX 2060S GPU with 8GB VRAM, and the algorithm was implemented using PyTorch 1.6. We use the implementation in [48, 36] for the existing methods, and use MB (MegaByte) and S (Second) as the evaluation units of space and time complexity, respectively. Regarding the selection of coefficients and hyper-parameters, we evaluate them in numerical experiments and use the same method to select them in later tasks. Furthermore, we set 𝐲k,l0=𝐲k,l1T𝐲\mathbf{y}_{k,l}^{0}=\mathbf{y}_{k,l-1}^{T_{\mathbf{y}}}, 𝐳k,l0=𝐳k,l1T𝐳\mathbf{z}_{k,l}^{0}=\mathbf{z}_{k,l-1}^{T_{\mathbf{z}}} to initialize each step of the sub-problems. In view of the optimizer, we use SGD for solving LL and UL sub-problems in numerical experiments. In some applications, we change the UL optimizer to Adam to speed up the convergence.

5.1 Numerical Evaluations

5.1.1 Optimistic BLO

We start with the optimistic BLO, and use the numerical example with a non-convex LL which can adjust various dimensions to validate the effectiveness of BVFSM over existing methods. In particular, consider

min𝐱,𝐲n𝐱a2+𝐲a𝐜2\displaystyle\min_{\mathbf{x}\in\mathbb{R},\mathbf{y}\in\mathbb{R}^{n}}\|\mathbf{x}-a\|^{2}+\|\mathbf{y}-a-\mathbf{c}\|^{2} (18)
 s.t. [𝐲]iargmin[𝐲]isin(𝐱+[𝐲]i[𝐜]i),i,\displaystyle\text{\ s.t.\ }\;[\mathbf{y}]_{i}\in\underset{[\mathbf{y}]_{i}\in\mathbb{R}}{\mathrm{argmin}}\;\sin(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}),\forall\ i,

where [𝐲]i[\mathbf{y}]_{i} denotes the ii-th component of 𝐲\mathbf{y}, while aa\in\mathbb{R} and 𝐜n\mathbf{c}\in\mathbb{R}^{n} are adjustable parameters. Note that here 𝐱\mathbf{x}\in\mathbb{R} is a one-dimensional real number, but we still use the bold letter to represent this scalar to maintain the context consistency. The solution of such problem is 𝐱=(1n)a+nC1+n, and [𝐲]i=C+[𝐜]i𝐱,i,\mathbf{x}^{*}=\frac{(1-n)a+nC}{1+n},\ \text{ and }\ [\mathbf{y}^{*}]_{i}=C+[\mathbf{c}]_{i}-\mathbf{x}^{*},\forall\ i, where C=argmin𝑘{Ck2a:Ck=π2+2kπ,k},C=\underset{{k}}{\operatorname{argmin}}\left\{\|C_{k}-2a\|:C_{k}=-\frac{\pi}{2}+2k\pi,k\in\mathbb{Z}\right\}, and the optimal value is F=n(C2a)21+nF^{*}=\frac{n(C-2a)^{2}}{1+n}444 Derivation of the closed-form solution is provided in Appendix B, available at https://arxiv.org/abs/2110.04974.. This example satisfies all the assumptions of BVFSM, but does not meet the LLC assumption in  [27, 9, 28], which makes it a good example to validate the advantages of BVFSM. In the following experiments we set a=2a=2 and [𝐜]i=2, for any i=1,2,,n[\mathbf{c}]_{i}=2,\text{ for any }i=1,2,\cdots,n.

We compare BVFSM with several gradient-based optimization methods, including RHG, BDA, CG and Neumann. Note that they all assume the solution of the LL problem is unique except BDA, so for these methods we directly regard the obtained local optimal solutions of LL problems as the unique solutions. We set T=100T=100 for RHG and BDA, T=100T=100, Q=20Q=20 for CG and Neumann, the aggregation parameters equal to 0.50.5 in BDA, and (μk,θk,σk(1))=(1.0,1.0,1.0)/1.01k(\mu_{k},\theta_{k},\sigma_{k}^{(1)})=(1.0,1.0,1.0)/1.01^{k}, σk(2)=f(𝐱k,𝐲k)+1\sigma_{k}^{(2)}=f(\mathbf{x}_{k},\mathbf{y}_{k})+1, step size α=0.01\alpha=0.01, T𝐳=50T_{\mathbf{z}}=50, T𝐲=25T_{\mathbf{y}}=25, and L=1L=1 in BVFSM.

TABLE IV: Errors of UL variable 𝐱𝐱/𝐱\|{\mathbf{x}}-{\mathbf{x}}^{*}\|/\|{\mathbf{x}}^{*}\| with large-scale LL of nn dimension.
nn RHG BDA CG Neumann BVFSM
50 2.296 2.336 2.058 2.260 0.117
100 2.253 2.294 2.073 2.236 0.159
150 2.213 2.253 2.032 2.202 0.190
200 2.187 2.227 1.972 2.178 0.209

Convergence performance. Figure 1 compares the convergence curves of UL variable 𝐱\mathbf{x} and objective F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) in the 2-dimensional case (n=2n=2). Here the optimal solution is (𝐱,𝐲)=(2/3+π,8/3+π/2,8/3+π/2)(\mathbf{x}^{*},\mathbf{y}^{*})=(-2/3+\pi,8/3+\pi/2,8/3+\pi/2). In order to show the impact of initial points, we also set different initial points. From Figure 1(a), when the initial point is (𝐱,𝐲)=(8,8,8)(\mathbf{x},\mathbf{y})=(8,8,8), existing methods show the trend of convergence at the beginning of iteration, but they soon stop further converging due to falling into a local optimal solution. Furthermore, when the initial point is (𝐱,𝐲)=(0,0,0)(\mathbf{x},\mathbf{y})=(0,0,0) in Figure 1(b), existing methods show a trend that the distance to the optimal solution even increases during the whole iterative process because they incorrectly converge to the local solution away from the global solution. On the contrary, our method can converge to the optimal solution under different initial points. Table IV further verifies the convergence performance for larger-scale problems of various LL dimension nn. It shows that our method can still maintain good convergence performance with high-dimensional LL, while existing methods fail because they cannot solve the non-convex LL with convergence guarantee.

Refer to caption
Refer to caption
Figure 2: Computation time (seconds, S) in each step for calculating the gradient of 𝐱\mathbf{x} under various scales nn and mm (LL and UL dimensions). Blank areas not drawn indicate that the 3600-second time limit is exceeded.
TABLE V: The largest LL dimension nn that can be achieved by different methods for a single-step computation within 3600 seconds.
RHG BDA CG Neumann BVFSM
nn 13089 12871 15093 18118 283200

Computational efficiency for large-scale problems. Figure 2 compares the computation time for problems under various scales nn and mm. Note that the scale-up of UL dimension mm can be achieved by converting the one-dimensional 𝐱\mathbf{x} to the mean of multi-dimensional 𝐱\mathbf{x}. As we can see, our method costs the least computation time for problems of all scales, and the LL dimension nn has much more influence than the UL dimension mm. Table V shows the largest LL dimension within the 3600-second time limit. This allows us to apply BVFSM to more complex LL problems, which existing methods cannot deal with. We attribute these superior results to our novel way of the re-characterization via value-function. We further explore our performance on problems with complex network structures in Section 5.2.

Refer to caption
Refer to caption
(a) Errors when μ0,θ0=0.01\mu_{0},\theta_{0}=0.01
Refer to caption
Refer to caption
(b) Errors when μ0,θ0=0.001\mu_{0},\theta_{0}=0.001
Figure 3: Effect of specified T𝐳T_{\mathbf{z}}, T𝐲T_{\mathbf{y}}, μ\mu and θ\theta on the solution errors. Here T𝐳T_{\mathbf{z}} and T𝐲T_{\mathbf{y}} are the number of iterations for gradient descent on LL sub-problems in Eq. (10) and simple BLO problems in Eq. (11), which affect the accuracy of obtained solution. The curves on the surfaces denote the optimal T𝐳T_{\mathbf{z}} (or T𝐲T_{\mathbf{y}}) to minimize the error with fixed T𝐲T_{\mathbf{y}} (or T𝐳T_{\mathbf{z}}). Subfigure 3(a) shows that the increasing of T𝐲T_{\mathbf{y}} and T𝐳T_{\mathbf{z}} abnormally leads to the deviation of the solution, because with larger regularization coefficients μ\mu and θ\theta, the regularized problems are away from the original problems. Subfigure 3(b) shows that smaller μ\mu and θ\theta may not completely overcome the ill condition of the LL problem, resulting in the instability of the surfaces (in the second plot we reverse the axis to make it easier to observe the trend).

Effect of hyper-parameters. We next evaluate the effect of various hyper-parameters in the 2-dimensional case (n=2n=2). In Figure 3, we compare the errors under different settings of T𝐳T_{\mathbf{z}}, T𝐲T_{\mathbf{y}}, μ\mu and θ\theta. In Figure 3(a), with larger regularization coefficients μ\mu and θ\theta, the regularized problems are away from the original problems. Figure 3(b) shows that smaller μ\mu and θ\theta may not completely overcome the ill condition of the LL problem, which means the small coefficients cause the approximate problem to remain a little ill-conditioned, resulting in the instability of the surfaces. Hence, it is not an easy task to determine the regularization coefficients. Since the selection of such parameters is often highly related to the specific problem, in order to maintain the fairness for comparing the computational burden with other methods, we set T𝐳+2T𝐲=TT_{\mathbf{z}}+2T_{\mathbf{y}}=T in all experiments (because we need to calculate the gradient of two functions FF and ff for the T𝐲T_{\mathbf{y}}-step gradient descent). Figure 4 shows the effect of μ\muθ\theta, and σ\sigma on the convergence results. Figure 4(a) reveals the effect of regularization coefficients μ\mu and θ\theta. We find that when μ=0\mu=0, the collapse may occur (with the collapse rate at around 44%44\%), which indicates the necessity of adding the regularization term to avoid collapse and improve the computational stability. Figure 4(b) shows that it is a good choice to use smaller μ,θ\mu,\theta and larger σ\sigma with a suitable decay factor to avoid the offset of solution and achieve better convergence. Figure 5 further analyzes the effect of LL, the number of inner-loop iterations, on the convergence speed. It can be seen that the smaller LL is, the higher convergence speed can be obtained, so we set L=1L=1 in all experiments.

Refer to caption
(a) Effect of μ0\mu_{0} and θ0\theta_{0}
Refer to caption
(b) Effect of (μ0,θ0\mu_{0},\theta_{0}) and σ0\sigma_{0}
Figure 4: Effect of μ\mu, θ\theta (in the regularization term) and σ\sigma (in the auxiliary function) on the solution errors. We also plot the optimal and easy-to-collapse parameters. Subfigure 4(a) reveals that the effect of μ\mu is greater than that of θ\theta, and collapses occur when μ=0\mu=0. Subfigure 4(b) shows that smaller μ\mu and θ\theta (but greater than 0) and larger σ\sigma would help to converge.
Refer to caption
Refer to caption
Figure 5: Effect of LL (inner-loop iterations) on the convergence speed.

5.1.2 BLO with Constraints

To show the performance of BVFSM for problems with constraints discussed in Section 3.3, we use the following constrained example with non-convex LL: min𝐱,𝐲n𝐱a2+𝐲a2 s.t. [𝐲]iargmin[𝐲]i{sin(𝐱+[𝐲]i[𝐜]i):𝐱+[𝐲]i[0,1]},i,\small\begin{aligned} &\min_{\mathbf{x}\in\mathbb{R},\mathbf{y}\in\mathbb{R}^{n}}\|\mathbf{x}-a\|^{2}+\left\|\mathbf{y}-a\right\|^{2}\\ &\text{ s.t. }[\mathbf{y}]_{i}\in\mathop{\arg\min}_{[\mathbf{y}]_{i}\in\mathbb{R}}\Big{\{}\sin\left(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}\right):\mathbf{x}+[\mathbf{y}]_{i}\in[0,1]\Big{\}},\forall\ i,\\ \end{aligned} where aa\in\mathbb{R} and 𝐜n\mathbf{c}\in\mathbb{R}^{n} are any fixed given constant and vector satisfying [𝐜]i[0,1] for any i=1,,n[\mathbf{c}]_{i}\in[0,1]\text{ for any }i=1,\cdots,n. The optimal solution is 𝐱=1n1+na,[𝐲]i=𝐱,i,\mathbf{x}^{*}=\frac{1-n}{1+n}a,[\mathbf{y}]_{i}=-\mathbf{x}^{*},\forall\ i, and the optimal value is F=4n1+na2F^{*}=\frac{4n}{1+n}a^{2}. Derivation of the closed-form solution is provided in Appendix B. We conduct experiments under the 2-dimensional case (n=2n=2) and set a=2a=2 and [𝐜]i=1, for any i=1,2,,n[\mathbf{c}]_{i}=1,\text{ for any }i=1,2,\cdots,n. The constraint is carried out via (𝐱+[𝐲]i0.5)20.250(\mathbf{x}+[\mathbf{y}]_{i}-0.5)^{2}-0.25\leq 0, for each component of 𝐲\mathbf{y}, which is equivalent to 𝐱+[𝐲]i[0,1]\mathbf{x}+[\mathbf{y}]_{i}\in\left[0,1\right].

Refer to captionRefer to captionRefer to captionRefer to caption
Figure 6: The relationship between the true optimal solution (“Opt.” for short) and the solutions obtained by different methods during the UL iteration for constrained BLO. It can be seen that BVFSM can gradually converge to the true solution inside the feasible region, while other methods cannot deal with the constraint at all. The legend is only shown in the last plot.
Refer to caption
(a) Barrier
Refer to caption
(b) Penalty
Figure 7: Convergence results for constrained BLO with different initial points and parameters using 7(a) barrier and 7(b) penalty functions. Err. in the legend denotes 𝐱𝐱/𝐱\|{\mathbf{x}}-{\mathbf{x}^{*}}\|/\|\mathbf{x}^{*}\|. We select the truncated log barrier and quadratic penalty as the representatives of barrier and penalty functions respectively. Choosing the barrier function leads to higher stability and less sensitivity to parameters, while using the penalty function is more sensitive to parameters but can obtain smaller errors. Both methods are insensitive to initial values.

Figure 6 displays the solutions after KK iterations. It can be seen that when dealing with constrained LL problems, only BVFSM can effectively deal with the constraint. Hence, our method has broader application space, and we will show the experiment in real learning tasks in Section 5.2, which solves problems with UL constraints.

Refer to caption
(a) Errors

Refer to caption
Refer to caption
(b) k=100k=100
Refer to caption
(c) k=500k=500
Figure 8: Convergence behavior for pessimistic BLO. On the right side, we show the LL solution 𝐲(𝐱)\mathbf{y}(\mathbf{x}) obtained by different methods. Subfigures 8(b) and 8(c) illustrate the results when k=100k=100 and k=500k=500, respectively. Here the surfaces denote F(𝐱k,𝐲)F(\mathbf{x}_{k},\mathbf{y}), and 𝐲(𝐱k)\mathbf{y}^{*}(\mathbf{x}_{k}) means the optimal LL solution among multiple LL solutions 𝐲(𝐱k)𝒮(𝐱k)\mathbf{y}(\mathbf{x}_{k})\in\mathcal{S}(\mathbf{x}_{k}). Both RHG and BDA choose the incorrect LL solutions, while BVFSM chooses the correct solution.
TABLE VI: Comparison among existing methods, BVFSM, and BVFSM with constraints (BVFSM-C) for data hyper-cleaning tasks on three datasets: MNIST, FashionMNIST and CIFAR10. F1 score is the harmonic mean of precision and recall.
Method MNIST FashionMNIST CIFAR10
Accuracy F1 score Time (S)  Accuracy F1 score Time (S)  Accuracy F1 score Time (S)
RHG  87.90±\pm0.27 89.36±\pm0.11 0.4131  81.91±\pm0.18 87.12±\pm0.19 0.4589  34.95±\pm0.47 68.27±\pm0.72 1.3374
TRHG  88.57±\pm0.18 89.77±\pm0.29 0.2623  81.85±\pm0.17 86.76±\pm0.14 0.2840  35.42±\pm0.49 68.06±\pm0.55 0.8409
BDA  87.15±\pm0.82 90.38±\pm0.76 0.6694  79.97±\pm0.71 88.24±\pm0.58 0.8571  36.41±\pm0.23 67.33±\pm0.31 1.4869
CG  89.19±\pm0.35 85.96±\pm0.48 0.1799  83.15±\pm0.24 85.13±\pm0.27 0.2041  34.16±\pm0.75 69.10±\pm0.93 0.4796
Neumann  87.54±\pm0.13 89.58±\pm0.34 0.1723  81.37±\pm0.18 87.28±\pm0.19 0.1958  33.45±\pm0.16 68.87±\pm0.11 0.4694
BVFSM  90.41±\pm0.32 91.19±\pm0.25 0.1480  84.31±\pm0.27 88.35±\pm0.13 0.1612  38.19±\pm0.62 69.55±\pm0.42 0.4092
BVFSM-C  90.94±\pm0.32 91.83±\pm0.30 0.1566  83.23±\pm0.34 89.74±\pm0.24 0.1514  37.33±\pm0.33 69.73±\pm0.51 0.4374

To compare the performance of different auxiliary functions, we try barrier and penalty functions for PH,σH,Ph,σhP_{H,\sigma_{H}},P_{h,\sigma_{h}} and Pf,σfP_{f,\sigma_{f}}, which can be selected arbitrarily and separately indeed, but here are chosen the same to be compared more directly. Since all of these auxiliary functions can guarantee the convergence theoretically, we mainly focus on the robustness of them under different settings. From Figure 7, it can be seen that using a penalty function can converge only under certain settings within a small region, while using a barrier function has greater robustness, so we use barrier functions in other experiments. In Section 5.2, we further show investigations on penalty and barrier functions on complex networks.

5.1.3 Pessimistic BLO

To study the performance of pessimistic BLO, we use the example similar to optimistic BLO by changing Eq. (18) from min𝐱,𝐲n\min_{\mathbf{x}\in\mathbb{R},\mathbf{y}\in\mathbb{R}^{n}} to min𝐱max𝐲n\min_{\mathbf{x}\in\mathbb{R}}\max_{\mathbf{y}\in\mathbb{R}^{n}} and from 𝐱a2+𝐲a𝐜2\|\mathbf{x}-a\|^{2}+\|\mathbf{y}-a-\mathbf{c}\|^{2} to 𝐱a2𝐲a𝐜2\|\mathbf{x}-a\|^{2}-\|\mathbf{y}-a-\mathbf{c}\|^{2}. Here we consider the 2-dimensional case (LL dimension n=2n=2), and set a=2a=2 and [𝐜]i=2 for i=1,2[\mathbf{c}]_{i}=2\text{ for }i=1,2. In this case, the optimal solution is (𝐱,𝐲)=(2+π/2,4±π,4±π),(\mathbf{x}^{*},\mathbf{y}^{*})=(-2+\pi/2,4\pm\pi,4\pm\pi), and the optimal value is F=7/4π24π+16.F^{*}=-7/4\pi^{2}-4\pi+16. Derivation of the exact solution is provided in Appendix B. We select RHG and BDA respectively as the representatives of gradient-based methods with or without unique LL solution. We make no adaptive modifications to these methods which do not consider the pessimistic BLO situation.

Figure 8 shows the convergence curves of UL objective and how various methods choose 𝐲𝒮(𝐱)\mathbf{y}\in\mathcal{S}(\mathbf{x}) when 𝒮(𝐱)\mathcal{S}(\mathbf{x}) is not a singleton. From Figure 8(a), our method has significantly better convergence in the pessimistic case, while RHG and BDA cannot converge at all. Their distances to the optimal solution even increase because they fail to select the optimal LL solution 𝐲\mathbf{y} from multiple LL solutions 𝐲𝒮(𝐱)\mathbf{y}\in\mathcal{S}(\mathbf{x}), which is intuitively demonstrated in Figure 8(b) and 8(c).

Refer to caption
(a) UL objective F(𝐱,𝐲)F({\mathbf{x}},{\mathbf{y}})
Refer to caption
(b) F1 score
Figure 9: Performance for data hyper-cleaning based on the FashionMNIST experiment in Table VI. The legend is only shown in the first plot.
Refer to caption
(a) Accuracy
Refer to caption
(b) F1 score
Figure 10: Performance for data hyper-cleaning with barrier and penalty functions on FashionMNIST. We choose the truncated log barrier and quadratic penalty as the representatives of barrier and penalty functions respectively. Using a barrier function leads to higher accuracy and greater stability of F1 score. The legend is only shown in the second plot.
TABLE VII: The effect of contamination rates for data hyper-cleaning. Accuracy and F1 scores of existing methods drop sharply with the increasing of contamination rate, while BVFSM maintains a slightly decreasing trend, verifying the robustness of BVFSM in the face of harsh data.
Contamination rate 0.6 0.7 0.8 0.9
Method Accuracy F1 score Accuracy F1 score Accuracy F1 score Accuracy F1 score
RHG 77.39±0.61 68.18±0.94 75.62±0.94 56.72±0.72 68.91±0.71 46.81±0.78 59.83±0.91 29.39±0.38
TRHG 77.37±0.52 76.76±0.13 75.60±0.84 65.30±0.10 68.89±0.30 55.39±0.97 59.81±0.38 37.97±0.48
BDA 75.44±0.44 78.24±0.34 73.67±0.59 66.78±0.82 66.96±0.69 56.87±0.79 57.88±0.87 39.45±0.08
CG 78.64±0.52 75.17±0.79 76.87±0.06 63.71±0.41 70.16±0.80 53.80±0.09 61.08±0.63 36.38±0.03
Neumann 76.85±0.95 77.29±0.29 75.08±0.15 65.83±0.22 68.37±0.40 55.92±0.46 59.29±0.26 38.50±0.63
BVFSM 81.49±0.22 85.51±0.70 81.34±0.42 82.55±0.33 80.06±0.97 73.51±0.83 79.73±0.20 55.97±0.73
Refer to caption
(a) Time
Refer to caption
(b) VRAM
Refer to caption
Figure 11: Computation time (S) and memory (VRAM, MB) with various network structures for data hyper-cleaning. Blank areas not drawn indicate that the 8G VRAM limit is exceeded. We use fully connected networks of 2, 5, 10 layers and with widths of 50, 100, 150 and 200.

5.2 Hyper-parameter Optimization

In this subsection, we use a specific task of hyper-parameter optimization, called data hyper-cleaning, to evaluate the performance of BVFSM when the LL problem is non-convex. Assuming that some of the labels in our dataset are contaminated, the goal of data hyper-cleaning is to reduce the impact of incorrect samples by adding hyper-parameters to them. In this experiment, we set 𝐲10×301×300×d\mathbf{y}\in\mathbb{R}^{10\times 301}\times\mathbb{R}^{300\times d} as the parameter of a non-convex 2-layer linear network classifier where dd is the dimension of data, and 𝐱|𝒟tr|\mathbf{x}\in\mathbb{R}^{|\mathcal{D}_{tr}|} as the weight of each sample in the training set. Therefore, the LL problem is to learn a classifier 𝐲\mathbf{y} by cross-entropy loss gg weighted with given 𝐱\mathbf{x}:

f(𝐱,𝐲)=(𝐮i,𝐯i)𝒟𝚝𝚛[𝚜𝚒𝚐𝚖𝚘𝚒𝚍(𝐱)]ig(𝐲,𝐮i,𝐯i),f(\mathbf{x},\mathbf{y})=\sum_{(\mathbf{u}_{i},\mathbf{v}_{i})\in\mathcal{D}_{\mathtt{tr}}}[\mathtt{sigmoid}(\mathbf{x})]_{i}\ g(\mathbf{y},\mathbf{u}_{i},\mathbf{v}_{i}),

where (𝐮i,𝐯i)(\mathbf{u}_{i},\mathbf{v}_{i}) are the training samples, and 𝚜𝚒𝚐𝚖𝚘𝚒𝚍(𝐱)\mathtt{sigmoid}(\mathbf{x}) is the sigmoid function to constrain the weights 𝐱\mathbf{x} into the range of [0,1][0,1]. The UL problem is to find a weight 𝐱\mathbf{x} to reduce the cross-entropy loss gg of 𝐲\mathbf{y} on a cleanly labeled validation set:

F(𝐱,𝐲)=(𝐮i,𝐯i)𝒟𝚟𝚊𝚕g(𝐲,𝐮i,𝐯i).F(\mathbf{x},\mathbf{y})=\sum_{(\mathbf{u}_{i},\mathbf{v}_{i})\in\mathcal{D}_{\mathtt{val}}}g(\mathbf{y},\mathbf{u}_{i},\mathbf{v}_{i}).

In addition, we also consider adding explicit constraints directly on 𝐱\mathbf{x} (as discussed in Section 3.3) instead of using the sigmoid function as indirect constraints. The constraint is carried out via ([𝐱]i0.5)20.250([\mathbf{x}]_{i}-0.5)^{2}-0.25\leq 0, for each component of 𝐱\mathbf{x}, such that [𝐱]i[0,1][\mathbf{x}]_{i}\in\left[0,1\right].

Overall performance. Table VI shows the accuracy, F1 score and computation time on three different datasets. For each dataset, we randomly select 5000 samples as the training set 𝒟𝚝𝚛\mathcal{D}_{\mathtt{tr}}, 5000 samples as the validation set 𝒟𝚟𝚊𝚕\mathcal{D}_{\mathtt{val}}, and 10000 samples as the test set 𝒟𝚝𝚎𝚜𝚝\mathcal{D}_{\mathtt{test}}. After that, we contaminate half of the labels in 𝒟𝚝𝚛\mathcal{D}_{\mathtt{tr}}. From the result, BVFSM achieves the most competitive performance on all datasets. Furthermore, BVFSM is faster than EGBMs and IGBMs, and this advantage is more evident on CIFAR10 with larger LL dimension, consistent with the complexity analysis in Section 4.2. The UL objective value and F1 score during iterations on FashionMNIST are also plotted in Figure 9.

As for the performance of BLO with constraints, we can find from Table VI that BVFSM with constraints (denoted as BVFSM-C in the table) has slightly lower accuracy but higher F1 score than BVFSM using sigmoid function without explicit constraints. This is because for BVFSM without constraints, the compound of sigmoid function in the LL objective decreases the gradient of 𝐱\mathbf{x}, and thus the UL variable 𝐱\mathbf{x} with small change rate contributes to its slower convergence. Accuracy more reflects the convergence of LL variable 𝐲\mathbf{y}, while F1 score more reflects the convergence of UL variable 𝐱\mathbf{x}. Therefore, BVFSM with constraints performs slightly worse in accuracy but better in F1 score than BVFSM without constraint but with the sigmoid function.

Evaluations on the auxiliary functions, robustness, and network structures. Figure 10 compares the performance of different auxiliary functions. Consistent with the numerical experiment in Figure 7, the barrier function works better with higher stability without the need for too much fine tuning of parameters. Table VII compares the robustness under various data contamination rates. Figure 11 further shows the impact of network structures in depth and width. For the LL variable 𝐲\mathbf{y} we use fully connected networks of various layers and widths. It is worth noting that the computational burden is overall not quite sensitive to the network width, but very sensitive to the network depth. With the deepening of networks, other methods experience varying degrees of collapse due to occupying too much memory, while BVFSM can always keep the computation stable. Since there is no need to retain the LL iteration trajectory, our storage burden is much less than that of EGBMs (RHG and BDA). Thanks to the fact that BVFSM does not need to calculate the Jacobian- and Hessian-vector products (realized by saving an additional calculation graph in AD), our burden is also significantly lower than that of IGBMs (CG and Neumann).

TABLE VIII: Computation time (S) in each epoch for data hyper-cleaning in VGG series networks with different convolution layers (Conv.), batch sizes (B) and iteration number (K). N / A means exceeding the memory limit. Note that a smaller batch size may take more time because the batch switching time increases. BVFSM maintains the least burden and highest speed, especially with large-scale LL in real-world networks.
Conv. (B,K) RHG CG Neumann BVFSM
2 (1,7) 7515 4730 3225 2252
2 (128,20) N/A N/A 415.4 60.81
13 (128,100) N/A N/A 472.9 171.9
13 (512,100) N/A N/A N/A 121.8
Refer to caption
Refer to caption
Refer to caption
Figure 12: Effects of the quantity of Multiply–Accumulate Operations (MACs) and Parameters on the computational efficiency for data hyper-cleaning in large-scale networks. BVFSM and BVFSM denote BVFSM in the 5-layer network same to other methods and the more challenging 50-layer network, respectively. For a clearer comparison, we use logarithmic coordinates.

Computational efficiency for large-scale networks. Next, we verify our computational burden on large-scale networks closer to real applications such as VGG16 on CIFAR10 dataset. Because VGG16 has too much computational burden on existing methods, in order to make the comparison available, we change the experimental settings as follows. For each dataset, we randomly select 4096 samples as the training set 𝒟𝚝𝚛\mathcal{D}_{\mathtt{tr}}, 4096 samples as the validation set 𝒟𝚟𝚊𝚕\mathcal{D}_{\mathtt{val}}, and 512 samples as the test set 𝒟𝚝𝚎𝚜𝚝\mathcal{D}_{\mathtt{test}}. Because the original network is too computationally intensive for EGBMs, we perform an additional experiment on some sufficiently small batch size and iteration number KK. We also simplify the convolution layers from 13 layers in VGG16 to only the first two layers, and retain the last 3 linear layers. As shown in Table VIII, BVFSM always has the highest speed under various settings, and still works well with a large KK and batch size.

Additionally, we visualize how BVFSM can be applied to large-scale networks by expanding the width of 5-layer network, and compare the computational efficiency when the Multiply–Accumulate Operations (MACs) and parameters are increased. For the same-size network, the fully-connected layer typically has more parameters, while the convolutional layer has more MACs, so we use the fully-connected and convolutional layer respectively to simulate the scale-up of parameters and MACs. From Figure 12, other methods are computationally inefficient and can only handle small-scale networks, while BVFSM with much higher efficiency is applicable to larger-scale networks in frontier tasks. Moreover, considering the effect of number of layers on efficiency as shown in Figure 11, we also use a more challenging 50-layer network for BVFSM to further demonstrate its high efficiency. Specifically, existing methods usually cannot work under MobileNet with around 1 GMACs, while BVFSM is available under StyleGAN with around 100 GMACs.

TABLE IX: Averaged accuracy using various methods (including model-based methods and gradient-based BLO methods) for the few-shot classification task (1- and 5-shot, i.e., M=1,5,M=1,5, and N=5,20,30,40N=5,20,30,40) on Omniglot.
Method 5-way 20-way 30-way 40-way
  1-shot 5-shot   1-shot 5-shot   1-shot 5-shot   1-shot 5-shot
MAML  98.70±\pm0.40 99.91±\pm0.10  95.80±\pm0.30 98.90±\pm0.20  86.86±\pm0.49 96.86±\pm0.19  85.98±\pm0.45 94.46±\pm0.43
Meta-SGD  97.97±\pm0.70 98.96±\pm0.20  93.98±\pm0.43 98.42±\pm0.11  89.91±\pm0.04 96.21±\pm0.15  87.39±\pm0.43 95.10±\pm0.15
Reptile  97.68±\pm0.04 99.48±\pm0.06  89.43±\pm0.14 97.12±\pm0.32  85.40±\pm0.30 95.28±\pm0.30  82.50±\pm0.30 92.79±\pm0.33
iMAML  99.16±\pm0.35 99.67±\pm0.12  94.46±\pm0.42 98.69±\pm0.10  89.52±\pm0.20 96.51±\pm0.08  87.28±\pm0.21 95.27±\pm0.08
RHG  98.64±\pm0.21 99.58±\pm0.12  96.13±\pm0.20 99.09±\pm0.08  93.92±\pm0.18 98.43±\pm0.08  90.78±\pm0.20 96.79±\pm0.10
TRHG  98.74±\pm0.21 99.71±\pm0.07  95.82±\pm0.20 98.95±\pm0.07  94.02±\pm0.18 98.39±\pm0.07  90.73±\pm0.20 96.79±\pm0.10
BDA  99.04±\pm0.18 99.74±\pm0.05  96.50±\pm0.16 99.19±\pm0.07  94.37±\pm0.18 98.53±\pm0.07  92.49±\pm0.18 97.12±\pm0.09
BVFSM  98.85±\pm0.12 99.21±\pm0.18  96.73±\pm0.30 98.95±\pm0.20  94.65±\pm0.20 98.56±\pm0.17  92.73±\pm0.12 97.61±\pm0.47
Refer to caption
(a) Initialization
Refer to caption
Refer to caption
Refer to caption
(a) GAN
Refer to caption
Refer to caption
Refer to caption
(b) WGAN
Refer to caption
(c) Target
Refer to caption
Refer to caption
Refer to caption
(c) Unrolled GAN
Refer to caption
Refer to caption
Refer to caption
(d) BVFSM
Figure 13: Comparison of GAN training on a toy 2D mixture of Gaussians dataset. We show the heat map of the generated distribution as the number of training steps increases. Subfigure 13(a) indicates vanilla GAN can capture only one distribution, while in subfigure 13(b), WGAN attempts to capture all distributions at the same time with one Gaussian distribution, but fails to achieve satisfactory performance. Unrolled GAN in subfigure 13(c) can approximate all distributions simultaneously with the help of a leader-follower structure, but lacks further details. In contrast, our BVFSM fits all Gaussian distributions well with plenty of details, as shown in subfigure 13(d).

5.3 Few-shot Learning

We then conduct experiments on the few-shot learning task. Few-shot learning is one of the most popular applications in meta-learning, whose goal is to learn an algorithm that can also handle new tasks well. Specifically, each task is an NN-way classification and it aims to learn the hyper-parameter 𝐱\mathbf{x} so that each task can be solved by only MM training samples (i.e., NN-way MM-shot). Similar to works in [8, 29, 30], we model the network with two parts: a four-layer convolution network 𝐱\mathbf{x} as a common feature extraction layer among tasks, and a logical regression layer 𝐲=𝐲i\mathbf{y}={\mathbf{y}^{i}} as the separated classifier for each task. We also set dataset as 𝒟={𝒟i}\mathcal{D}=\{\mathcal{D}^{i}\}, where 𝒟i=𝒟𝚝𝚛i𝒟𝚟𝚊𝚕i\mathcal{D}^{i}=\mathcal{D}^{i}_{\mathtt{tr}}\cup\mathcal{D}^{i}_{\mathtt{val}} for the ii-th task. By setting the loss function of the ii-th task to be cross-entropy g(𝐱,𝐲i;𝒟𝚝𝚛i)g(\mathbf{x},\mathbf{y}^{i};\mathcal{D}^{i}_{\mathtt{tr}}) for the LL problem, the LL objective can be defined as

f(𝐱,𝐲)=ig(𝐱,𝐲i;𝒟𝚝𝚛i).f(\mathbf{x},\mathbf{y})=\sum_{i}g(\mathbf{x},\mathbf{y}^{i};\mathcal{D}^{i}_{\mathtt{tr}}).

As for the UL objective, we also utilize the cross-entropy function but define it based on {𝒟𝚟𝚊𝚕i}\{\mathcal{D}^{i}_{\mathtt{val}}\} as

F(𝐱,𝐲)=ig(𝐱,𝐲i;𝒟𝚟𝚊𝚕i).F(\mathbf{x},\mathbf{y})=\sum_{i}g(\mathbf{x},\mathbf{y}^{i};\mathcal{D}^{i}_{\mathtt{val}}).

Our experiment is performed on the widely used benchmark dataset: Omniglot [49], which contains examples of 1623 handwritten characters from 50 alphabets.

We compare our BVFSM with several approaches, such as MAML, Meta-SGD, Reptile, iMAML, RHG, TRHG and BDA [29, 30]. From Table IX, BVFSM achieves slightly poorer performance than existing methods in the 5-way task, because when dealing with small-scale LLC problems, the strength of regularization term by BVFSM to accelerate the convergence cannot fully counteract its impact on the offset of solution. However, for larger-scale LL problems (such as 20-way, 30-way and 40-way), thanks to the regularization term, BVFSM reveals significant advantages over other methods.

5.4 Generative Adversarial Networks

Next we perform intuitive experiments on GAN to illustrate the application of BVFSM for pessimistic BLO. GAN is a network used for unsupervised machine learning to build a min-max game between two players, i.e., the generator 𝙶𝚎𝚗(𝐱;)\mathtt{Gen}(\mathbf{x};\cdot) with the network parameter 𝐱\mathbf{x}, and the discriminator 𝙳𝚒𝚜(𝐲;)\mathtt{Dis}(\mathbf{y};\cdot) with the network parameter 𝐲\mathbf{y}. We denote the standard Gaussian distribution as 𝒩(0,1)\mathcal{N}(0,1) and the real data distribution as p𝚍𝚊𝚝𝚊p_{\mathtt{data}}. The generator 𝙶𝚎𝚗\mathtt{Gen} tries to fool the discriminator 𝙳𝚒𝚜\mathtt{Dis} by producing data from random latent vector 𝐯𝒩(0,1)\mathbf{v}\sim\mathcal{N}(0,1), while the discriminator 𝙳𝚒𝚜\mathtt{Dis} distinguishes between real data 𝐮p𝚍𝚊𝚝𝚊\mathbf{u}\sim p_{\mathtt{data}} and generated data 𝙶𝚎𝚗(𝐱;𝐯)\mathtt{Gen}(\mathbf{x};\mathbf{v}) by outputting the probability that the samples are real. The goal of GAN is to min𝐱max𝐲log(𝙳𝚒𝚜(𝐲;𝐮))+log(1𝙳𝚒𝚜(𝐲;𝙶𝚎𝚗(𝐱;𝐯))){\color[rgb]{0,0,0}\min_{\mathbf{x}}\max_{\mathbf{y}}}\log(\mathtt{Dis}(\mathbf{y};\mathbf{u}))+\log(1-\mathtt{Dis}(\mathbf{y};\mathtt{Gen}(\mathbf{x};\mathbf{v}))) [50].

However, this traditional modeling method regards 𝙳𝚒𝚜\mathtt{Dis} and 𝙶𝚎𝚗\mathtt{Gen} as equal status, and does not characterize the leader-follower relationship that 𝙶𝚎𝚗\mathtt{Gen} first generates data and after that 𝙳𝚒𝚜\mathtt{Dis} judges the data, which can be modeled by Stackelberg game and captured through BLO problems. Specifically, from this perspective, generative adversarial learning corresponds to a pessimistic BLO problem: the UL objective FF of 𝙶𝚎𝚗\mathtt{Gen} tries to generate adversarial samples, and the LL objective ff of 𝙳𝚒𝚜\mathtt{Dis} aims to learn a robust classifier which can maximize the UL objective. Therefore, we reformulate GAN into the form in Eq. (15) discussed in Section 3.4 to model this relationship, and call it bi-level GAN. Concretely, for the follower 𝙳𝚒𝚜(𝐲;)\mathtt{Dis}(\mathbf{y};\cdot), the LL objective is consistent with the original GAN:

f(𝐱,𝐲)=log(𝙳𝚒𝚜(𝐲;𝐮))+log(1𝙳𝚒𝚜(𝐲;𝙶𝚎𝚗(𝐱;𝐯))).f(\mathbf{x},\mathbf{y})=\log(\mathtt{Dis}(\mathbf{y};\mathbf{u}))+\log(1-\mathtt{Dis}(\mathbf{y};\mathtt{Gen}(\mathbf{x};\mathbf{v}))).

As for UL, considering the antagonistic goals of 𝙶𝚎𝚗\mathtt{Gen} and 𝙳𝚒𝚜\mathtt{Dis}, we model the UL problem as

F(𝐱,𝐲)=log(𝙳𝚒𝚜(𝐲;𝙶𝚎𝚗(𝐱;𝐯))).\displaystyle F(\mathbf{x},\mathbf{y})=\log(\mathtt{Dis}(\mathbf{y};\mathtt{Gen}(\mathbf{x};\mathbf{v}))).

Note that the popular WGAN [51] is a variation of the most classic vanilla GAN [50] (or simply GAN), while unrolled GAN [11] and the GAN generated by our BVFSM belong to bi-level GAN, modeling from a BLO perspective. Our method has the following two advantages over other types of GAN. On the one hand, compared with vanilla GAN and WGAN, bi-level GAN can effectively model the leader-follower relationship between the generator and discriminator, rather than regard them as the same status. On the other hand, in bi-level GAN, our method considers the situation that the objective has multiple solutions, from the viewpoint of pessimistic BLO, with theoretical convergence guarantee, which unrolled GAN cannot achieve.

In this experiment we train a simple GAN architecture on a 2D mixture of 8 Gaussians arranged on a circle. The dataset is sampled from a mixture of 8 Gaussians with standard deviation 0.02. The 8 points are the means of data and are equally spaced around a circle with radius 2. The generator consists of a fully-connected network with 2 hidden layers of size 128 with ReLU activation followed by a linear projection to 2 dimensions. The discriminator first scales its input down by a factor of 4 (to roughly scale it to (1,1)(-1,1)), and is followed by a 1-layer fully-connected network from ReLU activation to a linear layer of size 1 to act as the logit. As shown in Figure 13, we present a visual comparison of sample generation among GAN, WGAN, unrolled GAN, and our method. It can be seen that vanilla GAN can capture only one distribution rather than all Gaussian distributions at a time, because it ignores the leader-follower structure. WGAN benefits from the improvement of distance function and uses one distribution to approximate all Gaussian distributions at the same time, but it fails to display satisfying performance. Unrolled GAN shows the ability to capture all distributions simultaneously thanks to the leader-follower modeling by BLO, but it lacks further details of the distribution. However, the desirable treatment of non-convex problems by BVFSM brings about its ability to fit all distributions well with details. In addition, we show the KL divergence between the generated and target image in Table X. It can be seen that the traditional alternately optimized GAN and WGAN yield larger KL divergence, while unrolled GAN and our method, which consider GAN as a BLO model, produce smaller KL divergence, and our method further achieves the best result.

Figure 14 further validates the performance of BVFSM to be adaptive in large-scale GAN on real datasets. Specifically, we add BVFSM as a training strategy based on StyleGAN2 [52] on the AFHQ dataset. It can be seen that our approach is effective in improving the generation quality and performance metrics Inception Score (IS) and Frechet Inception Distance (FID).

TABLE X: KL divergence by various GAN. The divergence between the generated and target distribution by BVFSM is the smallest.
GAN WGAN Unrolled GAN BVFSM
KL Divergence 2.56 2.48 0.26 0.15
Refer to caption
(a) StyleGAN2 (IS, FID)=(11.69, 5.93)
Refer to caption
(b) StyleGAN2+BVFSM (IS, FID)=(11.94, 6.08)
Figure 14: Visualization results and performance metrics by StyleGAN2 and by adding BVFSM to StyleGAN2 on the AFHQ dataset.

6 Conclusions

In this paper, we propose a novel bi-level algorithm BVFSM to provide an accessible path for large-scale problems with high dimensions from complex real-world tasks. With the help of value-function which breaks the traditional mindset in gradient-based methods, BVFSM can remove the LLC condition required by earlier works, and improve the efficiency of gradient-based methods, to overcome the bottleneck caused by high-dimensional non-convex LL problems. By transforming the regularized LL problem into UL objective by the value-function-based sequential minimization method, we obtain a sequence of single-level unconstrained differentiable problems to approximate the original problem. We prove the asymptotic convergence without LLC, and present our numerical superiority through complexity analysis and numerical evaluations for a variety of applications. We also extend our method to BLO problems with constraints, and pessimistic BLO problems.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (Nos. U22B2052, 61922019, 12222106), the National Key R&D Program of China (2020YFB1313503, 2022YFA1004101), Shenzhen Science and Technology Program (No. RCYX20200714114700072), the Guangdong Basic and Applied Basic Research Foundation (No. 2022B1515020082), and Pacific Institute for the Mathematical Sciences (PIMS).

References

  • [1] R. Liu, X. Liu, X. Yuan, S. Zeng, and J. Zhang, “A value-function-based interior-point method for non-convex bi-level optimization,” in ICML, 2021.
  • [2] L. Franceschi, M. Donini, P. Frasconi, and M. Pontil, “Forward and reverse gradient-based hyperparameter optimization,” in ICML, 2017.
  • [3] T. Okuno, A. Takeda, A. Kawana, and M. Watanabe, “On lp-hyperparameter learning via bilevel nonsmooth optimization,” JMLR, vol. 22, no. 245, pp. 1–47, 2021.
  • [4] M. Mackay, P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse, “Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions,” in ICLR, 2018.
  • [5] H. Liu, K. Simonyan, and Y. Yang, “DARTS: differentiable architecture search,” in ICLR, 2019.
  • [6] H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang, and Z. Li, “Darts+: Improved differentiable architecture search with early stopping,” arXiv preprint arXiv:1909.06035, 2019.
  • [7] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable architecture search: Bridging the depth gap between search and evaluation,” in ICCV, 2019.
  • [8] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil, “Bilevel programming for hyperparameter optimization and meta-learning,” in ICML, 2018.
  • [9] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine, “Meta-learning with implicit gradients,” in NeurIPS, 2019.
  • [10] D. Zügner and S. Günnemann, “Adversarial attacks on graph neural networks via meta learning,” in ICLR, 2019.
  • [11] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled generative adversarial networks,” arXiv preprint arXiv:1611.02163, 2016.
  • [12] D. Pfau and O. Vinyals, “Connecting generative adversarial networks and actor-critic methods,” arXiv preprint arXiv:1610.01945, 2016.
  • [13] Z. Yang, Y. Chen, M. Hong, and Z. Wang, “Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost,” in NeurIPS, 2019.
  • [14] R. Liu, S. Cheng, Y. He, X. Fan, Z. Lin, and Z. Luo, “On the convergence of learning-based iterative methods for nonconvex inverse problems,” IEEE TPAMI, 2019.
  • [15] R. Liu, Z. Li, Y. Zhang, X. Fan, and Z. Luo, “Bi-level probabilistic feature learning for deformable image registration,” in IJCAI, 2020.
  • [16] R. Liu, J. Liu, Z. Jiang, X. Fan, and Z. Luo, “A bilevel integrated model with data-driven layer ensemble for multi-modality image fusion,” IEEE TIP, 2020.
  • [17] R. Liu, P. Mu, J. Chen, X. Fan, and Z. Luo, “Investigating task-driven latent feasibility for nonconvex image modeling,” IEEE TIP, 2020.
  • [18] S. Dempe, N. Gadhi, and L. Lafhim, “Optimality conditions for pessimistic bilevel problems using convexificator,” Positivity, pp. 1–19, 2020.
  • [19] S. Dempe, Bilevel optimization: theory, algorithms and applications.   TU Bergakademie Freiberg, Fakultät für Mathematik und Informatik, 2018.
  • [20] R. Liu, J. Gao, J. Zhang, D. Meng, and Z. Lin, “Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond,” IEEE TPAMI, vol. 44, no. 12, pp. 10 045–10 067, 2021.
  • [21] M. J. Alves, C. H. Antunes, and J. P. Costa, “New concepts and an algorithm for multiobjective bilevel programming: optimistic, pessimistic and moderate solutions,” Operational Research, pp. 1–34, 2019.
  • [22] R. G. Jeroslow, “The polynomial hierarchy and a simple model for competitive analysis,” Mathematical programming, 1985.
  • [23] J. F. Bard and J. E. Falk, “An explicit solution to the multi-level programming problem,” Computers & Opeations Research, vol. 9, no. 1, pp. 77–100, 1982.
  • [24] Z.-Q. Luo, J.-S. Pang, and D. Ralph, Mathematical programs with equilibrium constraints.   Cambridge University Press, 1996.
  • [25] D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyperparameter optimization through reversible learning,” in ICML, ser. JMLR Workshop and Conference Proceedings, 2015.
  • [26] A. Shaban, C. Cheng, N. Hatch, and B. Boots, “Truncated back-propagation for bilevel optimization,” in AISTATS, 2019.
  • [27] F. Pedregosa, “Hyperparameter optimization with approximate gradient,” in ICML, 2016.
  • [28] J. Lorraine, P. Vicol, and D. Duvenaud, “Optimizing millions of hyperparameters by implicit differentiation,” in AISTATS, 2020.
  • [29] R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang, “A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton,” in ICML, 2020.
  • [30] ——, “A general descent aggregation framework for gradient-based bi-level optimization,” IEEE TPAMI, vol. 45, no. 1, pp. 38–57, 2022.
  • [31] J. V. Outrata, “On the numerical solution of a class of stackelberg problems,” ZOR-Methods and Models of Operations Research, 1990.
  • [32] J. J. Ye and D. L. Zhu, “Optimality conditions for bilevel programming problems,” Optimization, 1995.
  • [33] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” JMLR, vol. 13, pp. 281–305, 2012.
  • [34] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-based optimization for general algorithm configuration,” in International conference on learning and intelligent optimization, 2011.
  • [35] A. V. Fiacco and G. P. McCormick, Nonlinear programming: sequential unconstrained minimization techniques.   SIAM, 1990.
  • [36] R. Grazzi, L. Franceschi, M. Pontil, and S. Salzo, “On the iteration complexity of hypergradient computation,” in ICML, 2020.
  • [37] L. S. Lasdon, “An efficient algorithm for minimizing barrier and penalty functions,” Mathematical Programming, vol. 2, no. 1, pp. 65–106, 1972.
  • [38] C. L. Byrne, “Alternating minimization as sequential unconstrained minimization: a survey,” Journal of Optimization Theory and Applications, vol. 156, no. 3, pp. 554–566, 2013.
  • [39] R. M. Freund, “Penalty and barrier methods for constrained optimization,” Lecture Notes, Massachusetts Institute of Technology, 2004.
  • [40] D. G. Luenberger, Y. Ye et al., Linear and nonlinear programming.   Springer, 1984, vol. 2.
  • [41] D. Boukari and A. Fiacco, “Survey of penalty, exact-penalty and multiplier methods from 1968 to 1993,” Optimization, vol. 32, no. 4, pp. 301–334, 1995.
  • [42] A. Auslender, “Penalty and barrier methods: a unified framework,” SIAM Journal on Optimization, vol. 10, no. 1, pp. 211–230, 1999.
  • [43] J. F. Bonnans and A. Shapiro, Perturbation analysis of optimization problems.   Springer Science & Business Media, 2013.
  • [44] B. S. Mordukhovich and N. M. Nam, “An easy path to convex analysis and applications,” Synthesis Lectures on Mathematics and Statistics, vol. 6, no. 2, pp. 1–218, 2013.
  • [45] P. Borges, C. Sagastizábal, and M. Solodov, “A regularized smoothing method for fully parameterized convex problems with applications to convex and nonconvex two-stage stochastic programming,” Mathematical Programming, 2020.
  • [46] R. Liu, Y. Liu, W. Yao, S. Zeng, and J. Zhang, “Averaged method of multipliers for bi-level optimization without lower-level strong convexity,” arXiv preprint arXiv:2302.03407, 2023.
  • [47] K. Ji and Y. Liang, “Lower bounds and accelerated algorithms for bilevel optimization,” JMLR, vol. 23, pp. 1–56, 2022.
  • [48] E. Grefenstette, B. Amos, D. Yarats, P. M. Htut, A. Molchanov, F. Meier, D. Kiela, K. Cho, and S. Chintala, “Generalized inner loop meta-learning,” arXiv preprint arXiv:1910.01727, 2019.
  • [49] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, “Human-level concept learning through probabilistic program induction,” Science, 2015.
  • [50] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
  • [51] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” 2017.
  • [52] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
[Uncaptioned image] Risheng Liu received the B.Sc. and Ph.D. degrees in mathematics from Dalian University of Technology in 2007 and 2012, respectively. He was a Visiting Scholar with the Robotics Institute, Carnegie Mellon University, from 2010 to 2012. He served as a Hong Kong Scholar Research Fellow at the Hong Kong Polytechnic University from 2016 to 2017. He is currently a Professor with the DUT-RU International School of Information Science & Engineering, Dalian University of Technology. His research interests include machine learning, optimization, computer vision, and multimedia. He is a member of the ACM, and was a co-recipient of the IEEE ICME Best Student Paper Award in 2014 and 2015. Two papers were also selected as a Finalist of the Best Paper Award in ICME 2017.
[Uncaptioned image] Xuan Liu received the B.Sc. degree in mathematics from Dalian University of Technology in 2020. He is currently an M.Phil. student in the Department of Software Engineering at Dalian University of Technology. His research interests include computer vision, machine learning, and control and optimization.
[Uncaptioned image] Shangzhi Zeng received the B.Sc. degree in Mathematics and Applied Mathematics from Wuhan University in 2015, the M.Phil. degree from Hong Kong Baptist University in 2017, and the Ph.D. degree from the University of Hong Kong in 2021. He is currently a PIMS postdoctoral fellow in the Department of Mathematics and Statistics at University of Victoria. His current research interests include variational analysis and bilevel optimization.
[Uncaptioned image] Jin Zhang received the B.A. degree in journalism and the M.Phil. degree in mathematics and operational research and cybernetics from Dalian University of Technology in 2007 and 2010, respectively, and the Ph.D. degree in applied mathematics from University of Victoria, Canada, in 2015. After working with Hong Kong Baptist University for three years, he joined Southern University of Science and Technology as a tenure-track Assistant Professor with the Department of Mathematics and promoted to an Associate Professor in 2022. His broad research area is comprised of optimization, variational analysis and their applications in economics, engineering, and data science.
[Uncaptioned image] Yixuan Zhang received the B.Sc. degree in Mathematics and Applied Mathematics from Beijing Normal University in 2020, and the M.Phil. degree from Southern University of Science and Technology in 2022. She is currently a Ph.D. student in the Department of Applied Mathematics at the Hong Kong Polytechnic University. Her current research interests include optimization and machine learning.

Appendix A Proofs of Lemmas in Section 4.1

A.1 Lemma 1

Let {σk}\{\sigma_{k}\} in Pk(ω)=Pσk(ω)P_{k}(\omega)\!=\!P_{\sigma_{k}}(\omega) be a positive sequence such that limkσk=0\lim_{k\rightarrow\infty}\sigma_{k}=0. Additionally assume that limkρ(σk(2);σk(1))=0\lim_{k\rightarrow\infty}\rho(-\sigma^{(2)}_{k}\;;\sigma_{k}^{(1)})=0 when ρ\rho is a modified barrier function. Then we have

  1. (1)

    Pk(ω)P_{k}(\omega) is continuous, differentiable and non-decreasing, and satisfies Pk(ω)0P_{k}(\omega)\geq 0.

  2. (2)

    For any ω0\omega\leq 0, limkPk(ω)=0\lim_{k\rightarrow\infty}P_{k}(\omega)=0.

  3. (3)

    For any sequence {ωk}\{\omega_{k}\}, limkPk(ωk)<+\lim_{k\rightarrow\infty}P_{k}(\omega_{k})<+\infty implies that lim supkωk0\limsup_{k\rightarrow\infty}\omega_{k}\leq 0.

Proof.

From the definitions of penalty and barrier functions (see, e.g., Definition 1), the statement (1) follows immediately.

When ρ\rho is a penalty function, limσ0ρ(ω;σ)\lim_{\sigma\rightarrow 0}\rho(\omega;\sigma) is equal to ++\infty for ω>0\omega>0 and 0 for ω0\omega\leq 0. Hence, as kk\rightarrow\infty, σk0\sigma_{k}\rightarrow 0, and we have Pk(ω)0P_{k}(\omega)\rightarrow 0, for ω0\omega\leq 0. For any sequence {ωk}\{\omega_{k}\}, if lim supkωk>0\limsup_{k\rightarrow\infty}\omega_{k}>0, there exists a subsequence {ωt}\{\omega_{t}\} of {ωk}\{\omega_{k}\} and ε>0\varepsilon>0 such that ωtε\omega_{t}\geq\varepsilon for all tt. Then, it follows from the monotonicity of ρ\rho that limtPt(ωt)=limtρ(ωt;σt)limtρ(ε;σt)=+\lim_{t\rightarrow\infty}P_{t}(\omega_{t})=\lim_{t\rightarrow\infty}\rho(\omega_{t};\sigma_{t})\geq\lim_{t\rightarrow\infty}\rho(\varepsilon;\sigma_{t})=+\infty. Thus, limkPk(ωk)<+\lim_{k\rightarrow\infty}P_{k}(\omega_{k})<+\infty implies that lim supkωk0\limsup_{k\rightarrow\infty}\omega_{k}\leq 0.

If ρ\rho is a modified barrier function, since ρ\rho is non-creasing, we have 0ρ(ωσk(2);σk(1))ρ(σk(2);σk(1))0\leq\rho(\omega-\sigma^{(2)}_{k};\sigma_{k}^{(1)})\leq\rho(-\sigma^{(2)}_{k};\sigma_{k}^{(1)}) when ω0\omega\leq 0. The assumption limkρ(σk(2);σk(1))=0\lim_{k\rightarrow\infty}\rho(-\sigma^{(2)}_{k};\sigma_{k}^{(1)})=0 implies ρ(ωσk(2);σk(1))0\rho(\omega-\sigma^{(2)}_{k};\sigma_{k}^{(1)})\rightarrow 0 when ω0\omega\leq 0. Hence, as kk\rightarrow\infty, we have σk(1)0\sigma_{k}^{(1)}\rightarrow 0, σk(2)0\sigma^{(2)}_{k}\rightarrow 0, and Pk(ω)0P_{k}(\omega)\rightarrow 0, for ω0\omega\leq 0. For any sequence {ωk}\{\omega_{k}\}, if limkPk(ωk)<+\lim_{k\rightarrow\infty}P_{k}(\omega_{k})<+\infty, then it follows from the definition of modified barrier function that ωkσk(2)\omega_{k}\leq\sigma^{(2)}_{k} and (3) follows immediately from σk(2)0\sigma^{(2)}_{k}\rightarrow 0.

A.2 Lemma 2

Let {(μk,σk)}\{(\mu_{k},\sigma_{k})\} be a positive sequence such that (μk,σk)0(\mu_{k},\sigma_{k})\rightarrow 0, also satisfying the same setting as in Lemma 1. Then for any sequence {𝐱k}\{\mathbf{x}_{k}\} converging to 𝐱¯\bar{\mathbf{x}},

lim supkfk(𝐱k)f(𝐱¯).\limsup_{k\rightarrow\infty}f_{k}^{*}(\mathbf{x}_{k})\leq f^{*}(\bar{\mathbf{x}}).
Proof.

Given any ϵ>0\epsilon>0, there exists 𝐲¯n\bar{\mathbf{y}}\in\mathbb{R}^{n} such that f(𝐱¯,𝐲¯)<f(𝐱¯)+1/2ϵf(\bar{\mathbf{x}},\bar{\mathbf{y}})<f^{*}(\bar{\mathbf{x}})+1/2\ \epsilon, and h(𝐱¯,𝐲¯)0h(\bar{\mathbf{x}},\bar{\mathbf{y}})\leq 0. If h(𝐱¯,𝐲¯)=0h(\bar{\mathbf{x}},\bar{\mathbf{y}})=~{}0, by Assumption 1.(4), the minimum of hh w.r.t. 𝐲\mathbf{y} in any neighbourhood of 𝐲¯\bar{\mathbf{y}} is smaller than 0, so we can find a 𝐲^\hat{\mathbf{y}} close enough to 𝐲¯\bar{\mathbf{y}} such that f(𝐱¯,𝐲^)<f(𝐱¯)+ϵf(\bar{\mathbf{x}},\hat{\mathbf{y}})<f^{*}(\bar{\mathbf{x}})+\epsilon, and h(𝐱¯,𝐲^)δh(\bar{\mathbf{x}},\hat{\mathbf{y}})\leq-\delta for some δ>0\delta>0. If h(𝐱¯,𝐲¯)<0h(\bar{\mathbf{x}},\bar{\mathbf{y}})<0, such 𝐲^\hat{\mathbf{y}} exists obviously.

As {𝐱k}\{\mathbf{x}_{k}\} converges to 𝐱¯\bar{\mathbf{x}}, h(𝐱¯,𝐲^)δh(\bar{\mathbf{x}},\hat{\mathbf{y}})\leq-\delta combining with the continuity of h(𝐱,𝐲)h(\mathbf{x},\mathbf{y}) implies the existence of K1>0K_{1}>0 such that h(𝐱k,𝐲^)δ/2h(\mathbf{x}_{k},\hat{\mathbf{y}})\leq-\delta/2 for all kK1k\geq K_{1}. Since the barrier function is non-decreasing, it follows that PB,k(h(𝐱k,𝐲^))ρ(δ/2;σk(1))P_{B,k}(h(\mathbf{x}_{k},\hat{\mathbf{y}}))\leq\rho(-\delta/2;\sigma_{k}^{(1)}). Then limkρ(δ/2;σk(1))=0\lim_{k\rightarrow\infty}\rho(-\delta/2;\sigma_{k}^{(1)})=0 yields

limkPB,k(h(𝐱k,𝐲^))=0.\lim_{k\rightarrow\infty}P_{B,k}\!\left(h(\mathbf{x}_{k},\hat{\mathbf{y}})\right)=0.

Next, as {𝐱k}\{\mathbf{x}_{k}\} converges to 𝐱¯\bar{\mathbf{x}}, it follows from the continuity of f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) and μk0\mu_{k}\rightarrow 0 that there exists K2K1K_{2}\geq K_{1}, such that for any kK2k\geq K_{2},

fk(𝐱k)\displaystyle f_{k}^{*}(\mathbf{x}_{k})\leq f(𝐱k,𝐲^)+PB,k(h(𝐱k,𝐲^))+μk2𝐲^2\displaystyle f(\mathbf{x}_{k},\hat{\mathbf{y}})+P_{B,k}\!\left(h(\mathbf{x}_{k},\hat{\mathbf{y}})\right)+\frac{\mu_{k}}{2}\|\hat{\mathbf{y}}\|^{2}
\displaystyle\leq f(𝐱¯,𝐲^)+ϵ\displaystyle f(\bar{\mathbf{x}},\hat{\mathbf{y}})+\epsilon
\displaystyle\leq f(𝐱¯)+2ϵ.\displaystyle f^{*}(\bar{\mathbf{x}})+2\epsilon.

By letting kk\rightarrow\infty, we obtain

lim supkfk(𝐱k)f(𝐱¯)+2ϵ,\limsup_{k\rightarrow\infty}f_{k}^{*}(\mathbf{x}_{k})\leq f^{*}(\bar{\mathbf{x}})+2\epsilon,

and taking ϵ0\epsilon\rightarrow 0 to the above yields the conclusion. ∎

A.3 Lemma 3

Let {(μk,θk,σk)}\{(\mu_{k},\theta_{k},\sigma_{k})\} be a positive sequence such that limk(μk,θk,σk)=0\lim_{k\rightarrow\infty}(\mu_{k},\theta_{k},\sigma_{k})=0, and satisfy the same setting as in Lemma 1. Given 𝐱¯𝒳\bar{\mathbf{x}}\in\mathcal{X}, then for any sequence {𝐱k}\{\mathbf{x}_{k}\} converging to 𝐱¯\bar{\mathbf{x}}, we have

lim infkφk(𝐱k)φ(𝐱¯).\liminf\limits_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})\geq\varphi(\bar{\mathbf{x}}).
Proof.

We assume by contradiction that there exists 𝐱¯𝒳\bar{\mathbf{x}}\in\mathcal{X} and a sequence {𝐱k}\{\mathbf{x}_{k}\}, satisfying 𝐱k𝐱¯\mathbf{x}_{k}\to\bar{\mathbf{x}} as kk\to\infty with the following inequality

limkφk(𝐱k)<φ(𝐱¯).\lim_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})<\varphi(\bar{\mathbf{x}}).

Then, there exist ϵ>0\epsilon>0 and a sequence {𝐲k}\{\mathbf{y}_{k}\} satisfying

F(𝐱k,𝐲k)+PH,k(H(𝐱k,𝐲k))+Ph,k(h(𝐱k,𝐲k))\displaystyle F(\mathbf{x}_{k},\mathbf{y}_{k})+P_{H,k}\!\left(H(\mathbf{x}_{k},\mathbf{y}_{k})\right)+P_{h,k}\!\left(h(\mathbf{x}_{k},\mathbf{y}_{k})\right) (19)
+Pf,k(f(𝐱k,𝐲k)fk(𝐱k))+θk2𝐲k2<φ(𝐱¯)ϵ.\displaystyle+P_{f,k}\!\left(f(\mathbf{x}_{k},\mathbf{y}_{k})-f_{k}^{*}(\mathbf{x}_{k})\right)+\frac{\theta_{k}}{2}\|\mathbf{y}_{k}\|^{2}<\varphi(\bar{\mathbf{x}})-\epsilon.

Since F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) is level-bounded in 𝐲\mathbf{y} locally uniformly in 𝐱¯\bar{\mathbf{x}}, we have that {𝐲k}\{\mathbf{y}_{k}\} is bounded. Take a subsequence {𝐲t}\{\mathbf{y}_{t}\} of {𝐲k}\{\mathbf{y}_{k}\} which satisfies there exists 𝐲^\hat{\mathbf{y}}, such that 𝐲t𝐲^\mathbf{y}_{t}\rightarrow\hat{\mathbf{y}}.

The inequality Eq. (19) yields that

Pf,t(f(𝐱t,𝐲t)ft(𝐱t))<φ(𝐱¯)ϵF(𝐱t,𝐲t).\displaystyle P_{f,t}\!\left(f(\mathbf{x}_{t},\mathbf{y}_{t})-f_{t}^{*}(\mathbf{x}_{t})\right)<\varphi(\bar{\mathbf{x}})-\epsilon-F(\mathbf{x}_{t},\mathbf{y}_{t}).

Taking tt\rightarrow\infty then limtPf,t(f(𝐱t,𝐲t)ft(𝐱t))<+\lim_{t\rightarrow\infty}P_{f,t}\!\left(f(\mathbf{x}_{t},\mathbf{y}_{t})-f_{t}^{*}(\mathbf{x}_{t})\right)<+\infty. From Lemma 1, we have lim supt{f(𝐱t,𝐲t)ft(𝐱t)}0\limsup_{t\rightarrow\infty}\left\{f(\mathbf{x}_{t},\mathbf{y}_{t})-f_{t}^{*}(\mathbf{x}_{t})\right\}\leq 0, and hence by the continuity of ff,

limtf(𝐱t,𝐲t)lim inftft(𝐱t).\lim_{t\rightarrow\infty}f(\mathbf{x}_{t},\mathbf{y}_{t})\leq\liminf_{t\rightarrow\infty}f_{t}^{*}(\mathbf{x}_{t}).

Then, by the continuity of ff and Lemma 2, we have

f(𝐱¯,𝐲^)=limtf(𝐱t,𝐲t)lim suptft(𝐱t)f(𝐱¯).f(\bar{\mathbf{x}},\hat{\mathbf{y}})=\lim_{t\rightarrow\infty}f(\mathbf{x}_{t},\mathbf{y}_{t})\leq\limsup_{t\rightarrow\infty}f_{t}^{*}(\mathbf{x}_{t})\leq f^{*}(\bar{\mathbf{x}}).

By using similar arguments and the continuity of hh and HH, one can show h(𝐱¯,𝐲^)0h(\bar{\mathbf{x}},\hat{\mathbf{y}})\leq 0 and H(𝐱¯,𝐲^)0.H(\bar{\mathbf{x}},\hat{\mathbf{y}})\leq 0. Thus, we have 𝐲^S(𝐱¯),\hat{\mathbf{y}}\in S(\bar{\mathbf{x}}), and 𝐲^\hat{\mathbf{y}} is a feasible point to problem Eq. (14) with 𝐱=𝐱¯\mathbf{x}=\bar{\mathbf{x}}. Then Eq. (19) yields

φ(𝐱¯)F(𝐱¯,𝐲^)lim supkF(𝐱k,𝐲k)φ(𝐱¯)ϵ,\varphi(\bar{\mathbf{x}})\leq F(\bar{\mathbf{x}},\hat{\mathbf{y}})\leq\limsup_{k\rightarrow\infty}F(\mathbf{x}_{k},\mathbf{y}_{k})\leq\varphi(\bar{\mathbf{x}})-\epsilon,

which implies a contradiction. Thus we get the conclusion. ∎

A.4 Lemma 4

Let {(μk,θk,σk)}\{(\mu_{k},\theta_{k},\sigma_{k})\} be a positive sequence such that limk(μk,θk,σk)=0\lim_{k\rightarrow\infty}(\mu_{k},\theta_{k},\sigma_{k})=0, and satisfy the same setting as in Lemma 1. Then for any 𝐱𝒳\mathbf{x}\in\mathcal{X},

lim supkφk(𝐱)φ(𝐱).\limsup_{k\rightarrow\infty}\varphi_{k}(\mathbf{x})\leq\varphi(\mathbf{x}).
Proof.

Given any 𝐱¯𝒳\bar{\mathbf{x}}\in\mathcal{X}, for any ϵ>0\epsilon>0, there exists 𝐲¯n\bar{\mathbf{y}}\in\mathbb{R}^{n} satisfying f(𝐱¯,𝐲¯)f(𝐱¯)f(\bar{\mathbf{x}},\bar{\mathbf{y}})\leq f^{*}(\bar{\mathbf{x}}), h(𝐱¯,𝐲¯)0h(\bar{\mathbf{x}},\bar{\mathbf{y}})\leq 0, H(𝐱¯,𝐲¯)0H(\bar{\mathbf{x}},\bar{\mathbf{y}})\leq 0, and F(𝐱¯,𝐲¯)φ(𝐱¯)+ϵF(\bar{\mathbf{x}},\bar{\mathbf{y}})\leq\varphi(\bar{\mathbf{x}})+\epsilon.

By the definition of φk\varphi_{k}, we have

φk(𝐱¯)\displaystyle\varphi_{k}(\bar{\mathbf{x}})\leq F(𝐱¯,𝐲¯)+PH,k(H(𝐱¯,𝐲¯))+Ph,k(h(𝐱¯,𝐲¯))\displaystyle F(\bar{\mathbf{x}},\bar{\mathbf{y}})+P_{H,k}\!\left(H(\bar{\mathbf{x}},\bar{\mathbf{y}})\right)+P_{h,k}\!\left(h(\bar{\mathbf{x}},\bar{\mathbf{y}})\right) (20)
+Pf,k(f(𝐱¯,𝐲¯)fk(𝐱¯))+θk2𝐲¯2.\displaystyle+P_{f,k}\!\left(f(\bar{\mathbf{x}},\bar{\mathbf{y}})-f_{k}^{*}(\bar{\mathbf{x}})\right)+\frac{\theta_{k}}{2}\|\bar{\mathbf{y}}\|^{2}.

From Lemma 1, as kk\rightarrow\infty, we have PH,k(H(𝐱¯,𝐲¯))0P_{H,k}\left(H(\bar{\mathbf{x}},\bar{\mathbf{y}})\right)\rightarrow 0, and Ph,k(h(𝐱¯,𝐲¯))0P_{h,k}\!\left(h(\bar{\mathbf{x}},\bar{\mathbf{y}})\right)\rightarrow 0. As we choose the standard barrier function for PB,kP_{B,k} in the definition of fk(𝐱)f^{*}_{k}(\mathbf{x}), thus f(𝐱,𝐲)+PB,k(h(𝐱,𝐲))+μk2𝐲2f(\mathbf{x},\mathbf{y})+P_{B,k}\!\left(h(\mathbf{x},\mathbf{y})\right)+\frac{\mu_{k}}{2}\|\mathbf{y}\|^{2} is always larger than f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) for any 𝐱\mathbf{x} and 𝐲\mathbf{y} feasible to the LL problem, and hence fk(𝐱¯)f(𝐱¯)f_{k}^{*}(\bar{\mathbf{x}})\geq f^{*}(\bar{\mathbf{x}}). Then we have f(𝐱¯,𝐲¯)fk(𝐱¯)f(𝐱¯,𝐲¯)f(𝐱¯)0f(\bar{\mathbf{x}},\bar{\mathbf{y}})-f_{k}^{*}(\bar{\mathbf{x}})\leq f(\bar{\mathbf{x}},\bar{\mathbf{y}})-f^{*}(\bar{\mathbf{x}})\leq 0. Then it follows from the monotonicity of Pf,kP_{f,k} and Lemma 1 that Pf,k(f(𝐱¯,𝐲¯)fk(𝐱¯))0P_{f,k}\!\left(f(\bar{\mathbf{x}},\bar{\mathbf{y}})-f_{k}^{*}(\bar{\mathbf{x}})\right)\rightarrow 0.

Therefore, as θk0\theta_{k}\rightarrow 0, by taking kk\rightarrow\infty in inequality Eq. (20), we have

lim supkφk(𝐱¯)φ(𝐱¯)+ϵ.\limsup_{k\rightarrow\infty}\varphi_{k}(\bar{\mathbf{x}})\leq\varphi(\bar{\mathbf{x}})+\epsilon.

Then, we get the conclusion by letting ϵ0\epsilon\rightarrow 0. ∎

Appendix B Closed-form Solution in Section 5.1

Here we provide the detailed derivation of the closed-form solutions for the numerical examples in Section 5.1.

B.1 Optimistic BLO in Section 5.1.1

We consider the following optimistic BLO:

min𝐱,𝐲n𝐱a2+𝐲a𝐜2\displaystyle\min_{\mathbf{x}\in\mathbb{R},\mathbf{y}\in\mathbb{R}^{n}}\|\mathbf{x}-a\|^{2}+\|\mathbf{y}-a-\mathbf{c}\|^{2}
 s.t. [𝐲]iargmin[𝐲]isin(𝐱+[𝐲]i[𝐜]i),i,\displaystyle\text{\ s.t.\ }\;[\mathbf{y}]_{i}\in\underset{[\mathbf{y}]_{i}\in\mathbb{R}}{\mathrm{argmin}}\;\sin(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}),\forall\ i,

where [𝐲]i[\mathbf{y}]_{i} denotes the ii-th component of 𝐲\mathbf{y}, while aa\in\mathbb{R} and 𝐜n\mathbf{c}\in\mathbb{R}^{n} are adjustable parameters. Note that here in the numerical example, 𝐱\mathbf{x}\in\mathbb{R} is a one-dimensional real number, and we still use the bold letter to represent the scalar in order to maintain the consistency of the context. The solution of such problem is

𝐱=(1n)a+nC1+n, and [𝐲]i=C+[𝐜]i𝐱,i,\mathbf{x}^{*}=\frac{(1-n)a+nC}{1+n},\ \text{ and }\ [\mathbf{y}^{*}]_{i}=C+[\mathbf{c}]_{i}-\mathbf{x},\forall\ i,

where

C=argmin𝑘{Ck2a:Ck=π2+2kπ,k},C=\underset{{k}}{\operatorname{argmin}}\left\{\|C_{k}-2a\|:C_{k}=-\frac{\pi}{2}+2k\pi,k\in\mathbb{Z}\right\},

and the optimal value is F=n(C2a)21+nF^{*}=\frac{n(C-2a)^{2}}{1+n}. Derivation of the optimal solution and optimal value is as follows.

From the LL problem

[𝐲]iargmin[𝐲]isin(𝐱+[𝐲]i[𝐜]i),i,[\mathbf{y}]_{i}\in\underset{[\mathbf{y}]_{i}\in\mathbb{R}}{\mathrm{argmin}}\;\sin(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}),\forall\ i,

we have [𝐲]i{𝐱+[𝐜]iπ2+2kπ:k},i[\mathbf{y}]_{i}\in\left\{-\mathbf{x}+[\mathbf{c}]_{i}-\frac{\pi}{2}+2k\pi:k\in\mathbb{Z}\right\},\forall\ i. Then the problem is to find the 𝐱\mathbf{x} and kk to minimize

F\displaystyle F =𝐱a2+𝐲a𝐜2\displaystyle=\|\mathbf{x}-a\|^{2}+\|\mathbf{y}-a-\mathbf{c}\|^{2}
=(𝐱a)2+i=1n([𝐲]ia[𝐜]i)2\displaystyle=(\mathbf{x}-a)^{2}+\sum_{i=1}^{n}([\mathbf{y}]_{i}-a-[\mathbf{c}]_{i})^{2}
=(𝐱a)2+n(𝐱π2+2kπa)2\displaystyle=(\mathbf{x}-a)^{2}+n(-\mathbf{x}-\frac{\pi}{2}+2k\pi-a)^{2}
=(n+1)𝐱2+2[n(a+π22kπ)a]𝐱\displaystyle=(n+1)\mathbf{x}^{2}+2\left[n\left(a+\frac{\pi}{2}-2k\pi\right)-a\right]\mathbf{x}
+a2+n(a+π22kπ)2.\displaystyle\quad+a^{2}+n\left(a+\frac{\pi}{2}-2k\pi\right)^{2}.

For a given kk, denote Ck=π2+2kπC_{k}=-\frac{\pi}{2}+2k\pi, then

Fk=(n+1)𝐱2+2[n(aCk)a]𝐱+a2+n(aCk)2,F_{k}=(n+1)\mathbf{x}^{2}+2\left[n(a-C_{k})-a\right]\mathbf{x}+a^{2}+n\left(a-C_{k}\right)^{2},

which is strongly convex and quadratic w.r.t. 𝐱\mathbf{x}. Thus, it is easy to obtain 𝐱k=(1n)a+nCk1+n\mathbf{x}_{k}^{*}=\frac{(1-n)a+nC_{k}}{1+n}, and

Fk\displaystyle F_{k}^{*} =a2+n(aCk)2[n(aCk)a]2n+1\displaystyle=a^{2}+n(a-C_{k})^{2}-\frac{[n(a-C_{k})-a]^{2}}{n+1}
=nn+1(Ck2a)2.\displaystyle=\frac{n}{n+1}(C_{k}-2a)^{2}.

Hence, by denoting

C=argmin𝑘{Ck2a:Ck=π2+2kπ,k},C=\underset{{k}}{\operatorname{argmin}}\left\{\|C_{k}-2a\|:C_{k}=-\frac{\pi}{2}+2k\pi,k\in\mathbb{Z}\right\},

we have the optimal value F=nn+1(C2a)2F^{*}=\frac{n}{n+1}(C-2a)^{2}, and the corresponding solution

𝐱=(1n)a+nC1+n, and [𝐲]i=C+[𝐜]i𝐱,i.\mathbf{x}^{*}=\frac{(1-n)a+nC}{1+n},\ \text{ and }\ [\mathbf{y}^{*}]_{i}=C+[\mathbf{c}]_{i}-\mathbf{x}^{*},\forall\ i.

B.2 BLO with constraints in Section 5.1.2

We consider the following constrained BLO problem with non-convex LL:

min𝐱,𝐲n𝐱a2+𝐲a2\displaystyle\min_{\mathbf{x}\in\mathbb{R},\mathbf{y}\in\mathbb{R}^{n}}\|\mathbf{x}-a\|^{2}+\left\|\mathbf{y}-a\right\|^{2}
s.t. [𝐲]iargmin[𝐲]i{sin(𝐱+[𝐲]i[𝐜]i):𝐱+[𝐲]i[0,1]},i,\displaystyle\text{ s.t. }[\mathbf{y}]_{i}\in\mathop{\arg\min}_{[\mathbf{y}]_{i}\in\mathbb{R}}\Big{\{}\sin\left(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}\right):\mathbf{x}+[\mathbf{y}]_{i}\in[0,1]\Big{\}},\forall\ i,

where aa\in\mathbb{R} and 𝐜n\mathbf{c}\in\mathbb{R}^{n} are any fixed given constant and vector satisfying [𝐜]i[0,1] for any i=1,,n[\mathbf{c}]_{i}\in[0,1]\text{ for any }i=1,\cdots,n. The optimal solution is

𝐱=1n1+na, and [𝐲]i=𝐱,i,\mathbf{x}^{*}=\frac{1-n}{1+n}a,\text{ and }[\mathbf{y}^{*}]_{i}=-\mathbf{x}^{*},\forall\ i,

and the optimal value is F=4n1+na2F^{*}=\frac{4n}{1+n}a^{2}. Derivation of the optimal solution and optimal value is as follows.

From 𝐱+[𝐲]i[0,1],\mathbf{x}+[\mathbf{y}]_{i}\in[0,1], along with [𝐜]i[0,1],i,[\mathbf{c}]_{i}\in[0,1],\forall\ i, it is easy to obtain

𝐱+[𝐲]i[𝐜]i[[𝐜]i,1[𝐜]i][1,1][π2,π2].\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}\in\big{[}\ [\mathbf{c}]_{i},1-[\mathbf{c}]_{i}\ \big{]}\subset[-1,1]\subset\left[-\frac{\pi}{2},\frac{\pi}{2}\right].

Hence, sin(𝐱+[𝐲]i[𝐜]i)\sin\left(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}\right) is increasing w.r.t. [𝐲]i[\mathbf{y}]_{i} under the constraints for all ii. Thus, from the LL problem we have 𝐱+[𝐲]i[𝐜]i=[𝐜]i\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}=-[\mathbf{c}]_{i}, i.e., [𝐲]i=𝐱,i.[\mathbf{y}]_{i}=-\mathbf{x},\forall\ i. Then the problem is to find the 𝐱\mathbf{x} to minimize

F\displaystyle F =𝐱a2+𝐲a2\displaystyle=\|\mathbf{x}-a\|^{2}+\left\|\mathbf{y}-a\right\|^{2}
=(𝐱a)2+n(𝐱a)2\displaystyle=(\mathbf{x}-a)^{2}+n(-\mathbf{x}-a)^{2}
=(n+1)𝐱2+2a(n1)𝐱+(n+1)a2.\displaystyle=(n+1)\mathbf{x}^{2}+2a(n-1)\mathbf{x}+(n+1)a^{2}.

Therefore, the optimal solution 𝐱=1n1+na\mathbf{x}^{*}=\frac{1-n}{1+n}a, [𝐲]i=𝐱,i[\mathbf{y}^{*}]_{i}=-\mathbf{x}^{*},\forall\ i, and by substituting 𝐱\mathbf{x}^{*} into FF, we have the optimal value F=4n1+na2F^{*}=\frac{4n}{1+n}a^{2}.

B.3 Pessimistic BLO in Section 5.1.3

For the pessimistic BLO we use the following example:

min𝐱max𝐲n𝐱a2𝐲a𝐜2\displaystyle\min_{\mathbf{x}\in\mathbb{R}}\max_{\mathbf{y}\in\mathbb{R}^{n}}\|\mathbf{x}-a\|^{2}-\|\mathbf{y}-a-\mathbf{c}\|^{2} (21)
 s.t. [𝐲]iargmin[𝐲]isin(𝐱+[𝐲]i[𝐜]i),i,\displaystyle\text{\ s.t.\ }\;[\mathbf{y}]_{i}\in\underset{[\mathbf{y}]_{i}\in\mathbb{R}}{\mathrm{argmin}}\;\sin(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}),\forall\ i,

where [𝐲]i[\mathbf{y}]_{i} denotes the ii-th component of 𝐲\mathbf{y}, while aa\in\mathbb{R} and 𝐜n\mathbf{c}\in\mathbb{R}^{n} are adjustable parameters. In our experiment, we set a=2a=2 and [𝐜]i=2 for any i=1,2,,n[\mathbf{c}]_{i}=2\text{ for any }i=1,2,\cdots,n, and consider the 2-dimensional case (LL dimension n=2n=2). The optimal solution to this problem is

(𝐱,𝐲)=(2+π2,4±π,4±π),(\mathbf{x}^{*},\mathbf{y}^{*})=\left(-2+\frac{\pi}{2},4\pm\pi,4\pm\pi\right),

and the optimal value is

F=(4+π2)22π2=74π24π+16.F^{*}=\left(-4+\frac{\pi}{2}\right)^{2}-2\pi^{2}=-\frac{7}{4}\pi^{2}-4\pi+16.

Derivation of the optimal solution and optimal value is as follows.

From the LL problem

[𝐲]iargmin[𝐲]isin(𝐱+[𝐲]i[𝐜]i),i,[\mathbf{y}]_{i}\in\underset{[\mathbf{y}]_{i}\in\mathbb{R}}{\mathrm{argmin}}\;\sin(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}),\forall\ i,

we have [𝐲]i{𝐱+[𝐜]iπ2+2kπ:k},i[\mathbf{y}]_{i}\in\left\{-\mathbf{x}+[\mathbf{c}]_{i}-\frac{\pi}{2}+2k\pi:k\in\mathbb{Z}\right\},\forall\ i. Then the problem is transferred to

min𝐱maxk(𝐱a)2n[𝐱+a(π2+2kπ)]2.\min_{\mathbf{x}\in\mathbb{R}}\max_{k\in\mathbb{Z}}\ (\mathbf{x}-a)^{2}-n\left[\mathbf{x}+a-\left(-\frac{\pi}{2}+2k\pi\right)\right]^{2}.

Denote Ck=π2+2kπC_{k}=-\frac{\pi}{2}+2k\pi. For a given 𝐱\mathbf{x}, suppose

𝐱+a[Ck^π,Ck^+π](k^).\mathbf{x}+a\in[C_{\widehat{k}}-\pi,C_{\widehat{k}}+\pi]\ (\widehat{k}\in\mathbb{Z}).

Then to maximize (𝐱a)2n[𝐱+a(π2+2kπ)]2(\mathbf{x}-a)^{2}-n\left[\mathbf{x}+a-\left(-\frac{\pi}{2}+2k\pi\right)\right]^{2} w.r.t. kk is to minimize [𝐱+a(π2+2kπ)]2=(𝐱+aCk)2\left[\mathbf{x}+a-\left(-\frac{\pi}{2}+2k\pi\right)\right]^{2}=(\mathbf{x}+a-C_{k})^{2} w.r.t. kk, and

argmink(𝐱+aCk)2=k^.\mathop{\arg\min}_{k}\ (\mathbf{x}+a-C_{k})^{2}=\widehat{k}.

Thus, the problem is transformed into

min𝐱(𝐱a)2n(𝐱+aCk^)2, if 𝐱+a[Ck^π,Ck^+π],\min_{\mathbf{x}}\ (\mathbf{x}-a)^{2}-n\left(\mathbf{x}+a-C_{\widehat{k}}\right)^{2},\text{ if }\mathbf{x}+a\in[C_{\widehat{k}}-\pi,C_{\widehat{k}}+\pi],

i.e.,

min𝐱φ(𝐱)\min_{\mathbf{x}\in\mathbb{R}}\varphi(\mathbf{x})

where

φ(𝐱):=(𝐱a)2\displaystyle\varphi(\mathbf{x}):=(\mathbf{x}-a)^{2} n(𝐱+aCk^)2,\displaystyle-n\left(\mathbf{x}+a-C_{\widehat{k}}\right)^{2},
if 𝐱[a+Ck^π,a+Ck^+π](k^).\displaystyle\text{ if }\mathbf{x}\in[-a+C_{\widehat{k}}-\pi,-a+C_{\widehat{k}}+\pi]\ (\widehat{k}\in\mathbb{Z}).

It is easy to obtain that φ(𝐱)\varphi(\mathbf{x}) is continuous on \mathbb{R} (on interval endpoints φ(𝐱)=(𝐱a)2nπ2\varphi(\mathbf{x})=(\mathbf{x}-a)^{2}-n\pi^{2}), and

12φ(𝐱)=𝐱an(𝐱+aCk^),\frac{1}{2}\varphi^{\prime}(\mathbf{x})=\mathbf{x}-a-n(\mathbf{x}+a-C_{\widehat{k}}),

when 𝐱[a+Ck^π,a+Ck^+π](k^)\mathbf{x}\in[-a+C_{\widehat{k}}-\pi,-a+C_{\widehat{k}}+\pi]\ (\widehat{k}\in\mathbb{Z}).

Because we set n=2n=2 and a=2a=2, then if 𝐱[2+Ck^π,2+Ck^+π]\mathbf{x}\in[-2+C_{\widehat{k}}-\pi,-2+C_{\widehat{k}}+\pi],

φ(𝐱):=(𝐱2)22(𝐱+2Ck^)2,\varphi(\mathbf{x}):=(\mathbf{x}-2)^{2}-2\left(\mathbf{x}+2-C_{\widehat{k}}\right)^{2},
12φ(𝐱)=𝐱6+2Ck^.\frac{1}{2}\varphi^{\prime}(\mathbf{x})=-\mathbf{x}-6+2C_{\widehat{k}}.

Hence,

12φ(𝐱)[4+Ck^π,4+Ck^+π].\frac{1}{2}\varphi^{\prime}(\mathbf{x})\in\left[-4+C_{\widehat{k}}-\pi,-4+C_{\widehat{k}}+\pi\right].

Therefore,

  1. (1)

    if k^0\widehat{k}\leq 0, then

    12φ(𝐱)4+Ck^+π4+π2<0,\frac{1}{2}\varphi^{\prime}(\mathbf{x})\leq-4+C_{\widehat{k}}+\pi\leq-4+\frac{\pi}{2}<0,

    and φ(𝐱)\varphi(\mathbf{x}) is decreasing. In this case, Ck^π2C_{\widehat{k}}\leq-\frac{\pi}{2}, and 𝐱2+π2\mathbf{x}\leq-2+\frac{\pi}{2}.

  2. (2)

    if k^2\widehat{k}\geq 2, then

    12φ(𝐱)4+Ck^π4+7π2>0,\frac{1}{2}\varphi^{\prime}(\mathbf{x})\geq-4+C_{\widehat{k}}-\pi\geq-4+\frac{7\pi}{2}>0,

    and φ(𝐱)\varphi(\mathbf{x}) is increasing. In this case, Ck^7π2C_{\widehat{k}}\geq\frac{7\pi}{2}, and 𝐱2+5π2\mathbf{x}~{}\geq~{}-2+\frac{5\pi}{2}.

  3. (3)

    when 𝐱[2+π2,2+5π2]\mathbf{x}\in\left[-2+\frac{\pi}{2},-2+\frac{5\pi}{2}\right], the corresponding k^=1\widehat{k}=1, and

    φ(𝐱):=(𝐱2)22(𝐱+23π2)2,\varphi(\mathbf{x}):=(\mathbf{x}-2)^{2}-2\left(\mathbf{x}+2-\frac{3\pi}{2}\right)^{2},
    12φ(𝐱)=𝐱6+3π.\frac{1}{2}\varphi^{\prime}(\mathbf{x})=-\mathbf{x}-6+3\pi.

    Thus, when 𝐱(2+π2,3π6)\mathbf{x}\in\left(-2+\frac{\pi}{2},3\pi-6\right), φ(𝐱)>0\varphi^{\prime}(\mathbf{x})>0, and φ(𝐱)\varphi(\mathbf{x}) is increasing; when 𝐱(3π6,2+5π2)\mathbf{x}\in\left(3\pi-6,-2+\frac{5\pi}{2}\right), φ(𝐱)<0\varphi^{\prime}(\mathbf{x})<0, and φ(𝐱)\varphi(\mathbf{x}) is decreasing.

To sum up, φ(𝐱)\varphi(\mathbf{x}) is increasing on [2+π2,3π6]\left[-2+\frac{\pi}{2},3\pi-6\right] and [2+5π2,+)\left[-2+\frac{5\pi}{2},+\infty\right), and decreasing on (,2+π2]\left(-\infty,-2+\frac{\pi}{2}\right] and [3π6,2+5π2]\left[3\pi-6,-2+\frac{5\pi}{2}\right]. Therefore, the optimal 𝐱\mathbf{x} occurs at either 2+π2-2+\frac{\pi}{2} or 2+5π2-2+\frac{5\pi}{2}. Since at these two local minimizers,

φ(2+π2)=(4+π2)22π2\displaystyle\varphi\left(-2+\frac{\pi}{2}\right)=\left(-4+\frac{\pi}{2}\right)^{2}-2\pi^{2}
<\displaystyle< φ(2+5π2)=(4+5π2)22π2,\displaystyle\varphi\left(-2+\frac{5\pi}{2}\right)=\left(-4+\frac{5\pi}{2}\right)^{2}-2\pi^{2},

we have 𝐱=2+π2\mathbf{x}^{*}=-2+\frac{\pi}{2}, and

F=(4+π2)22π2=74π24π+16.F^{*}=\left(-4+\frac{\pi}{2}\right)^{2}-2\pi^{2}=-\frac{7}{4}\pi^{2}-4\pi+16.

For 𝐱=2+π2\mathbf{x}^{*}=-2+\frac{\pi}{2}, it means 𝐱[2π2π,2π2+π]\mathbf{x}^{*}\in\left[-2-\frac{\pi}{2}-\pi,-2-\frac{\pi}{2}+\pi\right] and 𝐱[2+3π2π,2+π2+π]\mathbf{x}^{*}\in\left[-2+\frac{3\pi}{2}-\pi,-2+\frac{\pi}{2}+\pi\right], so Ck^=π2C_{\widehat{k}}=-\frac{\pi}{2} or 3π2\frac{3\pi}{2}, i.e., k^=0\widehat{k}=0 or 11. Thus,

[𝐲]i=𝐱+[𝐜]iπ2=4π[\mathbf{y}^{*}]_{i}=-\mathbf{x}^{*}+[\mathbf{c}]_{i}-\frac{\pi}{2}=4-\pi

or

[𝐲]i=𝐱+[𝐜]i+3π2=4+π[\mathbf{y}^{*}]_{i}=-\mathbf{x}^{*}+[\mathbf{c}]_{i}+\frac{3\pi}{2}=4+\pi

for i=1,2i=1,2. Hence,

(𝐱,𝐲)=(2+π2,4±π,4±π).(\mathbf{x}^{*},\mathbf{y}^{*})=\left(-2+\frac{\pi}{2},4\pm\pi,4\pm\pi\right).