Value-Function-based Sequential Minimization for Bi-level Optimization

Risheng Liu, Xuan Liu, Shangzhi Zeng, Jin Zhang, and Yixuan Zhang R. Liu and X. Liu are with the DUT-RU International School of Information Science

\&

Engineering, Dalian University of Technology, and the Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian, Liaoning, China. R. Liu is also with the Pazhou Lab, Guangzhou, Guangdong, China. E-mail: rsliu@dlut.edu.cn, liuxuan_16@126.com. S. Zeng is with the Department of Mathematics and Statistics, University of Victoria, Victoria, B.C., Canada. E-mail: zengshangzhi@uvic.ca. J. Zhang is with the Department of Mathematics, SUSTech International Center for Mathematics, Southern University of Science and Technology, National Center for Applied Mathematics Shenzhen, and Peng Cheng Laboratory, Shenzhen, Guangdong, China. (Corresponding author, E-mail: zhangj9@sustech.edu.cn.) Y. Zhang is with the Department of Applied Mathematics, the Hong Kong Polytechnic University, Hong Kong, China. E-mail:
yi-xuan.zhang@connect.polyu.hk. Manuscript received April 19, 2005; revised August 26, 2015.

Abstract

Gradient-based Bi-Level Optimization (BLO) methods have been widely applied to handle modern learning tasks. However, most existing strategies are theoretically designed based on restrictive assumptions (e.g., convexity of the lower-level sub-problem), and computationally not applicable for high-dimensional tasks. Moreover, there are almost no gradient-based methods able to solve BLO in those challenging scenarios, such as BLO with functional constraints and pessimistic BLO. In this work, by reformulating BLO into approximated single-level problems, we provide a new algorithm, named Bi-level Value-Function-based Sequential Minimization (BVFSM), to address the above issues. Specifically, BVFSM constructs a series of value-function-based approximations, and thus avoids repeated calculations of recurrent gradient and Hessian inverse required by existing approaches, time-consuming especially for high-dimensional tasks. We also extend BVFSM to address BLO with additional functional constraints. More importantly, BVFSM can be used for the challenging pessimistic BLO, which has never been properly solved before. In theory, we prove the asymptotic convergence of BVFSM on these types of BLO, in which the restrictive lower-level convexity assumption is discarded. To our best knowledge, this is the first gradient-based algorithm that can solve different kinds of BLO (e.g., optimistic, pessimistic, and with constraints) with solid convergence guarantees. Extensive experiments verify the theoretical investigations and demonstrate our superiority on various real-world applications.

Index Terms:

Bi-level optimization, gradient-based method, value-function, sequential minimization, hyper-parameter optimization.

1 Introduction

Currently, a number of important machine learning and deep learning tasks can be captured by hierarchical models, such as hyper-parameter optimization [1, 2, 3, 4], neural architecture search [5, 6, 7], meta learning [8, 9, 10], Generative Adversarial Networks (GAN) [11, 12], reinforcement learning [13], image processing [14, 15, 16, 17], and so on. In general, these hierarchical models can be formulated as the following Bi-Level Optimization (BLO) problem [18, 19, 20]:

``\min\limits_{\mathbf{x}\in\mathcal{X}}"\ F(\mathbf{x},\mathbf{y}),\ \mathrm{\ s.t.\ }\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x}):=\mathop{\arg\min}_{\mathbf{y}}f(\mathbf{x},\mathbf{y}),

(1)

where $\mathbf{x}\in\mathcal{X}$ is the Upper-Level (UL) variable, $\mathbf{y}\in\mathbb{R}^{n}$ is the Lower-Level (LL) variable, the UL objective $F(\mathbf{x},\mathbf{y}):\mathcal{X}\times\mathbb{R}^{n}\rightarrow\mathbb{R}$ and the LL objective $f(\mathbf{x},\mathbf{y}):\mathbb{R}^{m}\times\mathbb{R}^{n}\rightarrow\mathbb{R}$ , are continuously differentiable and jointly continuous functions, and the UL constraint $\mathcal{X}\subset\mathbb{R}^{m}$ is a compact set. Nevertheless, the model in Eq. (1) cannot be solved directly. Some existing works only consider the case that the LL solution set $\mathcal{S}(\mathbf{x})$ is a singleton. However, since this may not be satisfied and $\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x})$ may not be unique, Eq. (1) is not a rigorous BLO model in mathematics, and thus we use the quotation marks around “min” to denote the slightly imprecise definition of the UL objective [18, 21].

Strictly, people usually focus on an extreme situation of the BLO model, i.e., the optimistic BLO [18]:

\min\limits_{\mathbf{x}\in\mathcal{X}}\min\limits_{\mathbf{y}\in\mathbb{R}^{n}}\ F(\mathbf{x},\mathbf{y}),\ \mathrm{\ s.t.\ }\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x}).

(2)

It can be found from the above expression that in optimistic BLO, $\mathbf{x}$ and $\mathbf{y}$ are in a cooperative relationship, aiming to minimize $F(\mathbf{x},\mathbf{y})$ at the same time. Therefore, it can be applied to a variety of learning and vision tasks, such as hyper-parameter optimization, meta learning, and so on. Sometimes we also need to study BLO problems with inequality constraints on the UL or LL for capturing constraints in real tasks. Another situation one can consider is the pessimistic BLO, which changes the $\min_{\mathbf{y}\in\mathbb{R}^{n}}$ in Eq. (2) into $\max_{\mathbf{y}\in\mathbb{R}^{n}}$ [18]. In the pessimistic case, $\mathbf{x}$ and $\mathbf{y}$ are in an adversarial relationship, and hence solving pessimistic BLO can be applied to adversarial learning and GAN.

Actually, BLO is challenging to solve, because in the hierarchical structure, we need to solve $\mathcal{S}(\mathbf{x})$ governed by the fixed $\mathbf{x}$ , and select an appropriate $\mathbf{y}$ from $\mathcal{S}(\mathbf{x})$ to optimize the UL $F(\mathbf{x},\mathbf{y})$ , making $\mathbf{x}$ and $\mathbf{y}$ intricately dependent of each other, especially when $\mathcal{S}(\mathbf{x})$ is not a singleton [22]. In classical optimization, KKT condition is utilized to characterize the problem, but this method is not applicable to machine learning tasks of large scale due to the use of too many multipliers [23, 24]. In the machine learning community, a class of mainstream and popular methods are gradient-based methods, divided into Explicit Gradient-Based Methods (EGBMs) [2, 8, 25, 26, 5] and Implicit Gradient-Based Methods (IGBMs) [27, 9, 28], according to divergent ideas of calculating the gradient needed for implementing gradient descent. EGBMs implement this process via unrolled differentiation, and IGBMs use the implicit function theorem to obtain the gradient. Both of them usually deal with the problem where the LL solution set $\mathcal{S}(\mathbf{x})$ is a singleton, which is a quite restrictive condition in real application tasks. In dealing with this, Liu et al. [29, 30] proposed Bi-level Descent Aggregation (BDA) as a new EGBM, which removes this assumption and solves the model from the perspective of optimistic BLO.

Nevertheless, there still exists a bottleneck hard to break through, that the LL problems in real learning tasks are usually too complex for EGBMs and IGBMs. In theory, all of the EGBMs and IGBMs require the convexity of the LL problem, or the Lower-Level Convexity, denoted as LLC for short, which is a strong condition and not satisfied in many complicated real-world tasks. For example, since the layer of chosen network is usually greater than one, LLC is not satisfied, so the convergence of these methods cannot be guaranteed. In computation, additionally, EGBMs using unrolled differentiation request large time and space complexity, while IGBMs need to approximate the inverse of a matrix, also with high computational cost, especially when the LL variable $\mathbf{y}$ is of large scale, which means the dimension of $\mathbf{y}$ is large, generating matrices and vectors of high dimension during the calculating procedure. Furthermore, it has been rarely discussed how to handle machine learning tasks by solving an optimization problem with functional constraints on the UL and LL, or by solving a pessimistic BLO. However, these problems are worth discussing, because pessimistic BLO can be used to capture min-max bi-level structures, which is suitable for GAN and so on, and optimization problems with constraints can be used to represent learning tasks more accurately. Unfortunately, existing methods including EGBMs and IGBMs, are not able to handle these problems.

To address the above limitations of existing methods, in this work, we propose a novel framework, named Bi-level Value-Function-based Sequential Minimization (BVFSM) ¹¹1A preliminary version of this work has been published in [1].. To be specific, we start with reformulating BLO into a simple bi-level optimization problem by the value-function [31, 32] of UL objective. After that, we further transform it into a single-level optimization problem with an inequality constraint through the value-function of LL objective. Then, by using the smoothing technique via regularization and adding the constraint into the objective by an auxiliary function of penalty or barrier, eventually the original problem can be transformed into a sequence of unconstrained differentiable single-level problems, which can be solved by gradient descent. Thanks to the re-characterization via the value-function of LL problem, our computational cost is the least to implement the algorithm, and simultaneously, BVFSM can be applied under more relaxed conditions.

Specifically, BVFSM avoids solving an unrolled dynamic system by recurrent gradient or approximating the inverse of Hessian during each iteration like existing methods. Instead, we only need to calculate the first-order gradient in each iteration, reduces the computational complexity relative to the LL problem size by an order of magnitude compared to existing gradient-based BLO methods, and thus require less time and space complexity than EGBMs and IGBMs, especially for complex high-dimensional BLO. Besides, BVFSM enables to maintain the level of complexity when applying BLO to networks, thereby making it possible to use BLO in existing networks and expanding its range of applications significantly. We illustrate the efficiency of BVFSM over existing methods through complexity analysis in theory and various experimental results in reality. In addition, we consider the asymptotic convergence different from some previous gradient-based methods inspired from the perspective of sequential minimization, and prove that the solutions to the sequence of approximate sub-problems converge to the true solution of the original BLO without the restrictive LLC assumption as before. Also, BVFSM can be extended to more complicated and challenging scenarios, namely, BLO with functional constraints and pessimistic BLO problems. We regard pessimistic BLO as a new viewpoint to deal with learning tasks, which has not been solved by gradient-based methods before to our best knowledge. Specially, we use the experiment of GAN as an example to illustrate the application of our method for solving pessimistic BLO. We summarize our contributions as follows.

•

By reformulating the original BLO as an approximated single-level problem based on the value-function, BVFSM breaks the traditional mindset in gradient-based methods, and establishes a competently new sequential minimization algorithmic framework, which not only can be used to address optimistic BLO, but also has the ability to handle BLO in other more challenging scenarios (i.e., with functional constraints and pessimistic), which have seldom been discussed.
•

BVFSM significantly reduces the computational complexity by an order of magnitude compared to existing gradient-based BLO methods with the help of value-function-based reformulation which breaks the traditional mindset. Also, BVFSM avoids the repeated calculation of recurrent gradients and Hessian inverse, which are the core bottleneck for solving high-dimensional BLO problems in existing approaches. The superiority allows BVFSM to be applied to large-scale networks and frontier tasks effectively.
•

We rigorously analyze the asymptotic convergence behaviors of BVFSM on all types of BLO mentioned above. Our theoretical investigations successfully remove the restrictive LLC condition, required in most existing works but actually too ambitious to satisfy in real-world applications.
•

In terms of experiments, we conduct extensive experiments to verify our theoretical findings and demonstrate the superiority of BVFSM on various learning tasks. Especially, by formulating and solving GAN by BVFSM, we also show the application potential of our solution strategy on pessimistic BLO for complex learning problems.

2 Related Works

As aforementioned, BLO is challenging to solve due to its nested structures between UL and LL. Early methods can only handle models with not too many hyper-parameters. For example, to find appropriate parameters, the standard method is to use random search [33] through randomly sampling, or to use Bayesian optimization [34]. However, in real learning tasks, the dimension of hyper-parameters is very large, which early methods cannot deal with, so gradient-based methods are proposed. Here we first put forward a unified form of gradient-based methods, and then discuss the existing methods for further comparing them with our proposed method.

Existing gradient-based methods mainly focus on the optimistic BLO only, so we use the optimistic scenario to illustrate our algorithmic framework clearly, while in Section 3.4, we will discuss how to use our method to solve pessimistic BLO. For optimistic BLO, it can be found from Eq. (2) that the UL variable $\mathbf{x}$ and LL variable $\mathbf{y}$ will effect each other in a nested relationship. To address this issue, one can transform it into the following form, where $\varphi(\mathbf{x})$ is the value-function of the sub-problem,

\min\limits_{\mathbf{x}\in\mathcal{X}}\ \varphi(\mathbf{x}),\quad\varphi(\mathbf{x}):=\mathop{\min}_{\mathbf{y}}\Big{\{}F(\mathbf{x},\mathbf{y}):\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x})\Big{\}}.

(3)

For a fixed $\mathbf{x}$ , this sub-problem for solving $\varphi(\mathbf{x})$ is an inner simple BLO task, as it is only about one variable $\mathbf{y}$ , with $\mathbf{x}$ as a parameter. Then, we hope to minimize $\varphi(\mathbf{x})$ through gradient descent. However, as a value-function, $\varphi(\mathbf{x})$ is non-smooth, non-convex, even with jumps, and thus ill-conditioned, so we use a smooth function to approximate $\varphi(\mathbf{x})$ and approach $\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}}$ . Existing methods can be classified into two categories according to divergent ways to calculate $\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}}$ [20], i.e., Explicit Gradient-Based Methods (EGBMs), which derives the gradient by Automatic Differentiation (AD), and Implicit Gradient-Based Methods (IGBMs), which apply implicit function theorem to deal with the optimality conditions of LL problems.

Note that both EGBMs and IGBMs require $\mathbf{y}\in\mathcal{S}(\mathbf{x})$ to be unique (except BDA), denoted as $\mathbf{y}^{*}(\mathbf{x})$ , while for BDA, by integrating information from both the UL and LL sub-problem, $\mathbf{y}_{T}(\mathbf{x})$ is obtained by iterations to approach the appropriate $\mathbf{y}^{*}(\mathbf{x})$ . Hence, $\varphi(\mathbf{x})=F(\mathbf{x},\mathbf{y}^{*}(\mathbf{x}))$ , and therefore by the chain rule, the approximated $\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}}$ is split into direct and indirect gradients of $\mathbf{x}$ ,

\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}}=\frac{\partial F(\mathbf{x},\mathbf{y}^{*}(\mathbf{x}))}{\partial\mathbf{x}}+G(\mathbf{x}),

(4)

where $\frac{\partial F(\mathbf{x},\mathbf{y})}{\partial\mathbf{x}}$ is the direct gradient and $G(\mathbf{x})$ is the indirect gradient, $G(\mathbf{x})=\left(\frac{\partial\mathbf{y}^{*}(\mathbf{x})}{\partial\mathbf{x}}\right)^{\top}\frac{\partial F(\mathbf{x},\mathbf{y}^{*})}{\partial\mathbf{y}^{*}}$ . Then we need to compute $G(\mathbf{x})$ , in other words, the value of $\frac{\partial\mathbf{y}^{*}(\mathbf{x})}{\partial\mathbf{x}}$ .

Explicit Gradient-Based Methods (EGBMs). Maclaurin et al. [25] and Franceschi et al. [2, 8] first proposed Reverse Hyper-Gradient (RHG) and Forward Hyper-Gradient (FHG) respectively, to implement a dynamic system, under the LLC assumption. Given an initial point $\mathbf{y}_{0}$ , denote the iteration process to approach $\mathbf{y}^{*}(\mathbf{x})$ as $\mathbf{y}_{t+1}(\mathbf{x})=\Phi_{t}(\mathbf{x},\mathbf{y}_{t}(\mathbf{x})),\ t=0,1,\cdots,T-1,$ where $\Phi_{t}$ is a smooth mapping performed to solve $\mathbf{y}_{T}(\mathbf{x})$ and $T$ is the number of iterations. In particular, for example, if the process is gradient descent, $\Phi_{t}\left(\mathbf{x},\mathbf{y}_{t}(\mathbf{x})\right)=\mathbf{y}_{t}(\mathbf{x})-s_{t}\frac{\partial f\left(\mathbf{x},\mathbf{y}_{t}(\mathbf{x})\right)}{\partial\mathbf{y}_{t}},$ where $s_{t}>0$ is the corresponding step size. Then $\varphi(\mathbf{x})$ in Eq. (3) can be approximated by $\varphi(\mathbf{x})\approx\varphi_{T}(\mathbf{x})=F(\mathbf{x},\mathbf{y}_{T}(\mathbf{x}))$ . As $T$ increases, $\varphi_{T}(\mathbf{x})$ approaches $\varphi(\mathbf{x})$ generally, and a sequence of unconstrained minimization problems is obtained. Thus, gradient-based methods can be regarded as a kind of sequential-minimization-type scheme [35]. From the chain rule, we have $\frac{\partial\mathbf{y}_{t}(\mathbf{x})}{\partial\mathbf{x}}=\left(\frac{\partial\Phi_{t-1}(\mathbf{x},\mathbf{y}_{t-1})}{\partial\mathbf{y}_{t-1}}\right)^{\top}\frac{\partial\mathbf{y}_{t-1}(\mathbf{x})}{\partial\mathbf{x}}+\frac{\partial\Phi_{t-1}(\mathbf{x},\mathbf{y}_{t-1})}{\partial\mathbf{x}},$ and $\frac{\partial\mathbf{y}_{T}(\mathbf{x})}{\partial\mathbf{x}}$ can be obtained from this unrolled procedure. However, FHG and RHG require calculating the gradient of $\mathbf{x}$ composed of the first-order condition of LL problem by AD during the entire trajectory, so the computational cost owing to the time and space complexity is very high. In dealing with this, Shaban et al. [26] proposed Truncated Reverse Hyper-Gradient (TRHG) to truncate the iteration, and thus TRHG only needs to store the last $I$ iterations, reducing the computational load. Nevertheless, it additionally requires $f$ to be strongly convex, and the truncated path length is hard to determine. Another method Liu et al. [5] tried is to use the difference of vectors to approximate the gradient, but the accuracy of using the difference is not promised and there is no theoretical guarantee for this method. On the other hand, from the viewpoint of theory, for more relaxed conditions, Liu et al. [29, 30] proposed Bi-level Descent Aggregation (BDA) to remove the assumption that the LL solution set is a singleton, which is a simplification of real-world problems. Specifically, BDA uses information from both the UL and the LL problem as an aggregation during iterations. However, the obstacle of LLC and computational cost still exists.

Implicit Gradient-Based Methods (IGBMs). IGBMs or implicit differentiation [27, 9, 28], can be applied to obtain $\frac{\partial\mathbf{y}^{*}(\mathbf{x})}{\partial\mathbf{x}}$ under the LLC assumption. If $\frac{\partial^{2}f(\mathbf{x},\mathbf{y}^{*}(\mathbf{x}))}{\partial\mathbf{y}\partial\mathbf{y}}$ is assumed to be invertible in advance as an additional condition, by using the implicit function theorem on the optimality condition $\frac{\partial f(\mathbf{x},\mathbf{y}^{*}(\mathbf{x}))}{\partial\mathbf{y}}=0$ , the LL problem is replaced with an implicit equation, and then $\frac{\partial\mathbf{y}^{*}(\mathbf{x})}{\partial\mathbf{x}}=-\left(\frac{\partial^{2}f(\mathbf{x},\mathbf{y}^{*}(\mathbf{x}))}{\partial\mathbf{y}\partial\mathbf{y}}\right)^{-1}\frac{\partial^{2}f(\mathbf{x},\mathbf{y}^{*}(\mathbf{x}))}{\partial\mathbf{y}\partial\mathbf{x}}.$ Unlike EGBMs relying on the first-order condition during the entire trajectory, IGBMs only depends on the first-order condition once, which decouples the computational burden from the solution trajectory of the LL problem, but this leads to repeated computation of the inverse of Hessian matrix, which is still a heavy burden. In dealing with this, to avoid direct inverse calculation, the Conjugate Gradient (CG) method [27, 9] changes it into solving a linear system, and Neumann method [28] uses the Neumann series to calculate the Hessian inverse. However, after using these methods, the computational requirements are reduced but still large, because the burden of computing the inverse of matrix changes into computing Hessian-vector products. Additionally, the accuracy of solving a linear system highly depends on its condition number [36], and the ill condition may result in numerical instabilities. A large quadratic term is added on the LL objective to eliminate the ill-condition in [9], but this approach may change the solution set greatly.

As discussed above, EGBMs and IGBMs need repeated calculations of recurrent gradient or Hessian inverse, leading to high time and space complexity in numerical computation, and require the LLC assumption in theory. Actually, when the dimension of $\mathbf{y}$ is very large, which happens in practical problems usually, the computational burden of massively computing the products of matrices and vectors might be too heavy to carry. In addition, the LLC assumption is also not suitable for most complex real-world tasks.

3 The Proposed Algorithm

In this section, we illustrate our algorithmic framework, named Bi-level Value-Function-based Sequential Minimization (BVFSM). Our method also follows the idea of constructing a sequence of unconstrained minimization problems to approximate the original bi-level problem, but different from existing methods, BVFSM uses the re-characterization via the value-function of the LL problem. Thanks to this strategy, our algorithm is able to handle problems with complicated non-convex high-dimensional LL, which existing methods are not able to deal with.

3.1 Value-Function-based Single-level Reformulation

BVFSM designs a sequence of single-level unconstrained minimization problems to approximate the original problem through a value-function-based reformulation. We first present this procedure under the optimistic BLO case.

Recall the original optimistic BLO in Eq. (2) has been transformed into Eq. (3), and we hope to compute $G(\mathbf{x})$ in Eq. (4). Note that the difficulty of computing $\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}}$ comes from the ill-condition of $\varphi(\mathbf{x})$ , owing to the nested structure of the bi-level sub-problem for solving $\varphi(\mathbf{x})$ . Hence, we introduce the value-function of the LL problem $f^{*}(\mathbf{x}):=\min_{\mathbf{y}}f(\mathbf{x},\mathbf{y})$ to transform it into a single-level problem. Then the problem can be reformulated as

\varphi(\mathbf{x})=\mathop{\min}_{\mathbf{y}}\Big{\{}F(\mathbf{x},\mathbf{y}):f(\mathbf{x},\mathbf{y})\leq f^{*}(\mathbf{x})\Big{\}}.

(5)

However, the inequality constraint $f(\mathbf{x},\mathbf{y})\leq f^{*}(\mathbf{x})$ is still ill-posed, because it does not satisfy any standard regularity condition and $f^{*}(\mathbf{x})$ is non-smooth. In dealing with such difficulty, we approximate $f^{*}(\mathbf{x})$ with regularization:

f_{\mu}^{*}(\mathbf{x})=\min\limits_{\mathbf{y}}\left\{f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}\right\},

(6)

where $\frac{\mu}{2}\|\mathbf{y}\|^{2}$ ( $\mu>0$ ) is the regularization term.

We further add an auxiliary function of the inequality constraints to the objective, and obtain

\displaystyle\varphi_{\mu,\theta,\sigma}(\mathbf{x})\!=\!\min\limits_{\mathbf{y}}

\displaystyle\left\{F(\mathbf{x},\mathbf{y})\!+\!P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)\!+\!\frac{\theta}{2}\|\mathbf{y}\|^{2}\right\},

(7)

where $(\mu,\theta,\sigma)>0$ , $\frac{\theta}{2}\|\mathbf{y}\|^{2}$ is the regularization term, and $P_{\sigma}:\mathbb{R}\rightarrow\mathbb{\overline{R}}$ (where $\mathbb{\overline{R}}=\mathbb{R}\cup\{\infty\}$ ) is the selected auxiliary function for the sequential unconstrained minimization method with parameter $\sigma$ , which will be defined in Eq. (8) and discussed in detail next. This reformulation changes the constrained problem Eq. (5) into a sequence of unconstrained problems Eq. (7) under different parameters. The regularization terms in Eq. 6 and Eq. 7 are to guarantee the uniqueness of solution to these two problems, which is essential for the differentiability of $\varphi_{\mu,\theta,\sigma}(\mathbf{x})$ , and will be discussed in Remark 1 of Section 3.2. Experiments in Section 5.1.1 also demonstrate that introducing the regularization terms for differentiability to avoid possible jumps matters to improve the computational stability.

The sequential unconstrained minimization method is mainly used for solving constrained nonlinear programming by changing the problem into a sequence of unconstrained minimization problems [35, 37, 38]. To be specific, we add to the objective a selected auxiliary function of the constraints with a sequence of parameters, and obtain a series of unconstrained problems. The convergence of parameters makes the sequential unconstrained problems converge to the original constrained problem, leading to the convergence of the solution. Based on the property of auxiliary functions, they are divided mainly into two types, barrier functions and penalty functions [39, 40], whose definitions are provided here.

Definition 1

A continuous, differentiable, and non-decreasing function $\rho:\mathbb{R}\rightarrow\mathbb{\overline{R}}$ is called a standard barrier function if $\rho(\omega;\sigma)$ satisfies $\rho(\omega;\sigma)\geq 0$ and $\lim_{\sigma\rightarrow 0}\rho(\omega;\sigma)=0$ , when $\omega<0$ ; and $\rho(\omega;\sigma)\rightarrow\infty$ when $\omega\rightarrow 0$ . It is called a standard penalty function if it satisfies $\rho(\omega;\sigma)=0$ when $\omega\leq 0$ ; and $\rho(\omega;\sigma)>0$ and $\lim_{\sigma\rightarrow 0}\rho(\omega;\sigma)=\infty$ when $\omega>0$ . Here $\sigma>0$ is the barrier or penalty parameter. In addition, if $\rho(\omega;\sigma^{(1)})$ is a standard barrier function, then $\rho(\omega-\sigma^{(2)}\;;\sigma^{(1)})\$ is called a modified barrier function $(\sigma^{(1)},\sigma^{(2)}>0)$ .

For the simplicity of expression later, we denote the function $P_{\sigma}$ in Eq. (7) to be

P_{\sigma}(\omega):=\left\{\begin{aligned} &\rho(\omega;\sigma),\text{if}\ \rho\ \text{ is a penalty}\\ &\qquad\qquad\qquad\text{ or standard barrier function},\\ &\rho(\omega-\sigma^{(2)};\sigma^{(1)}),\text{if}\ \rho\ \text{ is a modified barrier function}.\end{aligned}\right.

(8)

Here for a modified barrier function, $\sigma^{(2)}>0$ is to guarantee that in Eq. (7), $f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})-\sigma^{(2)}<0$ , and the barrier function is well-defined.

Classical examples of auxiliary functions are the quadratic penalty function, inverse barrier function and log barrier function [41, 40]. There are also some other popular examples, such as the polynomial penalty function [39] and truncated log barrier function [42]. These examples of standard penalty and barrier functions are listed in Table I. Note that we need the smoothness of $\varphi_{\mu,\theta,\sigma}(\mathbf{x})$ , and will calculate the gradient of $P_{\sigma}$ afterwards, so we choose smooth auxiliary functions rather than non-smooth exact penalty functions. Note that when $\rho$ is a modified barrier function, $\sigma$ of $P_{\sigma}(\omega)$ in Eq. (8) has two components $\sigma^{(1)}$ and $\sigma^{(2)}$ , and the specific form of $P_{\sigma}(\omega)$ comes from substituting $\omega-\sigma^{(2)}$ and $\sigma^{(1)}$ for $\omega$ and $\sigma$ in Table I.

TABLE I: Some available standard penalty and barrier functions.

Penalty functions
Quadratic	$\rho\!\left(\omega;\sigma\right)=\frac{1}{2\sigma}\left(\omega^{+}\right)^{2}$ , where $\omega^{+}=\max\left\{\omega,0\right\}$
Polynomial	$\rho\!\left(\omega;\sigma\right)=\frac{1}{q\sigma}\left(\omega^{+}\right)^{q}$ , where $q$ is a positive integer
Polynomial	chosen such that $\rho\!\left(\omega;\sigma\right)$ is differentiable
Barrier functions
Inverse	$\rho\!\left(\omega;\sigma\right)=\left\{\begin{aligned} &-\frac{\sigma}{\omega},&\omega<0\\ \\[-17.07182pt] &\infty,&\omega\geq 0\end{aligned}\right.$
Truncated Log	$\rho\!\left(\omega;\sigma\right)=\left\{\begin{aligned} &-\sigma\left(\log\left(-\omega\right)+\beta_{1}\right),&-\kappa\leq\omega<0\\ \\[-14.22636pt] &-\sigma\left(\beta_{2}+\frac{\beta_{3}}{\omega^{2}}+\frac{\beta_{4}}{\omega}\right),&\omega<-\kappa\\ \\[-17.07182pt] &\infty,&\omega\geq 0\end{aligned}\right.$
	where $0<\kappa\leq 1$ , $\beta_{1},\beta_{2},\beta_{3},\beta_{4}$ are chosen
	such that $\rho\!\left(\omega;\sigma\right)\geq 0$ and is twice differentiable

3.2 Sequential Minimization Strategy

From the discussion above, we then hope to solve

\min\limits_{\mathbf{x}\in\mathcal{X}}\ \varphi_{\mu,\theta,\sigma}(\mathbf{x}),

(9)

with $\varphi_{\mu,\theta,\sigma}(\mathbf{x})$ in Eq. (7). First denote

\mathbf{z}^{*}_{\mu}(\mathbf{x})=\underset{\mathbf{y}}{\mathrm{argmin}}\left\{f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}\right\},

(10)

\displaystyle\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})=\underset{\mathbf{y}}{\mathrm{argmin}}

\displaystyle\left\{F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)+\frac{\theta}{2}\|\mathbf{y}\|^{2}\right\}.

(11)

The following proposition gives the smoothness of $\varphi_{\mu,\theta,\sigma}\left(\mathbf{x}\right)$ and the formula for computing $\frac{\partial\varphi_{\mu,\theta,\sigma}\left(\mathbf{x}\right)}{\partial\mathbf{x}}$ or $G(\mathbf{x})$ , which serves as the ground for our algorithm.

Proposition 1 (Calculation of $G(\mathbf{x})$ )

\displaystyle G(\mathbf{x})=\frac{\partial P_{\sigma}\!\left(f\left(\mathbf{x},\mathbf{y}\right)-f_{\mu}^{*}(\mathbf{x})\right)}{\partial\mathbf{x}}\big{|}_{\mathbf{y}=\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})},

(12)

where $\begin{aligned} f_{\mu}^{*}(\mathbf{x})=f\!\left(\mathbf{x},\mathbf{z}^{*}_{\mu}(\mathbf{x})\right)+\frac{\mu}{2}\|\mathbf{z}^{*}_{\mu}(\mathbf{x})\|^{2},\end{aligned}$ and $\frac{\partial f_{\mu}^{*}(\mathbf{x})}{\partial\mathbf{x}}=\frac{\partial f\!\left(\mathbf{x},\mathbf{y}\right)}{\partial\mathbf{x}}\big{|}_{\mathbf{y}=\mathbf{z}^{*}_{\mu}(\mathbf{x})}.$

Proof.

We first prove that for any $\bar{\mathbf{x}}\in\mathcal{X}$ , $f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}$ is level-bounded in $\mathbf{y}$ locally uniformly in $\bar{\mathbf{x}}\in\mathcal{X}$ (see [29, Definition 3]). That means for any $c\in\mathbb{R}$ , there exist $\delta>0$ and a bounded set $\mathcal{B}\subset\mathbb{R}^{n}$ , such that $\left\{\mathbf{y}\in\mathbb{R}^{n}:f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}\leq c\right\}\subset\mathcal{B},$ for all $\mathbf{x}\in\mathcal{B}_{\delta}(\bar{\mathbf{x}})\cap\mathcal{X}$ , where $\mathcal{B}_{\delta}(\bar{\mathbf{x}})$ denotes the open ball with center at $\bar{\mathbf{x}}$ and radius $\delta$ , i.e., $\mathcal{B}_{\delta}(\bar{\mathbf{x}})=\{\hat{\mathbf{x}}\in\mathcal{X}:\|\bar{\mathbf{x}}-\hat{\mathbf{x}}\|<\delta\}$ . Assume by contradiction that the above does not hold. Then there exist sequences $\{\mathbf{x}_{k}\}$ and $\{\mathbf{y}_{k}\}$ satisfying $\mathbf{x}_{k}\rightarrow\bar{\mathbf{x}}$ and $\|\mathbf{y}_{k}\|\rightarrow+\infty$ , such that $f(\mathbf{x}_{k},\mathbf{y}_{k})+\frac{\mu}{2}\|\mathbf{y}_{k}\|^{2}\leq c.$ As $f(\mathbf{x},\mathbf{y})$ is bounded below, then $\|\mathbf{y}_{k}\|\rightarrow\infty$ implies $f(\mathbf{x}_{k},\mathbf{y}_{k})+\frac{\mu}{2}\|\mathbf{y}_{k}\|^{2}\rightarrow\infty$ , which contradicts with $f(\mathbf{x}_{k},\mathbf{y}_{k})+\frac{\mu}{2}\|\mathbf{y}_{k}\|^{2}\leq c$ and $c\in\mathbb{R}$ .

Hence, from the arbitrariness of $\bar{\mathbf{x}}$ , we have $f(\mathbf{x},\mathbf{y})\!+\!\frac{\mu}{2}\|\mathbf{y}\|^{2}$ is level-bounded in $\mathbf{y}$ locally uniformly in $\mathbf{x}\!\!\!\in\!\!\!\mathcal{X}$ , and then the inf-compactness condition in [43, Theorem 4.13] holds for $f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}$ . Since ${\mathrm{argmin}}_{\mathbf{y}\in\mathbb{R}^{n}}\left\{f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}\right\}$ is a singleton, it follows from [43, Theorem 4.13, Remark 4.14] that

\displaystyle{\color[rgb]{0,0,0}\frac{\partial f_{\mu}^{*}(\mathbf{x})}{\partial\mathbf{x}}}

\displaystyle=\frac{\partial\left(f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}\right)}{\partial\mathbf{x}}\big{|}_{\mathbf{y}=\mathbf{z}_{\mu}^{*}(\mathbf{x})}=\frac{\partial f(\mathbf{x},\mathbf{y})}{\partial\mathbf{x}}\big{|}_{\mathbf{y}=\mathbf{z}_{\mu}^{*}(\mathbf{x})}.

Next, from definitions of penalty and barrier functions (Definition 1), we have $\rho(\omega;\sigma)\geq 0$ for any $\omega$ , so $P_{\sigma}(\omega)\geq 0$ holds. Then, since $F(\mathbf{x},\mathbf{y})$ is assumed to be bounded below, similar to $f(\mathbf{x},\mathbf{y})+\frac{\mu}{2}\|\mathbf{y}\|^{2}$ , the inf-compactness condition in [43, Theorem 4.13] also holds for $F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)+\frac{\theta}{2}\|\mathbf{y}\|^{2}$ . Combining with the fact that

{\mathrm{argmin}}_{\mathbf{y}\in\mathbb{R}^{n}}\left\{F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)+\frac{\theta}{2}\|\mathbf{y}\|^{2}\right\}

is a singleton, [43, Theorem 4.13, Remark 4.14] shows that

		$\displaystyle\frac{\partial\varphi_{\mu,\theta,\sigma}\left(\mathbf{x}\right)}{\partial\mathbf{x}}$
		$\displaystyle=\frac{\partial\left(F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{}(\mathbf{x})\right)+\frac{\theta}{2}\\|\mathbf{y}\\|^{2}\right)}{\partial\mathbf{x}}\big{\|}_{\mathbf{y}=\mathbf{y}_{\mu,\theta,\sigma}^{}(\mathbf{x})}$
		$\displaystyle=\left(\frac{\partial F(\mathbf{x},\mathbf{y})}{\partial\mathbf{x}}+\frac{\partial P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{}(\mathbf{x})\right)}{\partial\mathbf{x}}\right)\big{\|}_{\mathbf{y}=\mathbf{y}_{\mu,\theta,\sigma}^{}(\mathbf{x})}.$

Therefore, the conclusion in Eq. (12) follows immediately. ∎

Remark 1

In Proposition 1 we require the uniqueness of $\mathbf{z}^{*}_{\mu}(\mathbf{x})$ and $\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})$ to guarantee the differentiability of $\varphi_{\mu,\theta,\sigma}(\mathbf{x})$ . This can be achieved by conditions much weaker than convexity, such as the convexity only on a level set. We start with the uniqueness of $\mathbf{z}^{*}_{\mu}(\mathbf{x})$ . For any given $\mathbf{x}\in\mathcal{X}$ , consider a function $f(\mathbf{x},\mathbf{y})$ satisfying that there exists a constant $c>\min_{\mathbf{y}}f(\mathbf{x},\mathbf{y})$ such that $f(\mathbf{x},\mathbf{y})$ is convex in $\mathbf{y}$ on the level set $\{\mathbf{y}:f(\mathbf{x},\mathbf{y})\leq c\}$ . Suppose $\hat{\mathbf{y}}$ is a minimum of $f(\mathbf{x},\mathbf{y})$ . Then $\inf_{\mathbf{y}}\{f(\mathbf{x},\mathbf{y})+\mu/2\|\mathbf{y}\|^{2}\}\leq f(\mathbf{x},\hat{\mathbf{y}})+\mu/2\|\hat{\mathbf{y}}\|^{2}=\min_{\mathbf{y}}f(\mathbf{x},\mathbf{y})+\mu/2\|\hat{\mathbf{y}}\|^{2}<c,$ for a sufficiently small $\mu>0$ . Thus, $\mathbf{z}^{*}_{\mu}(\mathbf{x})$ locates inside the level set $\{\mathbf{y}:f(\mathbf{x},\mathbf{y})\leq c\}$ on which $f(\mathbf{x},\mathbf{y})$ is convex. Hence, $f(\mathbf{x},\mathbf{y})+\mu/2\|\mathbf{y}\|^{2}$ is strictly convex on $\{\mathbf{y}:f(\mathbf{x},\mathbf{y})\leq c\}$ , and the uniqueness of $\mathbf{z}^{*}_{\mu}(\mathbf{x})$ follows.

As for the uniqueness of $\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})$ , suppose given $\mathbf{x}$ , there exist constants $c_{1}>\min_{\mathbf{y}\in\mathcal{S}(\mathbf{x})}F(\mathbf{x},\mathbf{y})$ and $c_{2}>\min_{\mathbf{y}}f(\mathbf{x},\mathbf{y})$ such that $F$ and $f$ are convex in $\mathbf{y}$ on the set $\{\mathbf{y}:F(\mathbf{x},\mathbf{y})\leq c_{1}\text{ and }f(\mathbf{x},\mathbf{y})\leq c_{2}\}$ . If we select a non-decreasing and convex auxiliary function $P_{\sigma}(\cdot)$ (such as those in Table I), then $P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)$ is convex in $\mathbf{y}$ on the set (see [44] Proposition 1.54). Or simply if there exists $c>\min_{\mathbf{y}}F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)$ such that $F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)$ is convex on the set $\{\mathbf{y}:F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)\leq c\}$ , then it derives the strict convexity of $F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)+\frac{\theta}{2}\|\mathbf{y}\|^{2}$ in $\mathbf{y}$ on the set similarly, and the uniqueness of $\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})$ follows.

Proposition 1 serves as the foundation for our algorithmic framework. Denote $\varphi_{k}(\mathbf{x}):=\varphi_{\mu_{k},\theta_{k},\sigma_{k}}(\mathbf{x})$ . Next, we will illustrate the implementation at the $k$ -th step of the outer loop and the $l$ -th step of the inner loop, that is, to calculate $\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}}$ , as a guide.

We first calculate $\mathbf{z}^{*}_{\mu_{k}}(\mathbf{x}_{l})$ in Eq. (10) through $T_{\mathbf{z}}$ steps of gradient descent, and denote the output as $\mathbf{z}_{k,l}^{{T_{\mathbf{z}}}}$ , regarded as an approximation of $\mathbf{z}^{*}_{\mu_{k}}(\mathbf{x}_{l})$ . After that, we calculate $\mathbf{y}_{\mu_{k},\theta_{k},\sigma_{k}}^{*}(\mathbf{x}_{l})$ in Eq. (11) through $T_{\mathbf{y}}$ steps of gradient descent, and denote the output as $\mathbf{y}_{k,l}^{T_{\mathbf{y}}}$ . Note that if the objective function is convex, running some number of steps of the method would lead to an approximation of the minimizer. Meanwhile, the convexity of objective functions for approximating $\mathbf{z}^{*}_{\mu_{k}}(\mathbf{x}_{l})$ and $\mathbf{y}_{\mu_{k},\theta_{k},\sigma_{k}}^{*}(\mathbf{x}_{l})$ can be guaranteed if $f$ and $F$ are convex in $\mathbf{y}$ as discussed in Remark 1. Also, if the objective function is not convex but all of its stationary points are minimizers, which is a weaker condition than convexity, gradient descent would still help to converge to minimizers.

Then, according to Proposition 1, we can obtain

\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}}\approx\frac{\partial F(\mathbf{x}_{l},\mathbf{y}_{k,l}^{T_{\mathbf{y}}})}{\partial\mathbf{x}}+G_{k,l}\ ,

(13)

with $G_{k,l}=\frac{\partial P_{\sigma_{k}}\!\left(f\left(\mathbf{x}_{l},\mathbf{y}_{k,l}^{T_{\mathbf{y}}}\right)-f_{k,l}^{T_{\mathbf{z}}}\right)}{\partial\mathbf{x}},$ where $f_{k,l}^{T_{\mathbf{z}}}=f(\mathbf{x}_{l},\mathbf{z}_{k,l}^{T_{\mathbf{z}}})+\frac{\mu_{k}}{2}\|\mathbf{z}_{k,l}^{{T_{\mathbf{z}}}}\|^{2}.$ As a summary, the algorithm for solving Eq. (9) is shown in Algorithms 1 and 2.

Algorithm 1 Our Solution Strategy for Eq. (9).

\mathbf{x}_{0}

(\mu_{0},\theta_{0},\sigma_{0})

, step size

\alpha>0

0: The optimal UL solution

\mathbf{x}^{*}

1: for

k=0\rightarrow K-1

2: Calculate

(\mu_{k},\theta_{k},\sigma_{k})

3: for

l=0\rightarrow L-1

4: Calculate

\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}}

by Algorithm 2.

\mathbf{x}_{l+1}=\mathtt{Proj}_{\mathcal{X}}\left(\mathbf{x}_{l}-\alpha\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}}\right)

6: end for

\mathbf{x}_{0}=\mathbf{x}_{L}

8: end for

9: return

\mathbf{x}^{*}

Algorithm 2 Calculation of

\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}}

\mathbf{x}_{l},(\mu_{k},\theta_{k},\sigma_{k})

\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}}

1: Calculate

\mathbf{z}_{k,l}^{{T_{\mathbf{z}}}}

as an approximation of

\mathbf{z}^{*}_{\mu_{k}}\!(\mathbf{x}_{l})

by performing

T_{\mathbf{z}}

-step gradient descent on Eq. (10).

2: Calculate

\mathbf{y}_{k,l}^{T_{\mathbf{y}}}

as an approximation of

\mathbf{y}_{\mu_{k},\theta_{k},\sigma_{k}}^{*}\!(\mathbf{x}_{l})

by performing

T_{\mathbf{y}}

-step gradient descent on Eq. (11).

3: Calculate an approximation of

\frac{\partial\varphi_{k}(\mathbf{x}_{l})}{\partial\mathbf{x}}

by Eq. (13).

3.3 Extension for BLO with Functional Constraints

We consider the BLO with functional constraints on UL and LL problems in this subsection, which is a more general setting, and the above discussion without constraints can be extended to the case with inequality constraints.

The optimistic BLO problems in Eq. (2) with functional constraints are then

		$\displaystyle\min\limits_{\mathbf{x}\in\mathcal{X},\mathbf{y}\in\mathbb{R}^{n}}\ F(\mathbf{x},\mathbf{y})$
		$\displaystyle\mathrm{\ s.t.\ }\left\{\begin{aligned} &H_{j}(\mathbf{x},\mathbf{y})\leq 0,\text{for any }j\ \\ &\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x}):=\mathop{\arg\min}_{\mathbf{y}}\Big{\{}f(\mathbf{x},\mathbf{y}):h_{j^{\prime}}(\mathbf{x},\mathbf{y})\leq 0,\text{for any }j^{\prime}\Big{\}}\end{aligned}\right.,$

where the UL constraints $H_{j}(\mathbf{x},\mathbf{y}):\mathbb{R}^{m}\times\mathbb{R}^{n}\rightarrow\mathbb{R}$ $\left(\text{for any }{j}\in\left\{1,2,\cdots,J\right\}\right)$ and the LL constraints $h_{j^{\prime}}(\mathbf{x},\mathbf{y}):\mathbb{R}^{m}\times\mathbb{R}^{n}\rightarrow\mathbb{R}$ $\left(\text{for any }{j^{\prime}}\in\left\{1,2,\cdots,J^{\prime}\right\}\right)$ are continuously differentiable functions. This is equivalent to $\min_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}),$ where $\varphi(\mathbf{x})$ , the value-function of the sub-problem in Eq. (3), is adapted correspondingly to be

\varphi(\mathbf{x}):=\mathop{\min}_{\mathbf{y}}\Big{\{}F(\mathbf{x},\mathbf{y}):H_{j}(\mathbf{x},\mathbf{y})\leq 0,\ \forall j,\text{ and }\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x})\Big{\}}.

Also, the counterpart for value-function of the LL problem is $f^{*}(\mathbf{x}):=\min_{\mathbf{y}}\left\{f(\mathbf{x},\mathbf{y}):h_{j^{\prime}}(\mathbf{x},\mathbf{y})\leq 0,\forall\ {j^{\prime}}\right\}$ , and following the technique within Eq. (5) to transform the LL problem into an inequality constraint, the problem can be reformulated as

	$\displaystyle\varphi(\mathbf{x})=\mathop{\min}_{\mathbf{y}}\Big{\{}$	$\displaystyle F(\mathbf{x},\mathbf{y}):H_{j}(\mathbf{x},\mathbf{y})\leq 0,\forall\ j,$		(14)
		$\displaystyle f(\mathbf{x},\mathbf{y})\leq f^{*}(\mathbf{x}),\text{ and }h_{j^{\prime}}(\mathbf{x},\mathbf{y})\leq 0,\ \forall\ {j^{\prime}}\Big{\}}.$		(14)

After that, using the same idea of sequential minimization method, inspired by the regularized smoothing method in [45], the value-function $f^{*}(\mathbf{x})$ can be approximated with a barrier function (different from Eq. (6) due to the LL constraints) and a regularization term:

f_{\mu,\sigma_{B}}^{*}(\mathbf{x})=\min\limits_{\mathbf{y}}\Bigg{\{}f(\mathbf{x},\mathbf{y})+\sum_{{j^{\prime}}=1}^{J^{\prime}}P_{B,\sigma_{B}}\!\left(h_{j^{\prime}}(\mathbf{x},\mathbf{y})\right)+\frac{\mu}{2}\|\mathbf{y}\|^{2}\Bigg{\}},

where $P_{B,\sigma_{B}}(\omega):\mathbb{R}\rightarrow\mathbb{\overline{R}}$ is the selected standard barrier function for the LL constraint $h_{j^{\prime}}$ as defined in Eq. (8), with $\sigma_{B}$ as the barrier parameter. Note that here we define $P_{B,\sigma_{B}}$ as a standard barrier function but not a modified barrier function. As for the approximation of $\varphi(\mathbf{x})$ , Eq. (7) is transferred into $\small\begin{aligned} &\varphi_{\mu,\theta,\sigma}(\mathbf{x})=\min\limits_{\mathbf{y}}\Bigg{\{}F(\mathbf{x},\mathbf{y})+\sum_{j=1}^{J}P_{H,\sigma_{H}}\!\left(H_{j}(\mathbf{x},\mathbf{y})\right)\\ &+\sum_{{j^{\prime}}=1}^{J^{\prime}}P_{h,\sigma_{h}}\!\left(h_{j^{\prime}}(\mathbf{x},\mathbf{y})\right)+\ P_{f,\sigma_{f}}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu,\sigma_{B}}^{*}(\mathbf{x})\right)+\frac{\theta}{2}\|\mathbf{y}\|^{2}\Bigg{\}},\end{aligned}$ where $P_{H,\sigma_{H}},P_{h,\sigma_{h}},P_{f,\sigma_{f}}:\mathbb{R}\rightarrow\mathbb{\overline{R}}$ are the selected auxiliary functions of penalty or modified barrier with parameters $\sigma_{H}$ , $\sigma_{h}$ and $\sigma_{f}$ , and $(\mu,\theta,\sigma)=(\mu,\theta,\sigma_{B},\sigma_{H},\sigma_{h},\sigma_{f})>0$ .

Then corresponding to Eq. (10) and Eq. (11), by denoting

\small\mathbf{z}^{*}_{\mu,\sigma_{B}}(\mathbf{x})\!=\!\underset{\mathbf{y}}{\mathrm{argmin}}\Bigg{\{}f(\mathbf{x},\mathbf{y})+\sum_{{j^{\prime}}=1}^{J^{\prime}}P_{B,\sigma_{B}}\!\left(h_{j^{\prime}}(\mathbf{x},\mathbf{y})\right)+\frac{\mu}{2}\|\mathbf{y}\|^{2}\Bigg{\}},

$\small\begin{aligned} &\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})=\underset{\mathbf{y}}{\mathrm{argmin}}\Bigg{\{}F(\mathbf{x},\mathbf{y})+\sum_{j=1}^{J}P_{H,\sigma_{H}}\!\left(H_{j}(\mathbf{x},\mathbf{y})\right)\\ &+\sum_{{j^{\prime}}=1}^{J^{\prime}}P_{h,\sigma_{h}}\!\left(h_{j^{\prime}}(\mathbf{x},\mathbf{y})\right)+\ P_{f,\sigma_{f}}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu,\sigma_{B}}^{*}(\mathbf{x})\right)+\frac{\theta}{2}\|\mathbf{y}\|^{2}\Bigg{\}},\end{aligned}$ we have the next proposition, which follows the same idea from Proposition 1.

Proposition 2

Suppose $F(\mathbf{x},\mathbf{y})$ and $f(\mathbf{x},\mathbf{y})$ are bounded below and continuously differentiable. Given $\mathbf{x}\in\mathcal{X}$ and $\mu,\theta,\sigma>0$ , when $\mathbf{z}^{*}_{\mu,\sigma_{B}}(\mathbf{x})$ and $\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})$ are unique, then $\varphi_{\mu,\theta,\sigma}(\mathbf{x})$ is differentiable and $\begin{aligned} G(\mathbf{x})=&\Bigg{[}\sum_{j=1}^{J}\frac{\partial P_{H,\sigma_{H}}\!\left(H_{j}(\mathbf{x},\mathbf{y})\right)}{\partial\mathbf{x}}+\sum_{{j^{\prime}}=1}^{J^{\prime}}\frac{\partial P_{h,\sigma_{h}}\!\left(h_{j^{\prime}}(\mathbf{x},\mathbf{y})\right)}{\partial\mathbf{x}}\\ &+\frac{\partial P_{f,\sigma_{f}}\!\left(f\left(\mathbf{x},\mathbf{y}\right)-f_{\mu,\sigma_{B}}^{*}(\mathbf{x})\right)}{\partial\mathbf{x}}\Bigg{]}\big{|}_{\mathbf{y}=\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})},\end{aligned}$ where

	$\displaystyle f_{\mu,\sigma_{B}}^{}(\mathbf{x})=f\!\left(\mathbf{x},\mathbf{z}^{}_{\mu,\sigma_{B}}(\mathbf{x})\right)+\sum_{{j^{\prime}}=1}^{J^{\prime}}\partial P_{B,\sigma_{B}}\!\left(h_{j^{\prime}}(\mathbf{x},\mathbf{z}^{*}_{\mu,\sigma_{B}}(\mathbf{x}))\right)$
	$\displaystyle+\frac{\mu}{2}\\|\mathbf{z}^{*}_{\mu,\sigma_{B}}(\mathbf{x})\\|^{2},$

\small\frac{\partial f_{\mu,\sigma_{B}}^{*}(\mathbf{x})}{\partial\mathbf{x}}\!\!=\!\!\Bigg{[}\frac{\partial f\!\left(\mathbf{x},\mathbf{y}\right)}{\partial\mathbf{x}}+\sum_{{j^{\prime}}=1}^{J^{\prime}}\frac{\partial P_{B,\sigma_{B}}\!\left(h_{j^{\prime}}\!\left(\mathbf{x},\mathbf{y}\right)\right)}{\partial\mathbf{x}}\Bigg{]}\big{|}_{\mathbf{y}=\mathbf{z}^{*}_{\mu,\sigma_{B}}(\mathbf{x})}.

The proof is similar to Proposition 1, obtained by applying [43, Theorem 4.13, Remark 4.14]. The algorithm is then based on Proposition 2 and similar to Algorithm 1 and 2. Note that in Section 4.1, our convergence analysis is carried out under this constrained setting, because problems without constraints can be regarded as its special case.

Remark 2

In terms of the uniqueness of $\mathbf{z}^{*}_{\mu,\sigma_{B}}(\mathbf{x})$ and $\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})$ , if we select convex auxiliary functions $P_{B,\sigma_{B}}$ , $P_{H,\sigma_{H}}$ , $P_{h,\sigma_{h}}$ , $P_{f,\sigma_{f}}$ , and suppose $h_{j^{\prime}}(\mathbf{x},\mathbf{y}),\forall\ j^{\prime},$ and $H_{j}(\mathbf{x},\mathbf{y}),\forall\ j,$ in the constraints are convex in the level set, then the uniqueness follows similarly as in Remark 1.

3.4 Extension for Pessimistic BLO

In this part, we consider the pessimistic BLO, which has been rarely discussed for handling learning tasks to our best knowledge. For brevity, we focus on problems without constraints on UL and LL, and this can be extended to the case with constraints easily. As what we have discussed about pessimistic BLO at the very beginning, its form is

\displaystyle\min\limits_{\mathbf{x}\in\mathcal{X}}

\displaystyle\max\limits_{\mathbf{y}\in\mathbb{R}^{n}}\ F(\mathbf{x},\mathbf{y}),\ \mathrm{\ s.t.\ }\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x})=\mathop{\arg\min}_{\mathbf{y}}f(\mathbf{x},\mathbf{y}).

(15)

Similar to the optimistic case, this problem can be transformed into $\min_{\mathbf{x}}\ \varphi^{p}(\mathbf{x}),$ where the value-function $\varphi(\mathbf{x})$ in Eq. (3) is redefined as $\begin{aligned} \varphi^{p}(\mathbf{x}):=&\mathop{\max}_{\mathbf{y}}\Big{\{}F(\mathbf{x},\mathbf{y}):\mathbf{y}\in{\rm\mathcal{S}}(\mathbf{x})\Big{\}}.\end{aligned}$ Considering the value-function $f^{*}(\mathbf{x})$ , we have the same regularized $f_{\mu}^{*}(\mathbf{x})$ to the optimistic case in Eq. (6). As for the approximation of $\varphi^{p}(\mathbf{x})$ , thanks to the value-function-based sequential minimization, different from Eq. (7), we have

\displaystyle\varphi^{p}_{\mu,\theta,\sigma}(\mathbf{x})=\max\limits_{\mathbf{y}}

\displaystyle\Big{\{}F(\mathbf{x},\mathbf{y})-P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)-\frac{\theta}{2}\|\mathbf{y}\|^{2}\Big{\}},

where $P_{\sigma}$ is defined in Eq. (8). Same as before, our goal is to solve $\min_{\mathbf{x}}\varphi^{p}_{\mu,\theta,\sigma}(\mathbf{x}).$

Denote $\mathbf{z}^{*}_{\mu}(\mathbf{x})$ to be the same as in Eq. (10), and Eq. (11) is changed into

\displaystyle\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})=\underset{\mathbf{y}}{\mathrm{argmax}}

\displaystyle\Big{\{}F(\mathbf{x},\mathbf{y})-P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)-\frac{\theta}{2}\|\mathbf{y}\|^{2}\Big{\}}.

Then Proposition 1 in the optimistic case is changed into the following in the pessimistic case.

Proposition 3

Suppose $-F(\mathbf{x},\mathbf{y})$ and $f(\mathbf{x},\mathbf{y})$ are bounded below and continuously differentiable. Given $\mathbf{x}\in\mathcal{X}$ and $\mu,\theta,\sigma>0$ , when $\mathbf{z}^{*}_{\mu}(\mathbf{x})$ and $\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})$ are unique, then $\varphi^{p}_{\mu,\theta,\sigma}(\mathbf{x})$ is differentiable and

\frac{\partial\varphi^{p}_{\mu,\theta,\sigma}\left(\mathbf{x}\right)}{\partial\mathbf{x}}=\frac{\partial F\!\left(\mathbf{x},\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})\right)}{\partial\mathbf{x}}+G(\mathbf{x}),

with $\begin{aligned} G(\mathbf{x})=\frac{-\partial P_{\sigma}\!\left(f\left(\mathbf{x},\mathbf{y}\right)-f_{\mu}^{*}(\mathbf{x})\right)}{\partial\mathbf{x}}\big{|}_{\mathbf{y}=\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})},\end{aligned}$ where $\begin{aligned} f_{\mu}^{*}(\mathbf{x})=f\!\left(\mathbf{x},\mathbf{z}^{*}_{\mu}(\mathbf{x})\right)+\frac{\mu}{2}\|\mathbf{z}^{*}_{\mu}(\mathbf{x})\|^{2},\end{aligned}$ and $\frac{\partial f_{\mu}^{*}(\mathbf{x})}{\partial\mathbf{x}}=\frac{\partial f\!\left(\mathbf{x},\mathbf{y}\right)}{\partial\mathbf{x}}\big{|}_{\mathbf{y}=\mathbf{z}^{*}_{\mu}(\mathbf{x})}.$

The proof is similar to Proposition 1, obtained by applying [43, Theorem 4.13, Remark 4.14]. The algorithm in the pessimistic case can then be derived similar to the optimistic case, but when calculating $\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})$ , gradient ascent is performed instead of gradient descent. In addition, the convergence of BVFSM for pessimistic BLO will be discussed in detail in Section 4.1.

Remark 3

The uniqueness of $\mathbf{z}^{*}_{\mu}(\mathbf{x})$ can be guaranteed same to the analysis in Remark 1. Similarly, suppose given $\mathbf{x}$ , there exist constants $c_{1}<\max_{\mathbf{y}\in\mathcal{S}(\mathbf{x})}F(\mathbf{x},\mathbf{y})$ and $c_{2}>\min_{\mathbf{y}}f(\mathbf{x},\mathbf{y}),$ such that $F$ is concave and $f$ is convex in $\mathbf{y}$ on the set $\{\mathbf{y}:F(\mathbf{x},\mathbf{y})\geq c_{1}\text{ and }f(\mathbf{x},\mathbf{y})\leq c_{2}\}$ , and we select a non-decreasing and convex auxiliary function $P_{\sigma}(\cdot)$ . Or simply suppose there exists $c<\max_{\mathbf{y}}F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)$ such that $F(\mathbf{x},\mathbf{y})+P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)$ is concave on the set $\{\mathbf{y}:F(\mathbf{x},\mathbf{y})-P_{\sigma}\!\left(f(\mathbf{x},\mathbf{y})-f_{\mu}^{*}(\mathbf{x})\right)\geq c\}$ . Then the uniqueness of $\mathbf{y}_{\mu,\theta,\sigma}^{*}(\mathbf{x})$ follows.

4 Theoretical Analysis

This section brings out the convergence analysis and complexity analysis of the proposed BVFSM.

4.1 Convergence Analysis

Here we show the convergence analysis of the proposed method. As BLO without constraints can be seen as a special case of BLO with constraints by regarding constraints as $H_{j}(\mathbf{x},\mathbf{y})\equiv 0,\forall j,$ and $h_{j^{\prime}}(\mathbf{x},\mathbf{y})\equiv 0,\forall{j^{\prime}}$ , we prove the more general constrained case. Also, for brevity, we first prove in the optimistic BLO case, and the pessimistic case will be analyzed later in Corollary 1.

Note that for sequential-minimization-type scheme, including EGBMs and BVFSM, the convergence analysis can be classified into asymptotic and non-asymptotic convergence [46, 47]. This work considers asymptotic convergence and focuses on the approximation quality. That is, whether the solutions to approximate problems converge to the original solution, which comes from the sequential approximated sub-problems converging to the original bi-level problem. We prove the asymptotic convergence from the aspect of global solution, and start by recalling the equivalent definition of epiconvergence given in [43, pp. 41].

Definition 2

$\varphi_{k}\stackrel{{\scriptstyle e}}{{\longrightarrow}}\varphi$ if and only if for all $\mathbf{x}\in\mathbb{R}^{m}$ , the following two conditions hold:

(1)

for any sequence $\{\mathbf{x}_{k}\}$ converging to $\mathbf{x}$ ,

$\liminf\limits_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})\geq\varphi(\mathbf{x});$ (16)
(2)

there is a sequence $\{\mathbf{x}_{k}\}$ converging to $\mathbf{x}$ such that

$\limsup\limits_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})\leq\varphi(\mathbf{x}).$ (17)

The convergence results are given under the following statements as our blanket assumption.

Assumption 1 (Assumptions for the problem)

(1)

$\mathcal{S}(\mathbf{x})$ is nonempty for $\mathbf{x}\in\mathcal{X}$ .
(2)

Both $F(\mathbf{x},\mathbf{y})$ and $f(\mathbf{x},\mathbf{y})$ are jointly continuous and continuously differentiable. Both $H_{j}(\mathbf{x},\mathbf{y})$ , $\forall\ {j}$ and $h_{j^{\prime}}(\mathbf{x},\mathbf{y})$ , $\forall\ {j^{\prime}}$ are continuously differentiable.
(3)

$F(\mathbf{x},\mathbf{y})$ is level-bounded in $\mathbf{y}$ locally uniformly in $\mathbf{x}\in\mathcal{X}$ (see [29, Definition 3]).
(4)

For constrained BLO, $0$ is not a local optimal value of $h_{j^{\prime}}(\mathbf{x},\mathbf{y})$ w.r.t. $\mathbf{y}$ for all ${j^{\prime}}$ .

For the simplicity of symbols, here we let $j=j^{\prime}=1$ , meaning that there is one constraint on the UL and LL problem respectively, and denote them as $H(\mathbf{x},\mathbf{y})$ and $h(\mathbf{x},\mathbf{y})$ . When $j\neq 1$ or $j^{\prime}\neq 1$ , the proofs parallel actually. In addition, denote $f^{*}_{k}(\mathbf{x}):=f_{\mu_{k},\sigma_{B,k}}^{*}(\mathbf{x})$ , $\varphi_{k}(\mathbf{x}):=\varphi_{\mu_{k},\theta_{k},\sigma_{k}}(\mathbf{x})$ , and $P_{k}(\omega)\!:=\!P_{\sigma_{k}}(\omega)$ defined in Eq. (8). Also, $P_{B,k}(\omega),P_{H,k}(\omega),P_{h,k}(\omega),P_{f,k}(\omega)$ are defined similarly. Note that $P_{B,k}$ is the standard barrier function, while $P_{H,k}$ , $P_{h,k}$ , $P_{f,k}$ are penalty or modified barrier functions. Then

f^{*}_{k}(\mathbf{x})=\min\limits_{\mathbf{y}}\left\{f(\mathbf{x},\mathbf{y})+P_{B,k}\!\left(h(\mathbf{x},\mathbf{y})\right)+\frac{\mu_{k}}{2}\|\mathbf{y}\|^{2}\right\},

	$\displaystyle\varphi_{k}(\mathbf{x})=\min\limits_{\mathbf{y}}\Big{\{}F(\mathbf{x},\mathbf{y})+P_{H,k}\!\left(H(\mathbf{x},\mathbf{y})\right)+P_{h,k}\!\left(h(\mathbf{x},\mathbf{y})\right)$
	$\displaystyle+P_{f,k}\!\left(f(\mathbf{x},\mathbf{y})-f_{k}^{*}(\mathbf{x})\right)+\frac{\theta_{k}}{2}\\|\mathbf{y}\\|^{2}\Big{\}}.$

To begin with, we present the following lemma on the properties of penalty and modified barrier functions, as the preparation for further discussion and proofs²²2Proofs of the four lemmas are provided in Appendix A, available at https://arxiv.org/abs/2110.04974..

Lemma 1

Let $\{\sigma_{k}\}$ in $P_{k}(\omega)\!=\!P_{\sigma_{k}}(\omega)$ be a positive sequence such that $\lim_{k\rightarrow\infty}\sigma_{k}=0$ . Additionally assume that $\lim_{k\rightarrow\infty}\rho(-\sigma^{(2)}_{k}\;;\sigma_{k}^{(1)})=0$ when $\rho$ is a modified barrier function. Then we have

(1)

$P_{k}(\omega)$ is continuous, differentiable and non-decreasing, and satisfies $P_{k}(\omega)\geq 0$ .
(2)

For any $\omega\leq 0$ , $\lim_{k\rightarrow\infty}P_{k}(\omega)=0$ .
(3)

For any sequence $\{\omega_{k}\}$ , $\lim_{k\rightarrow\infty}P_{k}(\omega_{k})<+\infty$ implies that $\limsup_{k\rightarrow\infty}\omega_{k}\leq 0$ .

We will use these properties in later proofs, so we hold these requirements on parameters in Lemma 1 from now on. To prove the convergence result, we verify the two conditions given in Definition 2, and show that $\varphi_{k}(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x})\stackrel{{\scriptstyle e}}{{\longrightarrow}}\varphi(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x})$ , where $\delta_{\mathcal{X}}(\mathbf{x})$ denotes the indicator function of the set $\mathcal{X}$ , i.e., $\delta_{\mathcal{X}}(\mathbf{x})=0$ if $\mathbf{x}\in\mathcal{X}$ and $\delta_{\mathcal{X}}(\mathbf{x})=+\infty$ if $\mathbf{x}\notin\mathcal{X}$ . To begin with, we propose the following three lemmas to verify the two condition in Eq. (16) and Eq. (17) in Definition 2.

Lemma 2

Let $\{(\mu_{k},\sigma_{k})\}$ be a positive sequence such that $(\mu_{k},\sigma_{k})\rightarrow 0$ , also satisfying the same setting as in Lemma 1. Then for any sequence $\{\mathbf{x}_{k}\}$ converging to $\bar{\mathbf{x}}$ , $\limsup\limits_{k\rightarrow\infty}f_{k}^{*}(\mathbf{x}_{k})\leq~{}f^{*}(\bar{\mathbf{x}}).$

Lemma 3

Let $\{(\mu_{k},\theta_{k},\sigma_{k})\}$ be a positive sequence such that $\lim_{k\rightarrow\infty}(\mu_{k},\theta_{k},\sigma_{k})=0$ , and satisfy the same setting as in Lemma 1. Given $\bar{\mathbf{x}}\in\mathcal{X}$ , then for any sequence $\{\mathbf{x}_{k}\}$ converging to $\bar{\mathbf{x}}$ , we have $\liminf\limits_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})\geq\varphi(\bar{\mathbf{x}}).$

Lemma 4

Let $\{(\mu_{k},\theta_{k},\sigma_{k})\}$ be a positive sequence such that $\lim_{k\rightarrow\infty}(\mu_{k},\theta_{k},\sigma_{k})=0$ , and satisfy the same setting as in Lemma 1. Then for any $\mathbf{x}\in\mathcal{X}$ , $\limsup\limits_{k\rightarrow\infty}\varphi_{k}(\mathbf{x})\leq\varphi(\mathbf{x}).$

TABLE II: Convergence of existing methods and BVFSM. We present the convergence results and conditions whether it is available without the LLC condition, whether it can be extended to BLO with constraints and pessimistic BLO, respectively for each method. Note that these convergence results are studied via two different types: the asymptotic and non-asymptotic analysis [46, 47], and BVFSM achieves the asymptotic convergence without the LLC condition. BVFSM can also be extended to BLO with constraints and pessimistic BLO which other methods cannot carry out.

Category	Method	Convergence Results	Required Conditions	w/o LLC	Constraints	Pessimistic
EGBMs	FHG/RHG	Asymptotic	$F(\mathbf{x},\mathbf{y})$ and $f(\mathbf{x},\mathbf{y})$ are $C^{1}$ .	✗	✗	✗
		$\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi_{k}(\mathbf{x})\rightarrow\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x})$	$\mathcal{S}(\mathbf{x})$ is a singleton.
	TRHG	Non-asymptotic	$F(\mathbf{x},\mathbf{y})$ is $C^{1}$ and bounded below.	✗	✗	✗
		$\mathbf{x}_{k}{\longrightarrow}\widehat{\mathbf{x}}^{*}$	$f(\mathbf{x},\mathbf{y})$ is $C^{1}$ , $L_{f}$ -smooth and strongly convex.
	BDA	Asymptotic	$F(\mathbf{x},\mathbf{y})$ is $L_{F}$ -smooth, convex, bounded below.	✗	✗	✗
		$\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi_{k}(\mathbf{x})\rightarrow\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x})$	$f(\mathbf{x},\mathbf{y})$ is $L_{f}$ -smooth. $\mathcal{S}(\mathbf{x})$ is a singleton.
IGBMs	CG/Neumann	Non-asymptotic	$F(\mathbf{x},\mathbf{y})$ and $f(\mathbf{x},\mathbf{y})$ are $C^{1}$ .	✗	✗	✗
		$\mathbf{x}_{k}{\longrightarrow}\widehat{\mathbf{x}}^{*}$	$\frac{\partial^{2}f(\mathbf{x},\mathbf{y})}{\partial\mathbf{y}\partial\mathbf{y}}$ is invertible. $\mathcal{S}(\mathbf{x})$ is a singleton.
Ours	BVFSM	Asymptotic	$F(\mathbf{x},\mathbf{y})$ and $f(\mathbf{x},\mathbf{y})$ are $C^{1}$	✔	✔	✔
		$\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi_{k}(\mathbf{x})\rightarrow\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x})$	and level-bounded.

1

$C^{1}$ denotes continuously differentiable. $L_{f}$ (or $L_{F}$ )-smooth means the gradient of $f$ (or $F$ ) is Lipschitz continuous with Lipschitz constant $L_{f}$ (or $L_{F}$ ). “Level-bounded” is short for “level-bounded in $\mathbf{y}$ locally uniformly in $\mathbf{x}\in\mathcal{X}$ ”.
2

$\widehat{\mathbf{x}}^{*}$ denotes the stationary point.

TABLE III: Complexity of existing gradient-based methods and BVFSM. We show the key update ideas for calculating

G(\mathbf{x})

\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}}

. Please see [2, 27, 28] for more details of EGBMs and IGBMs. Note that our method avoids solving an unrolled dynamic system or approximating the inverse of Hessian.

Category	Method	Key point for calculating $G(\mathbf{x})$		Time	Space
EGBMs	FHG	$G(\mathbf{x})\approx\mathbf{Z}_{T}^{\top}\frac{\partial F(\mathbf{x},\mathbf{y}_{T})}{\partial\mathbf{y}}$	$\mathbf{Z}_{t}=\frac{\partial^{2}f}{\partial\mathbf{y}^{2}}\mathbf{Z}_{t-1}+\frac{\partial^{2}f}{\partial\mathbf{y}\partial\mathbf{x}}$	$O(m^{2}nT)$	$O(mn)$
	RHG	$G(\mathbf{x})\approx\mathbf{q}_{-1}$	$\mathbf{q}_{t-1}=\mathbf{q}_{t}+\left(\frac{\partial^{2}f}{\partial\mathbf{x}\partial\mathbf{y}}\right)^{\top}\mathbf{p}_{t},\ \mathbf{p}_{t-1}=\left(\frac{\partial^{2}f}{\partial\mathbf{y}^{2}}\right)^{\top}\mathbf{p}_{t}$	$O(n(m+n)T)$	$O(m+nT)$
	TRHG	$G(\mathbf{x})\approx\mathbf{q}_{I-1}$		$O(n(m+n)I)$	$O(m+nI)$
	BDA	$G(\mathbf{x})\approx\mathbf{q}_{-1}$	Same as RHG, but replace $f$ with $(1-\alpha)f+\alpha F$	$O(n(m+n)T)$	$O(m+nT)$
IGBMs	CG	$G(\mathbf{x})\approx-\left(\frac{\partial^{2}f(\mathbf{x},\mathbf{y}_{T})}{\partial\mathbf{y}\partial\mathbf{x}}\right)^{\top}\mathbf{q}$	$\frac{\partial^{2}f}{\partial\mathbf{y}^{2}}\mathbf{q}=\frac{\partial F}{\partial\mathbf{y}}$	$O(m+nT+n^{2}Q)$	$O(m+n)$
IGBMs	Neumann		$\mathbf{q}=\sum_{i=0}^{Q}\left(\mathbf{I}-\frac{\partial^{2}f}{\partial\mathbf{y}^{2}}\right)^{i}\frac{\partial F}{\partial\mathbf{y}}$	$O(m+nT+n^{2}Q)$	$O(m+n)$
Ours	BVFSM	$G(\mathbf{x})\approx\frac{\partial P_{\sigma_{k}}\!\left(f\left(\mathbf{x}_{l},\mathbf{y}_{k,l}^{T_{\mathbf{y}}}\right)-f_{k,l}^{T_{\mathbf{z}}}\right)}{\partial\mathbf{x}}$	$f_{k,l}^{T_{\mathbf{z}}}=f(\mathbf{x}_{l},\mathbf{z}_{k,l}^{T_{\mathbf{z}}})+\frac{\mu_{k}}{2}\\|\mathbf{z}_{k,l}^{{T_{\mathbf{z}}}}\\|^{2}$	$O(m+n(T_{\mathbf{z}}+T_{\mathbf{y}}))$	$O(m+n)$

Now, by combining the above results, we can obtain the desired epiconvergence result, which also indicates the convergence of our method. Note that this is another type of the convergence of algorithm iterates in asymptotic convergence different from non-asymptotic convergence.

Theorem 1 (Convergence for Optimistic BLO)

Let $\{(\mu_{k},\theta_{k},\sigma_{k})\}$ be a positive sequence such that $(\mu_{k},\theta_{k},\sigma_{k})\rightarrow 0$ , also satisfying the same setting as in Lemma 1.

(1)

The epiconvergence holds:

\varphi_{k}(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x})\stackrel{{\scriptstyle e}}{{\longrightarrow}}\varphi(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x}).

(2)

We have the following inequality:

\limsup\limits_{k\rightarrow\infty}\left(\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi_{k}(\mathbf{x})\right)\leq\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}).

In addition, if $\mathbf{x}_{\ell}\in\mathrm{argmin}_{\mathbf{x}\in\mathcal{X}}\varphi_{\ell}(\mathbf{x})$ for some subsequence $\{\ell\}\subset\mathbb{N}$ , and $\mathbf{x}_{\ell}$ converges to $\tilde{\mathbf{x}}$ , then $\tilde{\mathbf{x}}\in\mathrm{argmin}_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x})$ and

\lim\limits_{\ell\rightarrow\infty}\left(\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi_{\ell}(\mathbf{x})\right)=\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}).

Proof.

To prove the epiconvergence of $\varphi_{k}$ to $\varphi$ , we just need to verify that the sequence $\{\varphi_{k}\}$ satisfies the two conditions given in Definition 2. Considering any sequence $\{\mathbf{x}_{k}\}$ converging to $\mathbf{x}$ , if $\mathbf{x}\in\mathcal{X}$ , from Lemma 3 we have

	$\displaystyle\varphi(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x})=\varphi(\mathbf{x})$	$\displaystyle\leq\liminf_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})$
		$\displaystyle\leq\liminf_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})+\delta_{\mathcal{X}}(\mathbf{x}_{k}).$

When $\mathbf{x}\notin\mathcal{X}$ , we have $\liminf_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})+\delta_{\mathcal{X}}(\mathbf{x}_{k})=+\infty$ because $\mathcal{X}$ is closed. Thus the first condition Eq. (16) in Definition 2 is satisfied.

Next, for any $\mathbf{x}\in\mathbb{R}^{m}$ , if $\mathbf{x}\in\mathcal{X}$ , then it follows from Lemma 4 that

	$\displaystyle\limsup_{k\rightarrow\infty}\varphi_{k}(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x})$	$\displaystyle=\limsup_{k\rightarrow\infty}\varphi_{k}(\mathbf{x})$
		$\displaystyle\leq\varphi(\mathbf{x})=\varphi(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x}).$

When $\mathbf{x}\notin\mathcal{X}$ , we have $\varphi(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x})=+\infty$ . Thus, the second condition Eq. (17) in Definition 2 is satisfied. Therefore, we get the conclusion (1) immediately from Definition 2, and the conclusion (2) follows from [43, Proposition 4.6]. ∎

Next, we consider the convergence for pessimistic BLO. To begin with, for pessimistic BLO without functional constraints, we denote $\varphi^{p}_{k}(\mathbf{x})$ similarly to the optimistic case:

	$\displaystyle\varphi^{p}_{k}(\mathbf{x}):=$	$\displaystyle\varphi^{p}_{\mu_{k},\theta_{k},\sigma_{k}}(\mathbf{x})$
	$\displaystyle=$	$\displaystyle\max\limits_{\mathbf{y}}\Big{\{}F(\mathbf{x},\mathbf{y})-P_{k}\!\left(f(\mathbf{x},\mathbf{y})-f_{k}^{*}(\mathbf{x})\right)-\frac{\theta_{k}}{2}\\|\mathbf{y}\\|^{2}\Big{\}},$

where $f^{*}_{k}(\mathbf{x})=\min_{\mathbf{y}}\left\{f(\mathbf{x},\mathbf{y})+\frac{\mu_{k}}{2}\|\mathbf{y}\|^{2}\right\}.$ Then we have the following corollary. Note that this convergence result can also be extended to pessimistic BLO with constraints easily.

Corollary 1 (Convergence for Pessimistic BLO)

Let $\{(\mu_{k},\theta_{k},\sigma_{k})\}$ be a positive sequence such that $(\mu_{k},\theta_{k},\sigma_{k})\rightarrow 0$ , also satisfying the same setting as in Lemma 1. Then we have the following inequality:

\limsup\limits_{k\rightarrow\infty}\left(\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi^{p}_{k}(\mathbf{x})\right)\leq\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi^{p}(\mathbf{x}).

In addition, if $\mathbf{x}_{\ell}\in\mathrm{argmin}_{\mathbf{x}\in\mathcal{X}}\varphi^{p}_{\ell}(\mathbf{x})$ for some subsequence $\{\ell\}\subset\mathbb{N}$ , and $\mathbf{x}_{\ell}$ converges to $\tilde{\mathbf{x}}$ , then we have $\tilde{\mathbf{x}}\in\mathrm{argmin}_{\mathbf{x}\in\mathcal{X}}\varphi^{p}(\mathbf{x})$ and

\lim\limits_{\ell\rightarrow\infty}\left(\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi^{p}_{\ell}(\mathbf{x})\right)=\inf\limits_{\mathbf{x}\in\mathcal{X}}\varphi^{p}(\mathbf{x}).

Proof.

Based on the proof of Theorem 1, we first need to prove $\varphi^{p}_{k}(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x})\stackrel{{\scriptstyle e}}{{\longrightarrow}}\varphi^{p}(\mathbf{x})+\delta_{\mathcal{X}}(\mathbf{x})$ by Lemma 1, 2, 3, and 4. Lemma 1 and 2 are unrelated to whether it is the optimistic or pessimistic case, and thus holds naturally. The corresponding results for Lemma 3 and 4 can be derived simply by replacing $F$ in their proof with $-F$ . Then the conclusion can be obtained by the process same to Theorem 1. ∎

In Table II, we present the comparison among existing methods and our BVFSM. It can be seen that under mild assumptions, BVFSM is able to achieve asymptotic convergence without the LLC restriction, and be applied in BLO with constraints and pessimistic BLO, which is not available by other methods. In addition, as shown in Theorem 1, our asymptotic convergence is obtained from the epiconvergence property, which is a stronger result than solely asymptotic convergence.

4.2 Complexity Analysis

In this part, we compare the time and space complexity of Algorithms 2 with EGBMs (i.e., FHG, RHG, TRHG and BDA) and IGBMs (i.e., CG and Neumann) for computing $G(\mathbf{x})$ or $\frac{\partial\varphi(\mathbf{x})}{\partial\mathbf{x}}$ , i.e., the direction for updating variable $\mathbf{x}$ . Table III summarizes the complexity results. Our complexity analysis follows the assumptions in [5]. Note that BVFSM has an order of magnitude lower time complexity with respect to the LL dimension $n$ compared to existing methods. For all existing methods, we assume solving the optimal solution of the LL problem, also the transition function $\Phi$ in EGBMs for obtaining $\mathbf{y}_{T}(\mathbf{x})$ , is the process of a $T$ -step gradient descent.

EGBMs. As discussed in [2, 26], after implementing $T$ steps of gradient descent with time and space complexity of $O(n)$ to solve the LL problem, FHG for forward calculation of Hessian-matrix product can be evaluated in time $O(m^{2}nT)$ and space $O(mn)$ , and RHG for reverse calculation of Hessian- and Jacobian-vector products can be evaluated in time $O(n(m+n)T)$ and space $O(m+nT)$ . TRHG truncates the length of back-propagation trajectory to $I$ after a $T$ -step gradient descent, and thus reduces the time and space complexity to $O(n(m+n)I)$ and space $O(m+nI)$ . BDA uses the same idea to RHG, except that it combines UL and LL objectives during back propagation, so the order of complexity of time and space is the same to RHG. The time complexity for EGBMs to calculate the UL gradient is proportional to $T$ , the number of iterations of the LL problem, and thus EGBMs take a large amount of time to ensure convergence.

IGBMs. After implementing a $T$ -step gradient descent for the LL problem, IGBMs approximate the inverse of Hessian matrix by conjugate gradient (CG), which solves a linear system of $Q$ steps, or by Neumann series. Note that each step of CG and Neumann method includes Hessian-vector products, requiring $O(m+n^{2}Q)$ time and $O(m+n)$ space, so IGBMs run in time $O(m+nT+n^{2}Q)$ and space $O(m+~{}n)$ . IGBMs decouple the complexity of calculating the UL gradient from being proportional to $T$ , but the iteration number $Q$ always relies on the properties of the Hessian matrix, and in some cases, $Q$ can be much larger than $T$ .

BVFSM. In our algorithm, it takes time $O(nT_{\mathbf{z}})$ and space $O(n)$ to calculate $T_{\mathbf{z}}$ steps of gradient descent on Eq. (10) for the solution of LL problem $\mathbf{z}^{T_{\mathbf{z}}}$ . Then $T_{\mathbf{y}}$ steps of gradient descent on Eq. (11) are used to calculate $\mathbf{y}^{T_{\mathbf{y}}}$ , which requires time $O(nT_{\mathbf{y}})$ and space $O(n)$ . After that, the direction can be obtained according to the formula given in Eq. (13) by several computations of the gradient $\frac{\partial f}{\partial\mathbf{x}}$ and $\frac{\partial F}{\partial\mathbf{x}}$ without any intermediate update, which requires time $O(m)$ and space $O(m+n)$ . Therefore, BVFSM runs in time $O(m+n(T_{\mathbf{z}}+T_{\mathbf{y}}))$ and space $O(m+n)$ for each iteration.

It can be observed from Table III that BVFSM needs less space than EGBMs, and it takes much less time than EGBMs and IGBMs, especially when $n$ is large, meaning the LL problem is high-dimensional, such as in application tasks with a large-scale network. Overall, this is because BVFSM does not need any computation of Hessian- or Jacobian-vector products for solving the unrolled dynamic system by recurrent iteration or approximating the inverse of Hessian. Its complexity only comes from calculating the gradients of $F$ and $f$ , which is much easier than calculating Hessian- and Jacobian-vector products (even by AD). Besides, although BVFSM has the same order of space complexity to IGBMs, it is indeed smaller, because the memory is saved by eliminating the need to save the computational graph used for calculating Hessian. We will further verify these advantages through numerical results in Section 5.

Refer to caption — (a) Convergence with the initial point $(\mathbf{x},\mathbf{y})=(8,8,8)$

5 Experimental Results

In this section, we quantitatively demonstrate the performance of our BVFSM³³3Code is available at https://github.com/vis-opt-group/BVFSM., especially when dealing with complicated and high-dimensional problems. We start with investigating the convergence performance, computational efficiency, and effect of hyper-parameters on numerical examples in Section 5.1. In Section 5.2, we apply BVFSM in the hyper-parameter optimization for the data hyper-cleaning task under different settings including the type of auxiliary functions, contamination rates, and various network structures. To further validate the generality of our method, we conduct experiments on other tasks such as few-shot learning in Section 5.3 and GAN in Section 5.4. The experiments were conducted on a PC with Intel Core i7-9700K CPU (4.2 GHz), 32GB RAM and an NVIDIA GeForce RTX 2060S GPU with 8GB VRAM, and the algorithm was implemented using PyTorch 1.6. We use the implementation in [48, 36] for the existing methods, and use MB (MegaByte) and S (Second) as the evaluation units of space and time complexity, respectively. Regarding the selection of coefficients and hyper-parameters, we evaluate them in numerical experiments and use the same method to select them in later tasks. Furthermore, we set $\mathbf{y}_{k,l}^{0}=\mathbf{y}_{k,l-1}^{T_{\mathbf{y}}}$ , $\mathbf{z}_{k,l}^{0}=\mathbf{z}_{k,l-1}^{T_{\mathbf{z}}}$ to initialize each step of the sub-problems. In view of the optimizer, we use SGD for solving LL and UL sub-problems in numerical experiments. In some applications, we change the UL optimizer to Adam to speed up the convergence.

5.1 Numerical Evaluations

5.1.1 Optimistic BLO

We start with the optimistic BLO, and use the numerical example with a non-convex LL which can adjust various dimensions to validate the effectiveness of BVFSM over existing methods. In particular, consider

		$\displaystyle\min_{\mathbf{x}\in\mathbb{R},\mathbf{y}\in\mathbb{R}^{n}}\\|\mathbf{x}-a\\|^{2}+\\|\mathbf{y}-a-\mathbf{c}\\|^{2}$		(18)
		$\displaystyle\text{\ s.t.\ }\;[\mathbf{y}]_{i}\in\underset{[\mathbf{y}]_{i}\in\mathbb{R}}{\mathrm{argmin}}\;\sin(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}),\forall\ i,$		(18)

where $[\mathbf{y}]_{i}$ denotes the $i$ -th component of $\mathbf{y}$ , while $a\in\mathbb{R}$ and $\mathbf{c}\in\mathbb{R}^{n}$ are adjustable parameters. Note that here $\mathbf{x}\in\mathbb{R}$ is a one-dimensional real number, but we still use the bold letter to represent this scalar to maintain the context consistency. The solution of such problem is $\mathbf{x}^{*}=\frac{(1-n)a+nC}{1+n},\ \text{ and }\ [\mathbf{y}^{*}]_{i}=C+[\mathbf{c}]_{i}-\mathbf{x}^{*},\forall\ i,$ where $C=\underset{{k}}{\operatorname{argmin}}\left\{\|C_{k}-2a\|:C_{k}=-\frac{\pi}{2}+2k\pi,k\in\mathbb{Z}\right\},$ and the optimal value is $F^{*}=\frac{n(C-2a)^{2}}{1+n}$ ⁴⁴4 Derivation of the closed-form solution is provided in Appendix B, available at https://arxiv.org/abs/2110.04974.. This example satisfies all the assumptions of BVFSM, but does not meet the LLC assumption in [27, 9, 28], which makes it a good example to validate the advantages of BVFSM. In the following experiments we set $a=2$ and $[\mathbf{c}]_{i}=2,\text{ for any }i=1,2,\cdots,n$ .

We compare BVFSM with several gradient-based optimization methods, including RHG, BDA, CG and Neumann. Note that they all assume the solution of the LL problem is unique except BDA, so for these methods we directly regard the obtained local optimal solutions of LL problems as the unique solutions. We set $T=100$ for RHG and BDA, $T=100$ , $Q=20$ for CG and Neumann, the aggregation parameters equal to $0.5$ in BDA, and $(\mu_{k},\theta_{k},\sigma_{k}^{(1)})=(1.0,1.0,1.0)/1.01^{k}$ , $\sigma_{k}^{(2)}=f(\mathbf{x}_{k},\mathbf{y}_{k})+1$ , step size $\alpha=0.01$ , $T_{\mathbf{z}}=50$ , $T_{\mathbf{y}}=25$ , and $L=1$ in BVFSM.

TABLE IV: Errors of UL variable

\|{\mathbf{x}}-{\mathbf{x}}^{*}\|/\|{\mathbf{x}}^{*}\|

with large-scale LL of

n

dimension.

$n$	RHG	BDA	CG	Neumann	BVFSM
50	2.296	2.336	2.058	2.260	0.117
100	2.253	2.294	2.073	2.236	0.159
150	2.213	2.253	2.032	2.202	0.190
200	2.187	2.227	1.972	2.178	0.209

Convergence performance. Figure 1 compares the convergence curves of UL variable $\mathbf{x}$ and objective $F(\mathbf{x},\mathbf{y})$ in the 2-dimensional case ( $n=2$ ). Here the optimal solution is $(\mathbf{x}^{*},\mathbf{y}^{*})=(-2/3+\pi,8/3+\pi/2,8/3+\pi/2)$ . In order to show the impact of initial points, we also set different initial points. From Figure 1(a), when the initial point is $(\mathbf{x},\mathbf{y})=(8,8,8)$ , existing methods show the trend of convergence at the beginning of iteration, but they soon stop further converging due to falling into a local optimal solution. Furthermore, when the initial point is $(\mathbf{x},\mathbf{y})=(0,0,0)$ in Figure 1(b), existing methods show a trend that the distance to the optimal solution even increases during the whole iterative process because they incorrectly converge to the local solution away from the global solution. On the contrary, our method can converge to the optimal solution under different initial points. Table IV further verifies the convergence performance for larger-scale problems of various LL dimension $n$ . It shows that our method can still maintain good convergence performance with high-dimensional LL, while existing methods fail because they cannot solve the non-convex LL with convergence guarantee.

TABLE V: The largest LL dimension

n

that can be achieved by different methods for a single-step computation within 3600 seconds.

	RHG	BDA	CG	Neumann	BVFSM
$n$	13089	12871	15093	18118	283200

Computational efficiency for large-scale problems. Figure 2 compares the computation time for problems under various scales $n$ and $m$ . Note that the scale-up of UL dimension $m$ can be achieved by converting the one-dimensional $\mathbf{x}$ to the mean of multi-dimensional $\mathbf{x}$ . As we can see, our method costs the least computation time for problems of all scales, and the LL dimension $n$ has much more influence than the UL dimension $m$ . Table V shows the largest LL dimension within the 3600-second time limit. This allows us to apply BVFSM to more complex LL problems, which existing methods cannot deal with. We attribute these superior results to our novel way of the re-characterization via value-function. We further explore our performance on problems with complex network structures in Section 5.2.

Effect of hyper-parameters. We next evaluate the effect of various hyper-parameters in the 2-dimensional case ( $n=2$ ). In Figure 3, we compare the errors under different settings of $T_{\mathbf{z}}$ , $T_{\mathbf{y}}$ , $\mu$ and $\theta$ . In Figure 3(a), with larger regularization coefficients $\mu$ and $\theta$ , the regularized problems are away from the original problems. Figure 3(b) shows that smaller $\mu$ and $\theta$ may not completely overcome the ill condition of the LL problem, which means the small coefficients cause the approximate problem to remain a little ill-conditioned, resulting in the instability of the surfaces. Hence, it is not an easy task to determine the regularization coefficients. Since the selection of such parameters is often highly related to the specific problem, in order to maintain the fairness for comparing the computational burden with other methods, we set $T_{\mathbf{z}}+2T_{\mathbf{y}}=T$ in all experiments (because we need to calculate the gradient of two functions $F$ and $f$ for the $T_{\mathbf{y}}$ -step gradient descent). Figure 4 shows the effect of $\mu$ , $\theta$ , and $\sigma$ on the convergence results. Figure 4(a) reveals the effect of regularization coefficients $\mu$ and $\theta$ . We find that when $\mu=0$ , the collapse may occur (with the collapse rate at around $44\%$ ), which indicates the necessity of adding the regularization term to avoid collapse and improve the computational stability. Figure 4(b) shows that it is a good choice to use smaller $\mu,\theta$ and larger $\sigma$ with a suitable decay factor to avoid the offset of solution and achieve better convergence. Figure 5 further analyzes the effect of $L$ , the number of inner-loop iterations, on the convergence speed. It can be seen that the smaller $L$ is, the higher convergence speed can be obtained, so we set $L=1$ in all experiments.

5.1.2 BLO with Constraints

To show the performance of BVFSM for problems with constraints discussed in Section 3.3, we use the following constrained example with non-convex LL: $\small\begin{aligned} &\min_{\mathbf{x}\in\mathbb{R},\mathbf{y}\in\mathbb{R}^{n}}\|\mathbf{x}-a\|^{2}+\left\|\mathbf{y}-a\right\|^{2}\\ &\text{ s.t. }[\mathbf{y}]_{i}\in\mathop{\arg\min}_{[\mathbf{y}]_{i}\in\mathbb{R}}\Big{\{}\sin\left(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}\right):\mathbf{x}+[\mathbf{y}]_{i}\in[0,1]\Big{\}},\forall\ i,\\ \end{aligned}$ where $a\in\mathbb{R}$ and $\mathbf{c}\in\mathbb{R}^{n}$ are any fixed given constant and vector satisfying $[\mathbf{c}]_{i}\in[0,1]\text{ for any }i=1,\cdots,n$ . The optimal solution is $\mathbf{x}^{*}=\frac{1-n}{1+n}a,[\mathbf{y}]_{i}=-\mathbf{x}^{*},\forall\ i,$ and the optimal value is $F^{*}=\frac{4n}{1+n}a^{2}$ . Derivation of the closed-form solution is provided in Appendix B. We conduct experiments under the 2-dimensional case ( $n=2$ ) and set $a=2$ and $[\mathbf{c}]_{i}=1,\text{ for any }i=1,2,\cdots,n$ . The constraint is carried out via $(\mathbf{x}+[\mathbf{y}]_{i}-0.5)^{2}-0.25\leq 0$ , for each component of $\mathbf{y}$ , which is equivalent to $\mathbf{x}+[\mathbf{y}]_{i}\in\left[0,1\right]$ .

Figure 6 displays the solutions after $K$ iterations. It can be seen that when dealing with constrained LL problems, only BVFSM can effectively deal with the constraint. Hence, our method has broader application space, and we will show the experiment in real learning tasks in Section 5.2, which solves problems with UL constraints.

TABLE VI: Comparison among existing methods, BVFSM, and BVFSM with constraints (BVFSM-C) for data hyper-cleaning tasks on three datasets: MNIST, FashionMNIST and CIFAR10. F1 score is the harmonic mean of precision and recall.

Method	MNIST			FashionMNIST			CIFAR10
Method	Accuracy	F1 score	Time (S)	Accuracy	F1 score	Time (S)	Accuracy	F1 score	Time (S)
RHG	87.90 $\pm$ 0.27	89.36 $\pm$ 0.11	0.4131	81.91 $\pm$ 0.18	87.12 $\pm$ 0.19	0.4589	34.95 $\pm$ 0.47	68.27 $\pm$ 0.72	1.3374
TRHG	88.57 $\pm$ 0.18	89.77 $\pm$ 0.29	0.2623	81.85 $\pm$ 0.17	86.76 $\pm$ 0.14	0.2840	35.42 $\pm$ 0.49	68.06 $\pm$ 0.55	0.8409
BDA	87.15 $\pm$ 0.82	90.38 $\pm$ 0.76	0.6694	79.97 $\pm$ 0.71	88.24 $\pm$ 0.58	0.8571	36.41 $\pm$ 0.23	67.33 $\pm$ 0.31	1.4869
CG	89.19 $\pm$ 0.35	85.96 $\pm$ 0.48	0.1799	83.15 $\pm$ 0.24	85.13 $\pm$ 0.27	0.2041	34.16 $\pm$ 0.75	69.10 $\pm$ 0.93	0.4796
Neumann	87.54 $\pm$ 0.13	89.58 $\pm$ 0.34	0.1723	81.37 $\pm$ 0.18	87.28 $\pm$ 0.19	0.1958	33.45 $\pm$ 0.16	68.87 $\pm$ 0.11	0.4694
BVFSM	90.41 $\pm$ 0.32	91.19 $\pm$ 0.25	0.1480	84.31 $\pm$ 0.27	88.35 $\pm$ 0.13	0.1612	38.19 $\pm$ 0.62	69.55 $\pm$ 0.42	0.4092
BVFSM-C	90.94 $\pm$ 0.32	91.83 $\pm$ 0.30	0.1566	83.23 $\pm$ 0.34	89.74 $\pm$ 0.24	0.1514	37.33 $\pm$ 0.33	69.73 $\pm$ 0.51	0.4374

To compare the performance of different auxiliary functions, we try barrier and penalty functions for $P_{H,\sigma_{H}},P_{h,\sigma_{h}}$ and $P_{f,\sigma_{f}}$ , which can be selected arbitrarily and separately indeed, but here are chosen the same to be compared more directly. Since all of these auxiliary functions can guarantee the convergence theoretically, we mainly focus on the robustness of them under different settings. From Figure 7, it can be seen that using a penalty function can converge only under certain settings within a small region, while using a barrier function has greater robustness, so we use barrier functions in other experiments. In Section 5.2, we further show investigations on penalty and barrier functions on complex networks.

5.1.3 Pessimistic BLO

To study the performance of pessimistic BLO, we use the example similar to optimistic BLO by changing Eq. (18) from $\min_{\mathbf{x}\in\mathbb{R},\mathbf{y}\in\mathbb{R}^{n}}$ to $\min_{\mathbf{x}\in\mathbb{R}}\max_{\mathbf{y}\in\mathbb{R}^{n}}$ and from $\|\mathbf{x}-a\|^{2}+\|\mathbf{y}-a-\mathbf{c}\|^{2}$ to $\|\mathbf{x}-a\|^{2}-\|\mathbf{y}-a-\mathbf{c}\|^{2}$ . Here we consider the 2-dimensional case (LL dimension $n=2$ ), and set $a=2$ and $[\mathbf{c}]_{i}=2\text{ for }i=1,2$ . In this case, the optimal solution is $(\mathbf{x}^{*},\mathbf{y}^{*})=(-2+\pi/2,4\pm\pi,4\pm\pi),$ and the optimal value is $F^{*}=-7/4\pi^{2}-4\pi+16.$ Derivation of the exact solution is provided in Appendix B. We select RHG and BDA respectively as the representatives of gradient-based methods with or without unique LL solution. We make no adaptive modifications to these methods which do not consider the pessimistic BLO situation.

Figure 8 shows the convergence curves of UL objective and how various methods choose $\mathbf{y}\in\mathcal{S}(\mathbf{x})$ when $\mathcal{S}(\mathbf{x})$ is not a singleton. From Figure 8(a), our method has significantly better convergence in the pessimistic case, while RHG and BDA cannot converge at all. Their distances to the optimal solution even increase because they fail to select the optimal LL solution $\mathbf{y}$ from multiple LL solutions $\mathbf{y}\in\mathcal{S}(\mathbf{x})$ , which is intuitively demonstrated in Figure 8(b) and 8(c).

TABLE VII: The effect of contamination rates for data hyper-cleaning. Accuracy and F1 scores of existing methods drop sharply with the increasing of contamination rate, while BVFSM maintains a slightly decreasing trend, verifying the robustness of BVFSM in the face of harsh data.

Contamination rate	0.6		0.7		0.8		0.9
Method	Accuracy	F1 score	Accuracy	F1 score	Accuracy	F1 score	Accuracy	F1 score
RHG	77.39±0.61	68.18±0.94	75.62±0.94	56.72±0.72	68.91±0.71	46.81±0.78	59.83±0.91	29.39±0.38
TRHG	77.37±0.52	76.76±0.13	75.60±0.84	65.30±0.10	68.89±0.30	55.39±0.97	59.81±0.38	37.97±0.48
BDA	75.44±0.44	78.24±0.34	73.67±0.59	66.78±0.82	66.96±0.69	56.87±0.79	57.88±0.87	39.45±0.08
CG	78.64±0.52	75.17±0.79	76.87±0.06	63.71±0.41	70.16±0.80	53.80±0.09	61.08±0.63	36.38±0.03
Neumann	76.85±0.95	77.29±0.29	75.08±0.15	65.83±0.22	68.37±0.40	55.92±0.46	59.29±0.26	38.50±0.63
BVFSM	81.49±0.22	85.51±0.70	81.34±0.42	82.55±0.33	80.06±0.97	73.51±0.83	79.73±0.20	55.97±0.73

5.2 Hyper-parameter Optimization

In this subsection, we use a specific task of hyper-parameter optimization, called data hyper-cleaning, to evaluate the performance of BVFSM when the LL problem is non-convex. Assuming that some of the labels in our dataset are contaminated, the goal of data hyper-cleaning is to reduce the impact of incorrect samples by adding hyper-parameters to them. In this experiment, we set $\mathbf{y}\in\mathbb{R}^{10\times 301}\times\mathbb{R}^{300\times d}$ as the parameter of a non-convex 2-layer linear network classifier where $d$ is the dimension of data, and $\mathbf{x}\in\mathbb{R}^{|\mathcal{D}_{tr}|}$ as the weight of each sample in the training set. Therefore, the LL problem is to learn a classifier $\mathbf{y}$ by cross-entropy loss $g$ weighted with given $\mathbf{x}$ :

f(\mathbf{x},\mathbf{y})=\sum_{(\mathbf{u}_{i},\mathbf{v}_{i})\in\mathcal{D}_{\mathtt{tr}}}[\mathtt{sigmoid}(\mathbf{x})]_{i}\ g(\mathbf{y},\mathbf{u}_{i},\mathbf{v}_{i}),

where $(\mathbf{u}_{i},\mathbf{v}_{i})$ are the training samples, and $\mathtt{sigmoid}(\mathbf{x})$ is the sigmoid function to constrain the weights $\mathbf{x}$ into the range of $[0,1]$ . The UL problem is to find a weight $\mathbf{x}$ to reduce the cross-entropy loss $g$ of $\mathbf{y}$ on a cleanly labeled validation set:

F(\mathbf{x},\mathbf{y})=\sum_{(\mathbf{u}_{i},\mathbf{v}_{i})\in\mathcal{D}_{\mathtt{val}}}g(\mathbf{y},\mathbf{u}_{i},\mathbf{v}_{i}).

In addition, we also consider adding explicit constraints directly on $\mathbf{x}$ (as discussed in Section 3.3) instead of using the sigmoid function as indirect constraints. The constraint is carried out via $([\mathbf{x}]_{i}-0.5)^{2}-0.25\leq 0$ , for each component of $\mathbf{x}$ , such that $[\mathbf{x}]_{i}\in\left[0,1\right]$ .

Overall performance. Table VI shows the accuracy, F1 score and computation time on three different datasets. For each dataset, we randomly select 5000 samples as the training set $\mathcal{D}_{\mathtt{tr}}$ , 5000 samples as the validation set $\mathcal{D}_{\mathtt{val}}$ , and 10000 samples as the test set $\mathcal{D}_{\mathtt{test}}$ . After that, we contaminate half of the labels in $\mathcal{D}_{\mathtt{tr}}$ . From the result, BVFSM achieves the most competitive performance on all datasets. Furthermore, BVFSM is faster than EGBMs and IGBMs, and this advantage is more evident on CIFAR10 with larger LL dimension, consistent with the complexity analysis in Section 4.2. The UL objective value and F1 score during iterations on FashionMNIST are also plotted in Figure 9.

As for the performance of BLO with constraints, we can find from Table VI that BVFSM with constraints (denoted as BVFSM-C in the table) has slightly lower accuracy but higher F1 score than BVFSM using sigmoid function without explicit constraints. This is because for BVFSM without constraints, the compound of sigmoid function in the LL objective decreases the gradient of $\mathbf{x}$ , and thus the UL variable $\mathbf{x}$ with small change rate contributes to its slower convergence. Accuracy more reflects the convergence of LL variable $\mathbf{y}$ , while F1 score more reflects the convergence of UL variable $\mathbf{x}$ . Therefore, BVFSM with constraints performs slightly worse in accuracy but better in F1 score than BVFSM without constraint but with the sigmoid function.

Evaluations on the auxiliary functions, robustness, and network structures. Figure 10 compares the performance of different auxiliary functions. Consistent with the numerical experiment in Figure 7, the barrier function works better with higher stability without the need for too much fine tuning of parameters. Table VII compares the robustness under various data contamination rates. Figure 11 further shows the impact of network structures in depth and width. For the LL variable $\mathbf{y}$ we use fully connected networks of various layers and widths. It is worth noting that the computational burden is overall not quite sensitive to the network width, but very sensitive to the network depth. With the deepening of networks, other methods experience varying degrees of collapse due to occupying too much memory, while BVFSM can always keep the computation stable. Since there is no need to retain the LL iteration trajectory, our storage burden is much less than that of EGBMs (RHG and BDA). Thanks to the fact that BVFSM does not need to calculate the Jacobian- and Hessian-vector products (realized by saving an additional calculation graph in AD), our burden is also significantly lower than that of IGBMs (CG and Neumann).

TABLE VIII: Computation time (S) in each epoch for data hyper-cleaning in VGG series networks with different convolution layers (Conv.), batch sizes (B) and iteration number (K). N / A means exceeding the memory limit. Note that a smaller batch size may take more time because the batch switching time increases. BVFSM maintains the least burden and highest speed, especially with large-scale LL in real-world networks.

Conv.	(B,K)	RHG	CG	Neumann	BVFSM
2	(1,7)	7515	4730	3225	2252
2	(128,20)	N/A	N/A	415.4	60.81
13	(128,100)	N/A	N/A	472.9	171.9
13	(512,100)	N/A	N/A	N/A	121.8

Computational efficiency for large-scale networks. Next, we verify our computational burden on large-scale networks closer to real applications such as VGG16 on CIFAR10 dataset. Because VGG16 has too much computational burden on existing methods, in order to make the comparison available, we change the experimental settings as follows. For each dataset, we randomly select 4096 samples as the training set $\mathcal{D}_{\mathtt{tr}}$ , 4096 samples as the validation set $\mathcal{D}_{\mathtt{val}}$ , and 512 samples as the test set $\mathcal{D}_{\mathtt{test}}$ . Because the original network is too computationally intensive for EGBMs, we perform an additional experiment on some sufficiently small batch size and iteration number $K$ . We also simplify the convolution layers from 13 layers in VGG16 to only the first two layers, and retain the last 3 linear layers. As shown in Table VIII, BVFSM always has the highest speed under various settings, and still works well with a large $K$ and batch size.

Additionally, we visualize how BVFSM can be applied to large-scale networks by expanding the width of 5-layer network, and compare the computational efficiency when the Multiply–Accumulate Operations (MACs) and parameters are increased. For the same-size network, the fully-connected layer typically has more parameters, while the convolutional layer has more MACs, so we use the fully-connected and convolutional layer respectively to simulate the scale-up of parameters and MACs. From Figure 12, other methods are computationally inefficient and can only handle small-scale networks, while BVFSM with much higher efficiency is applicable to larger-scale networks in frontier tasks. Moreover, considering the effect of number of layers on efficiency as shown in Figure 11, we also use a more challenging 50-layer network for BVFSM to further demonstrate its high efficiency. Specifically, existing methods usually cannot work under MobileNet with around 1 GMACs, while BVFSM is available under StyleGAN with around 100 GMACs.

TABLE IX: Averaged accuracy using various methods (including model-based methods and gradient-based BLO methods) for the few-shot classification task (1- and 5-shot, i.e.,

M=1,5,

and

N=5,20,30,40

) on Omniglot.

Method	5-way		20-way		30-way		40-way
Method	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot
MAML	98.70 $\pm$ 0.40	99.91 $\pm$ 0.10	95.80 $\pm$ 0.30	98.90 $\pm$ 0.20	86.86 $\pm$ 0.49	96.86 $\pm$ 0.19	85.98 $\pm$ 0.45	94.46 $\pm$ 0.43
Meta-SGD	97.97 $\pm$ 0.70	98.96 $\pm$ 0.20	93.98 $\pm$ 0.43	98.42 $\pm$ 0.11	89.91 $\pm$ 0.04	96.21 $\pm$ 0.15	87.39 $\pm$ 0.43	95.10 $\pm$ 0.15
Reptile	97.68 $\pm$ 0.04	99.48 $\pm$ 0.06	89.43 $\pm$ 0.14	97.12 $\pm$ 0.32	85.40 $\pm$ 0.30	95.28 $\pm$ 0.30	82.50 $\pm$ 0.30	92.79 $\pm$ 0.33
iMAML	99.16 $\pm$ 0.35	99.67 $\pm$ 0.12	94.46 $\pm$ 0.42	98.69 $\pm$ 0.10	89.52 $\pm$ 0.20	96.51 $\pm$ 0.08	87.28 $\pm$ 0.21	95.27 $\pm$ 0.08
RHG	98.64 $\pm$ 0.21	99.58 $\pm$ 0.12	96.13 $\pm$ 0.20	99.09 $\pm$ 0.08	93.92 $\pm$ 0.18	98.43 $\pm$ 0.08	90.78 $\pm$ 0.20	96.79 $\pm$ 0.10
TRHG	98.74 $\pm$ 0.21	99.71 $\pm$ 0.07	95.82 $\pm$ 0.20	98.95 $\pm$ 0.07	94.02 $\pm$ 0.18	98.39 $\pm$ 0.07	90.73 $\pm$ 0.20	96.79 $\pm$ 0.10
BDA	99.04 $\pm$ 0.18	99.74 $\pm$ 0.05	96.50 $\pm$ 0.16	99.19 $\pm$ 0.07	94.37 $\pm$ 0.18	98.53 $\pm$ 0.07	92.49 $\pm$ 0.18	97.12 $\pm$ 0.09
BVFSM	98.85 $\pm$ 0.12	99.21 $\pm$ 0.18	96.73 $\pm$ 0.30	98.95 $\pm$ 0.20	94.65 $\pm$ 0.20	98.56 $\pm$ 0.17	92.73 $\pm$ 0.12	97.61 $\pm$ 0.47

5.3 Few-shot Learning

We then conduct experiments on the few-shot learning task. Few-shot learning is one of the most popular applications in meta-learning, whose goal is to learn an algorithm that can also handle new tasks well. Specifically, each task is an $N$ -way classification and it aims to learn the hyper-parameter $\mathbf{x}$ so that each task can be solved by only $M$ training samples (i.e., $N$ -way $M$ -shot). Similar to works in [8, 29, 30], we model the network with two parts: a four-layer convolution network $\mathbf{x}$ as a common feature extraction layer among tasks, and a logical regression layer $\mathbf{y}={\mathbf{y}^{i}}$ as the separated classifier for each task. We also set dataset as $\mathcal{D}=\{\mathcal{D}^{i}\}$ , where $\mathcal{D}^{i}=\mathcal{D}^{i}_{\mathtt{tr}}\cup\mathcal{D}^{i}_{\mathtt{val}}$ for the $i$ -th task. By setting the loss function of the $i$ -th task to be cross-entropy $g(\mathbf{x},\mathbf{y}^{i};\mathcal{D}^{i}_{\mathtt{tr}})$ for the LL problem, the LL objective can be defined as

f(\mathbf{x},\mathbf{y})=\sum_{i}g(\mathbf{x},\mathbf{y}^{i};\mathcal{D}^{i}_{\mathtt{tr}}).

As for the UL objective, we also utilize the cross-entropy function but define it based on $\{\mathcal{D}^{i}_{\mathtt{val}}\}$ as

F(\mathbf{x},\mathbf{y})=\sum_{i}g(\mathbf{x},\mathbf{y}^{i};\mathcal{D}^{i}_{\mathtt{val}}).

Our experiment is performed on the widely used benchmark dataset: Omniglot [49], which contains examples of 1623 handwritten characters from 50 alphabets.

We compare our BVFSM with several approaches, such as MAML, Meta-SGD, Reptile, iMAML, RHG, TRHG and BDA [29, 30]. From Table IX, BVFSM achieves slightly poorer performance than existing methods in the 5-way task, because when dealing with small-scale LLC problems, the strength of regularization term by BVFSM to accelerate the convergence cannot fully counteract its impact on the offset of solution. However, for larger-scale LL problems (such as 20-way, 30-way and 40-way), thanks to the regularization term, BVFSM reveals significant advantages over other methods.

5.4 Generative Adversarial Networks

Next we perform intuitive experiments on GAN to illustrate the application of BVFSM for pessimistic BLO. GAN is a network used for unsupervised machine learning to build a min-max game between two players, i.e., the generator $\mathtt{Gen}(\mathbf{x};\cdot)$ with the network parameter $\mathbf{x}$ , and the discriminator $\mathtt{Dis}(\mathbf{y};\cdot)$ with the network parameter $\mathbf{y}$ . We denote the standard Gaussian distribution as $\mathcal{N}(0,1)$ and the real data distribution as $p_{\mathtt{data}}$ . The generator $\mathtt{Gen}$ tries to fool the discriminator $\mathtt{Dis}$ by producing data from random latent vector $\mathbf{v}\sim\mathcal{N}(0,1)$ , while the discriminator $\mathtt{Dis}$ distinguishes between real data $\mathbf{u}\sim p_{\mathtt{data}}$ and generated data $\mathtt{Gen}(\mathbf{x};\mathbf{v})$ by outputting the probability that the samples are real. The goal of GAN is to ${\color[rgb]{0,0,0}\min_{\mathbf{x}}\max_{\mathbf{y}}}\log(\mathtt{Dis}(\mathbf{y};\mathbf{u}))+\log(1-\mathtt{Dis}(\mathbf{y};\mathtt{Gen}(\mathbf{x};\mathbf{v})))$ [50].

However, this traditional modeling method regards $\mathtt{Dis}$ and $\mathtt{Gen}$ as equal status, and does not characterize the leader-follower relationship that $\mathtt{Gen}$ first generates data and after that $\mathtt{Dis}$ judges the data, which can be modeled by Stackelberg game and captured through BLO problems. Specifically, from this perspective, generative adversarial learning corresponds to a pessimistic BLO problem: the UL objective $F$ of $\mathtt{Gen}$ tries to generate adversarial samples, and the LL objective $f$ of $\mathtt{Dis}$ aims to learn a robust classifier which can maximize the UL objective. Therefore, we reformulate GAN into the form in Eq. (15) discussed in Section 3.4 to model this relationship, and call it bi-level GAN. Concretely, for the follower $\mathtt{Dis}(\mathbf{y};\cdot)$ , the LL objective is consistent with the original GAN:

f(\mathbf{x},\mathbf{y})=\log(\mathtt{Dis}(\mathbf{y};\mathbf{u}))+\log(1-\mathtt{Dis}(\mathbf{y};\mathtt{Gen}(\mathbf{x};\mathbf{v}))).

As for UL, considering the antagonistic goals of $\mathtt{Gen}$ and $\mathtt{Dis}$ , we model the UL problem as

\displaystyle F(\mathbf{x},\mathbf{y})=\log(\mathtt{Dis}(\mathbf{y};\mathtt{Gen}(\mathbf{x};\mathbf{v}))).

Note that the popular WGAN [51] is a variation of the most classic vanilla GAN [50] (or simply GAN), while unrolled GAN [11] and the GAN generated by our BVFSM belong to bi-level GAN, modeling from a BLO perspective. Our method has the following two advantages over other types of GAN. On the one hand, compared with vanilla GAN and WGAN, bi-level GAN can effectively model the leader-follower relationship between the generator and discriminator, rather than regard them as the same status. On the other hand, in bi-level GAN, our method considers the situation that the objective has multiple solutions, from the viewpoint of pessimistic BLO, with theoretical convergence guarantee, which unrolled GAN cannot achieve.

In this experiment we train a simple GAN architecture on a 2D mixture of 8 Gaussians arranged on a circle. The dataset is sampled from a mixture of 8 Gaussians with standard deviation 0.02. The 8 points are the means of data and are equally spaced around a circle with radius 2. The generator consists of a fully-connected network with 2 hidden layers of size 128 with ReLU activation followed by a linear projection to 2 dimensions. The discriminator first scales its input down by a factor of 4 (to roughly scale it to $(-1,1)$ ), and is followed by a 1-layer fully-connected network from ReLU activation to a linear layer of size 1 to act as the logit. As shown in Figure 13, we present a visual comparison of sample generation among GAN, WGAN, unrolled GAN, and our method. It can be seen that vanilla GAN can capture only one distribution rather than all Gaussian distributions at a time, because it ignores the leader-follower structure. WGAN benefits from the improvement of distance function and uses one distribution to approximate all Gaussian distributions at the same time, but it fails to display satisfying performance. Unrolled GAN shows the ability to capture all distributions simultaneously thanks to the leader-follower modeling by BLO, but it lacks further details of the distribution. However, the desirable treatment of non-convex problems by BVFSM brings about its ability to fit all distributions well with details. In addition, we show the KL divergence between the generated and target image in Table X. It can be seen that the traditional alternately optimized GAN and WGAN yield larger KL divergence, while unrolled GAN and our method, which consider GAN as a BLO model, produce smaller KL divergence, and our method further achieves the best result.

Figure 14 further validates the performance of BVFSM to be adaptive in large-scale GAN on real datasets. Specifically, we add BVFSM as a training strategy based on StyleGAN2 [52] on the AFHQ dataset. It can be seen that our approach is effective in improving the generation quality and performance metrics Inception Score (IS) and Frechet Inception Distance (FID).

TABLE X: KL divergence by various GAN. The divergence between the generated and target distribution by BVFSM is the smallest.

	GAN	WGAN	Unrolled GAN	BVFSM
KL Divergence	2.56	2.48	0.26	0.15

6 Conclusions

In this paper, we propose a novel bi-level algorithm BVFSM to provide an accessible path for large-scale problems with high dimensions from complex real-world tasks. With the help of value-function which breaks the traditional mindset in gradient-based methods, BVFSM can remove the LLC condition required by earlier works, and improve the efficiency of gradient-based methods, to overcome the bottleneck caused by high-dimensional non-convex LL problems. By transforming the regularized LL problem into UL objective by the value-function-based sequential minimization method, we obtain a sequence of single-level unconstrained differentiable problems to approximate the original problem. We prove the asymptotic convergence without LLC, and present our numerical superiority through complexity analysis and numerical evaluations for a variety of applications. We also extend our method to BLO problems with constraints, and pessimistic BLO problems.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (Nos. U22B2052, 61922019, 12222106), the National Key R&D Program of China (2020YFB1313503, 2022YFA1004101), Shenzhen Science and Technology Program (No. RCYX20200714114700072), the Guangdong Basic and Applied Basic Research Foundation (No. 2022B1515020082), and Pacific Institute for the Mathematical Sciences (PIMS).

References

[1] R. Liu, X. Liu, X. Yuan, S. Zeng, and J. Zhang, “A value-function-based interior-point method for non-convex bi-level optimization,” in ICML, 2021.
[2] L. Franceschi, M. Donini, P. Frasconi, and M. Pontil, “Forward and reverse gradient-based hyperparameter optimization,” in ICML, 2017.
[3] T. Okuno, A. Takeda, A. Kawana, and M. Watanabe, “On lp-hyperparameter learning via bilevel nonsmooth optimization,” JMLR, vol. 22, no. 245, pp. 1–47, 2021.
[4] M. Mackay, P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse, “Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions,” in ICLR, 2018.
[5] H. Liu, K. Simonyan, and Y. Yang, “DARTS: differentiable architecture search,” in ICLR, 2019.
[6] H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang, and Z. Li, “Darts+: Improved differentiable architecture search with early stopping,” arXiv preprint arXiv:1909.06035, 2019.
[7] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable architecture search: Bridging the depth gap between search and evaluation,” in ICCV, 2019.
[8] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil, “Bilevel programming for hyperparameter optimization and meta-learning,” in ICML, 2018.
[9] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine, “Meta-learning with implicit gradients,” in NeurIPS, 2019.
[10] D. Zügner and S. Günnemann, “Adversarial attacks on graph neural networks via meta learning,” in ICLR, 2019.
[11] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled generative adversarial networks,” arXiv preprint arXiv:1611.02163, 2016.
[12] D. Pfau and O. Vinyals, “Connecting generative adversarial networks and actor-critic methods,” arXiv preprint arXiv:1610.01945, 2016.
[13] Z. Yang, Y. Chen, M. Hong, and Z. Wang, “Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost,” in NeurIPS, 2019.
[14] R. Liu, S. Cheng, Y. He, X. Fan, Z. Lin, and Z. Luo, “On the convergence of learning-based iterative methods for nonconvex inverse problems,” IEEE TPAMI, 2019.
[15] R. Liu, Z. Li, Y. Zhang, X. Fan, and Z. Luo, “Bi-level probabilistic feature learning for deformable image registration,” in IJCAI, 2020.
[16] R. Liu, J. Liu, Z. Jiang, X. Fan, and Z. Luo, “A bilevel integrated model with data-driven layer ensemble for multi-modality image fusion,” IEEE TIP, 2020.
[17] R. Liu, P. Mu, J. Chen, X. Fan, and Z. Luo, “Investigating task-driven latent feasibility for nonconvex image modeling,” IEEE TIP, 2020.
[18] S. Dempe, N. Gadhi, and L. Lafhim, “Optimality conditions for pessimistic bilevel problems using convexificator,” Positivity, pp. 1–19, 2020.
[19] S. Dempe, Bilevel optimization: theory, algorithms and applications. TU Bergakademie Freiberg, Fakultät für Mathematik und Informatik, 2018.
[20] R. Liu, J. Gao, J. Zhang, D. Meng, and Z. Lin, “Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond,” IEEE TPAMI, vol. 44, no. 12, pp. 10 045–10 067, 2021.
[21] M. J. Alves, C. H. Antunes, and J. P. Costa, “New concepts and an algorithm for multiobjective bilevel programming: optimistic, pessimistic and moderate solutions,” Operational Research, pp. 1–34, 2019.
[22] R. G. Jeroslow, “The polynomial hierarchy and a simple model for competitive analysis,” Mathematical programming, 1985.
[23] J. F. Bard and J. E. Falk, “An explicit solution to the multi-level programming problem,” Computers & Opeations Research, vol. 9, no. 1, pp. 77–100, 1982.
[24] Z.-Q. Luo, J.-S. Pang, and D. Ralph, Mathematical programs with equilibrium constraints. Cambridge University Press, 1996.
[25] D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyperparameter optimization through reversible learning,” in ICML, ser. JMLR Workshop and Conference Proceedings, 2015.
[26] A. Shaban, C. Cheng, N. Hatch, and B. Boots, “Truncated back-propagation for bilevel optimization,” in AISTATS, 2019.
[27] F. Pedregosa, “Hyperparameter optimization with approximate gradient,” in ICML, 2016.
[28] J. Lorraine, P. Vicol, and D. Duvenaud, “Optimizing millions of hyperparameters by implicit differentiation,” in AISTATS, 2020.
[29] R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang, “A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton,” in ICML, 2020.
[30] ——, “A general descent aggregation framework for gradient-based bi-level optimization,” IEEE TPAMI, vol. 45, no. 1, pp. 38–57, 2022.
[31] J. V. Outrata, “On the numerical solution of a class of stackelberg problems,” ZOR-Methods and Models of Operations Research, 1990.
[32] J. J. Ye and D. L. Zhu, “Optimality conditions for bilevel programming problems,” Optimization, 1995.
[33] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” JMLR, vol. 13, pp. 281–305, 2012.
[34] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-based optimization for general algorithm configuration,” in International conference on learning and intelligent optimization, 2011.
[35] A. V. Fiacco and G. P. McCormick, Nonlinear programming: sequential unconstrained minimization techniques. SIAM, 1990.
[36] R. Grazzi, L. Franceschi, M. Pontil, and S. Salzo, “On the iteration complexity of hypergradient computation,” in ICML, 2020.
[37] L. S. Lasdon, “An efficient algorithm for minimizing barrier and penalty functions,” Mathematical Programming, vol. 2, no. 1, pp. 65–106, 1972.
[38] C. L. Byrne, “Alternating minimization as sequential unconstrained minimization: a survey,” Journal of Optimization Theory and Applications, vol. 156, no. 3, pp. 554–566, 2013.
[39] R. M. Freund, “Penalty and barrier methods for constrained optimization,” Lecture Notes, Massachusetts Institute of Technology, 2004.
[40] D. G. Luenberger, Y. Ye et al., Linear and nonlinear programming. Springer, 1984, vol. 2.
[41] D. Boukari and A. Fiacco, “Survey of penalty, exact-penalty and multiplier methods from 1968 to 1993,” Optimization, vol. 32, no. 4, pp. 301–334, 1995.
[42] A. Auslender, “Penalty and barrier methods: a unified framework,” SIAM Journal on Optimization, vol. 10, no. 1, pp. 211–230, 1999.
[43] J. F. Bonnans and A. Shapiro, Perturbation analysis of optimization problems. Springer Science & Business Media, 2013.
[44] B. S. Mordukhovich and N. M. Nam, “An easy path to convex analysis and applications,” Synthesis Lectures on Mathematics and Statistics, vol. 6, no. 2, pp. 1–218, 2013.
[45] P. Borges, C. Sagastizábal, and M. Solodov, “A regularized smoothing method for fully parameterized convex problems with applications to convex and nonconvex two-stage stochastic programming,” Mathematical Programming, 2020.
[46] R. Liu, Y. Liu, W. Yao, S. Zeng, and J. Zhang, “Averaged method of multipliers for bi-level optimization without lower-level strong convexity,” arXiv preprint arXiv:2302.03407, 2023.
[47] K. Ji and Y. Liang, “Lower bounds and accelerated algorithms for bilevel optimization,” JMLR, vol. 23, pp. 1–56, 2022.
[48] E. Grefenstette, B. Amos, D. Yarats, P. M. Htut, A. Molchanov, F. Meier, D. Kiela, K. Cho, and S. Chintala, “Generalized inner loop meta-learning,” arXiv preprint arXiv:1910.01727, 2019.
[49] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, “Human-level concept learning through probabilistic program induction,” Science, 2015.
[50] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
[51] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” 2017.
[52] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.

Appendix A Proofs of Lemmas in Section 4.1

A.1 Lemma 1

(1)

$P_{k}(\omega)$ is continuous, differentiable and non-decreasing, and satisfies $P_{k}(\omega)\geq 0$ .
(2)

For any $\omega\leq 0$ , $\lim_{k\rightarrow\infty}P_{k}(\omega)=0$ .
(3)

For any sequence $\{\omega_{k}\}$ , $\lim_{k\rightarrow\infty}P_{k}(\omega_{k})<+\infty$ implies that $\limsup_{k\rightarrow\infty}\omega_{k}\leq 0$ .

Proof.

From the definitions of penalty and barrier functions (see, e.g., Definition 1), the statement (1) follows immediately.

When $\rho$ is a penalty function, $\lim_{\sigma\rightarrow 0}\rho(\omega;\sigma)$ is equal to $+\infty$ for $\omega>0$ and $0$ for $\omega\leq 0$ . Hence, as $k\rightarrow\infty$ , $\sigma_{k}\rightarrow 0$ , and we have $P_{k}(\omega)\rightarrow 0$ , for $\omega\leq 0$ . For any sequence $\{\omega_{k}\}$ , if $\limsup_{k\rightarrow\infty}\omega_{k}>0$ , there exists a subsequence $\{\omega_{t}\}$ of $\{\omega_{k}\}$ and $\varepsilon>0$ such that $\omega_{t}\geq\varepsilon$ for all $t$ . Then, it follows from the monotonicity of $\rho$ that $\lim_{t\rightarrow\infty}P_{t}(\omega_{t})=\lim_{t\rightarrow\infty}\rho(\omega_{t};\sigma_{t})\geq\lim_{t\rightarrow\infty}\rho(\varepsilon;\sigma_{t})=+\infty$ . Thus, $\lim_{k\rightarrow\infty}P_{k}(\omega_{k})<+\infty$ implies that $\limsup_{k\rightarrow\infty}\omega_{k}\leq 0$ .

If $\rho$ is a modified barrier function, since $\rho$ is non-creasing, we have $0\leq\rho(\omega-\sigma^{(2)}_{k};\sigma_{k}^{(1)})\leq\rho(-\sigma^{(2)}_{k};\sigma_{k}^{(1)})$ when $\omega\leq 0$ . The assumption $\lim_{k\rightarrow\infty}\rho(-\sigma^{(2)}_{k};\sigma_{k}^{(1)})=0$ implies $\rho(\omega-\sigma^{(2)}_{k};\sigma_{k}^{(1)})\rightarrow 0$ when $\omega\leq 0$ . Hence, as $k\rightarrow\infty$ , we have $\sigma_{k}^{(1)}\rightarrow 0$ , $\sigma^{(2)}_{k}\rightarrow 0$ , and $P_{k}(\omega)\rightarrow 0$ , for $\omega\leq 0$ . For any sequence $\{\omega_{k}\}$ , if $\lim_{k\rightarrow\infty}P_{k}(\omega_{k})<+\infty$ , then it follows from the definition of modified barrier function that $\omega_{k}\leq\sigma^{(2)}_{k}$ and (3) follows immediately from $\sigma^{(2)}_{k}\rightarrow 0$ .

∎

A.2 Lemma 2

\limsup_{k\rightarrow\infty}f_{k}^{*}(\mathbf{x}_{k})\leq f^{*}(\bar{\mathbf{x}}).

Proof.

Given any $\epsilon>0$ , there exists $\bar{\mathbf{y}}\in\mathbb{R}^{n}$ such that $f(\bar{\mathbf{x}},\bar{\mathbf{y}})<f^{*}(\bar{\mathbf{x}})+1/2\ \epsilon$ , and $h(\bar{\mathbf{x}},\bar{\mathbf{y}})\leq 0$ . If $h(\bar{\mathbf{x}},\bar{\mathbf{y}})=~{}0$ , by Assumption 1.(4), the minimum of $h$ w.r.t. $\mathbf{y}$ in any neighbourhood of $\bar{\mathbf{y}}$ is smaller than 0, so we can find a $\hat{\mathbf{y}}$ close enough to $\bar{\mathbf{y}}$ such that $f(\bar{\mathbf{x}},\hat{\mathbf{y}})<f^{*}(\bar{\mathbf{x}})+\epsilon$ , and $h(\bar{\mathbf{x}},\hat{\mathbf{y}})\leq-\delta$ for some $\delta>0$ . If $h(\bar{\mathbf{x}},\bar{\mathbf{y}})<0$ , such $\hat{\mathbf{y}}$ exists obviously.

As $\{\mathbf{x}_{k}\}$ converges to $\bar{\mathbf{x}}$ , $h(\bar{\mathbf{x}},\hat{\mathbf{y}})\leq-\delta$ combining with the continuity of $h(\mathbf{x},\mathbf{y})$ implies the existence of $K_{1}>0$ such that $h(\mathbf{x}_{k},\hat{\mathbf{y}})\leq-\delta/2$ for all $k\geq K_{1}$ . Since the barrier function is non-decreasing, it follows that $P_{B,k}(h(\mathbf{x}_{k},\hat{\mathbf{y}}))\leq\rho(-\delta/2;\sigma_{k}^{(1)})$ . Then $\lim_{k\rightarrow\infty}\rho(-\delta/2;\sigma_{k}^{(1)})=0$ yields

\lim_{k\rightarrow\infty}P_{B,k}\!\left(h(\mathbf{x}_{k},\hat{\mathbf{y}})\right)=0.

Next, as $\{\mathbf{x}_{k}\}$ converges to $\bar{\mathbf{x}}$ , it follows from the continuity of $f(\mathbf{x},\mathbf{y})$ and $\mu_{k}\rightarrow 0$ that there exists $K_{2}\geq K_{1}$ , such that for any $k\geq K_{2}$ ,

	$\displaystyle f_{k}^{*}(\mathbf{x}_{k})\leq$	$\displaystyle f(\mathbf{x}_{k},\hat{\mathbf{y}})+P_{B,k}\!\left(h(\mathbf{x}_{k},\hat{\mathbf{y}})\right)+\frac{\mu_{k}}{2}\\|\hat{\mathbf{y}}\\|^{2}$
	$\displaystyle\leq$	$\displaystyle f(\bar{\mathbf{x}},\hat{\mathbf{y}})+\epsilon$
	$\displaystyle\leq$	$\displaystyle f^{*}(\bar{\mathbf{x}})+2\epsilon.$

By letting $k\rightarrow\infty$ , we obtain

\limsup_{k\rightarrow\infty}f_{k}^{*}(\mathbf{x}_{k})\leq f^{*}(\bar{\mathbf{x}})+2\epsilon,

and taking $\epsilon\rightarrow 0$ to the above yields the conclusion. ∎

A.3 Lemma 3

\liminf\limits_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})\geq\varphi(\bar{\mathbf{x}}).

Proof.

We assume by contradiction that there exists $\bar{\mathbf{x}}\in\mathcal{X}$ and a sequence $\{\mathbf{x}_{k}\}$ , satisfying $\mathbf{x}_{k}\to\bar{\mathbf{x}}$ as $k\to\infty$ with the following inequality

\lim_{k\rightarrow\infty}\varphi_{k}(\mathbf{x}_{k})<\varphi(\bar{\mathbf{x}}).

Then, there exist $\epsilon>0$ and a sequence $\{\mathbf{y}_{k}\}$ satisfying

		$\displaystyle F(\mathbf{x}_{k},\mathbf{y}_{k})+P_{H,k}\!\left(H(\mathbf{x}_{k},\mathbf{y}_{k})\right)+P_{h,k}\!\left(h(\mathbf{x}_{k},\mathbf{y}_{k})\right)$		(19)
		$\displaystyle+P_{f,k}\!\left(f(\mathbf{x}_{k},\mathbf{y}_{k})-f_{k}^{*}(\mathbf{x}_{k})\right)+\frac{\theta_{k}}{2}\\|\mathbf{y}_{k}\\|^{2}<\varphi(\bar{\mathbf{x}})-\epsilon.$		(19)

Since $F(\mathbf{x},\mathbf{y})$ is level-bounded in $\mathbf{y}$ locally uniformly in $\bar{\mathbf{x}}$ , we have that $\{\mathbf{y}_{k}\}$ is bounded. Take a subsequence $\{\mathbf{y}_{t}\}$ of $\{\mathbf{y}_{k}\}$ which satisfies there exists $\hat{\mathbf{y}}$ , such that $\mathbf{y}_{t}\rightarrow\hat{\mathbf{y}}$ .

The inequality Eq. (19) yields that

\displaystyle P_{f,t}\!\left(f(\mathbf{x}_{t},\mathbf{y}_{t})-f_{t}^{*}(\mathbf{x}_{t})\right)<\varphi(\bar{\mathbf{x}})-\epsilon-F(\mathbf{x}_{t},\mathbf{y}_{t}).

Taking $t\rightarrow\infty$ then $\lim_{t\rightarrow\infty}P_{f,t}\!\left(f(\mathbf{x}_{t},\mathbf{y}_{t})-f_{t}^{*}(\mathbf{x}_{t})\right)<+\infty$ . From Lemma 1, we have $\limsup_{t\rightarrow\infty}\left\{f(\mathbf{x}_{t},\mathbf{y}_{t})-f_{t}^{*}(\mathbf{x}_{t})\right\}\leq 0$ , and hence by the continuity of $f$ ,

\lim_{t\rightarrow\infty}f(\mathbf{x}_{t},\mathbf{y}_{t})\leq\liminf_{t\rightarrow\infty}f_{t}^{*}(\mathbf{x}_{t}).

Then, by the continuity of $f$ and Lemma 2, we have

f(\bar{\mathbf{x}},\hat{\mathbf{y}})=\lim_{t\rightarrow\infty}f(\mathbf{x}_{t},\mathbf{y}_{t})\leq\limsup_{t\rightarrow\infty}f_{t}^{*}(\mathbf{x}_{t})\leq f^{*}(\bar{\mathbf{x}}).

By using similar arguments and the continuity of $h$ and $H$ , one can show $h(\bar{\mathbf{x}},\hat{\mathbf{y}})\leq 0$ and $H(\bar{\mathbf{x}},\hat{\mathbf{y}})\leq 0.$ Thus, we have $\hat{\mathbf{y}}\in S(\bar{\mathbf{x}}),$ and $\hat{\mathbf{y}}$ is a feasible point to problem Eq. (14) with $\mathbf{x}=\bar{\mathbf{x}}$ . Then Eq. (19) yields

\varphi(\bar{\mathbf{x}})\leq F(\bar{\mathbf{x}},\hat{\mathbf{y}})\leq\limsup_{k\rightarrow\infty}F(\mathbf{x}_{k},\mathbf{y}_{k})\leq\varphi(\bar{\mathbf{x}})-\epsilon,

which implies a contradiction. Thus we get the conclusion. ∎

A.4 Lemma 4

\limsup_{k\rightarrow\infty}\varphi_{k}(\mathbf{x})\leq\varphi(\mathbf{x}).

Proof.

Given any $\bar{\mathbf{x}}\in\mathcal{X}$ , for any $\epsilon>0$ , there exists $\bar{\mathbf{y}}\in\mathbb{R}^{n}$ satisfying $f(\bar{\mathbf{x}},\bar{\mathbf{y}})\leq f^{*}(\bar{\mathbf{x}})$ , $h(\bar{\mathbf{x}},\bar{\mathbf{y}})\leq 0$ , $H(\bar{\mathbf{x}},\bar{\mathbf{y}})\leq 0$ , and $F(\bar{\mathbf{x}},\bar{\mathbf{y}})\leq\varphi(\bar{\mathbf{x}})+\epsilon$ .

By the definition of $\varphi_{k}$ , we have

	$\displaystyle\varphi_{k}(\bar{\mathbf{x}})\leq$	$\displaystyle F(\bar{\mathbf{x}},\bar{\mathbf{y}})+P_{H,k}\!\left(H(\bar{\mathbf{x}},\bar{\mathbf{y}})\right)+P_{h,k}\!\left(h(\bar{\mathbf{x}},\bar{\mathbf{y}})\right)$		(20)
		$\displaystyle+P_{f,k}\!\left(f(\bar{\mathbf{x}},\bar{\mathbf{y}})-f_{k}^{*}(\bar{\mathbf{x}})\right)+\frac{\theta_{k}}{2}\\|\bar{\mathbf{y}}\\|^{2}.$		(20)

From Lemma 1, as $k\rightarrow\infty$ , we have $P_{H,k}\left(H(\bar{\mathbf{x}},\bar{\mathbf{y}})\right)\rightarrow 0$ , and $P_{h,k}\!\left(h(\bar{\mathbf{x}},\bar{\mathbf{y}})\right)\rightarrow 0$ . As we choose the standard barrier function for $P_{B,k}$ in the definition of $f^{*}_{k}(\mathbf{x})$ , thus $f(\mathbf{x},\mathbf{y})+P_{B,k}\!\left(h(\mathbf{x},\mathbf{y})\right)+\frac{\mu_{k}}{2}\|\mathbf{y}\|^{2}$ is always larger than $f(\mathbf{x},\mathbf{y})$ for any $\mathbf{x}$ and $\mathbf{y}$ feasible to the LL problem, and hence $f_{k}^{*}(\bar{\mathbf{x}})\geq f^{*}(\bar{\mathbf{x}})$ . Then we have $f(\bar{\mathbf{x}},\bar{\mathbf{y}})-f_{k}^{*}(\bar{\mathbf{x}})\leq f(\bar{\mathbf{x}},\bar{\mathbf{y}})-f^{*}(\bar{\mathbf{x}})\leq 0$ . Then it follows from the monotonicity of $P_{f,k}$ and Lemma 1 that $P_{f,k}\!\left(f(\bar{\mathbf{x}},\bar{\mathbf{y}})-f_{k}^{*}(\bar{\mathbf{x}})\right)\rightarrow 0$ .

Therefore, as $\theta_{k}\rightarrow 0$ , by taking $k\rightarrow\infty$ in inequality Eq. (20), we have

\limsup_{k\rightarrow\infty}\varphi_{k}(\bar{\mathbf{x}})\leq\varphi(\bar{\mathbf{x}})+\epsilon.

Then, we get the conclusion by letting $\epsilon\rightarrow 0$ . ∎

Appendix B Closed-form Solution in Section 5.1

Here we provide the detailed derivation of the closed-form solutions for the numerical examples in Section 5.1.

B.1 Optimistic BLO in Section 5.1.1

We consider the following optimistic BLO:

		$\displaystyle\min_{\mathbf{x}\in\mathbb{R},\mathbf{y}\in\mathbb{R}^{n}}\\|\mathbf{x}-a\\|^{2}+\\|\mathbf{y}-a-\mathbf{c}\\|^{2}$
		$\displaystyle\text{\ s.t.\ }\;[\mathbf{y}]_{i}\in\underset{[\mathbf{y}]_{i}\in\mathbb{R}}{\mathrm{argmin}}\;\sin(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}),\forall\ i,$

\mathbf{x}^{*}=\frac{(1-n)a+nC}{1+n},\ \text{ and }\ [\mathbf{y}^{*}]_{i}=C+[\mathbf{c}]_{i}-\mathbf{x},\forall\ i,

where

C=\underset{{k}}{\operatorname{argmin}}\left\{\|C_{k}-2a\|:C_{k}=-\frac{\pi}{2}+2k\pi,k\in\mathbb{Z}\right\},

and the optimal value is $F^{*}=\frac{n(C-2a)^{2}}{1+n}$ . Derivation of the optimal solution and optimal value is as follows.

From the LL problem

[\mathbf{y}]_{i}\in\underset{[\mathbf{y}]_{i}\in\mathbb{R}}{\mathrm{argmin}}\;\sin(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}),\forall\ i,

we have $[\mathbf{y}]_{i}\in\left\{-\mathbf{x}+[\mathbf{c}]_{i}-\frac{\pi}{2}+2k\pi:k\in\mathbb{Z}\right\},\forall\ i$ . Then the problem is to find the $\mathbf{x}$ and $k$ to minimize

	$\displaystyle F$	$\displaystyle=\\|\mathbf{x}-a\\|^{2}+\\|\mathbf{y}-a-\mathbf{c}\\|^{2}$
		$\displaystyle=(\mathbf{x}-a)^{2}+\sum_{i=1}^{n}([\mathbf{y}]_{i}-a-[\mathbf{c}]_{i})^{2}$
		$\displaystyle=(\mathbf{x}-a)^{2}+n(-\mathbf{x}-\frac{\pi}{2}+2k\pi-a)^{2}$
		$\displaystyle=(n+1)\mathbf{x}^{2}+2\left[n\left(a+\frac{\pi}{2}-2k\pi\right)-a\right]\mathbf{x}$
		$\displaystyle\quad+a^{2}+n\left(a+\frac{\pi}{2}-2k\pi\right)^{2}.$

For a given $k$ , denote $C_{k}=-\frac{\pi}{2}+2k\pi$ , then

F_{k}=(n+1)\mathbf{x}^{2}+2\left[n(a-C_{k})-a\right]\mathbf{x}+a^{2}+n\left(a-C_{k}\right)^{2},

which is strongly convex and quadratic w.r.t. $\mathbf{x}$ . Thus, it is easy to obtain $\mathbf{x}_{k}^{*}=\frac{(1-n)a+nC_{k}}{1+n}$ , and

	$\displaystyle F_{k}^{*}$	$\displaystyle=a^{2}+n(a-C_{k})^{2}-\frac{[n(a-C_{k})-a]^{2}}{n+1}$
		$\displaystyle=\frac{n}{n+1}(C_{k}-2a)^{2}.$

Hence, by denoting

C=\underset{{k}}{\operatorname{argmin}}\left\{\|C_{k}-2a\|:C_{k}=-\frac{\pi}{2}+2k\pi,k\in\mathbb{Z}\right\},

we have the optimal value $F^{*}=\frac{n}{n+1}(C-2a)^{2}$ , and the corresponding solution

\mathbf{x}^{*}=\frac{(1-n)a+nC}{1+n},\ \text{ and }\ [\mathbf{y}^{*}]_{i}=C+[\mathbf{c}]_{i}-\mathbf{x}^{*},\forall\ i.

B.2 BLO with constraints in Section 5.1.2

We consider the following constrained BLO problem with non-convex LL:

		$\displaystyle\min_{\mathbf{x}\in\mathbb{R},\mathbf{y}\in\mathbb{R}^{n}}\\|\mathbf{x}-a\\|^{2}+\left\\|\mathbf{y}-a\right\\|^{2}$
		$\displaystyle\text{ s.t. }[\mathbf{y}]_{i}\in\mathop{\arg\min}_{[\mathbf{y}]_{i}\in\mathbb{R}}\Big{\{}\sin\left(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}\right):\mathbf{x}+[\mathbf{y}]_{i}\in[0,1]\Big{\}},\forall\ i,$

where $a\in\mathbb{R}$ and $\mathbf{c}\in\mathbb{R}^{n}$ are any fixed given constant and vector satisfying $[\mathbf{c}]_{i}\in[0,1]\text{ for any }i=1,\cdots,n$ . The optimal solution is

\mathbf{x}^{*}=\frac{1-n}{1+n}a,\text{ and }[\mathbf{y}^{*}]_{i}=-\mathbf{x}^{*},\forall\ i,

and the optimal value is $F^{*}=\frac{4n}{1+n}a^{2}$ . Derivation of the optimal solution and optimal value is as follows.

From $\mathbf{x}+[\mathbf{y}]_{i}\in[0,1],$ along with $[\mathbf{c}]_{i}\in[0,1],\forall\ i,$ it is easy to obtain

\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}\in\big{[}\ [\mathbf{c}]_{i},1-[\mathbf{c}]_{i}\ \big{]}\subset[-1,1]\subset\left[-\frac{\pi}{2},\frac{\pi}{2}\right].

Hence, $\sin\left(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}\right)$ is increasing w.r.t. $[\mathbf{y}]_{i}$ under the constraints for all $i$ . Thus, from the LL problem we have $\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}=-[\mathbf{c}]_{i}$ , i.e., $[\mathbf{y}]_{i}=-\mathbf{x},\forall\ i.$ Then the problem is to find the $\mathbf{x}$ to minimize

	$\displaystyle F$	$\displaystyle=\\|\mathbf{x}-a\\|^{2}+\left\\|\mathbf{y}-a\right\\|^{2}$
		$\displaystyle=(\mathbf{x}-a)^{2}+n(-\mathbf{x}-a)^{2}$
		$\displaystyle=(n+1)\mathbf{x}^{2}+2a(n-1)\mathbf{x}+(n+1)a^{2}.$

Therefore, the optimal solution $\mathbf{x}^{*}=\frac{1-n}{1+n}a$ , $[\mathbf{y}^{*}]_{i}=-\mathbf{x}^{*},\forall\ i$ , and by substituting $\mathbf{x}^{*}$ into $F$ , we have the optimal value $F^{*}=\frac{4n}{1+n}a^{2}$ .

B.3 Pessimistic BLO in Section 5.1.3

For the pessimistic BLO we use the following example:

		$\displaystyle\min_{\mathbf{x}\in\mathbb{R}}\max_{\mathbf{y}\in\mathbb{R}^{n}}\\|\mathbf{x}-a\\|^{2}-\\|\mathbf{y}-a-\mathbf{c}\\|^{2}$		(21)
		$\displaystyle\text{\ s.t.\ }\;[\mathbf{y}]_{i}\in\underset{[\mathbf{y}]_{i}\in\mathbb{R}}{\mathrm{argmin}}\;\sin(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}),\forall\ i,$		(21)

where $[\mathbf{y}]_{i}$ denotes the $i$ -th component of $\mathbf{y}$ , while $a\in\mathbb{R}$ and $\mathbf{c}\in\mathbb{R}^{n}$ are adjustable parameters. In our experiment, we set $a=2$ and $[\mathbf{c}]_{i}=2\text{ for any }i=1,2,\cdots,n$ , and consider the 2-dimensional case (LL dimension $n=2$ ). The optimal solution to this problem is

(\mathbf{x}^{*},\mathbf{y}^{*})=\left(-2+\frac{\pi}{2},4\pm\pi,4\pm\pi\right),

and the optimal value is

F^{*}=\left(-4+\frac{\pi}{2}\right)^{2}-2\pi^{2}=-\frac{7}{4}\pi^{2}-4\pi+16.

Derivation of the optimal solution and optimal value is as follows.

From the LL problem

[\mathbf{y}]_{i}\in\underset{[\mathbf{y}]_{i}\in\mathbb{R}}{\mathrm{argmin}}\;\sin(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}),\forall\ i,

we have $[\mathbf{y}]_{i}\in\left\{-\mathbf{x}+[\mathbf{c}]_{i}-\frac{\pi}{2}+2k\pi:k\in\mathbb{Z}\right\},\forall\ i$ . Then the problem is transferred to

\min_{\mathbf{x}\in\mathbb{R}}\max_{k\in\mathbb{Z}}\ (\mathbf{x}-a)^{2}-n\left[\mathbf{x}+a-\left(-\frac{\pi}{2}+2k\pi\right)\right]^{2}.

Denote $C_{k}=-\frac{\pi}{2}+2k\pi$ . For a given $\mathbf{x}$ , suppose

\mathbf{x}+a\in[C_{\widehat{k}}-\pi,C_{\widehat{k}}+\pi]\ (\widehat{k}\in\mathbb{Z}).

Then to maximize $(\mathbf{x}-a)^{2}-n\left[\mathbf{x}+a-\left(-\frac{\pi}{2}+2k\pi\right)\right]^{2}$ w.r.t. $k$ is to minimize $\left[\mathbf{x}+a-\left(-\frac{\pi}{2}+2k\pi\right)\right]^{2}=(\mathbf{x}+a-C_{k})^{2}$ w.r.t. $k$ , and

\mathop{\arg\min}_{k}\ (\mathbf{x}+a-C_{k})^{2}=\widehat{k}.

Thus, the problem is transformed into

\min_{\mathbf{x}}\ (\mathbf{x}-a)^{2}-n\left(\mathbf{x}+a-C_{\widehat{k}}\right)^{2},\text{ if }\mathbf{x}+a\in[C_{\widehat{k}}-\pi,C_{\widehat{k}}+\pi],

i.e.,

\min_{\mathbf{x}\in\mathbb{R}}\varphi(\mathbf{x})

where

	$\displaystyle\varphi(\mathbf{x}):=(\mathbf{x}-a)^{2}$	$\displaystyle-n\left(\mathbf{x}+a-C_{\widehat{k}}\right)^{2},$
		$\displaystyle\text{ if }\mathbf{x}\in[-a+C_{\widehat{k}}-\pi,-a+C_{\widehat{k}}+\pi]\ (\widehat{k}\in\mathbb{Z}).$

It is easy to obtain that $\varphi(\mathbf{x})$ is continuous on $\mathbb{R}$ (on interval endpoints $\varphi(\mathbf{x})=(\mathbf{x}-a)^{2}-n\pi^{2}$ ), and

\frac{1}{2}\varphi^{\prime}(\mathbf{x})=\mathbf{x}-a-n(\mathbf{x}+a-C_{\widehat{k}}),

when $\mathbf{x}\in[-a+C_{\widehat{k}}-\pi,-a+C_{\widehat{k}}+\pi]\ (\widehat{k}\in\mathbb{Z})$ .

Because we set $n=2$ and $a=2$ , then if $\mathbf{x}\in[-2+C_{\widehat{k}}-\pi,-2+C_{\widehat{k}}+\pi]$ ,

\varphi(\mathbf{x}):=(\mathbf{x}-2)^{2}-2\left(\mathbf{x}+2-C_{\widehat{k}}\right)^{2},

\frac{1}{2}\varphi^{\prime}(\mathbf{x})=-\mathbf{x}-6+2C_{\widehat{k}}.

Hence,

\frac{1}{2}\varphi^{\prime}(\mathbf{x})\in\left[-4+C_{\widehat{k}}-\pi,-4+C_{\widehat{k}}+\pi\right].

Therefore,

(1)

if $\widehat{k}\leq 0$ , then

$\frac{1}{2}\varphi^{\prime}(\mathbf{x})\leq-4+C_{\widehat{k}}+\pi\leq-4+\frac{\pi}{2}<0,$

and $\varphi(\mathbf{x})$ is decreasing. In this case, $C_{\widehat{k}}\leq-\frac{\pi}{2}$ , and $\mathbf{x}\leq-2+\frac{\pi}{2}$ .
(2)

if $\widehat{k}\geq 2$ , then

$\frac{1}{2}\varphi^{\prime}(\mathbf{x})\geq-4+C_{\widehat{k}}-\pi\geq-4+\frac{7\pi}{2}>0,$

and $\varphi(\mathbf{x})$ is increasing. In this case, $C_{\widehat{k}}\geq\frac{7\pi}{2}$ , and $\mathbf{x}~{}\geq~{}-2+\frac{5\pi}{2}$ .
(3)

when $\mathbf{x}\in\left[-2+\frac{\pi}{2},-2+\frac{5\pi}{2}\right]$ , the corresponding $\widehat{k}=1$ , and

$\varphi(\mathbf{x}):=(\mathbf{x}-2)^{2}-2\left(\mathbf{x}+2-\frac{3\pi}{2}\right)^{2},$

$\frac{1}{2}\varphi^{\prime}(\mathbf{x})=-\mathbf{x}-6+3\pi.$

Thus, when $\mathbf{x}\in\left(-2+\frac{\pi}{2},3\pi-6\right)$ , $\varphi^{\prime}(\mathbf{x})>0$ , and $\varphi(\mathbf{x})$ is increasing; when $\mathbf{x}\in\left(3\pi-6,-2+\frac{5\pi}{2}\right)$ , $\varphi^{\prime}(\mathbf{x})<0$ , and $\varphi(\mathbf{x})$ is decreasing.

To sum up, $\varphi(\mathbf{x})$ is increasing on $\left[-2+\frac{\pi}{2},3\pi-6\right]$ and $\left[-2+\frac{5\pi}{2},+\infty\right)$ , and decreasing on $\left(-\infty,-2+\frac{\pi}{2}\right]$ and $\left[3\pi-6,-2+\frac{5\pi}{2}\right]$ . Therefore, the optimal $\mathbf{x}$ occurs at either $-2+\frac{\pi}{2}$ or $-2+\frac{5\pi}{2}$ . Since at these two local minimizers,

		$\displaystyle\varphi\left(-2+\frac{\pi}{2}\right)=\left(-4+\frac{\pi}{2}\right)^{2}-2\pi^{2}$
	$\displaystyle<$	$\displaystyle\varphi\left(-2+\frac{5\pi}{2}\right)=\left(-4+\frac{5\pi}{2}\right)^{2}-2\pi^{2},$

we have $\mathbf{x}^{*}=-2+\frac{\pi}{2}$ , and

F^{*}=\left(-4+\frac{\pi}{2}\right)^{2}-2\pi^{2}=-\frac{7}{4}\pi^{2}-4\pi+16.

For $\mathbf{x}^{*}=-2+\frac{\pi}{2}$ , it means $\mathbf{x}^{*}\in\left[-2-\frac{\pi}{2}-\pi,-2-\frac{\pi}{2}+\pi\right]$ and $\mathbf{x}^{*}\in\left[-2+\frac{3\pi}{2}-\pi,-2+\frac{\pi}{2}+\pi\right]$ , so $C_{\widehat{k}}=-\frac{\pi}{2}$ or $\frac{3\pi}{2}$ , i.e., $\widehat{k}=0$ or $1$ . Thus,

[\mathbf{y}^{*}]_{i}=-\mathbf{x}^{*}+[\mathbf{c}]_{i}-\frac{\pi}{2}=4-\pi

[\mathbf{y}^{*}]_{i}=-\mathbf{x}^{*}+[\mathbf{c}]_{i}+\frac{3\pi}{2}=4+\pi

for $i=1,2$ . Hence,

(\mathbf{x}^{*},\mathbf{y}^{*})=\left(-2+\frac{\pi}{2},4\pm\pi,4\pm\pi\right).

Value-Function-based Sequential Minimization for Bi-level Optimization

Abstract

Index Terms:

1 Introduction

2 Related Works

3 The Proposed Algorithm

3.1 Value-Function-based Single-level Reformulation

Definition 1

3.2 Sequential Minimization Strategy

Proposition 1 (Calculation of G​(𝐱)G(\mathbf{x}))

Proof.

Remark 1

3.3 Extension for BLO with Functional Constraints

Proposition 2

Remark 2

3.4 Extension for Pessimistic BLO

Proposition 3

Remark 3

4 Theoretical Analysis

4.1 Convergence Analysis

Definition 2

Assumption 1 (Assumptions for the problem)

Lemma 1

Lemma 2

Lemma 3

Lemma 4

Theorem 1 (Convergence for Optimistic BLO)

Proof.

Corollary 1 (Convergence for Pessimistic BLO)

Proof.

4.2 Complexity Analysis

5 Experimental Results

5.1 Numerical Evaluations

5.1.1 Optimistic BLO

5.1.2 BLO with Constraints

5.1.3 Pessimistic BLO

5.2 Hyper-parameter Optimization

5.3 Few-shot Learning

5.4 Generative Adversarial Networks

6 Conclusions

Acknowledgments

References

Appendix A Proofs of Lemmas in Section 4.1

A.1 Lemma 1

Proof.

A.2 Lemma 2

Proof.

A.3 Lemma 3

Proof.

A.4 Lemma 4

Proof.

Appendix B Closed-form Solution in Section 5.1

B.1 Optimistic BLO in Section 5.1.1

B.2 BLO with constraints in Section 5.1.2

B.3 Pessimistic BLO in Section 5.1.3

Proposition 1 (Calculation of $G(\mathbf{x})$ )