This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A New Simple Stochastic Gradient Descent Type Algorithm With Lower Computational Complexity for Bilevel Optimization

Haimei Huo, Risheng Liu, , Zhixun Su Haimei Huo is with the School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, China (e-mail: ab1234@mail.dlut.edu.cn).Risheng Liu(Corresponding author) is with the DUT-RU International School of Information Science and Engineering, Dalian University of Technology, and also with the Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian 116024, China (e-mail: rsliu@dlut.edu.cn).Zhixun Su(Corresponding author) is with the School of Mathematical Sciences, Dalian University of Technology, with the State Key Laboratory of Structural Analysis for Industrial Equipment, and also with the Key Laboratory for Computational Mathematics and Data Intelligence of Liaoning Province, Dalian 116024, China (e-mail: zxsu@dlut.edu.cn).
Abstract

Bilevel optimization has been widely used in many machine learning applications such as hyperparameter optimization and meta learning. Recently, many simple stochastic gradient descent(SGD) type algorithms(without using momentum and variance techniques) have been proposed to solve the bilevel optimization problems. However, all the existing simple SGD type algorithms estimate the hypergradient via stochastic estimation of Neumann series. In the paper, we propose to estimate the hypergradient via SGD-based Estimation(i.e., solving the linear system with SGD). By using warm start initialization strategy, a new simple SGD type algorithm SSGD based on SGD-based Estimation is proposed. We provide the convergence rate guarantee for SSGD and show that SSGD outperforms the best known computational complexity achieved by the existing simple SGD type algorithms. Our experiments validate our theoretical results and demonstrate the efficiency of our proposed algorithm SSGD in hyperparameter optimization applications.

Index Terms:
Bilevel optimization, stochastic gradient descent(SGD)-based Estimation, simple SGD type algorithms, warm start.

I Introduction

Bilevel optimization has become a powerful tool for many machine learning applications such as meta learning[1, 2, 3, 4], hyperparameter optimization[5], and reinforcement learning[6, 7]. It has a nested structure which contains two levels of optimization tasks, and the feasible region of the upper level(UL) function is restricted by the optimal solutions of the lower level(LL) problem. Generally, bilevel optimization can be expressed as follows:

min𝒙pΦ(𝒙):=f(𝒙,𝒚(𝒙))\displaystyle\min\limits_{\boldsymbol{x}\in\mathbb{R}^{p}}\Phi(\boldsymbol{x}):=f(\boldsymbol{x},\boldsymbol{y}^{*}(\boldsymbol{x})) (1)
s.t.𝒚(𝒙)𝒮(𝒙):=argmin𝒚qg(𝒙,𝒚)\displaystyle s.t.~{}\boldsymbol{y}^{*}(\boldsymbol{x})\in\mathcal{S}(\boldsymbol{x}):={\underset{\boldsymbol{y}\in\mathbb{R}^{q}}{\operatorname{arg\,min}}\,g(\boldsymbol{x},\boldsymbol{y})}

where f(𝒙,𝒚)f(\boldsymbol{x},\boldsymbol{y}) is the UL function, g(𝒙,𝒚)g(\boldsymbol{x},\boldsymbol{y}) is the LL function, and 𝒮(𝒙)\mathcal{S}(\boldsymbol{x}) is the solution set of the LL problem.

Many deterministic algorithms have been proposed to solve the bilevel problem (1). In early works, the bilevel problem is reformulated into single level problem using the first order optimality conditions of the LL problem[8, 9, 10, 11]. However, such type of methods often involve a large number of constraints, and are not applicable in complex machine learning problems. To address the problem, gradient based algorithms have been revisited to solve problem (1). Depending on the way to approximate the hypergradient Φ(𝒙)\nabla\Phi(\boldsymbol{x}), these algorithms can be generally categorized into the iterative differentiation(ITD) based approach[1, 5, 12, 13, 14] and the approximate implicit differentiation(AID) based approach[12, 15, 16, 17].

ITD based approach treats the iterative optimization algorithm for the LL problem as a dynamical system, and approximates the hypergradient by back-propagation(BP) along the dynamical system [12, 18], this kind of hypergradient approximation method termed as BP in the paper. Since computing the gradient along the entire dynamical system is computationally challenging for high-dimensional problems, [5] proposed to truncate the path to compute the gradient. On the other hand, the convergence of ITD based approach for solving problem (1)(\ref{eq1}) has also been studied. [1] assumed that the LL function g(𝒙,)g(\boldsymbol{x},\cdot) is strongly convex, and provided the asymptotic convergence analysis. To address the case that the LL function g(𝒙,)g(\boldsymbol{x},\cdot) is convex, [19] proposed BDA and the asymptotic convergence analysis is provided. For the nonasymptotic convergence analysis of ITD based approach, see [14].

AID based approach applies implicit function theorem to the first order optimality conditions of the LL problem, and approximates the hypergradient by solving a linear system[15, 16]. There are mainly two methods to solve the linear system: conjugate gradient descent(CG)[12] and approximating the Hessian inverse with Neumann series[17], these two hypergradient approximation methods termed as CG and Neumann series(NS) in the paper. For the nonasymptotic convergence analysis of the AID based approach, see [14, 17].

In fact, in many machine learning applications, the bilevel optimization takes the following form:

min𝒙pΦ(𝒙):=f(𝒙,𝒚(𝒙)):=𝔼𝝃[F(𝒙,𝒚(𝒙);𝝃)]\displaystyle\min\limits_{\boldsymbol{x}\in\mathbb{R}^{p}}\Phi(\boldsymbol{x}):=f\left(\boldsymbol{x},\boldsymbol{y}^{*}(\boldsymbol{x})\right):=\mathbb{E}_{\boldsymbol{\xi}}\left[F(\boldsymbol{x},\boldsymbol{y}^{*}(\boldsymbol{x});\boldsymbol{\xi})\right] (2)
s.t.𝒚(𝒙):=argmin𝒚qg(𝒙,𝒚):=𝔼𝜻[G(𝒙,𝒚;𝜻)]\displaystyle s.t.~{}\boldsymbol{y}^{*}(\boldsymbol{x}):={\underset{\boldsymbol{y}\in\mathbb{R}^{q}}{\operatorname{arg\,min}}\,g(\boldsymbol{x},\boldsymbol{y})}:=\mathbb{E}_{\boldsymbol{\zeta}}[G(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\zeta)}]

where UL function f(𝒙,𝒚)f(\boldsymbol{x},\boldsymbol{y}) and LL function g(𝒙,𝒚)g(\boldsymbol{x},\boldsymbol{y}) take the expected values with respect to(w.r.t.) random variables 𝝃\boldsymbol{\xi} and 𝜻\boldsymbol{\zeta}, respectively; g(𝒙,)g(\boldsymbol{x},\cdot) is strongly convex w.r.t. 𝒚\boldsymbol{y} and Φ(𝒙)\Phi(\boldsymbol{x}) is nonconvex w.r.t. 𝒙\boldsymbol{x}. Notice that if 𝝃\boldsymbol{\xi} and 𝜻\boldsymbol{\zeta} are discrete random variables, the values of 𝝃\boldsymbol{\xi} and 𝜻\boldsymbol{\zeta} are chosen from the given data sets 𝒟n:={𝝃1,,𝝃n}\mathcal{D}_{n}:=\{\boldsymbol{\xi}_{1},\ldots,\boldsymbol{\xi}_{n}\} and 𝒟m:={𝜻1,,𝜻m}\mathcal{D}_{m}:=\{\boldsymbol{\zeta}_{1},\ldots,\boldsymbol{\zeta}_{m}\}, respectively, and the probability to sample data from 𝒟n\mathcal{D}_{n} and 𝒟m\mathcal{D}_{m} is equal, respectively, then the objective functions in problem (2) can be written as f(𝒙,𝒚):=1ni=1nF(𝒙,𝒚;𝝃i)f(\boldsymbol{x},\boldsymbol{y}):=\frac{1}{n}\sum_{i=1}^{n}F(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\xi}_{i}) and g(𝒙,𝒚):=1mi=1mG(𝒙,𝒚;𝜻i)g(\boldsymbol{x},\boldsymbol{y}):=\frac{1}{m}\sum_{i=1}^{m}G(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\zeta}_{i}).

In order to achieve better efficiency than deterministic algorithms for solving problem (2), many stochastic algorithms have been proposed. Without considering the loop to estimate the Hessian inverse, these existing stochastic algorithms for problem (2) can be categorized into single loop algorithms and double loop algorithms according to the number of loops.

For single loop algorithms, the upper variable 𝒙\boldsymbol{x} and the LL variable 𝒚\boldsymbol{y} are updated simultaneously. As far as we know, the first single loop algorithm proposed for solving problem (2) is the two-timescale stochastic approximation(TTSA) algorithm[7], which is a simple SGD type algorithm. To ensure convergence, a larger step size for the LL problem and a smaller step size for the UL problem are required. Moreover, to reach an ϵ\epsilon-stationary point, the computational complexity is 𝒪~(ϵ5/2)\widetilde{\mathcal{O}}(\epsilon^{-5/2}) in terms of the target accuracy ϵ\epsilon. Recently, based on momentum and variance techniques, some single loop algorithms with improved computational complexity have been proposed, see e.g., [20, 21].

For a double loop algorithm, it contains two loops in which the inner loop is used to update the LL variable 𝒚\boldsymbol{y} to obtain an approximation to the optimal solution 𝒚(𝒙)\boldsymbol{y}^{*}(\boldsymbol{x}) of the LL problem, and the outer loop is used to update the UL variable 𝒙\boldsymbol{x}. Following the direction, [17] proposed a bilevel stochastic approximation(BSA) algorithm and the computational complexity is 𝒪~(ϵ3)\widetilde{\mathcal{O}}(\epsilon^{-3}) in terms of the target accuracy ϵ\epsilon. By introducing warm start initializaiton strategy, stocBiO[14] achieves the computational complexity of 𝒪~(ϵ2)\widetilde{\mathcal{O}}(\epsilon^{-2}) in terms of the target accuracy ϵ\epsilon. In addition, in order to improve the computational complexity of the simple SGD type double loop algorithms, variance reduction is used for both UL variable 𝒙\boldsymbol{x} and LL variable 𝒚\boldsymbol{y} in [22], and the computational complexity is 𝒪~(ϵ1.5)\widetilde{\mathcal{O}}(\epsilon^{-1.5}) in terms of the target accuracy ϵ\epsilon.

Although many stochastic algorithms have been proposed to solve problem (2), little attention is paid to the the hypergradient estimation process except a few attempts recently. [23] noticed that to estimate the hypergradient, the existing single loop algorithms and double loop algorithms require an additional loop to estimate Hessian inverse. By studying the well known hypergradient approximation methods such as BP[12], CG[12], and NS[17], they identified a general approximation formulation of hypergradient computation. Based on the formulation and the momentum based variance reduction technique used in [24], they proposed a fully single loop algorithm(FSLA) in which an additional loop to estimate the Hessian inverse is not required.

In the paper, our goal is to design a new simple stochastic gradient descent(SGD) type algorithm for problem (2), which can achieve lower computational complexity compared with the existing simple SGD type algorithms(simple SGD type algorithms refer to stochastic algorithms that do not use momentum and variance techniques). We notice that the existing simple SGD type algorithms(i.e., BSA, TTSA, and stocBiO) for solving problem (2) estimate the hypergradient via stochastic estimation of Neumann series, this kind of hypergradient estimation method termed as Stochastic NS in the paper. Inspired by the work in [23], we study some hypergradient estimation methods: Stochastic BP, Stochastic NS, and SGD-based Estimation, and find that the algorithm has the potential to converge faster by using SGD-based Estimation to estimate the hypergradient. Based on the finding, by using warm start initialization strategy, a new simple SGD type algorithm SSGD based on SGD-based Estimation is proposed. Through our theoretical analysis, SSGD can achieve lower complexity compared with the existing simple SGD type algorithms. The main contributions of this paper are:

  1. 1.

    We propose to use SGD-based Estimation to estimate the hypergradient.

  2. 2.

    We propose a new simple SGD type algorithm SSGD.

  3. 3.

    We prove that SSGD can achieve lower computational complexity compared with the existing simple SGD type algorithms. As shown in Table I, the computational complexity of SSGD to reach an ϵ\epsilon-stationary point outperforms that of BSA and stocBiO by an order of 𝒪(ϵ1log(1/ϵ))\mathcal{O}(\epsilon^{-1}\log({1}/{\epsilon})) and 𝒪(log(1/ϵ))\mathcal{O}(\log({1}/{\epsilon})), respectively. In addition, in terms of the target accuracy ϵ\epsilon, the computational complexity of SSGD outperforms that of TTSA by an order of 𝒪(ϵ0.5log(1/ϵ))\mathcal{O}(\epsilon^{-0.5}\log({1}/{\epsilon})).

  4. 4.

    We perform experiments to validate our theoretical results and demonstrate the application of our algorithm in hyperparameter optimization.

The rest of this paper is organized as follows. In Section II, we first make some settings for problem (2), and specify the problem to study in this paper. Then, for problem (2), we introduce some hypergradient estimation methods and analyze them. In Section III, we propose a new simple SGD type algorithm for problem (2). The convergence and complexity results of our proposed algorithm are provided in Section IV. Experimental results are provided in Section V. In Section VI, we conclude and summarize the paper.

Notations. We use \|\cdot\| to denote the l2l_{2} norm for vectors and spectral norm for matrices. For a twice differentiable function f(𝒙,𝒚):p×qf(\boldsymbol{x},\boldsymbol{y}):\mathbb{R}^{p}\times\mathbb{R}^{q}\rightarrow\mathbb{R}, f(𝒙,𝒚)\nabla f(\boldsymbol{x},\boldsymbol{y}) denotes the gradient of f(𝒙,𝒚)f(\boldsymbol{x},\boldsymbol{y}) taken w.r.t. all the variables (𝒙,𝒚)(\boldsymbol{x},\boldsymbol{y}), 𝒙f(𝒙,𝒚)\nabla_{\boldsymbol{x}}f(\boldsymbol{x},\boldsymbol{y})(resp. 𝒚f(𝒙,𝒚)\nabla_{\boldsymbol{y}}f(\boldsymbol{x},\boldsymbol{y})) denotes its partial derivate taken w.r.t. 𝒙\boldsymbol{x}(resp. 𝒚\boldsymbol{y}), 𝒙𝒚f(𝒙,𝒚)\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}f(\boldsymbol{x},\boldsymbol{y}) denotes the Jacobian of 𝒙f(𝒙,𝒚)\nabla_{\boldsymbol{x}}f(\boldsymbol{x},\boldsymbol{y}) w.r.t. 𝒚\boldsymbol{y}, 𝒚2f(𝒙,𝒚)\nabla_{\boldsymbol{y}}^{2}f(\boldsymbol{x},\boldsymbol{y}) denotes the Hessian matrix of f(𝒙,𝒚)f(\boldsymbol{x},\boldsymbol{y}) w.r.t. 𝒚\boldsymbol{y}. Furthermore, for a twice differentiable function f(𝒙,𝒚):=𝔼𝝃[F(𝒙,𝒚;𝝃)]:p×qf(\boldsymbol{x},\boldsymbol{y}):=\mathbb{E}_{\boldsymbol{\xi}}[F(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\xi})]:\mathbb{R}^{p}\times\mathbb{R}^{q}\rightarrow\mathbb{R}, which takes expected value w.r.t. random variable 𝝃\boldsymbol{\xi}, we define

𝒙F(𝒙,𝒚;𝒟)=1|𝒟|i=1|𝒟|𝒙F(𝒙,𝒚;𝝃i)\nabla_{\boldsymbol{x}}F(\boldsymbol{x},\boldsymbol{y};\mathcal{D})=\frac{1}{|\mathcal{D}|}\sum\limits_{i=1}^{|\mathcal{D}|}\nabla_{\boldsymbol{x}}F(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\xi}_{i})

where 𝒟={𝝃1,,𝝃|𝒟|}\mathcal{D}=\{\boldsymbol{\xi}_{1},\ldots,\boldsymbol{\xi}_{|\mathcal{D}|}\} is a sample set with samples 𝝃i\boldsymbol{\xi}_{i}, i=1,,|𝒟|i=1,\ldots,|\mathcal{D}| sampled over the distribution of 𝝃\boldsymbol{\xi}, and |𝒟||\mathcal{D}| is the sample size of 𝒟\mathcal{D}. Similarly, the definitions of 𝒚F(𝒙,𝒚;𝒟)\nabla_{\boldsymbol{y}}F(\boldsymbol{x},\boldsymbol{y};\mathcal{D}), 𝒙𝒚F(𝒙,𝒚;𝒟)\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}F(\boldsymbol{x},\boldsymbol{y};\mathcal{D}), and 𝒚2F(𝒙,𝒚;𝒟)\nabla_{\boldsymbol{y}}^{2}F(\boldsymbol{x},\boldsymbol{y};\mathcal{D}) can be obtained.

TABLE I: Computational complexity of the simple SGD type algorithms to reach an ϵ\epsilon-stationary point of Φ(𝒙)\Phi(\boldsymbol{x}) for problem (2)(the computational complexity of BSA, TTSA, and stocBiO can be obtained from Table 1 in [25]). Note that κ\kappa denotes the condition number; the notation κp\kappa^{p} denotes a polynomial function of κ\kappa since [7] does not provide the dependence on the condition number κ\kappa
Algorithm loop batch size computational complexity
BSA[17] double 𝒪(1)\mathcal{O}(1) 𝒪(κ9ϵ3log(1/ϵ))\mathcal{O}(\kappa^{9}\epsilon^{-3}\log(1/\epsilon))
TTSA [7] single 𝒪(1)\mathcal{O}(1) 𝒪(κpϵ5/2log(1/ϵ))\mathcal{O}(\kappa^{p}\epsilon^{-{5}/{2}}\log(1/\epsilon))
stocBiO [14] double 𝒪(ϵ1)\mathcal{O}(\epsilon^{-1}) 𝒪(κ9ϵ2log(1/ϵ))\mathcal{O}(\kappa^{9}\epsilon^{-2}\log(1/\epsilon))
SSGD(Ours) double 𝒪(ϵ1)\mathcal{O}(\epsilon^{-1}) 𝒪(κ9ϵ2)\mathcal{O}(\kappa^{9}\epsilon^{-2})

II Analysis

In this section, we first make some settings for problem (2), and then introduce some hypergradient estimation methods for problem (2). Finally, we analyze these hypergradient estimation methods.

II-A Preliminaries

In the following, we make some assumptions on the objective functions in problem (2).

Assumption 1.

The functions g(𝐱,𝐲)g(\boldsymbol{x},\boldsymbol{y}) and G(𝐱,𝐲;𝛇)G(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\zeta}) are μ\mu-strongly convex w.r.t. 𝐲\boldsymbol{y} for any 𝐱\boldsymbol{x} and 𝛇\boldsymbol{\zeta}, and Φ(𝐱)\Phi(\boldsymbol{x}) is nonconvex w.r.t. 𝐱\boldsymbol{x}.

Assumption 2.

The functions f(𝐱,𝐲)f(\boldsymbol{x},\boldsymbol{y}) and g(𝐱,𝐲)g(\boldsymbol{x},\boldsymbol{y}) satisfy

  • f(𝒙,𝒚)f(\boldsymbol{x},\boldsymbol{y}) is MM-Lipschitz, i.e., for 𝒛1:=[𝒙1;𝒚1]\boldsymbol{z}_{1}:=[\boldsymbol{x}_{1};\boldsymbol{y}_{1}], 𝒛2:=[𝒙2;𝒚2]\boldsymbol{z}_{2}:=[\boldsymbol{x}_{2};\boldsymbol{y}_{2}], we have

    f(𝒙1,𝒚1)f(𝒙2,𝒚2)M𝒛1𝒛2.\|f(\boldsymbol{x}_{1},\boldsymbol{y}_{1})-f(\boldsymbol{x}_{2},\boldsymbol{y}_{2})\|\leq M\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\|.
  • f(𝒙,𝒚)\nabla f(\boldsymbol{x},\boldsymbol{y}) and g(𝒙,𝒚)\nabla g(\boldsymbol{x},\boldsymbol{y}) are LL-Lipschitz, i.e., for 𝒛1:=[𝒙1;𝒚1]\boldsymbol{z}_{1}:=[\boldsymbol{x}_{1};\boldsymbol{y}_{1}], 𝒛2:=[𝒙2;𝒚2]\boldsymbol{z}_{2}:=[\boldsymbol{x}_{2};\boldsymbol{y}_{2}], we have

    f(𝒙1,𝒚1)f(𝒙2,𝒚2)L𝒛1𝒛2,\displaystyle\|\nabla f(\boldsymbol{x}_{1},\boldsymbol{y}_{1})-\nabla f(\boldsymbol{x}_{2},\boldsymbol{y}_{2})\|\leq L\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\|,
    g(𝒙1,𝒚1)g(𝒙2,𝒚2)L𝒛1𝒛2.\displaystyle\|\nabla g(\boldsymbol{x}_{1},\boldsymbol{y}_{1})-\nabla g(\boldsymbol{x}_{2},\boldsymbol{y}_{2})\|\leq L\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\|.
  • 𝒙𝒚g(𝒙,𝒚)\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}g(\boldsymbol{x},\boldsymbol{y}) and 𝒚2g(𝒙,𝒚)\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x},\boldsymbol{y}) are τ\tau- and ρ\rho-Lipschitz, i.e., for 𝒛1:=[𝒙1;𝒚1]\boldsymbol{z}_{1}:=[\boldsymbol{x}_{1};\boldsymbol{y}_{1}], 𝒛2:=[𝒙2;𝒚2]\boldsymbol{z}_{2}:=[\boldsymbol{x}_{2};\boldsymbol{y}_{2}], we have

    𝒙𝒚g(𝒙1,𝒚1)𝒙𝒚g(𝒙2,𝒚2)τ𝒛1𝒛2,\displaystyle\|\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}g(\boldsymbol{x}_{1},\boldsymbol{y}_{1})-\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}g(\boldsymbol{x}_{2},\boldsymbol{y}_{2})\|\leq\tau\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\|,
    𝒚2g(𝒙1,𝒚1)𝒚2g(𝒙2,𝒚2)ρ𝒛1𝒛2.\displaystyle\|\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x}_{1},\boldsymbol{y}_{1})-\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x}_{2},\boldsymbol{y}_{2})\|\leq\rho\|\boldsymbol{z}_{1}-\boldsymbol{z}_{2}\|.

Furthermore, the same assumption holds for F(𝐱,𝐲;𝛏)F(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\xi}) and G(𝐱,𝐲;𝛇)G(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\zeta}) for any given 𝛏\boldsymbol{\xi} and 𝛇\boldsymbol{\zeta}.

In fact, for problem (2)(\ref{eq2}), Assumptions 1 and 2 are widely adopted to establish the convergence of the stochastic algorithms such as BSA, TTSA, and stocBiO. Based on Assumptions 1 and 2, the differentiability of Φ(𝒙)\Phi(\boldsymbol{x}), and the Lipschitz continuity of 𝒚(𝒙)\boldsymbol{y}^{*}(\boldsymbol{x}) and Φ(𝒙)\nabla\Phi(\boldsymbol{x}) can be ensured, as shown below.

Proposition 1.

For problem (2), suppose Assumptions 1 and 2 hold. Then Φ(𝐱)\Phi(\boldsymbol{x}) is differentiable, and the hypergradient Φ(𝐱)\nabla\Phi(\boldsymbol{x}) takes the form of

Φ(𝒙)=𝒙f(𝒙,𝒚(𝒙))𝒙𝒚g(𝒙,𝒚(𝒙))𝒗(𝒙,𝒚(𝒙))\nabla\Phi(\boldsymbol{x})=\nabla_{\boldsymbol{x}}f(\boldsymbol{x},\boldsymbol{y}^{*}(\boldsymbol{x}))-\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}g(\boldsymbol{x},\boldsymbol{y}^{*}(\boldsymbol{x}))\boldsymbol{v}(\boldsymbol{x},\boldsymbol{y}^{*}(\boldsymbol{x})) (3)

where 𝐯(𝐱,𝐲(𝐱))=[𝐲2g(𝐱,𝐲(𝐱))]1𝐲f(𝐱,𝐲(𝐱))\boldsymbol{v}(\boldsymbol{x},\boldsymbol{y}^{*}(\boldsymbol{x}))=[\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x},\boldsymbol{y}^{*}(\boldsymbol{x}))]^{-1}\nabla_{\boldsymbol{y}}f(\boldsymbol{x},\boldsymbol{y}^{*}(\boldsymbol{x})) is the solution of the following linear system:

𝒚2g(𝒙,𝒚(𝒙))𝒗=𝒚f(𝒙,𝒚(𝒙)).\nabla_{\boldsymbol{y}}^{2}g\left(\boldsymbol{x},\boldsymbol{y}^{*}(\boldsymbol{x})\right)\boldsymbol{v}=\nabla_{\boldsymbol{y}}f\left(\boldsymbol{x},\boldsymbol{y}^{*}(\boldsymbol{x})\right).

Furthermore, 𝐲(𝐱)\boldsymbol{y}^{*}(\boldsymbol{x}) and Φ(𝐱)\nabla\Phi(\boldsymbol{x}) are Lipschitz continuous, i.e., for any 𝐱1\boldsymbol{x}_{1}, 𝐱2p\boldsymbol{x}_{2}\in\mathbb{R}^{p}, we have

𝒚(𝒙1)𝒚(𝒙2)Lμ𝒙1𝒙2,\displaystyle\|\boldsymbol{y}^{*}(\boldsymbol{x}_{1})-\boldsymbol{y}^{*}(\boldsymbol{x}_{2})\|\leq\frac{L}{\mu}\|\boldsymbol{x}_{1}-\boldsymbol{x}_{2}\|,
Φ(𝒙1)Φ(𝒙2)LΦ𝒙1𝒙2,\displaystyle\|\nabla\Phi(\boldsymbol{x}_{1})-\nabla\Phi(\boldsymbol{x}_{2})\|\leq L_{\Phi}\|\boldsymbol{x}_{1}-\boldsymbol{x}_{2}\|,

where LΦ=L+(2L2+τM2)/μ+(ρLM+L3+τML)/μ2+(ρL2M)/μ3L_{\Phi}=L+(2L^{2}+\tau M^{2})/{\mu}+(\rho LM+L^{3}+\tau ML)/{\mu^{2}}+(\rho L^{2}M)/{\mu^{3}}.

The proof of Proposition 1 is in the supplementary material. From Assumption 1 and Proposition 1, we know that Φ(𝒙)\Phi(\boldsymbol{x}) is nonconvex and differentiable. Therefore, we want the algorithms to find an ϵ\epsilon-stationary point defined as follows.

Definition 1.

For problem (2), we call 𝐱¯\bar{\boldsymbol{x}} an ϵ\epsilon-stationary point of Φ(𝐱)\Phi(\boldsymbol{x}) if 𝔼Φ(𝐱¯)2ϵ\mathbb{E}\|\nabla\Phi(\bar{\boldsymbol{x}})\|^{2}\leq\epsilon.

To facilitate our analysis of the complexity of the algorithm when it reaches the ϵ\epsilon stationary point, we borrow the metrics of complexity as defined in Definition 2 of [14].

Definition 2.

For a stochastic function F(𝐱,𝐲;𝛏)F(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\xi}) with 𝐱p\boldsymbol{x}\in\mathbb{R}^{p}, 𝐲q\boldsymbol{y}\in\mathbb{R}^{q}, and 𝛏\boldsymbol{\xi} being a random variable, and a vector 𝐯q\boldsymbol{v}\in\mathbb{R}^{q}, let Gc(F,ϵ)(F,\epsilon) be the number of partial derivatives 𝐱F(𝐱,𝐲;𝛏)\nabla_{\boldsymbol{x}}F(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\xi}) or 𝐲F(𝐱,𝐲;𝛏)\nabla_{\boldsymbol{y}}F(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\xi}), and let Jv(F,ϵ)(F,\epsilon) and Hv(F,ϵ)(F,\epsilon) be the number of Jacobian vector products 𝐱𝐲F(𝐱,𝐲;𝛏)𝐯\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}F(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\xi})\boldsymbol{v} and Hessian vector products 𝐲2F(𝐱,𝐲;𝛏)𝐯\nabla_{\boldsymbol{y}}^{2}F(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\xi})\boldsymbol{v}, respectively.

To prove the convergence of the stochastic algorithms, as adopted in [7, 14, 17], we make the following assumptions on the stochastic derivatives.

Assumption 3.

For problem (2), the stochastic derivatives F(𝐱,𝐲;𝛏)\nabla F(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\xi}), G(𝐱,𝐲;𝛇)\nabla G(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\zeta}), 𝐱𝐲G(𝐱,𝐲;𝛇)\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}G(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\zeta}), and 𝐲2G(𝐱,𝐲;𝛇)\nabla_{\boldsymbol{y}}^{2}G(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\zeta}) are unbiased estimates of f(𝐱,𝐲)\nabla f(\boldsymbol{x},\boldsymbol{y}), g(𝐱,𝐲)\nabla g(\boldsymbol{x},\boldsymbol{y}), 𝐱𝐲g(𝐱,𝐲)\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}g(\boldsymbol{x},\boldsymbol{y}), and 𝐲2g(𝐱,𝐲)\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x},\boldsymbol{y}), respectively. Furthermore, their variances satisfy

𝔼F(𝒙,𝒚;𝝃)f(𝒙,𝒚)2σf2,\displaystyle\mathbb{E}\|\nabla F(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\xi})-\nabla f(\boldsymbol{x},\boldsymbol{y})\|^{2}\leq\sigma_{f}^{2},
𝔼G(𝒙,𝒚;𝜻)g(𝒙,𝒚)2σg2,\displaystyle\mathbb{E}\|\nabla G(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\zeta})-\nabla g(\boldsymbol{x},\boldsymbol{y})\|^{2}\leq\sigma_{g}^{2},
𝔼𝒙𝒚G(𝒙,𝒚;𝜻)𝒙𝒚g(𝒙,𝒚)2σg,12,\displaystyle\mathbb{E}\|\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}G(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\zeta})-\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}g(\boldsymbol{x},\boldsymbol{y})\|^{2}\leq\sigma_{g,1}^{2},
𝔼𝒚2G(𝒙,𝒚;𝜻)𝒚2g(𝒙,𝒚)2σg,22,\displaystyle\mathbb{E}\|\nabla_{\boldsymbol{y}}^{2}G(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\zeta})-\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x},\boldsymbol{y})\|^{2}\leq\sigma_{g,2}^{2},

where σf2\sigma_{f}^{2}, σg2\sigma_{g}^{2}, σg,12\sigma_{g,1}^{2}, and σg,22\sigma_{g,2}^{2} are positive constants.

II-B Introduction of Some Hypergradient Estimation Methods

In the paper, we study the hypergradient estimation methods: Stochastic BP, Stochastic NS, and SGD-based Estimation, where Stochastic NS is adopted to estimate the hypergradient for all the existing simple SGD type algorithms for solving problem (2), and Stochastic BP and SGD-based Estimation are developed in this paper for problem (2). Notice that Stochastic NS is actually a stochastic version of NS. In the similar way, we obtain Stochastic BP by introducing stochastic estimation into BP, and based on AID, we obtain SGD-based Estimation by solving linear system with SGD.

In the following, we introduce how each of these hypergradient estimation methods estimates the hypergradient in a general simple SGD type algorithm shown in Algorithm 1, with problem (2) satisfying Assumptions 1, 2, and 3.

As shown in Algorithm 1, given 𝒙k\boldsymbol{x}^{k}, first, an approximation 𝒚k,T\boldsymbol{y}^{k,T} to the optimal solution 𝒚(𝒙k)\boldsymbol{y}^{*}(\boldsymbol{x}^{k}) of the LL problem is obtained through the SGD updates in line 5 of Algorithm 1. Based on the output 𝒚k,T\boldsymbol{y}^{k,T}, these hypergradient estimation methods estimate the hypergradient as follows.

Algorithm 1 A General Simple SGD Type Algorithm for Problem (2)
0:  KK, TT, JJ, stepsizes α\alpha and β\beta, 0<η<1/L0<\eta<1/L, initialization 𝒙𝟎p\boldsymbol{x^{0}}\in\mathbb{R}^{p}
1:  for k=0,,K1k=0,\ldots,K-1 do
2:     Initialize 𝒚k,0q\boldsymbol{y}^{k,0}\in\mathbb{R}^{q}
3:     for t=0,,T1t=0,\ldots,T-1 do
4:        Draw a sample batch 𝒮t\mathcal{S}_{t} from the distribution of 𝜻\boldsymbol{\zeta}
5:        Update 𝒚k,t+1=𝒚k,tβ𝒚G(𝒙k,𝒚k,t;𝒮t)\boldsymbol{y}^{k,t+1}=\boldsymbol{y}^{k,t}-\beta\nabla_{\boldsymbol{y}}G(\boldsymbol{x}^{k},\boldsymbol{y}^{k,t};\mathcal{S}_{t})
6:     end for
7:     Hypergradient estimation via one of the following :
8:          Stochastic BP: automatic differentiation through the iterations in line 5 of Algorithm 1, and compute 𝒉fk\boldsymbol{h}_{f}^{k} via (6)
9:          Stochastic NS:
10:             1) solve (8) by approximating [𝒚2g(𝒙k,𝒚k,T)]1[\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T})]^{-1} with the first JJ terms of the Neumann series in (9), and obtain an approximate solution 𝒗k,J\boldsymbol{v}^{k,J} in (10)
11:             2) compute 𝒉fk\boldsymbol{h}_{f}^{k} via (11)
12:          SGD-based Estimation:
13:             1) initialize 𝒗k,0q\boldsymbol{v}^{k,0}\in\mathbb{R}^{q}
14:             2) solve (12) with JJ steps of SGD, and obtain an approximate solution 𝒗k,J\boldsymbol{v}^{k,J}, as shown in (13)
15:             3) compute 𝒉fk\boldsymbol{h}_{f}^{k} via (14)
16:     Update 𝒙k+1=𝒙kα𝒉fk\boldsymbol{x}^{k+1}=\boldsymbol{x}^{k}-\alpha\boldsymbol{h}_{f}^{k}
17:  end for

Stochastic BP. Notice that Φ(𝒙)=f(𝒙,𝒚(𝒙))\Phi(\boldsymbol{x})=f(\boldsymbol{x},\boldsymbol{y}^{*}(\boldsymbol{x})), and that

Φ(𝒙k)\displaystyle\nabla\Phi(\boldsymbol{x}^{k}) =𝒙f(𝒙k,𝒚(𝒙k))\displaystyle=\nabla_{\boldsymbol{x}}f(\boldsymbol{x}^{k},\boldsymbol{y}^{*}(\boldsymbol{x}^{k})) (4)
+(𝒚(𝒙k))𝒚f(𝒙k,𝒚(𝒙k)).\displaystyle\qquad+(\nabla\boldsymbol{y}^{*}(\boldsymbol{x}^{k}))^{\top}\nabla_{\boldsymbol{y}}f(\boldsymbol{x}^{k},\boldsymbol{y}^{*}(\boldsymbol{x}^{k})).

For Φ(𝒙k)\nabla\Phi(\boldsymbol{x}^{k}) in the form of (4), Stochastic BP estimates the hypergradient by automatic differentiation. Specifically, since 𝒚k,T\boldsymbol{y}^{k,T} has a dependence on 𝒙k\boldsymbol{x}^{k} through the SGD updates in line 5 of Algorithm 1, Stochastic BP computes (𝒚k,T(𝒙k))(\nabla\boldsymbol{y}^{k,T}(\boldsymbol{x}^{k}))^{\top} by automatic differentiation along these SGD updates, and (𝒚k,T(𝒙k))(\nabla\boldsymbol{y}^{k,T}(\boldsymbol{x}^{k}))^{\top} can be obtained as follows:

(𝒚k,T(𝒙k))=βt=0T1𝒙𝒚G(𝒙k,𝒚k,t;𝒮t)𝐀T1t(\nabla\boldsymbol{y}^{k,T}(\boldsymbol{x}^{k}))^{\top}=-\beta\sum\limits_{t=0}^{T-1}\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}G(\boldsymbol{x}^{k},\boldsymbol{y}^{k,t};\mathcal{S}_{t})\mathbf{A}_{T-1-t} (5)

where 𝐀T1t=i=t+1T1(𝐈β𝒚2G(𝒙k,𝒚k,i;𝒮i))\mathbf{A}_{T-1-t}=\prod_{i=t+1}^{T-1}(\mathbf{I}-\beta\nabla_{\boldsymbol{y}}^{2}G(\boldsymbol{x}^{k},\boldsymbol{y}^{k,i};\mathcal{S}_{i})) with 𝐈\mathbf{I} being the identity matrix, and 𝐀0=𝐈\mathbf{A}_{0}=\mathbf{I}(for the derivation, see Appendix D.2 in [14]). Then,

𝒉fk\displaystyle\boldsymbol{h}_{f}^{k} =𝒙F(𝒙k,𝒚k,T;𝒟F)\displaystyle=\nabla_{\boldsymbol{x}}F(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T};\mathcal{D}_{F}) (6)
+(𝒚k,T(𝒙k))𝒚F(𝒙k,𝒚k,T;𝒟F)\displaystyle\qquad+(\nabla\boldsymbol{y}^{k,T}(\boldsymbol{x}^{k}))^{\top}\nabla_{\boldsymbol{y}}F(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T};\mathcal{D}_{F})

is used to estimate the hypergradient Φ(𝒙k)\nabla\Phi(\boldsymbol{x}^{k}) in (4), where 𝒟F\mathcal{D}_{F} is a sample set sampled from the distribution of 𝝃\boldsymbol{\xi} and (𝒚k,T(𝒙k))(\nabla\boldsymbol{y}^{k,T}(\boldsymbol{x}^{k}))^{\top} is given in (5).

In fact, Stochastic BP is a variant of BP, and 𝒉fk\boldsymbol{h}_{f}^{k} in (6) can be directly obtained by introducing stochastic estimation into the hypergradient approximation formula of BP in Proposition 2 in [14].

Stochastic NS. From Proposition 1, we have

Φ(𝒙k)\displaystyle\nabla\Phi(\boldsymbol{x}^{k}) =𝒙f(𝒙k,𝒚(𝒙k))\displaystyle=\nabla_{\boldsymbol{x}}f(\boldsymbol{x}^{k},\boldsymbol{y}^{*}(\boldsymbol{x}^{k})) (7)
𝒙𝒚g(𝒙k,𝒚(𝒙k))𝒗(𝒙k,𝒚(𝒙k))\displaystyle\qquad-\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}g(\boldsymbol{x}^{k},\boldsymbol{y}^{*}(\boldsymbol{x}^{k}))\boldsymbol{v}(\boldsymbol{x}^{k},\boldsymbol{y}^{*}(\boldsymbol{x}^{k}))

with 𝒗(𝒙k,𝒚(𝒙k))=[𝒚2g(𝒙k,𝒚(𝒙k))]1𝒚f(𝒙k,𝒚(𝒙k))\boldsymbol{v}(\boldsymbol{x}^{k},\boldsymbol{y}^{*}(\boldsymbol{x}^{k}))=[\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x}^{k},\boldsymbol{y}^{*}(\boldsymbol{x}^{k}))]^{-1}\nabla_{\boldsymbol{y}}f(\boldsymbol{x}^{k},\boldsymbol{y}^{*}(\boldsymbol{x}^{k})). For Φ(𝒙k)\nabla\Phi(\boldsymbol{x}^{k}) in the form of (7), Stochastic NS estimates the hypergradient by solving the following linear system:

𝒚2g(𝒙k,𝒚k,T)𝒗=𝒚f(𝒙k,𝒚k,T).\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T})\boldsymbol{v}=\nabla_{\boldsymbol{y}}f(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T}). (8)

By approximating Hessian inverse [𝒚2g(𝒙k,𝒚k,T)]1[\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T})]^{-1} with the first JJ terms of the following Neumann series:

j=0η(𝐈η𝒚2g(𝒙k,𝒚k,T))j\sum_{j=0}^{\infty}\eta(\mathbf{I}-\eta\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T}))^{j} (9)

and computing 𝒚F(𝒙k,𝒚k,T;𝒟F)\nabla_{\boldsymbol{y}}F(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T};\mathcal{D}_{F}) over a sample set 𝒟F\mathcal{D}_{F} sampled from the distribution of 𝝃\boldsymbol{\xi}, an approximate solution 𝒗k,J\boldsymbol{v}^{k,J} of (8) is obtained, and takes the form of

𝒗k,J=ηj=0J1i=JjJ1(𝐈η𝒚2G(𝒙k,𝒚k,T;i))𝒗0\boldsymbol{v}^{k,J}=\eta\sum\limits_{j=0}^{J-1}\prod_{i=J-j}^{J-1}(\mathbf{I}-\eta\nabla_{\boldsymbol{y}}^{2}G(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T};\mathcal{B}_{i}))\boldsymbol{v}_{0} (10)

where i=JJ1()=𝐈\prod_{i=J}^{J-1}(\cdot)=\mathbf{I}, 𝒗0=𝒚F(𝒙k,𝒚k,T;𝒟F)\boldsymbol{v}_{0}=\nabla_{\boldsymbol{y}}F(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T};\mathcal{D}_{F}), and sample sets i,i=1,,J1\mathcal{B}_{i},i=1,\ldots,J-1 are sampled from the distribution of 𝜻\boldsymbol{\zeta}. Then,

𝒉fk=𝒙F(𝒙k,𝒚k,T;𝒟F)𝒙𝒚G(𝒙k,𝒚k,T;𝒟G)𝒗k,J\boldsymbol{h}_{f}^{k}=\nabla_{\boldsymbol{x}}F(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T};\mathcal{D}_{F})-\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}G(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T};\mathcal{D}_{G})\boldsymbol{v}^{k,J} (11)

is used to estimate the hypergradient Φ(𝒙k)\nabla\Phi(\boldsymbol{x}^{k}) in (7), where 𝒟F\mathcal{D}_{F} is the aforementioned sample set, 𝒟G\mathcal{D}_{G} is a sample set sampled from the distribution of 𝜻\boldsymbol{\zeta}, and 𝒗k,J\boldsymbol{v}^{k,J} is given in (10). Note that sample sets {𝒟F,𝒟G,i(i=1,,J1)}\{\mathcal{D}_{F},\mathcal{D}_{G},\mathcal{B}_{i}(i=1,\ldots,J-1)\} are mutually independent.

In fact, 𝒉fk\boldsymbol{h}_{f}^{k} in (11) can be directly obtained by introducing stochastic estimation into the hypergradient approximation formula of NS in eq.(3) in [22], which is the reason we call this hypergradient estimation method Stochastic NS. Furthermore, for more details about Stochastic NS, see subsection 2.2 in [14].

SGD-based Estimation. SGD-based Estimation estimates the hypergradient in the way similar to Stochastic NS except that the approximate solution to (8) is obtained by using SGD. Notice that the solution to (8) is equivalent to the optimal solution of the following optimization problem:

min𝒗12𝒗𝒚2g(𝒙k,𝒚k,T)𝒗𝒗𝒚f(𝒙k,𝒚k,T).\min\limits_{\boldsymbol{v}}\frac{1}{2}\boldsymbol{v}^{\top}\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T})\boldsymbol{v}-\boldsymbol{v}^{\top}\nabla_{\boldsymbol{y}}f(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T}). (12)

SGD-based Estimation obtains an approximate solution 𝒗k,J\boldsymbol{v}^{k,J} to (12) by solving (12) with JJ steps of SGD as follows:

𝒗k,j+1\displaystyle\boldsymbol{v}^{k,j+1} =(𝐈η𝒚2G(𝒙k,𝒚k,T;j))𝒗k,j\displaystyle=(\mathbf{I}-\eta\nabla_{\boldsymbol{y}}^{2}G(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T};\mathcal{B}_{j}))\boldsymbol{v}^{k,j} (13)
+η𝒚F(𝒙k,𝒚k,T;𝒟F,j),j=0,,J1\displaystyle\qquad+\eta\nabla_{\boldsymbol{y}}F(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T};\mathcal{D}_{F,j}),\quad j=0,\ldots,J-1

where η\eta is the step size, and for each j{0,,J1}j\in\{0,\ldots,J-1\}, 𝒟F,j\mathcal{D}_{F,j} and j\mathcal{B}_{j} are the sample sets sampled from the distributions of 𝝃\boldsymbol{\xi} and 𝜻\boldsymbol{\zeta}, respectively. Then,

𝒉fk=𝒙F(𝒙k,𝒚k,T;𝒟F)𝒙𝒚G(𝒙k,𝒚k,T;𝒟G)𝒗k,J\boldsymbol{h}_{f}^{k}=\nabla_{\boldsymbol{x}}F(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T};\mathcal{D}_{F})-\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}G(\boldsymbol{x}^{k},\boldsymbol{y}^{k,T};\mathcal{D}_{G})\boldsymbol{v}^{k,J} (14)

is used to estimate the hypergradient, where 𝒟F\mathcal{D}_{F} and 𝒟G\mathcal{D}_{G} are the sample sets sampled from the distributions of 𝝃\boldsymbol{\xi} and 𝜻\boldsymbol{\zeta}, respectively.

II-C Analysis of Hypergradient Estimation Methods

Define k,1=σ{𝒙0,,𝒚0,T,,𝒙k,,𝒚k,T}\mathcal{F}_{k,1}^{{}^{\prime}}=\sigma\{\boldsymbol{x}^{0},\ldots,\boldsymbol{y}^{0,T},\ldots,\boldsymbol{x}^{k},\ldots,\boldsymbol{y}^{k,T}\}, k,2=σ{𝒙0,,𝒚0,T,,𝒗0,J,,𝒙k,,𝒚k,T}\mathcal{F}_{k,2}^{{}^{\prime}}=\sigma\{\boldsymbol{x}^{0},\ldots,\boldsymbol{y}^{0,T},\ldots,\boldsymbol{v}^{0,J},\ldots,\boldsymbol{x}^{k},\ldots,\boldsymbol{y}^{k,T}\}, where σ{}\sigma\{\cdot\} denotes the σ\sigma-algebra generated by random variables.

Then, similar to the discussion in the proof of Lemma 3 in the supplementary material, from the iteration in line 16 of Algorithm 1, we can obtain formula (19) in the supplementary material in which 𝒉¯fk=𝔼[𝒉fk|k,1]\bar{\boldsymbol{h}}_{f}^{k}=\mathbb{E}[\boldsymbol{h}_{f}^{k}|\mathcal{F}_{k,1}^{{}^{\prime}}] for Stochastic BP and Stochastic NS, and 𝒉¯fk=𝔼[𝒉fk|k,2]\bar{\boldsymbol{h}}_{f}^{k}=\mathbb{E}[\boldsymbol{h}_{f}^{k}|\mathcal{F}_{k,2}^{{}^{\prime}}] for SGD-based Estimation. By shifting the terms, we have

𝔼Φ(𝒙k)2\displaystyle\mathbb{E}\|\nabla\Phi(\boldsymbol{x}^{k})\|^{2}
2α(𝔼[Φ(𝒙k)]𝔼[Φ(𝒙k+1)])(1αLΦ)𝔼𝒉¯fk2\displaystyle\leq\frac{2}{\alpha}\big{(}\mathbb{E}[\Phi(\boldsymbol{x}^{k})]-\mathbb{E}[\Phi(\boldsymbol{x}^{k+1})]\big{)}-(1-\alpha{L_{\Phi}})\mathbb{E}\|\bar{\boldsymbol{h}}_{f}^{k}\|^{2}
+𝔼Φ(𝒙k)𝒉¯fk2+αLΦ𝔼𝒉fk𝒉¯fk2\displaystyle\qquad+\mathbb{E}\|\nabla\Phi(\boldsymbol{x}^{k})-\bar{\boldsymbol{h}}_{f}^{k}\|^{2}+\alpha{L_{\Phi}}\mathbb{E}\|\boldsymbol{h}_{f}^{k}-\bar{\boldsymbol{h}}_{f}^{k}\|^{2}

where α\alpha is in Algorithm 1, LΦL_{\Phi} is in Proposition 1.

The upper bound of 𝔼Φ(𝒙k)2\mathbb{E}\|\nabla\Phi(\boldsymbol{x}^{k})\|^{2} involves the term Φ(𝒙k)𝒉¯fk\|\nabla\Phi(\boldsymbol{x}^{k})-\bar{\boldsymbol{h}}_{f}^{k}\|. In the following, the upper bound of Φ(𝒙k)𝒉¯fk\|\nabla\Phi(\boldsymbol{x}^{k})-\bar{\boldsymbol{h}}_{f}^{k}\| is provided. For the proof, see Lemma 10 in the supplementary material.

Proposition 2.

Apply Algorithm 1 to solve problem (2). Suppose Assumptions 1, 2, and 3 hold. let 𝐇0=𝟎\mathbf{H}_{0}=\mathbf{0}, 𝐯0=𝐯k,0\boldsymbol{v}_{0}=\boldsymbol{v}^{k,0}, and define 𝐇=𝐱𝐲g(𝐱k,𝐲(𝐱k))(𝐲2g(𝐱k,𝐲(𝐱k)))1\mathbf{H}^{*}=\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}g(\boldsymbol{x}^{k},\boldsymbol{y}^{*}(\boldsymbol{x}^{k}))(\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x}^{k},\boldsymbol{y}^{*}(\boldsymbol{x}^{k})))^{-1}, 𝐯=𝐲2g(𝐱k,𝐲(𝐱k))1𝐲f(𝐱k,𝐲(𝐱k))\boldsymbol{v}^{*}=\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x}^{k},\boldsymbol{y}^{*}(\boldsymbol{x}^{k}))^{-1}\nabla_{\boldsymbol{y}}f(\boldsymbol{x}^{k},\boldsymbol{y}^{*}(\boldsymbol{x}^{k})), where 𝐯k,0\boldsymbol{v}^{k,0} is the initial value to solve problem (12)(see line 13 in Algorithm 1), and 𝐲(𝐱k)\boldsymbol{y}^{*}(\boldsymbol{x}^{k}) is the optimal solution to the LL problem of problem (2) with 𝐱=𝐱k\boldsymbol{x}=\boldsymbol{x}^{k}. For μ\mu , LL in Assumptions 1, 2 and TT, JJ, β\beta, η\eta in Algorithm 1, let μ<1\mu<1, L<1L<1, TT, JJ be any positive integers, and β=η\beta=\eta. Then,

  • For Stochastic BP, for 𝒉¯fk:=𝔼[𝒉fk|k,1]\bar{\boldsymbol{h}}_{f}^{k}:=\mathbb{E}[\boldsymbol{h}_{f}^{k}|\mathcal{F}_{k,1}^{{}^{\prime}}] with 𝒉fk\boldsymbol{h}_{f}^{k} in (6), we have

    Φ(𝒙k)𝒉¯fk\displaystyle\|\nabla\Phi(\boldsymbol{x}^{k})-\bar{\boldsymbol{h}}_{f}^{k}\|
    L(1+Lμ)𝒚k,T𝒚(𝒙k)+M(1ημ)T𝐇\displaystyle\leq L\bigg{(}1+\frac{L}{\mu}\bigg{)}\|\boldsymbol{y}^{k,T}-\boldsymbol{y}^{*}(\boldsymbol{x}^{k})\|+M(1-\eta\mu)^{T}\|\mathbf{H}^{*}\|
    +Mη(Lμρ+τ)t=0T1(1ημ)t𝒚k,T1t𝒚(𝒙k);\displaystyle+M\eta\bigg{(}\frac{L}{\mu}\rho+\tau\bigg{)}\sum\limits_{t=0}^{T-1}(1-\eta\mu)^{t}\|\boldsymbol{y}^{k,T-1-t}-\boldsymbol{y}^{*}(\boldsymbol{x}^{k})\|;
  • For Stochastic NS, for 𝒉¯fk:=𝔼[𝒉fk|k,1]\bar{\boldsymbol{h}}_{f}^{k}:=\mathbb{E}[\boldsymbol{h}_{f}^{k}|\mathcal{F}_{k,1}^{{}^{\prime}}] with 𝒉fk\boldsymbol{h}_{f}^{k} in (11), we have

    Φ(𝒙k)𝒉¯fk\displaystyle\|\nabla\Phi(\boldsymbol{x}^{k})-\bar{\boldsymbol{h}}_{f}^{k}\|
    L(1+Lμ)𝒚k,T𝒚(𝒙k)+M(1ημ)J𝐇\displaystyle\leq L\bigg{(}1+\frac{L}{\mu}\bigg{)}\|\boldsymbol{y}^{k,T}-\boldsymbol{y}^{*}(\boldsymbol{x}^{k})\|+M(1-\eta\mu)^{J}\|\mathbf{H}^{*}\|
    +Mη(Lμρ+τ)t=0J1(1ημ)t𝒚k,T𝒚(𝒙k);\displaystyle+M\eta\bigg{(}\frac{L}{\mu}\rho+\tau\bigg{)}\sum\limits_{t=0}^{J-1}(1-\eta\mu)^{t}\|\boldsymbol{y}^{k,T}-\boldsymbol{y}^{*}(\boldsymbol{x}^{k})\|;
  • For SGD-based Estimation, for 𝒉¯fk:=𝔼[𝒉fk|k,2]\bar{\boldsymbol{h}}_{f}^{k}:=\mathbb{E}[\boldsymbol{h}_{f}^{k}|\mathcal{F}_{k,2}^{{}^{\prime}}] with 𝒉fk\boldsymbol{h}_{f}^{k} in (14), we have

    Φ(𝒙k)𝒉¯fk\displaystyle\|\nabla\Phi(\boldsymbol{x}^{k})-\bar{\boldsymbol{h}}_{f}^{k}\|
    (L+Mμτ)𝒚k,T𝒚(𝒙k)\displaystyle\leq\bigg{(}L+\frac{M}{\mu}\tau\bigg{)}\|\boldsymbol{y}^{k,T}-\boldsymbol{y}^{*}(\boldsymbol{x}^{k})\|
    +L(1ημ)J𝒗k,0𝒗\displaystyle+L(1-\eta\mu)^{J}\|\boldsymbol{v}^{k,0}-\boldsymbol{v}^{*}\|
    +Lη(Mμρ+L)t=0J1(1ημ)t𝒚k,T𝒚(𝒙k);\displaystyle+L\eta\bigg{(}\frac{M}{\mu}\rho+L\bigg{)}\sum\limits_{t=0}^{J-1}(1-\eta\mu)^{t}\|\boldsymbol{y}^{k,T}-\boldsymbol{y}^{*}(\boldsymbol{x}^{k})\|;

Φ(𝒙k)\nabla\Phi(\boldsymbol{x}^{k}) is in (3) with 𝐱=𝐱k\boldsymbol{x}=\boldsymbol{x}^{k}, μ\mu, LL, MM, τ\tau, ρ\rho are given in Assumptions 1, 2.

From Proposition 2, it is easy to find that there are some differences for Stochastic BP, Stochastic NS, and SGD-based Estimation regarding the upper bound of Φ(𝒙k)𝒉¯fk\|\nabla\Phi(\boldsymbol{x}^{k})-\bar{\boldsymbol{h}}_{f}^{k}\|. For Stochastic BP and Stochastic NS, the upper bound for Φ(𝒙k)𝒉¯fk\|\nabla\Phi(\boldsymbol{x}^{k})-\bar{\boldsymbol{h}}_{f}^{k}\| involves M(1ημ)T𝐇M(1-\eta\mu)^{T}\|\mathbf{H}^{*}\| and M(1ημ)J𝐇M(1-\eta\mu)^{J}\|\mathbf{H}^{*}\|, respectively. While for SGD-based Estimation, the term L(1ημ)J𝒗k,0𝒗L(1-\eta\mu)^{J}\|\boldsymbol{v}^{k,0}-\boldsymbol{v}^{*}\| is involved in the upper bound of Φ(𝒙k)𝒉¯fk\|\nabla\Phi(\boldsymbol{x}^{k})-\bar{\boldsymbol{h}}_{f}^{k}\|.

Thus, to ensure that Φ(𝒙k)𝒉¯fk\|\nabla\Phi(\boldsymbol{x}^{k})-\bar{\boldsymbol{h}}_{f}^{k}\| is a sufficiently small number, a sufficiently large TT for Stochastic BP and a sufficiently large JJ for Stochastic NS may be required. In contrast, for SGD-based Estimation, we notice that 𝒗k,0\boldsymbol{v}^{k,0} in the term L(1ημ)J𝒗k,0𝒗L(1-\eta\mu)^{J}\|\boldsymbol{v}^{k,0}-\boldsymbol{v}^{*}\| is the initial value to solve problem (12). Apply warm start initialization strategy to 𝒗k,0\boldsymbol{v}^{k,0}(i.e., 𝒗k,0=𝒗k1,J\boldsymbol{v}^{k,0}=\boldsymbol{v}^{k-1,J}, where 𝒗k1,J\boldsymbol{v}^{k-1,J} is the output of the JJ-th iteration to solve problem (12) given 𝒙k1\boldsymbol{x}^{k-1}) can allow us to track the errors, such as 𝒗k1,0𝒗\|\boldsymbol{v}^{k-1,0}-\boldsymbol{v}^{*}\|, …, 𝒗0,0𝒗\|\boldsymbol{v}^{0,0}-\boldsymbol{v}^{*}\|, in the preceeding loops. Thus, a smaller number JJ may be able to ensure Φ(𝒙k)𝒉¯fk\|\nabla\Phi(\boldsymbol{x}^{k})-\bar{\boldsymbol{h}}_{f}^{k}\| to be a sufficiently small number, and there is no restriction on TT. Notice that warm start initialization strategy is used in stocBiO[14] to initialize the initial values for solving the LL problem, and the improved computational complexity is obtained.

The above discussion inspires us to think that SGD-based Estimation may be able to allow 𝔼Φ(𝒙k)2\mathbb{E}\|\nabla\Phi(\boldsymbol{x}^{k})\|^{2} to be sufficiently small under weaker restrictions on TT than Stochastic BP and JJ than Stochastic NS. Thus, SGD-based Estimation may be able to make Algorithm 1 converge faster than Stochastic BP and Stochastic NS.

Next, we perform experiments on a synthetic bilevel optimization problem, a special case of problem (2), to evaluate the performance of Algorithm 1 under different hypergradient estimation methods.

Synthetic Bilevel Optimization Problem. We consider the following synthetic bilevel optimization problem:

min𝒙pΦ(𝒙):=f(𝒙,𝒚(𝒙))s.t.𝒚(𝒙)=argmin𝒚pg(𝒙,𝒚)\min\limits_{\boldsymbol{x}\in\mathbb{R}^{p}}\Phi(\boldsymbol{x}):=f(\boldsymbol{x},\boldsymbol{y}^{*}(\boldsymbol{x}))\quad s.t.~{}\boldsymbol{y}^{*}(\boldsymbol{x})={\underset{\boldsymbol{y}\in\mathbb{R}^{p}}{\operatorname{arg\,min}}\,g(\boldsymbol{x},\boldsymbol{y})} (15)

with

f(𝒙,𝒚)=1|𝒟val|(𝒖i,vi)𝒟val12(𝒚𝒖ivi)2+𝒙3,\displaystyle f(\boldsymbol{x},\boldsymbol{y})=\frac{1}{|\mathcal{D}_{\text{val}}|}\sum\limits_{(\boldsymbol{u}_{i},v_{i})\in\mathcal{D}_{\text{val}}}\frac{1}{2}({\boldsymbol{y}}^{\top}\boldsymbol{u}_{i}-v_{i})^{2}+\|\boldsymbol{x}\|^{3},
g(𝒙,𝒚)=1|𝒟tr|(𝒖i,vi)𝒟tr12(𝒚𝒖ivi)2+r2𝒚𝒙2\displaystyle g(\boldsymbol{x},\boldsymbol{y})=\frac{1}{|\mathcal{D}_{\text{tr}}|}\sum\limits_{(\boldsymbol{u}_{i},v_{i})\in\mathcal{D}_{\text{tr}}}\frac{1}{2}(\boldsymbol{y}^{\top}\boldsymbol{u}_{i}-v_{i})^{2}+\frac{r}{2}\|\boldsymbol{y}-\boldsymbol{x}\|^{2}

where r=0.5r=0.5, and 𝒟tr\mathcal{D}_{\text{tr}} and 𝒟val\mathcal{D}_{\text{val}} are the data sets constructed as follows: given 𝒘0p\boldsymbol{w}_{0}\in\mathbb{R}^{p}, first randomly sample 2000020000 data points 𝒆ip1\boldsymbol{e}_{i}\in\mathbb{R}^{p-1}, i=1,,20000i=1,\ldots,20000 from normal distribution with a mean of 0 and a variance of 0.01; then, for each i{1,,20000}i\in\{1,\ldots,20000\}, we set 𝒖i=(𝒆i,1)\boldsymbol{u}_{i}=(\boldsymbol{e}_{i},1), and construct viv_{i}\in\mathbb{R} by vi=𝒘0𝒖i+σiv_{i}=\boldsymbol{w}_{0}^{\top}\boldsymbol{u}_{i}+\sigma_{i}, where σi\sigma_{i} is the Gaussian noise with mean 0 and variance 1; finally, dividing the dataset {(𝒖i,vi)}i=120000\{(\boldsymbol{u}_{i},v_{i})\}_{i=1}^{20000} equally into two parts, we obtain training set 𝒟tr\mathcal{D}_{\text{tr}} and validation set 𝒟val\mathcal{D}_{\text{val}}, with the training set size |𝒟tr||\mathcal{D}_{\text{tr}}| and validation set size |𝒟val||\mathcal{D}_{\text{val}}| being 10000, respectively.

It is easy to verify that g(𝒙,𝒚)g(\boldsymbol{x},\boldsymbol{y}) is strongly convex w.r.t. 𝒚\boldsymbol{y}. Moreover, for each 𝒙\boldsymbol{x}, the minimum solution of the LL problem is

𝒚(𝒙)=(𝐀tr+r𝐈)1(𝒃tr+r𝒙)\boldsymbol{y}^{*}(\boldsymbol{x})=(\mathbf{A}_{\text{tr}}+r\mathbf{I})^{-1}(\boldsymbol{b_{\text{tr}}}+r\boldsymbol{x})

and the hypergradient is

Φ(𝒙)=r(𝐀tr+r𝐈)1(𝐀val𝒚(𝒙)𝒃val)+3𝒙𝒙\nabla\Phi(\boldsymbol{x})=r(\mathbf{A}_{\text{tr}}+r\mathbf{I})^{-1}(\mathbf{A}_{\text{val}}\boldsymbol{y}^{*}(\boldsymbol{x})-\boldsymbol{b_{\text{val}}})+3\|\boldsymbol{x}\|\boldsymbol{x}

where

𝐀val=1|𝒟val|(𝒖i,vi)𝒟val𝒖i(𝒖i),\displaystyle\mathbf{A}_{\text{val}}=\frac{1}{|\mathcal{D}_{\text{val}}|}\sum\limits_{(\boldsymbol{u}_{i},v_{i})\in\mathcal{D}_{\text{val}}}\boldsymbol{u}_{i}(\boldsymbol{u}_{i})^{\top},
𝒃val=1|𝒟val|(𝒖i,vi)𝒟valvi𝒖i,\displaystyle\boldsymbol{b_{\text{val}}}=\frac{1}{|\mathcal{D}_{\text{val}}|}\sum\limits_{(\boldsymbol{u}_{i},v_{i})\in\mathcal{D}_{\text{val}}}v_{i}\boldsymbol{u}_{i},

and 𝐀tr\mathbf{A}_{\text{tr}}, 𝒃tr\boldsymbol{b_{\text{tr}}} can be obtained by replacing 𝒟val\mathcal{D}_{\text{val}} and |𝒟val||\mathcal{D}_{\text{val}}| in 𝐀val\mathbf{A}_{\text{val}} and 𝒃val\boldsymbol{b_{\text{val}}} with 𝒟tr\mathcal{D}_{\text{tr}} and |𝒟tr||\mathcal{D}_{\text{tr}}|.

In the following, we use Algorithm 1 to solve problem (15), and compare the performance of Algorithm 1 under different hypergradient estimation methods. To be specific, the hypergradient estimation methods that we consider are Stochastic BP, Stochastic NS, and SGD-based Estimation, where for SGD-based Estimation, we use two initialization strategies to initialize 𝒗k,0\boldsymbol{v}^{k,0} in line 13 of Algorithm 1, i.e., non-warm start initialization strategy and warm start initialization strategy. For non-warm start initialization strategy, we set 𝒗k,0\boldsymbol{v}^{k,0}, k=0,,K1k=0,\ldots,K-1 to be zero vectors. For warm-start initialization strategy, we set 𝒗k,0=𝒗k1,J\boldsymbol{v}^{k,0}=\boldsymbol{v}^{k-1,J}, k=1,,K1k=1,\ldots,K-1, where for each k{1,,K1}k\in\{1,\ldots,K-1\}, 𝒗k1,J\boldsymbol{v}^{k-1,J} is obtained by the iteration steps starting from 𝒗k1,0\boldsymbol{v}^{k-1,0} in (13). The experimental details are in the supplementary material.

To simplify the narrative, we refer to Algorithm 1 as Stochastic BP(resp. Stochastic NS) if Algorithm 1 estimates the hypergradient via Stochastic BP(resp. Stochastic NS). Similarly, if Algorithm 1 estimates the hypergradient via SGD-based Estimation and using the non-warm start initialization strategy(resp. warm start initialization strategy) to initialize 𝒗k,0\boldsymbol{v}^{k,0}, we refer to Algorithm 1 as SGD_W_Start_False(resp. SGD_W_Start_True). Furthermore, we use P_TT_ii(P_JJ_ii) to denote that the iteration step TT(JJ) is set to be ii for P.

In the experiments, data sets 𝒟tr\mathcal{D}_{\text{tr}} and 𝒟val\mathcal{D}_{\text{val}} are constructed by setting 𝒘0=(2,5,7)3\boldsymbol{w}_{0}=(2,5,7)\in\mathbb{R}^{3}. From Fig. 1LABEL:sub@fig_first_case, we observe that to converge to a stationary point with Φ(𝒙)=0\|\nabla\Phi(\boldsymbol{x})\|=0, T=1T=1 is sufficient for SGD_W_Start_True, and a larger TT(T=20T=20) is required for Stochastic BP. Furthermore, SGD_W_Start_True always converges faster than Stochastic BP for T{1,5,20}T\in\{1,5,20\}. From Fig. 1LABEL:sub@fig_second_case, we observe that to converge to a stationary point with Φ(𝒙)=0\|\nabla\Phi(\boldsymbol{x})\|=0, J=1J=1 is sufficient for SGD_W_Start_True, and a larger JJ(J=20J=20) is required for Stochastic NS. Furthermore, SGD_W_Start_True with J=1J=1 has a faster convergence speed than Stochastic NS with J=20J=20. From Fig. 1LABEL:sub@fig_third_case, we observe that to converge to a stationary point with Φ(𝒙)=0\|\nabla\Phi(\boldsymbol{x})\|=0, J=1J=1 is sufficient for SGD_W_Start_True, and a larger JJ(J=20J=20) is required for SGD_W_Start_False. Furthermore, SGD_W_Start_True with J=1J=1 has a faster convergence speed than SGD_W_Start_False with J=20J=20.

In conclusion, the above experimental results show that compared with Stochastic BP and Stochastic NS, Algorithm 1 can obtain a faster convergence speed by using SGD-based Estimation to estimate the hypergradient and using warm start initialization strategy for SGD-based Estimation.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 1: Comparing the performance of Algorithm 1 in solving problem (15), with data sets constructed by setting 𝒘0=(2,5,7)\boldsymbol{w}_{0}=(2,5,7), under differnt hypergradient estimation methods.

III Algorithm

In Section II, we find that compared with Stochastic BP and Stochastic NS, SGD-based Estimation can make Algorithm 1 achieve a faster convergence speed. Notice that all the existing simple SGD type algorithms for solving problem (2) use Stochastic NS to estimate the hypergradient. It is natural to think that using SGD-based Estimation to estimate the hypergradient may allow us to obtain an algorithm with lower computational complexity for problem (2).

Algorithm 2 A New Simple SGD Type Algorithm SSGD for Problem (2)
0:  K1K\geq 1, T1T\geq 1, J1J\geq 1, stepsizes α\alpha, β\beta, and η\eta, initializations 𝒙0p\boldsymbol{x}^{0}\in\mathbb{R}^{p}, 𝒚0q\boldsymbol{y}^{0}\in\mathbb{R}^{q}, 𝒗0q\boldsymbol{v}^{0}\in\mathbb{R}^{q}
1:  for k=0,,K1k=0,\ldots,K-1 do
2:     𝒙k+1=𝒙kα𝒉fk\boldsymbol{x}^{k+1}=\boldsymbol{x}^{k}-\alpha\boldsymbol{h}_{f}^{k}
3:     Set 𝒚k+1,0=𝒚k\boldsymbol{y}^{k+1,0}=\boldsymbol{y}^{k}
4:     for t=0,,T1t=0,\ldots,T-1 do
5:        Draw a sample batch 𝒮t\mathcal{S}_{t}
6:        𝒚k+1,t+1=𝒚k+1,tβ𝒚G(𝒙k+1,𝒚k+1,t;𝒮t)\boldsymbol{y}^{k+1,t+1}=\boldsymbol{y}^{k+1,t}-\beta\nabla_{\boldsymbol{y}}G(\boldsymbol{x}^{k+1},\boldsymbol{y}^{k+1,t};\mathcal{S}_{t})
7:     end for
8:     Set 𝒚k+1=𝒚k+1,T\boldsymbol{y}^{k+1}=\boldsymbol{y}^{k+1,T}, 𝒗k+1,0=𝒗k\boldsymbol{v}^{k+1,0}=\boldsymbol{v}^{k}
9:     for j=0,,J1j=0,\ldots,J-1 do
10:        Draw sample batches j\mathcal{B}_{j}, 𝒟F,j\mathcal{D}_{F,j}
11:        𝒗k+1,j+1=(𝐈η𝒚2G(𝒙k+1,𝒚k+1;j))𝒗k+1,j\boldsymbol{v}^{k+1,j+1}=(\mathbf{I}-\eta\nabla_{\boldsymbol{y}}^{2}G(\boldsymbol{x}^{k+1},\boldsymbol{y}^{k+1};\mathcal{B}_{j}))\boldsymbol{v}^{k+1,j}             +η𝒚F(𝒙k+1,𝒚k+1;𝒟F,j)\qquad+\eta\nabla_{\boldsymbol{y}}F(\boldsymbol{x}^{k+1},\boldsymbol{y}^{k+1};\mathcal{D}_{F,j})
12:     end for
13:     Set 𝒗k+1=𝒗k+1,J\boldsymbol{v}^{k+1}=\boldsymbol{v}^{k+1,J}
14:     Draw sample batches 𝒟F\mathcal{D}_{F}, 𝒟G\mathcal{D}_{G}
15:     Compute 𝒉fk+1\boldsymbol{h}_{f}^{k+1} via (16)(\ref{eq15})
16:  end for

As shown in Algorithm 2, by using warm start initialization strategy, a new simple SGD type algorithm SSGD based on SGD-based Estimation is proposed, in which warm start is not only used for SGD-based Estimation, but also for solving LL problem.

In Algorithm 2, given 𝒙k+1\boldsymbol{x}^{k+1}, SSGD runs TT steps of SGD to obtain an approximation 𝒚k+1,T\boldsymbol{y}^{k+1,T} to the optimal solution 𝒚(𝒙k+1)\boldsymbol{y}^{*}(\boldsymbol{x}^{k+1}) of the LL problem(see line 6 in Algorithm 2). Note that the initial value 𝒚k+1,0\boldsymbol{y}^{k+1,0} is initialized by using warm start and is set to be 𝒚k\boldsymbol{y}^{k}, where 𝒚k\boldsymbol{y}^{k} is equal to 𝒚k,T\boldsymbol{y}^{k,T}, which is the output of the TT-th iteration starting from 𝒚k,0\boldsymbol{y}^{k,0} in line 6 of Algorithm 2.

After setting 𝒚k+1=𝒚k+1,T\boldsymbol{y}^{k+1}=\boldsymbol{y}^{k+1,T}, the hypergradient is estimated via SGD-based Estimation. Specifically, SSGD first obtains an approximation 𝒗k+1,J\boldsymbol{v}^{k+1,J} to the optimal solution of the following optimization problem

min𝒗12𝒗𝒚2g(𝒙k+1,𝒚k+1)𝒗𝒗𝒚f(𝒙k+1,𝒚k+1)\min\limits_{\boldsymbol{v}}\frac{1}{2}\boldsymbol{v}^{\top}\nabla_{\boldsymbol{y}}^{2}g(\boldsymbol{x}^{k+1},\boldsymbol{y}^{k+1})\boldsymbol{v}-\boldsymbol{v}^{\top}\nabla_{\boldsymbol{y}}f(\boldsymbol{x}^{k+1},\boldsymbol{y}^{k+1})

using JJ steps of SGD starting from 𝒗k+1,0\boldsymbol{v}^{k+1,0}(see line 11 in Algorithm 2), where we also adopt a warm start with 𝒗k+1,0=𝒗k\boldsymbol{v}^{k+1,0}=\boldsymbol{v}^{k}. Then, by setting 𝒗k+1=𝒗k+1,J\boldsymbol{v}^{k+1}=\boldsymbol{v}^{k+1,J},

𝒉fk+1\displaystyle\boldsymbol{h}_{f}^{k+1} =𝒙F(𝒙k+1,𝒚k+1;𝒟F)\displaystyle=\nabla_{\boldsymbol{x}}F(\boldsymbol{x}^{k+1},\boldsymbol{y}^{k+1};\mathcal{D}_{F}) (16)
𝒙𝒚G(𝒙k+1,𝒚k+1;𝒟G)𝒗k+1\displaystyle\qquad-\nabla_{\boldsymbol{x}}\nabla_{\boldsymbol{y}}G(\boldsymbol{x}^{k+1},\boldsymbol{y}^{k+1};\mathcal{D}_{G})\boldsymbol{v}^{k+1}

is constructed as an estimate to Φ(𝒙k+1)\nabla\Phi(\boldsymbol{x}^{k+1}). Note that sample sets {𝒟F,𝒟G,j(j=0,,J1),𝒟F,j(j=0,,J1)}\{\mathcal{D}_{F},\mathcal{D}_{G},\mathcal{B}_{j}(j=0,\ldots,J-1),\mathcal{D}_{F,j}(j=0,\ldots,J-1)\} are mutually independent in the hypergradient estimation process.

For the sample sets in Algorithm (2), we suppose the sample sets 𝒮t\mathcal{S}_{t} for all t=0,,T1t=0,\ldots,T-1, j\mathcal{B}_{j} for all j=0,,J1j=0,\ldots,J-1, and 𝒟G\mathcal{D}_{G} have the sizes of SS, DD, and DgD_{g}, respectively. Furthermore, we suppose sample set 𝒟F\mathcal{D}_{F} and the sample sets 𝒟F,j\mathcal{D}_{F,j} for all j=0,,J1j=0,\ldots,J-1 have the same sample size DfD_{f}.

Remark 1.

From Algorithm 2, it is easy to observe that there are three loops. However, since in the classification of the existing stochastic algorithms for solving problem (2), the loop to estimate the hypergradient, i.e., approximate the Hessian inverse, is typically not counted. To be consistent with the existing literature, when performing classification, the loop to solve linear system in lines 9-12 of Algorithm 2 is not counted. Then, it is obvious that SSGD can be either a single loop algorithm(i.e., T=1T=1) or a double loop algorithm(i.e., T>1T>1). Although in the analysis that follows, it can be concluded that the convergence of SSGD is guaranteed for any T1T\geq 1. To obtain a lower computational complexity than the existing simple SGD type algorithms, T=𝒪(κ)T=\mathcal{O}(\kappa) is required for SSGD, where κ\kappa is the condition number. Therefore, we call SSGD a double loop algorithm, as shown in Table I.

IV Theoretical Results

In this section, we provide the convergence and complexity results of SSGD. Detailed theoretical analysis can be found in the supplementary material.

We first show that the convergence of SSGD can be guaranteed for any T1T\geq 1 and J1J\geq 1. Furthermore, the complexity result is provided.

Theorem 1.

Apply SSGD to solve problem (2). Suppose Assumptions 1, 2, and 3 hold. Define rv=2μL/(L2+μ2)=𝒪(κ1)r_{v}={2\mu L}/{(L^{2}+\mu^{2})}=\mathcal{O}(\kappa^{-1}), rw=ημ/(7(2ημ))r_{w}={\eta\mu}/{(7(2-\eta\mu))}, ρ2=𝒪(κ4)\rho_{2}=\mathcal{O}(\kappa^{-4}), ρ1=(2ρ2((1+rw)C23+2C24))/(ρy(1+rv)rv)\rho_{1}={(2\rho_{2}\left((1+r_{w})C_{23}+2C_{24}\right))}/{(\rho_{y}(1+r_{v})-r_{v})}, and

L31=LΦ2+2((ρ1+C24ρ2)2rvL2μ2+(1+rw)C22ρ2),L_{31}=\frac{L_{\Phi}}{2}+2\left((\rho_{1}+C_{24}\rho_{2})\frac{2}{r_{v}}\frac{L^{2}}{\mu^{2}}+(1+r_{w})C_{22}\rho_{2}\right),

where κ=L/μ\kappa={L}/{\mu} is the condition number, ρy=2βμL/(μ+L)\rho_{y}=2\beta\mu L/(\mu+L), LΦ=𝒪(κ3)L_{\Phi}=\mathcal{O}(\kappa^{3}) is defined in Proposition 1, C22C_{22} , C23C_{23}, and C24C_{24} are defined in Lemma 7 in the supplementary material, and η\eta, β\beta are the stepsizes. Let 𝐯0M/μ\|\boldsymbol{v}^{0}\|\leq M/\mu, and choose stepsizes η=1/(2L)\eta={1}/{(2L)}, β=3/(2(L+μ))=𝒪(1)\beta={3}/{(2(L+\mu))}=\mathcal{O}(1), and

α=min{12L31,14L2ρ2ημ,((1+rw)C23+2C24)ρ2C¯12},\alpha=\min\left\{\frac{1}{2L_{31}},~{}\frac{1}{4L^{2}}\rho_{2}\eta\mu,~{}\left((1+r_{w})C_{23}+2C_{24}\right)\frac{\rho_{2}}{\bar{C}_{1}^{2}}\right\},

where C¯1=L+Mτ/μ=𝒪(κ)\bar{C}_{1}=L+{M\tau}/{\mu}=\mathcal{O}({\kappa}) is defined in Lemma 4 in the supplementary material. Then, for any T1T\geq 1 and J1J\geq 1, we have

1Kk=0K1𝔼Φ(𝒙k)2\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}\|\nabla\Phi(\boldsymbol{x}^{k})\|^{2}
𝒪(κ5K+κ2σf2Df+κ2σg,12Dg+κ81Sσg2+κ4σg,22D).\displaystyle\leq\mathcal{O}\left(\frac{\kappa^{5}}{K}+\kappa^{2}\frac{\sigma_{f}^{2}}{D_{f}}+\kappa^{2}\frac{\sigma_{g,1}^{2}}{D_{g}}+\kappa^{8}\frac{1}{S}\sigma_{g}^{2}+\kappa^{4}\frac{\sigma_{g,2}^{2}}{D}\right).

Furthermore, to achieve an ϵ\epsilon-accurate stationary point, the computational complexity is 𝒪(κ13ϵ2max{T,J})\mathcal{O}(\kappa^{13}\epsilon^{-2}\max\{T,J\}).

Theorem 1 shows that SSGD converges sublinearly w.r.t. the number KK of iterations, the batch sizes DfD_{f}, SS for gradient estimation, DgD_{g} for Jacobian estimation, and DD for Hessian matrix estimation. In addition, from Theorem 1, we know that when T=1T=1 and J=1J=1, SSGD can achieve the lowest computational complexity 𝒪(κ13ϵ2)\mathcal{O}(\kappa^{13}\epsilon^{-2}). Compared with the existing simple SGD type algorithms listed in Table I, this computational complexity of SSGD outperforms that of BSA, TTSA, and stocBiO by an order of 𝒪(ϵ1log(1/ϵ))\mathcal{O}(\epsilon^{-1}\log({1}/{\epsilon})), 𝒪(ϵ0.5log(1/ϵ))\mathcal{O}(\epsilon^{-0.5}\log({1}/{\epsilon})), and 𝒪(log(1/ϵ))\mathcal{O}(\log({1}/{\epsilon})) in terms of the target accuracy ϵ\epsilon, respectively. While, in terms of the condition number κ\kappa, it has a higher complexity compared with the algorithms listed in Table I.

Notice that the convergence result and the complexity result of Theorem 1 are established under the assumption that T1T\geq 1 and J1J\geq 1. In the following, we show that by imposing some restrictions on TT and JJ, SSGD can obtain a lower computational complexity.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: Comparison of the simple SGD type algorithms for solving problem (15) whose data sets are constructed by setting 𝒘0=(2,5,7)\boldsymbol{w}_{0}=(2,5,7).
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: Comparison of the simple SGD type algorithms for solving problem (15) whose data sets are constructed by setting 𝒘0=(1,4,6,,6)500\boldsymbol{w}_{0}=(1,4,6,\ldots,6)\in\mathbb{R}^{500}.
Theorem 2.

Apply SSGD to solve problem (2). Suppose Assumptions 1, 2, and 3 hold. Define rv=1r_{v}=1, rw=1r_{w}=1, ρ2=𝒪(κ3)\rho_{2}=\mathcal{O}(\kappa^{-3}), ρ1=2ρ2(1+rw)C23\rho_{1}=2\rho_{2}(1+r_{w})C_{23}, and

L31=LΦ2+2(ρ14μ2L2+ρ2(1+rw)C22),L_{31}=\frac{L_{\Phi}}{2}+2\left(\frac{\rho_{1}}{4\mu^{2}}L^{2}+\rho_{2}(1+r_{w})C_{22}\right),

where κ=L/μ\kappa={L}/{\mu}, LΦ=𝒪(κ3)L_{\Phi}=\mathcal{O}(\kappa^{3}) is defined in Proposition 1, C22=(1+1/rw)(2+8L2/μ2)C12C_{22}=\left(1+{1}/{r_{w}}\right)\left(2+8{L^{2}}/{\mu^{2}}\right)C_{1}^{2}, C23=(1+1/rw)(5+rw)C12C_{23}=\left(1+{1}/{r_{w}}\right)(5+r_{w})C_{1}^{2}, and C1=M/μ2ρ+L/μC_{1}={M}/{\mu^{2}}\rho+{L}/{\mu}. Let 𝐯0M/μ\|\boldsymbol{v}^{0}\|\leq M/\mu, and select η=1/(2L)\eta={1}/{(2L)}, β=1/(2L)\beta={1}/{(2L)},

α=min{12L31,14L2ρ2,12ρ2(1+rw)C23C¯12},\alpha=\min\left\{\frac{1}{2L_{31}},~{}\frac{1}{4L^{2}}\rho_{2},~{}\frac{1}{2}\rho_{2}(1+r_{w})\frac{C_{23}}{\bar{C}_{1}^{2}}\right\},
Tlog(ρ1/(8(ρ1+C24ρ2)))log(L/(μ+L))=𝒪(κ),T\geq\frac{\log(\rho_{1}/(8(\rho_{1}+C_{24}\rho_{2})))}{\log(L/(\mu+L))}=\mathcal{O}(\kappa),
Jlog(1/(4(1+rw)2))log(1ημ)=𝒪(κ),J\geq\frac{\log(1/(4(1+r_{w})^{2}))}{\log(1-\eta\mu)}=\mathcal{O}(\kappa),

where C¯1=L+Mτ/μ=𝒪(κ)\bar{C}_{1}=L+{M\tau}/{\mu}=\mathcal{O}({\kappa}), and C24=(1+1/rw)(9+8rw)C12C_{24}=\left(1+{1}/{r_{w}}\right)(9+8r_{w})C_{1}^{2}. Then, we have

1Kk=0K1𝔼Φ(𝒙k)2\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}\|\nabla\Phi(\boldsymbol{x}^{k})\|^{2}
𝒪(κ3K+κσf2Df+κ2σg,12Dg+κ51Sσg2+κ3σg,22D).\displaystyle\leq\mathcal{O}\left(\frac{\kappa^{3}}{K}+\kappa\frac{\sigma_{f}^{2}}{D_{f}}+\kappa^{2}\frac{\sigma_{g,1}^{2}}{D_{g}}+\kappa^{5}\frac{1}{S}\sigma_{g}^{2}+\kappa^{3}\frac{\sigma_{g,2}^{2}}{D}\right).

Furthermore, a computational complexity of the order of 𝒪(κ9ϵ2)\mathcal{O}(\kappa^{9}\epsilon^{-2}) is sufficient for SSGD to reach an ϵ\epsilon-accurate stationary point.

In Theorem 2, the convergence of SSGD is established under the assumption that T𝒪(κ)T\geq\mathcal{O}(\kappa) and J𝒪(κ)J\geq\mathcal{O}(\kappa). Furthermore, a computational complexity by an order of 𝒪(κ9ϵ2)\mathcal{O}(\kappa^{9}\epsilon^{-2}) is obtained in Theorem 2, which is lower than the lowest computational complexity 𝒪(κ13ϵ2)\mathcal{O}(\kappa^{13}\epsilon^{-2}) obtained in Theorem 1. Compared with the algorithms listed in Table I, SSGD achieves the lowest computational complexity.

V Experimental Results

In this section, we first validate our theoretical results on synthetic bilevel optimization problems. Then, we apply our proposed algorithm on a hyperparameter optimization problem.

V-A Synthetic Bilevel Optimization Problems

In the following, we perform experiments on the bilevel problem in (15). The experimental details are in the supplementary material.

Firstly, we compare our proposed algorithm SSGD with the existing simple SGD type algorithms BSA, TTSA, and stocBiO. In the experiments, we consider two datasets, i.e., datasets constructed by setting 𝒘0=(2,5,7)\boldsymbol{w}_{0}=(2,5,7) and datasets constructed by setting 𝒘0=(1,4,6,,6)500\boldsymbol{w}_{0}=(1,4,6,\ldots,6)\in\mathbb{R}^{500}.

From Fig. 2LABEL:sub@fig_third_case and Fig. 3LABEL:sub@fig_third_case, we observe that SSGD and stocBiO run almost the same number of iteration steps KK to converge to the stationary point with Φ(𝒙)=0\|\nabla\Phi(\boldsymbol{x})\|=0, less than that of BSA and TTSA. However, compared with SSGD, a larger JJ to estimate the hypergradient is required for stocBiO to converge to the stationary point with Φ(𝒙)=0\|\nabla\Phi(\boldsymbol{x})\|=0. As a result, Fig. 2LABEL:sub@fig_second_case and Fig. 3LABEL:sub@fig_second_case show that SSGD converges fastest among all the compared methods. Furthermore, from Fig. 2LABEL:sub@fig_first_case and Fig. 3LABEL:sub@fig_first_case, we observe that SSGD converges to the point with the smallest upper objective function value for both cases.

Then, we study the influence of iteration step TT, iteration step JJ, batch size, and stepsize α\alpha on the convergence behavior of SSGD in Algorithm 2. In the experiments, the datasets are constructed by setting 𝒘0=(2,5,7,,7)1000\boldsymbol{w}_{0}=(2,5,7,\ldots,7)\in\mathbb{R}^{1000}.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Illustrating the convergence behavior of SSGD under different iteration steps TT and iteration steps JJ. (a) shows the converence behavior under different TT. (b) shows the convergence behavior under different JJ.

Fig. 4LABEL:sub@fig_first_case shows the convergence behavior of SSGD under different choices of T{1,5,30,50}T\in\{1,5,30,50\}, where SSGD_TT_ii indicates that TT is set to be ii. It shows that beginning from 1, increasing TT can speed up the convergence of SSGD, but when TT reaches 55, increasing TT can slow down the convergence speed of SSGD. Fig. 4LABEL:sub@fig_second_case shows the convergence behavior of SSGD under different choices of J{1,5,20,30}J\in\{1,5,20,30\}, where SSGD_JJ_ii indicates that JJ is set to be ii. We observe that SSGD converges fastest when JJ is chosen to be 5.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: Illustrating the convergence behavior of SSGD under different batch sizes and stepsizes α\alpha. (a) shows the convergence behavior under different batch sizes. (b) shows the convergence behavior under different stepsizes α\alpha.
Refer to caption
(a)
Refer to caption
(b)
Figure 6: Illustrating the convergence behavior of SSGD under different choices of TT and JJ.

Fig. 5LABEL:sub@fig_first_case shows the convergence behavior of SSGD under different batch sizes with SSGD_ii indicating that the batch size is set to be ii. It shows that SSGD converges fastest when batch size is set to be 1, and increasing batch size can slow down the convergence speed of SSGD. Fig. 5LABEL:sub@fig_second_case shows the convergence behavior of SSGD under different stepsizes α\alpha. It shows that choosing a larger stepsize α\alpha allows SSGD to converge faster.

Fig. 6 shows the convergence behavior of SSGD under different choices of TT and JJ, where SSGD_TT_tt_JJ_jj indicates that TT is set to be tt and JJ is set to be jj. It shows that SSGD converges faster in the double loop case(i.e., T>1T>1) than in the single loop case(i.e., T=1T=1).

V-B Hyper-parameter Optimization

In this section, we use a hyperparameter optimization problem, i.e., data hyper-cleaning[5], to evaluate the performance of our algorithm. Data hyper-cleaning is to train a classifier under the case that some data in the training set is corrupted. Following [5], when we train a linear classifier, data hyper-cleaning task can be expressed as a bilvel optimization problem whose UL function is given by

f(𝒙,𝒚)=1|𝒟val|(𝒖i,vi)𝒟valL(𝒚𝒖i,vi)f(\boldsymbol{x},\boldsymbol{y})=\frac{1}{|\mathcal{D}_{\text{val}}|}\sum\limits_{(\boldsymbol{u}_{i},v_{i})\in\mathcal{D}_{\text{val}}}L({\boldsymbol{y}}^{\top}\boldsymbol{u}_{i},v_{i})

and whose LL function is given by

g(𝒙,𝒚)=1|𝒟tr|(𝒖i,vi)𝒟trσ(xi)L(𝒚𝒖i,vi)+c𝒚2g(\boldsymbol{x},\boldsymbol{y})=\frac{1}{|\mathcal{D}_{\text{tr}}|}\sum\limits_{(\boldsymbol{u}_{i},v_{i})\in\mathcal{D}_{\text{tr}}}\sigma(x_{i})L(\boldsymbol{y}^{\top}\boldsymbol{u}_{i},v_{i})+c\|\boldsymbol{y}\|^{2}

where LL denotes the cross-entropy loss, σ()\sigma(\cdot) is the sigmoid function, cc is the coefficient of the regular term and is set to be 0.0010.001, and 𝒟val\mathcal{D}_{\text{val}} and 𝒟tr\mathcal{D}_{\text{tr}} denote the validation set and training set, respectively.

TABLE II: Comparison of the simple SGD type algorithms for data hyper-cleaning tasks on MNIST, FashionMNIST and CIFAR10 datasets. Acc.(%) and F1 score(%) denote the test accuracy and the harmonic mean of the precision and recall, respectively
Method MNIST FashionMNIST CIFAR10
Acc. F1 score Acc. F1 score Acc. F1 score
stocBiO 89.90 63.92 82.18 65.17 38.36 64.07
BSA 80.67 14.20 72.57 16.51 22.72 21.09
TTSA 88.88 33.94 80.62 32.54 35.69 40.03
SSGD 90.32 83.81 82.32 80.45 38.35 66.16

In the following experiments, we compare our proposed algorithm SSGD with BSA, TTSA, stocBiO on three datasets: MNIST [26], FashionMNIST [27] and CIFAR10. Each sample in the training set is corrupted with probability 0.30.3. More results and details are in the supplementary material.

Table II shows the experimental results on three datasets over 270s. We observe that SSGD achieves the highest test accuracy on MNIST and FashionMNIST datasets, and achieves the highest F1 score on all the datasets. It shows that SSGD can obtain competetive results on data hyper-cleaning problem.

VI Conclusion

In the paper, we propose to use SGD-based Estimation to estimate the hypergradient for the nonconvex-strongly-convex bilevel optimization problems, and propose a novel simple SGD type algorithm. We provide the convergence guarantee for our algorithm, and prove that our proposed algorithm can achieve lower computational complexity compared with all the existing simple SGD type algorithms for bilevel optimization. Moreover, the experiments validate our theoretical results.

Acknowledgments

The work is partially supported by National Key R&D Program of China(2020YFB1313503 and 2018AAA0100300), National Natural Science Foundation of China(Nos. 61922019, 61733002, 61672125, and 61976041), and LiaoNing Revitalization Talents Program(XLYC1807088).

References

  • [1] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil, “Bilevel programming for hyperparameter optimization and meta-learning,” in Proc. 35th Int. Conf. Mach. Learn. (ICML), Jul. 2018, pp. 1568–1577.
  • [2] L. Bertinetto, J. F. Henriques, P. Torr, and A. Vedaldi, “Meta-learning with differentiable closed-form solvers,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019.
  • [3] A. Rajeswaran, C. Finn, S. Kakade, and S. Levine, “Meta-learning with implicit gradients,” in Advances in Neural Information Processing Systems, 2019.
  • [4] K. Ji, J. D. Lee, Y. Liang, and H. V. Poor, “Convergence of meta-learning with task-specific adaptation over partial parameters,” in Proc. 34th Conf. Neural Inf. Process. Syst. (NeurIPS), 2020.
  • [5] A. Shaban, C. A. Cheng, N. Hatch, and B. Boots, “Truncated back-propagation for bilevel optimization,” in Proc. 22th Int. Conf. Artif. Intell. Statist., vol. 89, 2019, pp. 1723–1732.
  • [6] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neural Information Processing Systems, vol. 12, 1999.
  • [7] M. Hong, H. T. Wai, Z. Wang, and Z. Yang, “A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic,” 2020, arXiv: 2007.05170. [Online]. Available: https://arxiv.org/abs/2007.05170
  • [8] P. Hansen, B. Jaumard, and G. Savard, “New branch-and-bound rules for linear bilevel programming,” SIAM J. Sci. Stat. Comput., vol. 13, no. 5, pp. 1194–1217, 1992.
  • [9] C. Shi, J. Lu, and G. Zhang, “An extended kuhn–tucker approach for linear bilevel programming,” Appl. Math. Comput., vol. 162, no. 1, pp. 51–63, 2005.
  • [10] Y. Lv, T. Hu, G. Wang, and Z. Wan, “A penalty function method based on kuhn–tucker condition for solving linear bilevel programming,” Appl. Math. Comput., vol. 188, no. 1, pp. 808–813, 2007.
  • [11] S. Qin, X. Le, and J. Wang, “A neurodynamic optimization approach to bilevel quadratic programming,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 11, pp. 2580–2591, 2017.
  • [12] J. Domke, “Generic methods for optimization-based modeling,” in Proc. 15th Int. Conf. Artif. Intell. Statist., Apr. 2012, pp. 318–326.
  • [13] L. Franceschi, M. Donini, P. Frasconi, and M. Pontil, “Forward and reverse gradient-based hyperparameter optimization,” in Proc. 34th Int. Conf. Mach. Learn. (ICML), vol. 70, Aug. 2017, pp. 1165–1173.
  • [14] Y. L. K. Ji, J. Yang, “Bilevel optimization: Convergence analysis and enhanced design,” in Proc. Int. Conf. Mach. Learn. (ICML), 2021.
  • [15] F. Pedregosa, “Hyperparameter optimization with approximate gradient,” in Proc. Int. Conf. Mach. Learn. (ICML), 2016, p. 737–746.
  • [16] J. Lorraine, P. Vicol, and D. Duvenaud, “Optimizing millions of hyperparameters by implicit differentiation,” in Proc. 23th Int. Conf. Artif. Intell. Statist., 2020, pp. 1540–1552.
  • [17] S. Ghadimi and M. Wang, “Approximation methods for bilevel programming,” 2018, arXiv: 1802.02246. [Online]. Available: https://arxiv.org/abs/1802.02246
  • [18] D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyperparameter optimization through reversible learning,” in Proc. Int. Conf. Mach. Learn. (ICML), 2015.
  • [19] R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang, “A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton,” in Proc. Int. Conf. Mach. Learn. (ICML), 2020.
  • [20] P. Khanduri, S. Zeng, M. Hong, H. T. Wai, Z. Wang, and Z. Yang, “A momentum-assisted single-timescale stochastic approximation algorithm for bilevel optimization,” 2021, arXiv: 2102.07367v1. [Online]. Available: https://arxiv.org/abs/2102.07367v1
  • [21] Z. Guo, Y. Xu, W. Yin, R. Jin, and T. Yang, “On stochastic moving-average estimators for non-convex optimization,” 2021, arXiv: 2104.14840. [Online]. Available: https://arxiv.org/abs/2104.14840
  • [22] J. Yang, K. Ji, and Y. Liang, “Provably faster algorithms for bilevel optimization,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 13 670–13 682.
  • [23] H. H. J. Li, B. Gu, “A fully single loop algorithm for bilevel optimization without hessian inverse,” in Proc. AAAI Conf. Artif. Intell., vol. 36, 2022, pp. 7426–7434.
  • [24] A. Cutkosky and F. Orabona, “Momentum-based variance reduction in non-convex sgd,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [25] T. Chen, Y. Sun, and W. Yin, “Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 25 294–25 307.
  • [26] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, pp. 2278–2324, 1998.
  • [27] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” 2017, arXiv: 1708.07747. [Online]. Available: https://arxiv.org/abs/1708.07747
Haimei Huo received the B.S. degree in information and computing science from Changchun University of Science and Technology, China, in 2017. She is currently pursuing the PhD degree in computational mathematics from Dalian University of Technology, Dalian, China. Her research interests include machine learning and optimization.
Risheng Liu received his B.Sc. (2007) and Ph.D. (2012) from Dalian University of Technology, China. From 2010 to 2012, he was doing research as joint Ph.D. in robotics institute at Carnegie Mellon University. From 2016 to 2018, he was doing research as Hong Kong Scholar at the Hong Kong Polytechnic University. He is currently a full professor with the Digital Media Department at International School of Information Science & Engineering, Dalian University of Technology. He was awarded the ”Outstanding Youth Science Foundation” of the National Natural Science Foundation of China. His research interests include optimization, computer vision, and multimedia.
Zhixun Su received the B.S. degree in mathematics from Jilin University, Changchun, China in 1987, the M.S. degree in computer science from Nankai University, Tianjin, China in 1990, and the Ph.D. degree in computational mathematics from Dalian University of Technology, Dalian, China in 1993. He has been a Professor in the School of Mathematical Sciences, Dalian University of Technology, since 1999. He is currently the director of the Key Laboratory for Computational Mathematics and Data Intelligence of Liaoning Province and the vise director of Liaoning Center for Applied Mathematics. His current research interests include computer graphics, computer vision, computational geometry, and machine learning.