Optimization Methods Rooting in Optimal Control

Huanshui Zhang, Hongxia Wang This work was supported by the Original Exploratory Program Project of National Natural Science Foundation of China (62250056), the Joint Funds of the National Natural Science Foundation of China (U23A20325), the Major Basic Research of Natural Science Foundation of Shandong Province (ZR2021ZD14), and the High-level Talent Team Project of Qingdao West Coast New Area (RCTD-JC-2019-05).

Abstract

In the paper, we propose solving optimization problems (OPs) and understanding the Newton method from the optimal control view. We propose a new optimization algorithm based on the optimal control problem (OCP). The algorithm features converging more rapidly than gradient descent, meanwhile, it is superior to Newton’s method because it is not divergent in general and can be applied in the case of a singular Hessian matrix. These merits are supported by the convergence analysis for the algorithm in the paper. We also point out that the convergence rate of the proposed algorithm is inversely proportional to the magnitude of the control weight matrix and proportional to the control terminal time inherited from OCP.

I Introduction

The research history of optimization problems is long and rich, spanning several centuries and evolving with the development of mathematics, engineering, operations research, and computer science. This originates from that optimization problems are widespread in the modeling of real-world systems for a very broad range of applications. Such applications include economies of scale, fixed charges, finance, allocation and location problems, structural optimization, engineering design, network and transportation problems, chip design and database problems, mechanical design, chemical engineering design and control, molecular biology, and several other combinatorial optimization problems such as integer programming and related graph problems. With the development and implementation of practical optimization algorithms, more and more scientists in diverse disciplines have been using optimization techniques to solve problems [1, 2, 3, 4].

Predominant optimization methods, such as the gradient descent [5] and Newton’s methods [6], are proposed based on the function approximation. Despite being powerful, favorable, and widely used optimization algorithms, they also have limitations and potential drawbacks. The learning rate is a crucial hyperparameter in gradient descent [7]. Gradient descent’s performance is susceptible to the learning rate (If the learning rate is too small, the algorithm may converge slowly, while if it is too large, the algorithm may overshoot the minimum and fail to converge [8]) while the hyperparameters tuning is challenging [9]. Newton’s method may diverge if the Hessian matrix is singular or the algorithm encounters numerical instability (when the Hessian matrix is ill-conditioned). Both gradient descent and Newton’s methods are sensitive to the choice of the initial value. Suppose the initial guess is far from the actual root or minimum. In this case, Newton’s method might converge to a different solution or fail to converge altogether and gradient descent may get stuck in local minima or saddle points. To mitigate some of the aforementioned issues, a body of variations and enhancements, such as stochastic gradient descent [10], momentum [11], adaptive learning rate methods [12, 13, 14], quasi-Newton method [15], inexact Newton method [16], and modified Newton method [17] have been developed. However, it still seems somewhat blind to tune parameters in diverse methods.

The paper is the first to propose solving OPs and understanding the existing optimization methods from the optimal control view. Every parameter or variable in OCP has a clear physical meaning whereby it is easy to tune the parameters to alter the interested variables [18]. Accordingly, the convergence rate of algorithms based on OCP can be altered by adjusting those parameters inherited from OCP. It should be emphasized that in this paper, we do not apply the results of optimal control mechanically, but we propose a new optimization algorithm based on the optimal control and make a convergence analysis. The algorithm features converging more rapidly than gradient descent, meanwhile, it is superior to Newton’s method because it can be applied in the case of a singular Hessian matrix. We also point out that the convergence rate of the proposed algorithm can be accelerated by decreasing the magnitude of the control weight matrix or increasing the control terminal time inherited from OCP.

The remainder is arranged as follows. In Section II, we propose an update relation, an optimization method, for OP by the solution to OCP. We explore the relationship between the existing (gradient descent and Newton’s method) and proposed methods in Section III. Section IV is devoted to the convergence analysis of explicitly implementing the proposed method. Some conclusions are achieved in Section V.

II Resolve optimization problem by using optimal control method

Assume that $f(x)$ is twice continuously differentiable. In most practical settings, one would almost surely need to resort to numerical techniques to address the optimization problem

\displaystyle\min_{x}f(x).

(1)

The traditional numerical method, for instance, gradient descent or Newton’s method, directly finds zeros of $f^{\prime}(x)$ by an update recursion

\displaystyle x_{k+1}=\Phi(x_{k})

(2)

with $\Phi(\cdot)$ being a designed function.

Instead, we will propose an updated relation with the aid of the following OCP:

	$\displaystyle\min_{u}\sum_{k=0}^{N}[f(x_{k})+u_{k}^{\prime}Ru_{k}]+f(x_{N+1}),$		(3)
	$\displaystyle\mbox{subject to~{}}x_{k+1}=x_{k}+u_{k},$		(4)

where $x_{k}$ and $u_{k}$ are the state and control of system (4), respectively, integer $N>0$ is the control terminal time, positive definite matrix $R$ is the control weight matrix, and (3) is the cost functional of OCP, closely related to OP (1).

Remark 1.

It can be seen from (3) that the optimization algorithm based on OCP will be fast because $u_{k}$ in (4) is chosen to minimize $f(x_{k+1})$ .

Remark 2.

According to OCP (3)-(4), the smaller the weight matrix $R$ is, the more control energy is allowed to be consumed for minimization, i.e., the larger the update step size $u_{k}$ in (4) is, the more rapidly the iteration scheme (4) may converge to the extremal point of OP (1).

Remark 3.

OCP (3)-(4) degenerates into $\min_{u_{0}}f(x_{1})$ subject to (4) without restriction of consumed control energy.

Lemma 1.

The optimal control strategy of OCP (3)-(4) admits

\displaystyle u_{k}=-R^{-1}\sum_{i=k+1}^{N+1}f^{\prime}(x_{i}).

(5)

Proof.

The conclusion is direct by using the maximum principle. We thus omit its proof. ∎

Take (5) as an update direction. Then it can be obtained from (4) an update relation

\displaystyle x_{k+1}=x_{k}-R^{-1}\sum_{i=k+1}^{N+1}f^{\prime}(x_{i}).

(6)

Although the relation (6) is implicit, some explicit approximation implementations are possible. We will give an explicit approximation recursion later.

It should be stressed that the proposal of a new update scheme is not our ultimate goal. We are also interested in revealing the relationship between existing methods and our methods.

III The proposed algorithm and convergence analysis

To explore the relationship between the existing method and our method, we reformulate the update relation (6) now. Make the first-order Taylor expansion of $f^{\prime}(x_{i})$ at point $x_{i-1}$ in (6). It can be derived that

	$\displaystyle x_{k+1}=x_{k}-g_{k},k=0,\cdots,N,$		(7)
	$\displaystyle g_{k}=\frac{1}{R+f^{(2)}_{k}}(f^{\prime}_{k}+\sum_{i=k+1}^{N}(f^{\prime}_{i}-f^{(2)}_{i}g_{i})),g_{N}=\frac{f^{\prime}_{N}}{R+f^{(2)}_{N}},$		(8)

where $f^{\prime}_{k}\dot{=}f^{\prime}(x_{k})$ and $f^{(2)}_{k}\dot{=}f^{(2)}(x_{k})$ .

To facilitate calculation, we replace all $x_{i},i\geq k+1$ , with $x_{k}$ in the right-hand side of (8) and obtain the following recursion

	$\displaystyle x_{k+1}=x_{k}-\bar{g}_{k}(x_{k}),k=0,\dots,N,$		(9)
	$\displaystyle\bar{g}_{l}=\frac{1}{R+f^{(2)}_{k}}(f^{\prime}(x)+R\bar{g}_{l+1}(x)),\bar{g}_{N}=\frac{f^{\prime}(x)}{R+f^{(2)}(x)},l=N-1,N-2,\dots,k.$		(10)

According to (9)-(10), the following algorithm is proposed.

Algorithm 1:
1). Fix integer $N>0$ and prescribe any $x_{0}$ ;
2). Carry out the update relation (9)-(10) and obtain $x_{N+1}$ ;

Remark 4.

In the case of $R=0$ , the recursion (9)-(10) recovers the famous Newton method.

In the sequel, we will make some convergence analysis pertinent to Algorithm 1. To compare with Newton’s method, we assume that $f(x)$ is cubic differentiable.

Theorem 1.

Assume $f^{\prime}(x_{*})=0$ and $f^{(2)}(x_{*})>0$ . Consider the iteration scheme

	$\displaystyle x_{k+1}=$	$\displaystyle x_{k}-\bar{g}_{0}(x_{k}),$		(11)
	$\displaystyle\bar{g}_{l}(x)=$	$\displaystyle\frac{1}{R+f^{(2)}(x)}[f^{\prime}(x)+R\bar{g}_{l+1}(x)],\bar{g}_{N}=\frac{f^{\prime}(x)}{R+f^{(2)}(x)},l=N-1,\dots,0.$		(12)

Then (11)-(12) is linearly convergent to $x_{*}$ and has convergence constant $(\frac{R}{R+f^{(2)}(x_{*})})^{N+1}$ .

Proof.

The iteration error evolves as follows

	$\displaystyle x_{k+1}-x_{*}$
$\displaystyle=$	$\displaystyle x_{k}-\bar{g}_{0}(x_{k})-x_{*}$
$\displaystyle=$	$\displaystyle\phi(x_{k})-\phi(x_{*})$
$\displaystyle=$	$\displaystyle\phi^{\prime}(x_{})(x_{k}-x_{})+\phi^{(2)}(x_{})(x_{k}-x_{})^{2}+o(\|x_{k}-x_{*}\|^{2}).$	(13)

In the above,

	$\displaystyle\phi(x)=$	$\displaystyle x-\bar{g}_{0}(x),$		(14)
	$\displaystyle\phi^{\prime}(x_{*})=$	$\displaystyle(\frac{R}{R+f^{(2)}(x_{*})})^{N+1}.$		(15)

$\phi^{\prime}(x_{*})$ in (15) can be derived by induction over $\bar{g}_{k}^{\prime}(x_{*})$ . In detail, according to (10),

	$\displaystyle\bar{g}_{k}^{\prime}(x)=$	$\displaystyle\frac{1}{R+f^{(2)}(x)}(f^{(2)}(x)+R\bar{g}^{\prime}_{k+1}(x))-\frac{f^{(2)}(x)}{(R+f^{(2)}(x))^{2}}(f^{\prime}(x)+R\bar{g}_{k+1}(x)),$		(16)
	$\displaystyle\bar{g}^{\prime}_{N}(x)=$	$\displaystyle\frac{f^{(2)}(x)}{R+f^{(2)}(x)}-\frac{f^{\prime}(x)f^{(3)}(x)}{(R+f^{(2)}(x))^{2}}.$		(17)

(10) implies that every $\bar{g}_{i}(x)$ contains $f^{\prime}(x)$ as a factor. Accordingly, if $f^{\prime}(x_{*})=0$ , then

\displaystyle\bar{g}_{i}(x_{*})=

\displaystyle 0,i=N,N-1,\cdots,0

(18)

which together with (16)-(17) yields

	$\displaystyle\bar{g}_{k}^{\prime}(x_{*})=$	$\displaystyle\frac{1}{R+f^{(2)}(x_{})}(f^{(2)}(x_{})+R\bar{g}^{\prime}_{k+1}(x_{*})),$		(19)
	$\displaystyle\bar{g}^{\prime}_{N}(x_{*})=$	$\displaystyle\frac{f^{(2)}(x_{})}{R+f^{(2)}(x_{})}.$		(20)

Based on (19) and (20), $\bar{g}^{\prime}_{i}(x_{*})=1-(\frac{R}{R+f^{(2)}(x_{*})})^{N+1-i}$ can be acquired by using backward induction technique.

Firstly, in the case of $i=N$ , it is direct that

\displaystyle g_{N}^{\prime}(x_{*})=

\displaystyle\frac{f^{(2)}(x_{*})}{R+f^{(2)}(x_{*})}=1-\frac{R}{R+f^{(2)}(x_{*})}.

(21)

Secondly, assume for all $i\geq k+1$ , there always hold

\displaystyle g^{\prime}_{i}(x_{*})=1-(\frac{R}{R+f^{(2)}(x_{*})})^{N+1-i}.

(22)

Thirdly, we will show that in the case of $i=k$ , (22) also holds.

$\displaystyle\bar{g}_{k}^{\prime}(x_{*})=$	$\displaystyle\frac{1}{R+f^{(2)}(x_{})}(f^{(2)}(x_{})+R\bar{g}^{\prime}_{k+1}(x_{*}))$
$\displaystyle=$	$\displaystyle\frac{1}{R+f^{(2)}(x_{})}[f^{(2)}(x_{})+R(1-(\frac{R}{R+f^{(2)}(x_{*})})^{N-k})]$
$\displaystyle=$	$\displaystyle 1-(\frac{R}{R+f^{(2)}(x_{*})})^{N-k+1}.$	(23)

Now from (11),

\displaystyle\phi^{\prime}(x_{*})=1-\bar{g}^{\prime}_{0}(x_{*})=(\frac{R}{R+f^{(2)}(x_{*})})^{N+1}.

(24)

∎

Remark 5.

According to the convergence constant $(\frac{R}{R+f^{(2)}})^{N+1}$ provided in Theorem 1, the convergence rate of (11)-(12) depends on two adjustable parameters, $R$ and $N$ . For given $N$ , the smaller $R$ , the more rapidly (11)-(12) converges. For given $R$ , the larger $N$ , the more rapidly (11)-(12) converges.

Remark 6.

Theorem 1 states that the explicit approximation (11)-(12) has linear convergence. However, the convergence constant $(\frac{R}{R+f^{(2)}})^{N+1}$ also implies that (11)-(12) can converge rapidly enough and almost achieve a quadratic convergence by adjusting $R$ and $N$ to decrease the convergence constant.

By direct manipulation, $\phi^{(2)}(x_{*})$ in (13) can be given as

	$\displaystyle\phi^{(2)}(x_{})=\bar{g}^{(2)}_{0}(x_{})$
$\displaystyle=$	$\displaystyle 2(N+1)(\frac{R}{R+f^{(2)}(x_{})})^{N+1}-\frac{R+f^{(2)}(x_{})}{f^{(2)}(x_{})}[1-(\frac{R}{R+f^{(2)}(x_{})})^{N+1}]$
$\displaystyle=$	$\displaystyle\frac{f^{(3)}(x_{})}{f^{(2)}(R+f^{(2)}(x_{}))}\{[(2N+3)f^{(2)}(x_{})+R](\frac{R}{R+f^{(2)}(x_{})})^{N+1}-R-f^{(2)}(x_{*})\}$
$\displaystyle=$	$\displaystyle\frac{f^{(3)}(x_{})}{f^{(2)}(x_{})}\frac{[(2N+3)f^{(2)}(x_{})+R]R^{N+1}-(R+f^{(2)}(x_{}))^{N+2}}{(R+f^{(2)}(x_{*}))^{N+2}}.$	(25)

On account of $|[(2N+3)f^{(2)}(x_{*})+R]R^{N+1}-(R+f^{(2)}(x_{*}))^{N+2}|<|(R+f^{(2)}(x_{*}))^{N+2}|$ , we have $|\phi^{(2)}(x_{*})|<|\frac{f^{(3)}(x_{*})}{f^{(2)}(x_{*})}|$ .

Remark 7.

$|\phi^{(2)}(x_{*})|<|\frac{f^{(3)}(x_{*})}{f^{(2)}(x_{*})}|$ together with (13) and Remark 6 implies that the explicit approximation (11)-(12) even converges more rapidly than Newton’s method as $N\rightarrow+\infty$ .

Remark 5-7 are devoted to the convergence of (11)-(12). It can be regarded that (11)-(12) enjoys a unified step size $\bar{g}_{0}(k)$ in each iteration because of the unified convergence rate. As in the gradient descent method [19, 20], the unified step size may lead to a slow convergence. For this, Algorithm 1 with varying step size $\bar{g}_{k}(x_{k})$ is a better alternative to the recursion (11)-(12).

To analyze the convergence rate of Algorithm 1, for $i=1,\cdots,N$ , let

\displaystyle\psi_{i}(x)=x-\bar{g}_{i}(x).

(26)

Then the following theorem holds.

Theorem 2.

Assume $f^{(2)}(x_{*})>0$ . The iteration error of Algorithm 1 obeys

\displaystyle|x_{k+1}-x_{*}|\leq|\psi^{\prime}_{k}(x_{*})(x_{k}-x_{*})|\leq|\prod_{i=0}^{k}\psi^{\prime}_{i}(x_{*})(x_{0}-x_{*})|,

(27)

where

\displaystyle\psi_{i}^{\prime}(x_{*})=[\frac{R}{R+f^{(2)}(x_{*})}]^{N-i+1}.

(28)

Proof.

The proof is a straightforward derivation from (26). ∎

Remark 8.

Compared to (11)-(12), Algorithm 1 has varying step sizes $\bar{g}_{k}(x_{k})$ at each iteration. From Theoren 1 and (23), the algorithm holds a greater step size in the initial stages and a smaller one in the latter stages, which guarantees rapid convergence and lower computational cost simultaneously.

Remark 9.

On one side, it follows from (27) and (28) that Algorithm 1 superlinearly converges to the minimal of $f(x)$ as $f^{(2)}(x_{*})>0$ . On the other side, Algorithm 1 still works even if there occasionally holds $f^{(2)}(x)=0$ because of the positive definite matrix $R$ .

IV Conclusion

In the paper, a novel algorithm has been developed based on OCP. The algorithm converges more rapidly than gradient descent, meanwhile, unlike Newton’s method, it still works even if the singular Hessian matrix appears in iterate. We also pointed out that Newton’s method is recovered when the control weight matrix disappears in the developed algorithm. The convergence rate of the algorithm depends on two adjustable parameters. Because of their clear physical meanings inherited from OCP, it is very convenient to tune them to speed up the convergence rate.

References

[1] S. S. Rao, Engineering optimization: theory and practice. John Wiley & Sons, 2019.
[2] A. D. Belegundu and T. R. Chandrupatla, Optimization concepts and applications in engineering. Cambridge University Press, 2019.
[3] N. V. Banichuk, Introduction to optimization of structures. Springer Science & Business Media, 2013.
[4] E. K. Chong, W.-S. Lu, and S. H. Żak, An Introduction to Optimization: With Applications to Machine Learning. John Wiley & Sons, 2023.
[5] J. D. Faires and R. L. Burden, Numerical methods. Thomson, 2003.
[6] S. P. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
[7] A. G. Baydin, R. Cornish, D. M. Rubio, M. Schmidt, and F. Wood, “Online learning rate adaptation with hypergradient descent,” arXiv:1703.04782, 2017.
[8] Z. Fei, Z. Wu, Y. Xiao, J. Ma, and W. He, “A new short-arc fitting method with high precision using adam optimization algorithm,” Optik, vol. 212, p. 164788, 2020.
[9] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv:1609.04747, 2016.
[10] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers, pp. 177–186, Springer, 2010.
[11] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural networks, vol. 12, no. 1, pp. 145–151, 1999.
[12] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization.,” Journal of machine learning research, vol. 12, no. 7, 2011.
[13] B. O’donoghue and E. Candes, “Adaptive restart for accelerated gradient schemes,” Foundations of computational mathematics, vol. 15, pp. 715–732, 2015.
[14] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv:1212.5701, 2012.
[15] C. G. Broyden, “Quasi-newton methods and their application to function minimisation,” Mathematics of Computation, vol. 21, no. 99, pp. 368–381, 1967.
[16] R. S. Dembo, S. C. Eisenstat, and T. Steihaug, “Inexact newton methods,” SIAM Journal on Numerical analysis, vol. 19, no. 2, pp. 400–408, 1982.
[17] R. Fletcher and T. L. Freeman, “A modified newton method for minimization,” Journal of Optimization Theory and Applications, vol. 23, pp. 357–372, 1977.
[18] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control. John Wiley & Sons, 2012.
[19] H. Zhao, F. Liu, H. Zhang, and Z. Liang, “Research on a learning rate with energy index in deep learning,” Neural Networks, vol. 110, pp. 225–231, 2019.
[20] D. Li, C. Chen, Q. Lv, H. Gu, T. Lu, L. Shang, N. Gu, and S. M. Chu, “Adaerror: An adaptive learning rate method for matrix approximation-based collaborative filtering,” in Proceedings of the 2018 World Wide Web Conference, pp. 741–751, 2018.