This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ITSO: A novel Inverse Transform Sampling-based Optimization algorithm for stochastic search

Nikolaos P. Bakas, Vagelis Plevris, Andreas Langousis, Savvas A. Chatzichristofis Computation-based Science and Technology Research Center, The Cyprus Institute, 20 Konstantinou Kavafi Street, 2121, Aglantzia Nicosia, Cyprus. e-mail: n.bakas@cyi.ac.cyDepartment of Civil Engineering and Energy Technology, OsloMet—Oslo Metropolitan University, Pilestredet 35, Oslo 0166, Norway. e-mail: vageli@oslomet.noDepartment of Civil Engineering, University of Patras, 265 04 Patras, Greece. e-mail: andlag@alum.mit.eduIntelligent Systems Lab & Department of Computer Science, Neapolis University Pafos, 2 Danais Avenue, 8042 Pafos, Cyprus. e-mail: s.chatzichristofis@nup.ac.cy
Abstract

Optimization algorithms appear in the core calculations of numerous Artificial Intelligence (AI) and Machine Learning methods, as well as Engineering and Business applications. Following recent works on the theoretical deficiencies of AI, a rigor context for the optimization problem of a black-box objective function is developed. The algorithm stems directly from the theory of probability, instead of a presumed inspiration, thus the convergence properties of the proposed methodology are inherently stable. In particular, the proposed optimizer utilizes an algorithmic implementation of the nn-dimensional inverse transform sampling as a search strategy. No control parameters are required to be tuned, and the trade-off among exploration and exploitation is by definition satisfied. A theoretical proof is provided, concluding that only falling into the proposed framework, either directly or incidentally, any optimization algorithm converges in the fastest possible time. The numerical experiments, verify the theoretical results on the efficacy of the algorithm apropos reaching the optimum, as fast as possible.


Keywords: Stochastic Optimization, Inverse Transform Sampling, Black-box Function, Global Convergence.

1 Introduction

Despite the numerous research works and industrial applications of Artificial Intelligence algorithms, they have been criticized about lacking a solid theoretical background [1]. The empirical results demonstrate impressive performance, however, their theoretical foundation and analysis are often vague [2]. Machine Learning (ML) models, frequently aim at identifying an optimal solution [3, 4], which is computationally hard and attempts to explain the procedure often based on the evaluation of the function’s Gradients [5, 6]. Accordingly, the identification of whether a stochastic algorithm will work or not remains an open question for many research and real-world applications [7, 8, 9, 10]. A vast number of optimization algorithms have been developed and applied for the solution of the corresponding problems [11, 12, 13, 14], as well as mathematical proofs regarding their algorithmic convergence [15, 16, 17], however, their basic formulation often stems from nature-inspired procedures [18, 19] and not solid mathematical frameworks. Accordingly, a vast number of research works have been published in order to investigate the performance of black-box algorithms [20, 21, 22].

In its elementary form, the purpose of efficient optimization algorithms is to find the argument yielding the minimum value of a black-box function f(x)f(x), defined on a set AA, f:Anf:A\to\mathbb{R}^{n}. Accordingly, the inverse problem of maximization is the minimization of the negated function f(x)-f(x), while problems with multiple objective functions often utilize single function optimization algorithms to attain the best possible solution. AA is assumed a compact subset of the Euclidean space n{{\mathbb{R}}^{n}}, where nn is the number of dimensions of the set AA, however, the proposed method applies similarly to discrete and continuous topological spaces. The unknown, black-box function ff, returns values for the given input xij=𝐱i=(xi1,xi2,,xin)x_{ij}={{\mathbf{x}}_{i}}=({{x}_{i1}},{{x}_{i2}},\cdots,{{x}_{in}}) at each computational discrete time step i={1,2,,fe}i=\left\{1,2,\ldots,{{f}_{e}}\right\}, where fe{{f}_{e}} is the number of maximum function evaluations. The sought solution is a vector 𝐱minA{{\mathbf{x}}_{min}}\in A, such that f(𝐱min)f(𝐱),𝐱Af({{\mathbf{x}}_{min}})\leq f(\mathbf{x}),\forall\mathbf{x}\in A, which may be written by

𝐱min=argminf(𝐱):={𝐱An𝐲A:f(𝐲)f(𝐱)}\begin{split}{{\mathbf{x}}_{min}}={\operatorname{arg\,min}}\,f(\mathbf{x}):=\{\mathbf{x}\in A\subseteq\mathbb{R}^{n}\\ \mid\forall\mathbf{y}\in A:f(\mathbf{y})\geq f(\mathbf{x})\}\end{split} (1)

The purpose of this work is to provide a rigor context for the optimization problem of a black-box function, by adhering to the Probability Theory, aiming at identifying the best possible solution 𝐱min\mathbf{x}_{min}, within the given iterations fef_{e}, during the execution of the algorithm.

The rest of the paper is organized as follows: In Section 2, the proposed Inverse Transform Sampling Optimizer (ITSO) is presented. Additionally, the same Section provides in details the theoretical formulation of the algorithm as well as some programming and implementation techniques. Illustrative examples of the optimization history are also comprised. In Section 3 the theoretical proof of convergence is provided, as well as Lemma 1, deriving that the suggested optimization framework is the fastest possible. The numerical experiments are divided into three groups. Subsection 5 is about the comparison with 13 nonlinear loss functions, 17 optimization methods, for 10 and 20 dimensions of search space, and 5000 and 10000 iterations per dimension. Section 4, briefly presents the programming techniques that were investigated, in order to implement the proposed method into a computer code. Finally, the conclusions are drawn in Section 6.

2 Optimization by Inverse Transform Sampling

Let the probability distribution of the optimal values of ff, be considered as the product of some monotonically decreasing kernel kk over ff at some time-step ii and hence given input 𝐱i{{\mathbf{x}}_{i}}. This assumption stands in the foundation of the method and can be considered as rational, instead of an arbitrary selection strategy, inspired by natural or other phenomena. It is a straightforward application of the Probability theory to the problem. The kernel may be considered as parametric, concerning time ii. The selection of the kernel kk should satisfy the condition that its limit is the Dirac delta function centered at argminf{\operatorname{arg\,min}}\,f,

limiki(f)=δm(x),{\mathop{\lim_{i\to\infty}}}\,k_{i}(f)={{\delta}_{m}}(x), (2)

where δm\delta_{m} denotes the Dirac function centered at argminf{\operatorname{arg\,min}}\,f. Although the selection of the kernel kk is ambiguous, it can be chosen among a variety of functions satisfying Equation 2, such as the Gaussian:

ki(f)=exp(f(𝐱i)2g(i)),k_{i}(f)=exp(-{{f(\mathbf{x}_{i})}^{2}}g(i)), (3)

where g(i)g(i) is a time increasing pattern. The function g(i)g(i) controls the shape of the kernel kk, approximating numerically Equation 2. Additionally, we may use:

ki(f)=maxff(𝐱i),k_{i}({f})={\operatorname{max}}\,f-f(\mathbf{x}_{i}), (4)
ki(f)=1f(𝐱i)minf+eimaxfminf+eiei0,k_{i}({f})=1-\frac{f(\mathbf{x}_{i})-{\operatorname{min}}\,f+e_{i}}{{\operatorname{max}}\,f-{\operatorname{min}}\,f+e_{i}}\wedge{e_{i}\to 0}, (5)

or any other non-negative Lebesgue-integrable function, which reverses the order of the given set of all fi^{{f}_{{\hat{i}}}}, where i^\hat{i} is the permutation of the indices 1,2,,i1,2,...,i, such that the sequence of fi^{{f}_{{\hat{i}}}} being monotonically strictly decreasing. The duplicate values of fi^{{f}_{{\hat{i}}}} should be extracted to avoid numerical instabilities. These duplicates often appeared in the empirical calculations, especially when the algorithm was close to a local or global stationary point. A variety of kernel functions kk were investigated, and the results weren’t affected significantly, even when distorting k(f)k(f) with some random noise X𝒰(a,b)k(fi^)(a,b){X}^{\prime}\sim\mathcal{U}(a,b)\forall k({{f}_{{\hat{i}}}})\in(a,b).

Accordingly, for each dimension jj of the vector space AA, the marginal probability density function PXj{P}_{{{X}_{j}}} is obtained numerically from the kernel function:

PXj(xij)=k(f(xij))j{1,2,n}.{{P}_{{{X}_{j}}}}({x}_{ij})=k(f({{x}_{ij}}))\forall j\in\{1,2,...n\}. (6)

PXj(xij)dx{{P}_{{{X}_{j}}}}({{x}_{ij}})dx is the probability that argminf{\operatorname{arg\,min}}\,f falls within the infinitesimal interval [xijdx/2,xij+dx/2][{{x}_{ij}}-dx/2,{{x}_{ij}}+dx/2]. The corresponding cumulative distribution function (CDF) FXj{{F}_{{{X}_{j}}}}, can be calculated by:

FXj(xij)=lbjxijPXj(ξ)𝑑ξ,{{F}_{{{X}_{j}}}}({{x}_{ij}})=\int_{l{{b}_{j}}}^{{{x}_{ij}}}{{{P}_{{{X}_{j}}}}}({{\xi}})d\xi, (7)

where lbjl{{b}_{j}} is the lower bound of the jthj^{th} dimension of the vector space AA and can be numerically evaluated by some numerical integration rule, such as the Riemann sum Sj=i=1nPXj(xij)ΔxijS_{j}=\sum\limits_{i=1}^{n}{{{P}_{{{X}_{j}}}}}(x_{ij}^{*})\Delta{{x}_{ij}}, with PXj(xij){{{P}_{{{X}_{j}}}}}(x_{ij}^{*}) computed by the application of the kernel kk on some xij{x}_{ij}, such that f(xij)=(f(xij)+f(xi1,j))/2f(x_{ij}^{*})=(f({{x}_{ij}})+f({{x}_{i-1,j}}))/2, or another approximation scheme. In the following pseudo-code 1, the algorithmic implementation of the Inverse Transform Sampling method is demonstrated. The symbols are also noted in the Nomenclature section.

A graphical demonstration of the evolution of the Probability Density, as well as the corresponding Cumulative Distribution Functions, is presented in Figure 1, for f(x)=(x5)2f(x)=(x-5)^{2}, which is function with one extreme value and in Figure 2, for f(x)=sin(x+0.7)+0.01(x+0.7)2f(x)=sin(x+0.7)+0.01*(x+0.7)^{2}, which has many extrema. Interestingly, as the function evaluations increase, the CDF, is characterised by sharp slopes, which are positioned in regions where the PDF exhibits high values, and hence function f(x)f(x) attends its lows. Figures 1 and 2, offer an intuitive representation of the procedure for finding the minimum of the function, within the suggested framework.

Data: A,fe,nA,{{f}_{e}},n
Result: 𝐱=argminf{\mathbf{x}^{*}={\operatorname{arg\,min}}\,f}, f=f(𝐱){{f}^{*}}=f(\mathbf{x}^{*})
1 while ifei\leq{{f}_{e}} do
2 j𝒰{1,n}j\leftarrow\mathcal{U}\left\{1,n\right\};
3 ri𝒰(0,1)r_{i}\leftarrow\mathcal{U}(0,1);
4   SORT xijix_{ij}\hskip 5.69054pt\forall i;
5 xijFj1(ri){{x}_{ij}}\leftarrow{F_{j}}^{-1}(r_{i});
6 fi=f(𝐱i){{f}_{i}}=f(\mathbf{x}_{i});
7 if fif{{f}_{i}}\leq{{f}^{*}} then
8    ffi{{f}^{*}}\leftarrow{{f}_{i}};
9    xjxij{{x}^{*}_{j}}\leftarrow{{x}_{ij}};
10    
11 else
12    xijxj{{x}_{ij}}\leftarrow x^{*}_{j};
13   end if
14 
15 end while
return 𝐱=argminf\mathbf{x}^{*}={\operatorname{arg\,min}}\,f
Algorithm 1 ITSO-Mathematical Framework
Refer to caption
(a) PDF, i=3i=3
Refer to caption
(b) CDF, i=3i=3
Refer to caption
(c) PDF, i=100i=100
Refer to caption
(d) CDF, i=100i=100
Refer to caption
(e) PDF, i=350i=350
Refer to caption
(f) CDF, i=350i=350
Refer to caption
(g) PDF, i=500i=500
Refer to caption
(h) CDF, i=500i=500
Figure 1: Probability Density and Cumulative Distribution Functions, utilizing ITSO search strategy, for function f(x)=(x5)2f(x)=(x-5)^{2}. The shape sequentially tends to the Heaviside function, centered at xm=5x_{m}=5
Refer to caption
(a) PDF, i=3i=3
Refer to caption
(b) CDF, i=3i=3
Refer to caption
(c) PDF, i=150i=150
Refer to caption
(d) CDF, i=150i=150
Refer to caption
(e) PDF, i=250i=250
Refer to caption
(f) CDF, i=250i=250
Refer to caption
(g) PDF, i=500i=500
Refer to caption
(h) CDF, i=500i=500
Figure 2: Probability Density and Cumulative Distribution Functions, utilizing ITSO search strategy, for function f(x)=sin(x+0.7)+0.01(x+0.7)2f(x)=sin(x+0.7)+0.01*(x+0.7)^{2}. The shape sequentially tends to the Heaviside function, centered at xm=2.24x_{m}=-2.24

3 Convergence Properties

During the optimization process, the algorithm generates some input variables xijx_{ij} as arguments for the black-box function ff. By utilizing the values of ff, we may compute the values of the distribution PXj(xij){{P}_{{{X}_{j}}}}(x_{ij}), by Equation 6, and FXj(xij){{F}_{{{X}_{j}}}}({{x}_{ij}}) by Equation 7, for all xijx_{ij}.

Definition 1.

We define the argument of ff within fef_{e} iterations, corresponding to the minimum of ff as xm=argminfx_{m}={\operatorname{arg\,min}}\,f.

In iteration ii, the algorithm will have searched the space A^A\hat{A}\subset A, within the limits lbj,ubjlb_{j},ub_{j}. By definition, PP is the probability density (likelihood) that the optimum occurs in a region 𝐱±𝐝𝐱\mathbf{x}\pm\mathbf{dx}. Hence,

E[𝐗]={E[X1]E[X2]E[Xn]}={lb1ub1ξPX1(ξ)𝑑ξlb2ub2ξPX2(ξ)𝑑ξlbnubnξPXn(ξ)𝑑ξ},E[\mathbf{X}]=\begin{Bmatrix}E[X_{1}]\\ E[X_{2}]\\ \ldots\\ E[X_{n}]\end{Bmatrix}=\begin{Bmatrix}\int_{lb_{1}}^{ub_{1}}{\xi}P_{X_{1}}(\xi)d\xi\\ \int_{lb_{2}}^{ub_{2}}{\xi}P_{X_{2}}(\xi)d\xi\\ \ldots\\ \int_{lb_{n}}^{ub_{n}}{\xi}P_{X_{n}}(\xi)d\xi\end{Bmatrix}, (8)

where E[]E[\cdot], denotes the expectation of a random variable or vector.

Theorem 1.

If PXjP_{X_{j}} tends to Dirac δm\delta_{m}, then ITSO will converge to xmx_{m}

Proof.

For each dimension jj, we may write

E[Xj]=lbjubjξPXj(ξ)𝑑ξ=lbjubjξFXj(ξ)𝑑ξ,E[X_{j}]=\int_{lb_{j}}^{ub_{j}}{\xi}P_{X_{j}}(\xi)d\xi=\int_{lb_{j}}^{ub_{j}}{\xi}{F^{\prime}_{X_{j}}}(\xi)d\xi, (9)

and integrating by parts, we obtain

E[Xj]=[ξFXj(ξ)]lbjubjlbjubj1FXj(ξ)𝑑ξ=ubj1lbj0lbjubj1FXj(ξ)𝑑ξ.\begin{split}E[X_{j}]=[\xi F_{X_{j}}(\xi)]_{lb_{j}}^{ub_{j}}-\int_{lb_{j}}^{ub_{j}}{1}F_{X_{j}}(\xi)d\xi=ub_{j}*1-lb_{j}*0\\ -\int_{lb_{j}}^{ub_{j}}{1}F_{X_{j}}(\xi)d\xi.\end{split} (10)

If we apply the theorem of the antiderivative of inverse functions [23], we obtain

01FXj1(y)𝑑y+lbjubjFXj(x)𝑑x=ubj1lbj0.\int_{0}^{1}{{{F_{X_{j}}}^{-1}}}(y)dy+\int_{lb_{j}}^{ub_{j}}{F_{X_{j}}}(x)dx=ub_{j}*1-lb_{j}*0. (11)

Hence, for Equations 10 and 11 we deduce that

E[Xj]=01FXj1(y)𝑑y.E[X_{j}]=\int_{0}^{1}{{{F_{X_{j}}}^{-1}}}(y)dy. (12)

With subscript mm denoting that the Dirac function is centered at the argument that minimizes ff, i.e.

δm(x)={+,x=argminf0,xargminf,{{\delta}_{m}}(x)=\left\{\begin{array}[]{*{35}{l}}+\infty,&x={\operatorname{arg\,min}}\,f\\ 0,&x\neq{\operatorname{arg\,min}}\,f\\ \end{array}\right., (13)

and

Hm(x):=xδm(s)𝑑s.H_{m}(x):=\int_{-\infty}^{x}{\delta_{m}(s)}ds. (14)

As PXjP_{X_{j}} tends to Dirac function, FXjF_{X_{j}} tends to the Heaviside step function centered at xmx_{m}, hence by Equation 12 we deduce that

E[Xj]ife01Hm1(y)𝑑y,E[X_{j}]\xrightarrow[]{i\to f_{e}}\int_{0}^{1}{{{H_{m}}^{-1}}}(y)dy, (15)

and by Equation 11

E[Xj]ubj01Hm(y)𝑑y=ubj(ubjxm)1=xmE[X_{j}]\xrightarrow{}ub_{j}-\int_{0}^{1}{{H_{m}}}(y)dy=ub_{j}-(ub_{j}-x_{m})*1=x_{m} (16)

Lemma 1.

(ITSO Convergence Speed) ITSO is the fastest possible optimization framework

Proof.

Any distribution that doesn’t tend to Dirac, could be considered as a Dirac plus a positive function of xx. In this case, the algorithmic framework would search through a strategy that produces a P{P}^{\prime} over 𝐱\mathbf{x}, as

P=P±δm.{P}^{\prime}={{P}^{*}}\pm{{\delta}_{m}}. (17)

Hence, with FF^{*} indicating the CDF corresponding to PP^{*}, Equations 15, and 12, result in

E[X]ife01Hm1(y)𝑑y+lbubF(x)𝑑x,E[X]\xrightarrow[]{i\to f_{e}}\int_{0}^{1}{{{H_{m}}^{-1}}}(y)dy+\int_{lb}^{ub}{F^{*}(x)dx}, (18)

.

and thus, by Equations 17, and 16, we obtain

E[X]xm±ϵ.E[X]\xrightarrow{}x_{m}\pm\epsilon. (19)

Hence the algorithm would converge to a point different than xmx_{m}

4 Programming techniques

A variety of programming techniques were investigated, in order to implement the proposed method into a computer code. To keep the algorithm simple and reduce the computational time, we applied inverse transform sampling by keeping in each iteration ii the best function evaluations, and randomly sampling among them. This is equivalent to a kernel function that vanishes over the worst function evaluations, and distributing the probability mass to the best performing ones. Accordingly, similar programming techniques may be investigated in future works, within the suggested framework.

1 Initialize: 𝐱=rand(n),𝐨𝐩𝐭𝐢_𝐱=𝐱,opti_f=f(𝐨𝐩𝐭𝐢_𝐱)\mathbf{x}=\text{rand}(n),\quad\mathbf{opti\_x}=\mathbf{x},\quad{opti\_f}=f(\mathbf{\mathbf{opti\_x}});
2 for i=1:fei=1:f_{e} do
3 for j=1:nj=1:n do
4    𝐢𝐧𝐝𝐬𝐁𝐄𝐒𝐓=sortperm(opti_evals)[1:α]\mathbf{indsBEST}=\text{sortperm}({opti\_evals})[1:\alpha];
5    𝐢𝐧𝐩_𝐱=𝐱_𝐞𝐯𝐚𝐥𝐬[𝐢𝐧𝐝𝐬𝐁𝐄𝐒𝐓,j]\mathbf{inp\_x=x\_evals}[\mathbf{indsBEST},j];
6    𝐢𝐧𝐩_𝐲=𝐨𝐩𝐭𝐢_𝐞𝐯𝐚𝐥𝐬[𝐢𝐧𝐝𝐬𝐁𝐄𝐒𝐓]\mathbf{inp\_y=opti\_evals}[\mathbf{indsBEST}];
7    rr=min{𝐢𝐧𝐩_𝐱}{rr}=\text{min}\{\mathbf{inp\_x\}}
8    +rand(0,1)(max{𝐢𝐧𝐩_𝐱}min{𝐢𝐧𝐩_𝐱})\quad+\text{rand}(0,1)(\text{max}\{\mathbf{inp\_x\}}-\text{min}\{\mathbf{inp\_x\}});
9    𝐱[j]=rr\mathbf{x}[j]={rr};
10    fi=f(𝐱)f_{i}=f(\mathbf{x});
11    if fi{{f}_{i}}\leq opti_f then
12       opti_f=fi{opti\_f}={{f}_{i}};
13       𝐨𝐩𝐭𝐢_𝐱=𝐱\mathbf{opti\_x}=\mathbf{x};
14       
15    else
16       𝐱=𝐨𝐩𝐭𝐢_𝐱\mathbf{x}=\mathbf{opti\_x};
17      end if
18    
19   end for
20 
21 end for
return opti_f,𝐨𝐩𝐭𝐢_𝐱\quad{opti\_f},\quad\mathbf{opti\_x}
Algorithm 2 ITSO-Short Pseudocode

The Algorithm 2 represents a simple version of the supplementary code in the appendix, which may easily be programmed. The variable 𝐨𝐩𝐭𝐢_𝐞𝐯𝐚𝐥𝐬\mathbf{opti\_evals} is a dynamic vector, containing all values of the objective function returned during the optimization history, until step ii, and 𝐱_𝐞𝐯𝐚𝐥𝐬\mathbf{x\_evals} is an i×ni\times n matrix, containing all the design vectors 𝐱1:i\mathbf{x}_{1:i}, corresponding to 𝐨𝐩𝐭𝐢_𝐞𝐯𝐚𝐥𝐬\mathbf{opti\_evals}. The integer α\alpha is a parameter regarding how many instances of the optimization history are kept in order to randomly sample among them; for example if fe=104f_{e}=10^{4}, we may select α=102\alpha=10^{2}. In this variation of the code, the Inverse Transform Sampling is approximately implemented in line 7, by calculating the random variable rrrr, among the extrema of the vector 𝐢𝐧𝐩_𝐱\mathbf{inp\_x}, corresponding to the range where: for the jthj^{th} dimension of all 𝐱1:i\mathbf{x}_{1:i}, the α\alpha best function values were returned by the black-box function ff. The sought solution is the vector 𝐨𝐩𝐭𝐢_𝐱\mathbf{opti\_x}, and the mimimum attained value of the objective function opti_fopti\_f.

5 Numerical Experiments

In this section, we present the results obtained by running the ITSO algorithm, as well as Adaptive Differential Evolution (rand 1 bin), Differential Evolution (rand 1 bin), and Differential Evolution (rand 2 bin) with and without radius limited, Compass Coordinate Direct Search, Probabilistic Descent Direct search, Random Search, Resampling Inheritance Memetic Search, Resampling Memetic Search, Separable Natural Evolution Strategies, Simultaneous Perturbation Stochastic Approximation, and Exponential Natural Evolution Strategies from the Julia Package BlackBoxOptim.jl [24], and Nelder-Mead, Particle Swarm, and Simulated Annealing from Optim.jl [25]. To demonstrate the performance of each optimizer in attaining the minimum, we firstly run the algorithm r=10r=10 times, obtain fk(𝐱i)f_{k}(\mathbf{x}_{i}) for k={1,2,,r}k=\{1,2,\dots,r\} and all iterations i={1,2,,fe}i=\left\{1,2,\ldots,{{f}_{e}}\right\}, and average the results

f^(𝐱i)=k=1rfk(𝐱i)r.\hat{f}(\mathbf{x}_{i})=\frac{\sum_{k=1}^{r}f_{k}(\mathbf{x}_{i})}{r}. (20)

Then, we normalize the vector of obtained function evaluations 𝐯={f^(𝐱1),f^(𝐱1),,f^(𝐱fe)}\mathbf{v}=\{\hat{f}(\mathbf{x}_{1}),\hat{f}(\mathbf{x}_{1}),\dots,\hat{f}(\mathbf{x}_{f_{e}})\} in the domain [0,1]\left[0,1\right] through

f^n(𝐱i)=f^(𝐱i)min𝐯max𝐯min𝐯,\hat{f}_{n}(\mathbf{x}_{i})=\frac{\hat{f}(\mathbf{x}_{i})-\operatorname{min}\mathbf{v}}{\operatorname{max}\mathbf{v}-\operatorname{min}\mathbf{v}}, (21)

and finally use the optimization history hh

h(𝐱i)=(l=1mf^nl(𝐱i)m)110,h(\mathbf{x}_{i})=\left(\frac{\sum\limits_{l=1}^{m}{\hat{f}_{n}^{l}(\mathbf{x}_{i})}}{m}\right)^{\frac{1}{10}}, (22)

where mm indicates the number of black-box functions used for the evaluation. Equation 22 was selected as a performance metric, in order to obtain a clear representation of the various optimizers utilized, as powers smaller than one (in our case 110\frac{1}{10}), have the property to magnify the attained values at the final steps of the optimization history. In Figure 3 the numerical experiments for m=13m=13 functions are presented. Each line corresponds to the normalized average optimization history hh (Equation 22, for all functions, which were repeated 10 times). We may see a clear prevalence of the proposed framework, in terms of convergence performance, as expected by the theoretical investigation. The numerical experiments for the comparison with other optimizers, may be reproduced by running the file __run.jl.

Refer to caption
(a) 10 variables and 5000 evaluations
Refer to caption
(b) 20 variables and 5000 evaluations
Refer to caption
(c) 20 variables and 10000 evaluations
Refer to caption
(d) Utilized Optimizers
Figure 3: Average Optimization history hh (Equation 22, for all 13 Functions, repeated 10 times.
Table 1: Average Minimum Values of hh Attained by each Optimizer
n=10n=10 n=20n=20 n=20n=20
fe=5000f_{e}=5000 fe=5000f_{e}=5000 fe=10000f_{e}=10000
Adapt. Diff. Evol. (rand 1 bin)
0.00991 0.02210 0.00852
Adapt. Diff. Evol. (rand 1 bin, radious limited)
0.00734 0.01762 0.00482
Diff. Evol. (rand 1 bin)
0.01348 0.02196 0.01161
Diff. Evol. (rand 1 bin, radious limited)
0.00760 0.01377 0.00564
Diff. Evol. (rand 2 bin)
0.02140 0.03712 0.02687
Diff. Evol. (rand 2 bin, radious limited)
0.02072 0.03110 0.02008
Compass Coordinate Direct Search
0.00084 0.00118 0.00045
Probabilistic Descent Direct search
0.00984 0.01424 0.01239
Random Search 0.05678 0.08307 0.07936
Resampling Inheritance Memetic Search
0.00238 0.00513 0.00243
Resampling Memetic Search
0.00129 0.00236 0.00223
Separable Natural Evolution Strategies
0.02038 0.01165 0.01127
Simultaneous Perturbation Stochastic Approximation
0.13967 0.13998 0.13227
Exponential Natural Evolution Strategies
0.01972 0.01415 0.01286
Nelder-Mead 0.01858 0.03307 0.03452
Particle Swarm 0.00254 0.00166 0.00136
Simulated Annealing
0.03825 0.03812 0.03721
Proposed-ITSO 0.00001 0.00020 0.00017

6 Discussion and Conclusions

In this work a novel approach was presented for the well known problem of finding the argument that minimizes a black-box, function or system. A vast volume of approximation algorithms have been proposed, mainly heuristic, such as genetic, evolutionary, particle swarm, as well as their variations. However, they stem from nature-inspired procedures, and hence their converge is investigated a-posteriori. Despite their efficiency, they are often deprecated by researchers, due to the lack of rigorous mathematical formulation, as well as complexity of implementation. To the contrary, the proposed algorithm, initiates its formulation from well established probabilistic definitions and theorems, and its implementation demands a few lines of computer code. Furthermore, the convergence properties were found stable, as a proof that the suggested framework attains the best possible solution in the fewest possible iterations. The numerical examples validate the theoretical results and may be reproduced by the provided computer code. We consider the suggested method as a powerful framework which may easily be adopted to the sought solution of any problem involving the minimization of a black-box function.

Appendix A Programming Code

The corresponding computer code is available on GitHub https://github.com/nbakas/ITSO.jl. The examples of Figure 3 may be reproduced by running __run.jl. The sort version of the Algorithm 1 is available in Julia [26] (file ITSO-short.jl), Octave [27] (ITSOshort.m), and Python [28] (ITSO-short.py). The implementation of the framework is integrated in a few lines of computer code, which can be easily adapted for case specific applications with high efficiency.

Appendix B Black-Box Functions

The following functions were used for the numerical experiments. Equations 23, 24 (Elliptic, Cigar), were utilized from [29], Cigtab (Eq. 25), Griewank 26 from [30], Quartic (Eq. 27 from [31], Schwefel (Eq. 28), Rastrigin (Eq. 29), Sphere (Eq. 30), and Ellipsoid (Eq. 31) from [32, 24], and Alpine (Eq. 32) from [33]. Equations 33, 34, 35, were developed by the authors. The code implementation for the selected equations appears in file functions_opti.jl in the supplementary computer code.

The exact variation used in this work is as follows, where we have adopted the notation presented in the Nomenclature section, where ii denotes the step of the optimization history, and jj the dimension of the design variable xijx_{ij}.

felliptic(𝐱i)=j=1ncj(xij+32)2,where\displaystyle f_{elliptic}(\mathbf{x}_{i})=\sum_{j=1}^{n}{c_{j}(x_{ij}+\frac{3}{2})^{2}},\text{where} (23)
𝐜=103{0,1n1,,1}.\displaystyle\mathbf{c}=0^{3}\{0,\frac{1}{n-1},\dots,1\}.
fcigar(𝐱i)=x12+j=2n|xij|.f_{cigar}(\mathbf{x}_{i})=x_{1}^{2}+\sum_{j=2}^{n}{|x_{ij}|}. (24)
fcigtab(𝐱i)=x12+j=2n1|xij|+xn2.f_{cigtab}(\mathbf{x}_{i})=x_{1}^{2}+\sum_{j=2}^{n-1}{|x_{ij}|}+x_{n}^{2}. (25)
fgriewank(𝐱i)=1+14000j=1nxij2j=1ncos(xijj).f_{griewank}(\mathbf{x}_{i})=1+{\frac{1}{4000}}\sum_{{j=1}}^{n}x_{ij}^{2}-\prod_{{j=1}}^{n}\cos\left({\frac{x_{ij}}{{\sqrt{j}}}}\right). (26)
fquartic(𝐱i)=j=1nj(xij2)4.f_{quartic}(\mathbf{x}_{i})=\sum_{j=1}^{n}{j(x_{ij}-2)^{4}}. (27)
fschwefel(𝐱i)=j=1ncj2,where\displaystyle f_{schwefel}(\mathbf{x}_{i})=\sum_{j=1}^{n}{c_{j}^{2}},\text{where} (28)
cj=k=1j(xik9).\displaystyle{c_{j}}=\sum_{k=1}^{j}{(x_{ik}-9)}.
frastrigin(𝐱i)=10n+j=1n(xij+710)2\displaystyle f_{rastrigin}(\mathbf{x}_{i})=0n+\sum_{j=1}^{n}{(x_{ij}+\frac{7}{10})^{2}} (29)
10j=1ncos(2π(xij+710)2).\displaystyle-0\sum_{j=1}^{n}{\cos(2\pi(x_{ij}+\frac{7}{10})^{2})}.
fsphere(𝐱i)=j=1n(xij1310)2.f_{sphere}(\mathbf{x}_{i})=\sum_{j=1}^{n}{(x_{ij}-\frac{13}{10})^{2}}. (30)
felipsoid(𝐱i)=j=1n(xij2)2.\displaystyle f_{elipsoid}(\mathbf{x}_{i})=\sum_{j=1}^{n}{(x_{ij}-\sqrt{2})^{2}}. (31)
falpine(𝐱i)=j=1n|xijsinxij+110xij|.f_{alpine}(\mathbf{x}_{i})=\sum_{j=1}^{n}{|x_{ij}\sin{x_{ij}}+\frac{1}{10}x_{ij}|}. (32)
fx_j(𝐱i)=j=1n(xijj2110)2.f_{x\_j}(\mathbf{x}_{i})=\sum_{j=1}^{n}{(x_{ij}-j-\frac{21}{10})^{2}}. (33)
fx_5(𝐱i)=j=1n(xij5)25.f_{x\_5}(\mathbf{x}_{i})=\sum_{j=1}^{n}{(x_{ij}-5)^{2}}-5. (34)
fsin_x(𝐱i)=j=1n(sin(xij+710)+(xij+710)2100).f_{sin\_x}(\mathbf{x}_{i})=\sum_{j=1}^{n}{(\sin(x_{ij}+\frac{7}{10})+\frac{(x_{ij}+\frac{7}{10})^{2}}{100}}). (35)

References

  • [1] M. Hutson, “AI researchers allege that machine learning is alchemy,” Science, may 2018. [Online]. Available: http://www.sciencemag.org/news/2018/05/ai-researchers-allege-machine-learning-alchemy
  • [2] D. Sculley, J. Snoek, A. Rahimi, and A. Wiltschko, “Winner’s Curse? On Pace, Progress, and Empirical Rigor,” ICLR Workshop track, 2018.
  • [3] S. Sra, S. Nowozin, and S. J. Wright, Optimization for machine learning. Mit Press, 2012.
  • [4] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” Siam Review, vol. 60, no. 2, pp. 223–311, 2018.
  • [5] C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature Machine Intelligence, vol. 1, no. 5, pp. 206–215, 2019. [Online]. Available: https://doi.org/10.1038/s42256-019-0048-x
  • [6] J. Wu, M. Poloczek, A. G. Wilson, and P. Frazier, “Bayesian optimization with gradients,” in Advances in Neural Information Processing Systems, 2017, pp. 5267–5278.
  • [7] B. Doerr, “Probabilistic tools for the analysis of randomized optimization heuristics,” in Theory of Evolutionary Computation. Springer, 2020, pp. 1–87.
  • [8] C. Doerr, “Complexity theory for discrete black-box optimization heuristics,” in Theory of Evolutionary Computation. Springer, 2020, pp. 133–212.
  • [9] M. Papadrakakis, N. D. Lagaros, and V. Plevris, “Design optimization of steel structures considering uncertainties,” Engineering Structures, vol. 27, no. 9, pp. 1408–1418, 2005.
  • [10] ——, “Optimum design of space frames under seismic loading,” International Journal of Structural Stability and Dynamics, vol. 1, no. 01, pp. 105–123, 2001.
  • [11] N. Moayyeri, S. Gharehbaghi, and V. Plevris, “Cost-based optimum design of reinforced concrete retaining walls considering different methods of bearing capacity computation,” Mathematics, vol. 7, no. 12, p. 1232, 2019.
  • [12] V. Plevris and M. Papadrakakis, “A hybrid particle swarm—gradient algorithm for global structural optimization,” Computer-Aided Civil and Infrastructure Engineering, vol. 26, no. 1, pp. 48–68, 2011.
  • [13] N. D. Lagaros, N. Bakas, and M. Papadrakakis, “Optimum Design Approaches for Improving the Seismic Performance of 3D RC Buildings,” Journal of Earthquake Engineering, vol. 13, no. 3, pp. 345–363, mar 2009. [Online]. Available: http://www.tandfonline.com/doi/full/10.1080/13632460802598594
  • [14] N. D. Lagaros, M. Papadrakakis, and N. P. Bakas, “Automatic minimization of the rigidity eccentricity of 3D reinforced concrete buildings,” Journal of Earthquake Engineering, vol. 10, no. 4, pp. 533–564, jul 2006. [Online]. Available: http://www.tandfonline.com/doi/full/10.1080/13632460609350609
  • [15] A. D. Bull, “Convergence rates of efficient global optimization algorithms,” Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2879–2904, 2011.
  • [16] G. Rudolph, “Convergence analysis of canonical genetic algorithms,” IEEE transactions on neural networks, vol. 5, no. 1, pp. 96–101, 1994.
  • [17] M. Clerc and J. Kennedy, “The particle swarm-explosion, stability, and convergence in a multidimensional complex space,” IEEE transactions on Evolutionary Computation, vol. 6, no. 1, pp. 58–73, 2002.
  • [18] K. R. Opara and J. Arabas, “Differential evolution: A survey of theoretical analyses,” Swarm and evolutionary computation, vol. 44, pp. 546–558, 2019.
  • [19] N. Razmjooy, V. V. Estrela, H. J. Loschi, and W. Fanfan, “A comprehensive survey of new meta-heuristic algorithms,” Recent Advances in Hybrid Metaheuristics for Data Clustering, Wiley Publishing, 2019.
  • [20] M. A. Muñoz and K. A. Smith-Miles, “Performance analysis of continuous black-box optimization algorithms via footprints in instance space,” Evolutionary computation, vol. 25, no. 4, pp. 529–554, 2017.
  • [21] C. Audet and W. Hare, Derivative-free and blackbox optimization. Springer, 2017.
  • [22] C. Audet and M. Kokkolaras, “Blackbox and derivative-free optimization: theory, algorithms and applications,” 2016.
  • [23] F. D. Parker, “Integrals of Inverse Functions,” The American Mathematical Monthly, vol. 62, no. 6, p. 439, jun 1955. [Online]. Available: http://www.jstor.org/stable/2307006?origin=crossref
  • [24] R. Feldt, “Blackboxoptim.jl,” 2013-2018. [Online]. Available: https://github.com/robertfeldt/BlackBoxOptim.jl
  • [25] J. Myles White, T. Holy, O. Contributors (2012), P. Kofod Mogensen, J. Myles White, T. Holy, P. Contributors (2016), Other. Kofod Mogensen, A. Nilsen Riseth, J. Myles White, T. Holy, and O. Contributors (2017), “Optim.jl,” 2012,2016,2017. [Online]. Available: https://github.com/JuliaNLSolvers/Optim.jl
  • [26] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, “Julia: A fresh approach to numerical computing,” SIAM review, vol. 59, no. 1, pp. 65–98, 2017.
  • [27] Contributors, “Gnu octave,” http://hg.savannah.gnu.org/hgweb/octave/file/tip/doc/interpreter/contributors.in, 28 Feb 2020.
  • [28] ——, “Python 3.8.2,” https://www.python.org/, 2020.
  • [29] J. Liang, B. Qu, P. Suganthan, and A. G. Hernández-Díaz, “Problem definitions and evaluation criteria for the cec 2013 special session on real-parameter optimization,” Computational Intelligence Laboratory, Zhengzhou University, Zhengzhou, China and Nanyang Technological University, Singapore, Technical Report, vol. 201212, no. 34, pp. 281–295, 2013.
  • [30] C.-K. Au and H.-F. Leung, “Eigenspace sampling in the mirrored variant of (1, λ\lambda)-cma-es,” in 2012 IEEE Congress on Evolutionary Computation. IEEE, 2012, pp. 1–8.
  • [31] M. Jamil and X.-S. Yang, “A literature survey of benchmark functions for global optimization problems,” arXiv preprint arXiv:1308.4008, 2013.
  • [32] S. Finck, N. Hansen, R. Ros, and A. Auger, “Real-parameter black-box optimization benchmarking 2009: Presentation of the noiseless functions,” Citeseer, Tech. Rep., 2010.
  • [33] K. Hussain, M. N. M. Salleh, S. Cheng, and R. Naseem, “Common benchmark functions for metaheuristic evaluation: A review,” JOIV: International Journal on Informatics Visualization, vol. 1, no. 4-2, pp. 218–223, 2017.