This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online Projected Gradient Descent for Stochastic Optimization with Decision-Dependent Distributions

Killian Wood    Gianluca Bianchin    and Emiliano Dall’Anese K. Wood is with the Dept. of Applied Mathematics, University of Colorado Boulder. G. Bianchin and E. Dall’Anese are with the Dept. of Electrical, Computer, and Energy Eng., University of Colorado Boulder. This work was supported in part by the National Science Foundation ERC ASPIRE and by the National Renewable Energy Laboratory through the subcontract UGA-0-41026-148.
Abstract

This paper investigates the problem of tracking solutions of stochastic optimization problems with time-varying costs that depend on random variables with decision-dependent distributions. In this context, we propose the use of an online stochastic gradient descent method to solve the optimization, and we provide explicit bounds in expectation and in high probability for the distance between the optimizers and the points generated by the algorithm. In particular, we show that when the gradient error due to sampling is modeled as a sub-Weibull random variable, then the tracking error is ultimately bounded in expectation and in high probability. The theoretical findings are validated via numerical simulations in the context of charging optimization of a fleet of electric vehicles.

{IEEEkeywords}

Optimization, Optimization algorithms.

1 Introduction

\IEEEPARstart

This paper considers the problem of developing and analyzing online algorithms to track the solutions of time-varying stochastic optimization problems, where the distribution of the underlying random variables is decision-dependent. Formally, we consider problems of the form111Notation. We let 0:={0}\mathbb{N}_{0}:=\mathbb{N}\cup\{0\}, where \mathbb{N} denotes the set of natural numbers. For a given column vector xnx\in\mathbb{R}^{n}, x\|x\| is the Euclidean norm. Given a differentiable function f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R}, f(x)\nabla f(x) denotes the gradient of ff at xx (taken to be a column vector). Given a closed convex set CnC\subseteq\mathbb{R}^{n}, projC:nn\operatorname{\text{proj}}_{C}:\mathbb{R}^{n}\to\mathbb{R}^{n} denotes the Euclidean projection of yy onto CC, namely projC(y):=argminvCyv\operatorname{\text{proj}}_{C}(y):=\arg\min_{v\in C}\|y-v\|. For a given random variable zz\in\mathbb{R}, 𝔼[z]\mathbb{E}[z] denotes the expected value of zz, and (zϵ)\mathbb{P}(z\leq\epsilon) denotes the probability of zz taking values smaller than or equal to ϵ\epsilon; zp:=𝔼[|z|p]1/p\|z\|_{p}:=\mathbb{E}[|z|^{p}]^{1/p}, for any p1p\geq 1. Finally, ee denotes Euler’s number. :

xtargminxCt𝔼zDt(x)[t(x,z)],x^{*}_{t}\in\arg\min_{x\in C_{t}}\underset{z\sim D_{t}(x)}{\operatorname{\mathbb{E}}}\left[\ell_{t}(x,z)\right], (1)

where t0t\in\mathbb{N}_{0} is a time index, xdx\in\real^{d} is the decision variable, DtD_{t} is a map from the set d to the space of distributions, z𝒵tz\in\mathcal{Z}_{t} is a random variable (with 𝒵t\mathcal{Z}_{t} the union of the support of Dt(x)D_{t}(x) for all xCtx\in C_{t}), t:d×𝒵t\ell_{t}:\real^{d}\times\mathcal{Z}_{t}\rightarrow\real is the loss function, and CtdC_{t}\subseteq\real^{d} is a closed and convex set. Problems of this form arise in sequential learning and strategic classification [1], and in applications in power and energy systems [2, 3] to model uncertainty in pricing and human behavior. Moreover, the framework (1) can be used to solve control problems for dynamical systems whose dynamics are unknown, where the variable zz is used to account for the lack of knowledge of the underlying system dynamics (similarly to problems in feedback-based optimization [4, 5]).

Since the distribution of zz in (1) depends on the decision variable xx, the problem of finding xtx_{t}^{*} is computationally burdensome for general cases, and intractable when DtD_{t} in unknown – even when the loss function is convex in xx [6, 7]. For this reason, we focus on finding decisions that are optimal with respect to the distribution that they induce; we refer to these points as performatively stable [6], while we refer to solutions xtx_{t}^{*} to the original problem (1) as performatively optimal. We obtain explicit error bounds between performatively optimal and performatively stable points by leveraging tools from [6, 7]. The main focus of this work is to propose and analyze online algorithms that can determine performatively stable points, in contexts where the loss function and constraint set are revealed sequentially. Since the distributional map DtD_{t} may be unknown in practice, we then extend our techniques to stochastic methods that only require samples of zz.

Prior Work: Online (projected) gradient descent methods have been well-investigated by using tools form the controls community, we refer to the representative works [8, 9, 10, 11, 12] as well as to pertinent references therein. Convergence guarantees for online stochastic gradient methods where drift and noise terms satisfy sub-Gaussian assumptions were recently provided in [13]. Online stochastic optimization problems with time-varying distributions are studied in, e.g., [1, 14, 15]. On the other hand, time-varying costs are considered in [16], along with sampling strategies to satisfy regret guarantees. For static optimization problems, the notion of performatively stable points is introduced in [6], where error bounds for risk minimization and gradient descent methods applied to stochastic problems with decision-dependent distributions are provided. Stochastic gradient methods to identify performatively stable points for decision-dependent distributions are studied in [7, 17] – the latter also providing results for an online setting in expectation. A stochastic gradient method for time-invariant distributional maps is presented in [18].

Contributions: We offer the following main contributions. C1) First, we propose an online projected gradient descent (OPGD) method to solve (1), and we show that the tracking error (relative to the performatively stable points) is ultimately bounded by terms that account for the temporal drift of the optimizers. C2) Second, we propose an online stochastic projected gradient descent (OSPGD) and we provide error bounds in expectation and in high probability. Our bounds in high probability are derived by modeling the gradient error as a sub-Weibull random variable [19]: this allows us to capture a variety of sub-cases, including scenarios where the error follows sub-Gaussian and sub-Exponential distributions [20], or any distribution with finite support.

Relative to [1, 14, 15, 16] our distributions are decision dependent; relative to [6, 7, 17, 18], our cost and distributional maps are time varying. Moreover, our results do not rely on bias or variance assumptions regarding the gradient estimator. In the absence of distributional shift and without a sub-Weibull error, our upper bounds reduce to the results of [9]. Relative to [18], we seek performatively stable points rather than the performative optima. In doing so, we incur the error characterized in [6]; however, we do not restrict to distributional maps that are continuous distributions or require finite difference approximations. With respect to the available literature on stochastic optimization, we provide for the first time explicit bounds in expectation and in high probability to solve stochastic optimization with decision dependent distributions in the presence of time-dependent distributional maps.

The remainder of this paper is organized as follows. Section 2 introduces some preliminaries; Section 3 studies the OPGD, and Section 4 studies the OSPGD. Section 5 illustrates simulation results, and Section 6 concludes the paper.

2 Preliminaries

We first introduce preliminary definitions and results. We consider random variables zz that take values on a metric space (M,d)(M,d), where the set MM is equipped with the Borel σ\sigma-algebra induced by metric dd. We assume that MM is a complete and separable metric space (hence MM is a Polish space). We let 𝒫(M)\mathcal{P}(M) denote the set of Radon probability measures on MM with finite first moment. Given ν𝒫(M)\nu\in\mathcal{P}(M), zνz\sim\nu denotes that the random variable zz is distributed according to ν\nu. Due to Kantorovich-Rubenstein duality, the Wasserstein-1 distance between μ,ν𝒫(M)\mu,\nu\in\mathcal{P}(M) can be defined as [21]:

W1(μ,ν)=supgLip1{𝔼zμ[g(z)]𝔼zν[g(z)]},W_{1}(\mu,\nu)=\underset{g\in\text{Lip}_{1}}{\sup}\left\{\underset{z\sim\mu}{\operatorname{\mathbb{E}}}\left[g(z)\right]-\underset{z\sim\nu}{\operatorname{\mathbb{E}}}\left[g(z)\right]\right\}, (2)

where Lip1\text{Lip}_{1} is the set of 1-Lipschitz functions over MM. We note that the pair (𝒫(M),W1)(\mathcal{P}(M),W_{1}) describes a metric space of probability measures.

Heavy-Tailed Distributions. In this paper, we will utilize the sub-Weibull model [19], introduced next.

Definition 1

(Sub-Weibull Random Variable) zz is a sub-Weibull random variable, denoted by zsubW(θ,ν)z\sim\operatorname{\text{subW}}(\theta,\nu), if there exists θ,ν>0\theta,\nu>0 such that zkνkθ\|z\|_{k}\leq\nu k^{\theta} for all k1k\geq 1.

The parameter θ\theta measures the heaviness of the tail (higher values correspond to heavier tails) and the parameter ν\nu measures the proxy-variance [19]. In what follows, we will also use the following equivalent characterization of a sub-Weibull random variable: zsubW(θ,ν)z\sim\operatorname{\text{subW}}(\theta,\nu) if and only if θ,ν>0,(|z|ϵ)2exp((ϵ/ν))1/θ\exists~\theta,\nu^{\prime}>0,\mathbb{P}(|z|\geq\epsilon)\leq 2\exp(-\left(\epsilon/\nu^{\prime})\right)^{1/\theta}. As shown in [22], the two characterizations are equivalent by choosing ν=(θ2e)θν\nu=\left(\frac{\theta}{2e}\right)^{\theta}\nu^{\prime}. The class of sub-Weibull random variables enjoys the following properties.

Proposition 2.1

(Closure of Sub-Weibull) Let zsubW(θ1,ν1)z\sim\operatorname{\text{subW}}(\theta_{1},\nu_{1}) and ysubW(θ2,ν2)y\sim\operatorname{\text{subW}}(\theta_{2},\nu_{2}) be (possibly coupled) sub-Weibull random variables and let cc\in\mathbb{R}. Then, the following holds:

  1. 1.

    z+ysubW(max{θ1,θ2},ν1+ν2)z+y\sim\operatorname{\text{subW}}(\max\{\theta_{1},\theta_{2}\},\nu_{1}+\nu_{2});

  2. 2.

    zysubW(θ1+θ2,ψ(θ1,θ2)ν1ν2)zy\sim\operatorname{\text{subW}}(\theta_{1}+\theta_{2},\psi(\theta_{1},\theta_{2})\nu_{1}\nu_{2}), ψ(θ1,θ2):=(θ1+θ2)θ1+θ2/(θ1θ1θ2θ2)\psi(\theta_{1},\theta_{2}):=(\theta_{1}+\theta_{2})^{\theta_{1}+\theta_{2}}/(\theta_{1}^{\theta_{1}}\theta_{2}^{\theta_{2}});

  3. 3.

    z+csubW(θ1,|c|+ν1)z+c\sim\operatorname{\text{subW}}(\theta_{1},|c|+\nu_{1});

  4. 4.

    czsubW(θ1,|c|ν1)cz\sim\operatorname{\text{subW}}(\theta_{1},|c|\nu_{1}).

Proof 2.2.

Properties 1) and 4) are proved in [19]; property 2) is proved in [23]. To show 3), since cc\in\mathbb{R}, then for any k1k\geq 1 ck=|c||c|kθ\|c\|_{k}=|c|\leq|c|k^{\theta}. It follows that z+ckzk+ckνkθ+|c|kθ(ν+|c|)kθ\|z+c\|_{k}\leq\|z\|_{k}+\|c\|_{k}\leq\nu k^{\theta}+|c|k^{\theta}\leq(\nu+|c|)k^{\theta}.

3 Online Projected Gradient Descent

In this section, we propose and study an OPGD method to solve (1). In Section 4, we will leverage the results derived in this section to analyze a the stochastic version OSPGD.

We begin by outlining our main assumptions.

Assumption 1

(Strong Convexity) For a fixed z𝒵tz\in\mathcal{Z}_{t}, the map xt(x,z)x\mapsto\ell_{t}(x,z) is αt\alpha_{t}-strongly convex, where αt>0\alpha_{t}>0, for all t0t\in\mathbb{N}_{0}.

Assumption 2

(Joint smoothness) For all t0t\in\mathbb{N}_{0}, xxt(x,z)x\mapsto\nabla_{x}\ell_{t}(x,z) is βt\beta_{t}-Lipschitz continuous for all z𝒵tz\in\mathcal{Z}_{t}, and zxt(x,z)z\mapsto\nabla_{x}\ell_{t}(x,z) is βt\beta_{t}-Lipschitz continuous for all xdx\in\real^{d}.

Assumption 3

(Distributional Sensitivity) For all t0t\in\mathbb{N}_{0}, there exists εt>0\varepsilon_{t}>0 such that

W1(Dt(x),Dt(x))εtxx2\displaystyle W_{1}(D_{t}(x),D_{t}(x^{\prime}))\leq\varepsilon_{t}\|x-x^{\prime}\|_{2} (3)

for any x,xdx,x^{\prime}\in\mathbb{R}^{d}. \square

Assumption 4

(Convex Constraint Set) For all t0t\in\mathbb{N}_{0}, the set CtC_{t} is closed and convex. \square

3.1 Performatively stable points

Since the objective function and the distribution in (1) both depend on the decision variable xx, the problem (1) is intractable in general, even when the loss is convex. For this reason, we follow the approach of [6, 7] and seek optimization algorithms that can determine the performatively stable point, defined as follows:

x¯targminxCt𝔼zDt(x¯t)[t(x,z)].\bar{x}_{t}\in\arg\min_{x\in C_{t}}\underset{z\sim D_{t}(\bar{x}_{t})}{\operatorname{\mathbb{E}}}\left[\ell_{t}(x,z)\right]\,. (4)

Convergence to a performatively stable point is desirable because it guarantees that x¯t\bar{x}_{t} is optimal for the distribution that it induces on zz. The following result, adapted from [7, Prop. 3.3], establishes existence and uniqueness of a performatively stable point.

Lemma 3.1.

(Existence of Performatively Stable Points) [7, Prop. 3.3]) Let Assumptions 1-4 hold, and suppose that εtβtαt<1\frac{\varepsilon_{t}\beta_{t}}{\alpha_{t}}<1 for all t0t\in\mathbb{N}_{0}. Then, a sequence of performatively stable points {x¯t}t0\{\bar{x}_{t}\}_{t\in\mathbb{N}_{0}} exists and is unique.

In general, performatively stable points may not coincide with the optimizers of the original problem (1). However, an explicit error bound can be derived, as formally stated next.

Lemma 3.2.

(Error of Performatively Stable Points [6]) Suppose that the function zt(x,z)z\mapsto\ell_{t}(x,z) is γt\gamma_{t}-Lipschitz continuous for all xdx\in\real^{d} and t0t\in\mathbb{N}_{0}. Then, under the same assumptions of Lemma 3.1, it holds that

x¯txt2εtγtαt1, for all 0.\|\bar{x}_{t}-x_{t}^{*}\|\leq 2\varepsilon_{t}\gamma_{t}\alpha_{t}^{-1},\text{ for all }\in\mathbb{N}_{0}. (5)

The proof Lemma 3.2 follows from [6, Thm 3.5, Thm 4.3]. In the remainder of this paper, we assume that the assumptions of Lemma 3.1 are satisfied, so that the performatively stable point sequence is unique. We illustrate the difference between x¯t\bar{x}_{t} and xtx_{t}^{*} in the following example.

Example 3.3.

Consider an instance of (1) where (x,z)=x2+z,\ell(x,z)=x^{2}+z, Ct=C_{t}=\real, Dt(x)=𝒩(μtx,σt2)D_{t}(x)=\mathcal{N}(\mu_{t}x,\sigma_{t}^{2}), μt,σt>0\mu_{t},\sigma_{t}>0. In this case, the objective can be specified in closed form as: 𝔼zDt(x)[x2+z]=x2+μtx\underset{z\sim D_{t}(x)}{\operatorname{\mathbb{E}}}\left[x^{2}+z\right]=x^{2}+\mu_{t}x, and thus the unique performatively optimal point is given by xt=μt/2x_{t}^{*}=-\mu_{t}/2. To determine the performatively stable point, notice that x(x,z)=2x\nabla_{x}\ell(x,z)=2x, and thus x¯t\bar{x}_{t} satisfies 𝔼zDt(x¯t)[2x¯t]=0\underset{z\sim D_{t}(\bar{x}_{t})}{\operatorname{\mathbb{E}}}\left[2\bar{x}_{t}\right]=0, which implies x¯t=0\bar{x}_{t}=0. The bound in (5) thus holds by noting that εt=μt\varepsilon_{t}=\mu_{t}, γt=1\gamma_{t}=1, and αt=2\alpha_{t}=2. \square

3.2 Online projected gradient descent

We now propose an OPGD that seeks to track the trajectory of the performatively stable optimizer {x¯t}t0\{\bar{x}_{t}\}_{t\in\mathbb{N}_{0}}. To this end, in what follows we adopt the following notation:

ft(x,ν):=𝔼zν[t(x,z)],f_{t}(x,\nu):=\underset{z\sim\nu}{\operatorname{\mathbb{E}}}\left[\ell_{t}(x,z)\right], (6)

for any xdx\in\real^{d}, ν𝒫(M)\nu\in\mathcal{P}(M), and t0t\in\mathbb{N}_{0}. Notice that when ν\nu is a distribution induced by the decision variable yy, namely ν=Dt(y)\nu=D_{t}(y), we will use the notation ft(x,Dt(y))f_{t}(x,D_{t}(y)). Moreover, we denote by ft(x,ν)\nabla f_{t}(x,\nu) the gradient of ft(x,ν)f_{t}(x,\nu) (we also note that, according to the dominated convergence theorem, the expectation and gradient operators can be interchanged).

The OPGD amounts to the following step at each t0t\in\mathbb{N}_{0}:

xt+1\displaystyle x_{t+1} =Gt(xt,Dt(xt)),\displaystyle=G_{t}(x_{t},D_{t}(x_{t})),\,\,\,\, (7)

where Gt(xt,ν):=projCt(xtηtft(xt,ν))G_{t}(x_{t},\nu):=\operatorname{\text{proj}}_{C_{t}}\left(x_{t}-\eta_{t}\nabla f_{t}(x_{t},\nu)\right), with ηt>0\eta_{t}>0 denoting a stepsize.

First, we note that a performatively stable point is a fixed point of the algorithmic map (7), namely, x¯t=Gt(x¯t,Dt(x¯t))\bar{x}_{t}=G_{t}(\bar{x}_{t},D_{t}(\bar{x}_{t})). Next, we focus on characterizing the error between the updates (7) and the performatively stable points {x¯t}t0\{\bar{x}_{t}\}_{t\in\mathbb{N}_{0}}. To this aim, we denote the temporal drift in the performatively stable points as φt:=x¯t+1x¯t,\varphi_{t}:=\|\bar{x}_{t+1}-\bar{x}_{t}\|, and the tracking error relative to the performatively stable points as et:=xtx¯te_{t}:=\|x_{t}-\bar{x}_{t}\|. Our error bound for OPGD is presented next.

Theorem 3.4.

(Tracking Error of OPGD) Let Assumptions 1-4 hold, suppose that εtβtαt<1\frac{\varepsilon_{t}\beta_{t}}{\alpha_{t}}<1 for all t0t\in\mathbb{N}_{0}, and let {xt}t0\{x_{t}\}_{t\in\mathbb{N}_{0}} denote a sequence generated by (7). Then, for all t0t\in\mathbb{N}_{0}, the error et=xtx¯te_{t}=\|x_{t}-\bar{x}_{t}\| satisfies:

et+1ate0+i=0tbiφi,e_{t+1}\leq\ a_{t}e_{0}+\sum_{i=0}^{t}b_{i}\varphi_{i}, (8)

where at:=i=1tρi+ηiβiεia_{t}:=\prod_{i=1}^{t}\rho_{i}+\eta_{i}\beta_{i}\varepsilon_{i},

bi:={1if i=t,k=i+1tρk+ηkβkεkif it,b_{i}:=\begin{cases}1&\text{if $i=t$},\\ \prod_{k=i+1}^{t}\rho_{k}+\eta_{k}\beta_{k}\varepsilon_{k}&\text{if $i\neq t$},\end{cases}

and ρt:=max{|1ηtαt|,|1ηtβt|}\rho_{t}:=\max\{|1-\eta_{t}\alpha_{t}|,|1-\eta_{t}\beta_{t}|\}. Moreover, if

ηt[1rαt+βtεt,1+rβt(1+εt)] for all t0,\displaystyle\eta_{t}\in\bigg{[}\frac{1-r}{\alpha_{t}+\beta_{t}\varepsilon_{t}},\frac{1+r}{\beta_{t}(1+\varepsilon_{t})}\bigg{]}\text{ for all }t\in\mathbb{N}_{0}, (9)

for some r(0,1)r\in(0,1), then λ~:=supt0{ρt+ηtβtεt}r\tilde{\lambda}:=\sup_{t\geq 0}\{\rho_{t}+\eta_{t}\beta_{t}\varepsilon_{t}\}\leq r and

lim supt+et(1λ~)1supt0{φt},\displaystyle\limsup_{t\rightarrow+\infty}e_{t}\leq(1-\tilde{\lambda})^{-1}\sup_{t\geq 0}\{\varphi_{t}\}, (10)

where φ~:=supt0{φt}\tilde{\varphi}:=\sup_{t\geq 0}\{\varphi_{t}\}.

Before presenting the proof, some remarks are in order.

Remark 3.5.

By application of Lemma 3.2, OPGD guarantees that the error between the algorithmic updates and the performatively optimal points is bounded at all times. Precisely, the following estimate holds: lim supt+xtxt(1λ~)1φ~+2supt0{εtγtαt1}.\limsup_{t\rightarrow+\infty}\|x_{t}-x_{t}^{*}\|\leq(1-\tilde{\lambda})^{-1}\tilde{\varphi}+2\sup_{t\geq 0}\{\varepsilon_{t}\gamma_{t}\alpha_{t}^{-1}\}. \square

Remark 3.6.

When (9) holds, one can write the bound et+1ate0+(1λ~)1supi{φi}e_{t+1}\leq a_{t}e_{0}+(1-\tilde{\lambda})^{-1}\sup_{i}\{\varphi_{i}\}; this is an exponential input-to-state-stability (E-ISS) result [24], where {x¯t}\{\bar{x}_{t}\} are the equilibria of (7) and φi\varphi_{i} is treated as a disturbance. ISS implies that ete_{t} is ultimately bounded as in (10). \square

Next, we present the proof of Theorem 3.4. The following lemmas are instrumental.

Lemma 3.7.

(Gradient Deviations) Under Assumption 2, for any t0t\in\mathbb{N}_{0}, xdx\in\real^{d}, and measures μ,ν𝒫(M)\mu,\nu\in\mathcal{P}(M), the following bound holds:

ft(x,μ)ft(x,ν)βtW1(μ,ν).\|\nabla f_{t}(x,\mu)-\nabla f_{t}(x,\nu)\|\leq\beta_{t}W_{1}(\mu,\nu). (11)
Lemma 3.8.

(Contractive Map) Let Assumptions 1-2 and 4 hold. For any ν𝒫(M)\nu\in\mathcal{P}(M), the map xGt(x,ν)x\mapsto G_{t}(x,\nu) is Lipschitz continuous, namely, for any x,ydx,y\in\mathbb{R}^{d}:

Gt(x,ν)Gt(y,ν)ρtxy,\|G_{t}(x,\nu)-G_{t}(y,\nu)\|\leq\rho_{t}\|x-y\|, (12)

where ρt=max{|1ηtαt|,|1ηtβt|}\rho_{t}=\max\{|1-\eta_{t}\alpha_{t}|,|1-\eta_{t}\beta_{t}|\}. Moreover, if ρt<1\rho_{t}<1 for all t0t\in\mathbb{N}_{0}, then x¯t\bar{x}_{t} is the unique fixed point of (7).

The proof of Lemma 3.7 follows by iterating the reasoning in [7, Lemma 2.1] for all t0t\in\mathbb{N}_{0}; the proof of lemma 3.8 is standard and is omitted due to space limitations.

Proof of Theorem 3.4:  Note that xtCtx_{t}\in C_{t} for all t0t\in\mathbb{N}_{0} directly follows by definition of Euclidean projection. By using the triangle inequality, we find that

et+1\displaystyle e_{t+1} xt+1x¯t+x¯tx¯t+1\displaystyle\leq\|x_{t+1}-\bar{x}_{t}\|+\|\bar{x}_{t}-\bar{x}_{t+1}\|
=Gt(xt,Dt(xt))Gt(x¯t,Dt(x¯t))+φt\displaystyle=\|G_{t}(x_{t},D_{t}(x_{t}))-G_{t}(\bar{x}_{t},D_{t}(\bar{x}_{t}))\|+\varphi_{t}
Gt(xt,Dt(xt))Gt(xt,Dt(x¯t))\displaystyle\leq\|G_{t}(x_{t},D_{t}(x_{t}))-G_{t}(x_{t},D_{t}(\bar{x}_{t}))\|
+Gt(xt,Dt(x¯t))Gt(x¯t,Dt(x¯t))+φt,\displaystyle\qquad+\|G_{t}(x_{t},D_{t}(\bar{x}_{t}))-G_{t}(\bar{x}_{t},D_{t}(\bar{x}_{t}))\|+\varphi_{t},

where the first identity follows from the definition of Gt(,)G_{t}(\cdot,\cdot) and the second inequality follows by adding and subtracting Gt(xt,Dt(x¯t))G_{t}(x_{t},D_{t}(\bar{x}_{t})). Applying (11) and Lemma 3.8 yields:

et+1\displaystyle e_{t+1} ηtft(xt,xt)ft(xt,x¯t)\displaystyle\leq\eta_{t}\|\nabla f_{t}(x_{t},x_{t})-\nabla f_{t}(x_{t},\bar{x}_{t})\|
+Gt(xt,Dt(x¯t))Gt(x¯t,Dt(x¯t))+φt\displaystyle\qquad+\|G_{t}(x_{t},D_{t}(\bar{x}_{t}))-G_{t}(\bar{x}_{t},D_{t}(\bar{x}_{t}))\|+\varphi_{t}
ηtβtW1(Dt(xt),Dt(x¯t))+ρtet+φt\displaystyle\leq\eta_{t}\beta_{t}W_{1}(D_{t}(x_{t}),D_{t}(\bar{x}_{t}))+\rho_{t}e_{t}+\varphi_{t}
ηtβtεtet+ρtet+φt\displaystyle\leq\eta_{t}\beta_{t}\varepsilon_{t}e_{t}+\rho_{t}e_{t}+\varphi_{t}
=(ρt+ηtβtεt)et+φt.\displaystyle=(\rho_{t}+\eta_{t}\beta_{t}\varepsilon_{t})e_{t}+\varphi_{t}. (13)

Thus we obtain the following by expanding the recursion:

et+1\displaystyle e_{t+1} (i=0tλi)e0+φt+i=0t1(k=i+1tλk)φi,\displaystyle\leq\left(\prod_{i=0}^{t}\lambda_{i}\right)e_{0}+\varphi_{t}+\sum_{i=0}^{t-1}\left(\prod_{k=i+1}^{t}\lambda_{k}\right)\varphi_{i},

where we defined λt:=ρt+ηtβtεt\lambda_{t}:=\rho_{t}+\eta_{t}\beta_{t}\varepsilon_{t}. The bound (8) then follows by definition of the sequences {at}\{a_{t}\} and {bt}\{b_{t}\}.

To prove (10), we show that suptλt<1\sup_{t}\lambda_{t}<1 for appropriate ηt\eta_{t}. Fix r(0,1)r\in(0,1). Then, by the definition of ρt\rho_{t}, λtr\lambda_{t}\leq r holds if the following two conditions are satisfied simultaneously:

|1ηtαt|+ηtβtεtr,and|1ηtβt|+ηtβtεt\displaystyle|1-\eta_{t}\alpha_{t}|+\eta_{t}\beta_{t}\varepsilon_{t}\leq r,\ \text{and}\ |1-\eta_{t}\beta_{t}|+\eta_{t}\beta_{t}\varepsilon_{t} r.\displaystyle\leq r. (14)

The first inequality holds if and only if r+ηtβtεt<1ηtαt<rηtβtεt-r+\eta_{t}\beta_{t}\varepsilon_{t}<1-\eta_{t}\alpha_{t}<r-\eta_{t}\beta_{t}\varepsilon_{t} or, equivalently, 1rηt(αt+βtεt)1+r1-r\leq\eta_{t}(\alpha_{t}+\beta_{t}\varepsilon_{t})\leq 1+r. The second inequality holds if and only if r+ηtβtεt<1ηtβt<rηtβtεt-r+\eta_{t}\beta_{t}\varepsilon_{t}<1-\eta_{t}\beta_{t}<r-\eta_{t}\beta_{t}\varepsilon_{t}. By using αtβt\alpha_{t}\leq\beta_{t}, both inequalities are satisfied when

1rβt(1+εt)1rαt+βtεtηt1+rβt(1+εt)1+rαt+βtεt.\frac{1-r}{\beta_{t}(1+\varepsilon_{t})}\leq\frac{1-r}{\alpha_{t}+\beta_{t}{\varepsilon_{t}}}\leq\eta_{t}\leq\frac{1+r}{\beta_{t}(1+\varepsilon_{t})}\leq\frac{1+r}{\alpha_{t}+\beta_{t}\varepsilon_{t}}.

Thus, to satisfy the maximum, its sufficient to enforce that ηt[1rαt+βtεt,1+rβt(1+εt)]\eta_{t}\in\bigg{[}\frac{1-r}{\alpha_{t}+\beta_{t}\varepsilon_{t}},\frac{1+r}{\beta_{t}(1+\varepsilon_{t})}\bigg{]}. The result (10) follows by utilizing the geometric series. \blacksquare

Finally, we observe that when the objective and constraints are time-invariant, we recover the result of [7, Sec. 5] as formalized next.

Corollary 3.9.

(Tracking Error of OPGD for Time-Invariant Problems) If the problem (1) is time independent and the assumptions in Theorem 3.4 hold, then OPGD with fixed step size η(0,2/β(1+ε))\eta\in(0,2/\beta(1+\varepsilon)) converges linearly to the performatively stable point.

Proof 3.10.

When (1) is time independent, then for all t0t\in\mathbb{N}_{0}, αt=α\alpha_{t}=\alpha, βt=β\beta_{t}=\beta, εt=ε\varepsilon_{t}=\varepsilon, ρt=ρ\rho_{t}=\rho, φt=0\varphi_{t}=0. Accordingly, the recursion (8) yields: et+1λete_{t+1}\leq\lambda e_{t} with λ=ρ+ηβε\lambda=\rho+\eta\beta\varepsilon. By replacing strict inequality and r=1r=1 in (14), we conclude that η<2/β(1+ε)\eta<2/\beta(1+\varepsilon) implies λ<1\lambda<1. Hence et+1/etλ<1e_{t+1}/e_{t}\leq\lambda<1.

4 Online Stochastic Gradient Descent

An exact expression for the distributional map xtDt(xt)x_{t}\mapsto D_{t}(x_{t}) may not be available in general and, even if available, evaluating the gradient may be computational burdensome. We consider the case where we have access to a finite number of samples of ztz_{t} at each time step tt to estimate the gradient ft(xt,Dt(xt))\nabla f_{t}(x_{t},D_{t}(x_{t})). For example, given a mini-batch of samples {z^ti}i=1Nt\{\hat{z}_{t}^{i}\}_{i=1}^{N_{t}} of ztz_{t}, the approximate gradient is computed as gt(xt)=(1/Nt)i=1Ntt(xt,z^i)g_{t}(x_{t})=(1/N_{t})\sum_{i=1}^{N_{t}}\nabla\ell_{t}(x_{t},\hat{z}_{i}); when Nt=1N_{t}=1 we have a “greedy” estimate and when Nt>1N_{t}>1 we have a “lazy” estimate [17]. Accordingly, we consider an OSPGD described by:

xt+1=G^t(xt),G^t(x):=projCt(xηtgt(x)),\displaystyle x_{t+1}=\hat{G}_{t}(x_{t}),\,\,\,\,\hat{G}_{t}(x):=\operatorname{\text{proj}}_{C_{t}}\left(x-\eta_{t}g_{t}(x)\right), (15)

where ηt>0\eta_{t}>0 is a stepsize. In the remainder, we focus on finding error bounds in the spirit of Theorem 3.4 for OSPGD.

4.1 Bounds in expectation and high-probability

Throughout our analysis, we interpret OSPGD as an inexact OPGD with gradient error given by the random variable:

ξt:=gt(xt)ft(xt,Dt(xt)).\xi_{t}:=\|g_{t}(x_{t})-\nabla f_{t}(x_{t},D_{t}(x_{t}))\|. (16)

We make the following assumption.

Assumption 5

(Sub-Weibull Error) The gradient error ξt\xi_{t} is sub-Weibull; i.e., ξtsubW(θ,νt)\xi_{t}\sim\operatorname{\text{subW}}(\theta,\nu_{t}) for some θ,νt>0\theta,\nu_{t}>0.

Assumption 5 allows us to describe a variety of sub-cases, including scenarios where the error follows sub-Gaussian and sub-Exponential distributions [20], or any distribution with finite support. Further, notice that Assumption 5 does not require the random variables {ξt}t0\{\xi_{t}\}_{t\in\mathbb{N}_{0}} to be independent. Examples of random variables that satisfy Assumption 5 are described in Section 4.2. Our error bounds for OSPGD are presented next.

Theorem 4.1.

(Expected and High-probability Bounds for OSPGD) Let Assumptions 1-4 hold, and suppose that εtβtαt<1\frac{\varepsilon_{t}\beta_{t}}{\alpha_{t}}<1 for all t0t\in\mathbb{N}_{0}. Recall that et=xtx¯te_{t}=\|x_{t}-\bar{x}_{t}\|. Then, the following estimates hold for (15):

  1. 1.

    For all tt\in\mathbb{N},

    𝔼[et+1]ate0+i=1tbi(φi+ηi𝔼[ξi]).\operatorname{\mathbb{E}}\left[e_{t+1}\right]\leq a_{t}e_{0}+\sum_{i=1}^{t}b_{i}(\varphi_{i}+\eta_{i}\operatorname{\mathbb{E}}[\xi_{i}])\,. (17)
  2. 2.

    If, additionally, Assumption 5 holds and δ(0,1)\delta\in(0,1), then with probability 1δ1-\delta:

    et+1(2eθ)θlogθ(2δ)(ate0+i=1tbi(φi+ηiνi)),e_{t+1}\leq\left(\frac{2e}{\theta}\right)^{\theta}\log^{\theta}\left(\frac{2}{\delta}\right)\big{(}a_{t}e_{0}+\sum_{i=1}^{t}b_{i}(\varphi_{i}+\eta_{i}\nu_{i})\big{)}, (18)

where {at}\{a_{t}\} and {bi}\{b_{i}\} are as in Theorem 3.4.

Proof 4.2.

Note that xtCtx_{t}\in C_{t} for all tt\in\mathbb{N} directly follows by definition of Euclidean projection. To show (17), we first find a stochastic recursion. By the triangle inequality:

et+1\displaystyle e_{t+1} G^t(xt)Gt(x¯t,Dt(x¯t))+φt\displaystyle\leq\|\hat{G}_{t}(x_{t})-G_{t}(\bar{x}_{t},D_{t}(\bar{x}_{t}))\|+\varphi_{t}
G^t(xt)Gt(xt,Dt(xt))\displaystyle\leq\|\hat{G}_{t}(x_{t})-G_{t}(x_{t},D_{t}(x_{t}))\|
+Gt(xt,Dt(xt))Gt(x¯t,Dt(x¯t))+φt,\displaystyle+\|G_{t}(x_{t},D_{t}(x_{t}))-G_{t}(\bar{x}_{t},D_{t}(\bar{x}_{t}))\|+\varphi_{t},

where the second inequality follows by adding and subtracting Gt(xt,Dt(xt))G_{t}(x_{t},D_{t}(x_{t})). By iterating (3.2), we have Gt(xt,Dt(xt))Gt(x¯t,Dt(x¯t))λtet+φt\|G_{t}(x_{t},D_{t}(x_{t}))-G_{t}(\bar{x}_{t},D_{t}(\bar{x}_{t}))\|\leq\lambda_{t}e_{t}+\varphi_{t}, where λt:=ρt+ηtβtεt\lambda_{t}:=\rho_{t}+\eta_{t}\beta_{t}\varepsilon_{t}, and thus et+1ηtgt(xt)ft(xt,Dt(xt))+λtet+φte_{t+1}\leq\eta_{t}\|g_{t}(x_{t})-\nabla f_{t}(x_{t},D_{t}(x_{t}))\|+\lambda_{t}e_{t}+\varphi_{t}. This yields the stochastic recursion et+1λtet+φt+ηtξte_{t+1}\leq\lambda_{t}e_{t}+\varphi_{t}+\eta_{t}\xi_{t}. Expanding the recursion yields

et+1\displaystyle e_{t+1} (i=0tλi)e0+φt+i=0t1(k=i+1tλk)(φi+ηiξi),\displaystyle\leq\left(\prod_{i=0}^{t}\lambda_{i}\right)e_{0}+\varphi_{t}+\sum_{i=0}^{t-1}\left(\prod_{k=i+1}^{t}\lambda_{k}\right)(\varphi_{i}+\eta_{i}\xi_{i}),

or, equivalently,

et+1ate0+i=0tbi(φi+ηiξi).e_{t+1}\leq a_{t}e_{0}+\sum_{i=0}^{t}b_{i}(\varphi_{i}+\eta_{i}\xi_{i}). (19)

Thus, (17) follows by taking the expectation on both sides.

To prove (18), we demonstrate that the righthand side of (19) is sub-Weibull distributed. Since ξisubW(θ,νi)\xi_{i}\sim\operatorname{\text{subW}}(\theta,\nu_{i}), Proposition 2.1 implies that bi(φi+ηiξi)subW(θ,bi(φi+ηiνi))b_{i}(\varphi_{i}+\eta_{i}\xi_{i})\sim\operatorname{\text{subW}}\left(\theta,b_{i}(\varphi_{i}+\eta_{i}\nu_{i})\right). By summing over ii, we obtain:

i=0tbi(φi+ηiξi)subW(θ,i=0tbi(φi+ηiνi)).\sum_{i=0}^{t}b_{i}(\varphi_{i}+\eta_{i}\xi_{i})\sim\operatorname{\text{subW}}\left(\theta,\ \sum_{i=0}^{t}b_{i}(\varphi_{i}+\eta_{i}\nu_{i})\right).

Thus, by letting ωt:=ate0+i=0tbi(φi+ηiξi)\omega_{t}:=a_{t}e_{0}+\sum_{i=0}^{t}b_{i}(\varphi_{i}+\eta_{i}\xi_{i}), we conclude that ωtsubW(θ,υt)\omega_{t}\sim\operatorname{\text{subW}}\left(\theta,\upsilon_{t}\right), where υt=ate0+i=0tbi(φi+ηiνi)\upsilon_{t}=\ a_{t}e_{0}+\sum_{i=0}^{t}b_{i}(\varphi_{i}+\eta_{i}\nu_{i}). From Definition 1 we have

(|ωt|ϵ)2exp(θ2e(ϵυt)1θ).\mathbb{P}(|\omega_{t}|\geq\epsilon)\leq 2\exp\left(-\frac{\theta}{2e}\left(\frac{\epsilon}{\upsilon_{t}}\right)^{\frac{1}{\theta}}\right). (20)

Now let δ(0,1)\delta\in(0,1) be fixed and set it equal to the right hand side of the above inequality. Solving for ϵ\epsilon yields ϵ=logθ(2δ)(2eθ)θυt\epsilon=\log^{\theta}\left(\frac{2}{\delta}\right)\left(\frac{2e}{\theta}\right)^{\theta}\upsilon_{t}. Then, we have that ωt(2eθ)θlogθ(2δ)υt\omega_{t}\leq\left(\frac{2e}{\theta}\right)^{\theta}\log^{\theta}\left(\frac{2}{\delta}\right)\upsilon_{t}, with probability 1δ1\scalebox{0.75}[0.9]{$-$}\delta. Finally, (18) follows by substitution.

The bound (17) generalizes the estimate in Theorem 3.4 by accounting for the gradient error. It is also worth pointing out that (17) and (18) have a similar structure; indeed, (18) differs only by a logarithmic factor and by the introduction of the tail parameters νi\nu_{i} (which replaces the expectation term).

Remark 4.3.

An alternative high probability bound can be obtained by using (17) and Markov’s inequality. For any δ(0,1)\delta\in(0,1), then Markov’s inequality guarantees that:

et+11δ(ate0+i=1tbi(φi+ηi𝔼[ei])),e_{t+1}\leq\frac{1}{\delta}\bigg{(}a_{t}e_{0}+\sum_{i=1}^{t}b_{i}(\varphi_{i}+\eta_{i}\operatorname{\mathbb{E}}[e_{i}])\bigg{)}, (21)

with probability at least 1δ1-\delta. However, if we increase the confidence of the bound by allowing δ0\delta\rightarrow 0, the right-hand-side of (21) grows more rapidly than (18). \square

Note that the bounds in Theorem 4.1 are valid for any tt\in\mathbb{N}. The asymptotic behavior is noted in the next remark.

Remark 4.4.

If (9) holds, then lim supt+et(1λ~)1(φ~+η~ξ~)\limsup_{t\rightarrow+\infty}e_{t}\leq(1-\tilde{\lambda})^{-1}(\tilde{\varphi}+\tilde{\eta}\tilde{\xi}) almost surely, where η~\tilde{\eta} and ξ~\tilde{\xi} are upper bounds on the step size and 𝔼[ξt]\operatorname{\mathbb{E}}[\xi_{t}]; the proof is omitted because of space limits, but follows arguments similar to [23]. \square

4.2 Remarks on the error model

The class of sub-Weibull distributions allows one to consider variety of error models. For instance, it includes sub-Gaussian and sub-exponential as sub-cases by setting θ=1/2\theta=1/2 and θ=1\theta=1, respectively. We notice that a sub-Gaussian assumption was typically utilized in prior works on stochastic gradient descent; for example, the assumption 𝔼[exp(ξ2/σ2)]e\mathbb{E}[\exp\left(\xi^{2}/\sigma^{2}\right)]\leq e in [25] corresponds to sub-Gaussian tail behavior. However, recent works suggest that stochastic gradient descent may exhibit errors with tails that are heavier than a sub-Gaussian (see, e.g., [26]). To further elaborate on the flexibility offered by a sub-Weibull model, we provide the following additional examples.

Example 4.5.

Suppose that each entry of the gradient error gt(xt)ft(xt,xt)g_{t}(x_{t})-\nabla f_{t}(x_{t},x_{t}) follows a distribution subW(θ,ν)\operatorname{\text{subW}}(\theta,\nu), i=1,,di=1,\ldots,d for given θ,ν>0\theta,\nu>0. Then ξt\|\xi_{t}\| is sub-Weibull with ξtsubW(θ,2θdν)\|\xi_{t}\|\sim\operatorname{\text{subW}}(\theta,2^{\theta}\sqrt{d}\nu) [23]. \square

Example 4.6.

Suppose that an entry of the gradient error gt(xt)ft(xt,xt)g_{t}(x_{t})-\nabla f_{t}(x_{t},x_{t}) is Gaussian is zero mean and variance ς2\varsigma^{2}; then, it it sub-Gaussian with sub-Gaussian norm CςC\varsigma, with CC an absolute constant [20], and it is therefore a sub-Weibull subW(1/2,Cς)\operatorname{\text{subW}}(1/2,C^{\prime}\varsigma) with CC^{\prime} an absolute constant. \square

Example 4.7.

Suppose that ξt\xi_{t} is a random variable with mean μt:=𝔼[ξt]\mu_{t}:=\operatorname{\mathbb{E}}[\xi_{t}], such that ξt[ξ¯,ξ¯]\xi_{t}\in[\bar{\xi},\underline{\xi}] almost surely. Then ξtμtsubW(1/2,(ξ¯ξ¯)/2)\xi_{t}-\mu_{t}\sim\operatorname{\text{subW}}(1/2,(\underline{\xi}-\bar{\xi})/\sqrt{2}) [23]. \square

5 Application to Electric Vehicle Charging

This section illustrates the use of the proposed algorithms in an application inspired from [3], where the operator of a fleet of electric vehicles (EVs) seeks to determine an optimal charging policy in order to minimize its charging costs. The region of interest is modeled as a graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}), where each node in 𝒱\mathcal{V} represents a charging station (or a group thereof), and an edge (i,j)(i,j) in \mathcal{E} allows vehicles to transfer from node ii to jj. We assume that the graph is strongly connected, so that EVs can be redirected from one node to any other node. We let xi0x_{i}\in\mathbb{R}_{\geq 0} denote the energy requested by the fleet at node i𝒱i\in\mathcal{V}. We assume that the net energy available is limited, and define the set Ct:={xd:i𝒱xict}C_{t}:=\{x\in\mathbb{R}^{d}:\sum_{i\in\mathcal{V}}x_{i}\leq c_{t}\}, for a given ct>0c_{t}\in\real_{>0}. Given {xi}\{x_{i}\}, the operator of the power grid strategically chooses a price per unit of energy so as to optimize its revenue from selling energy; we let zi0z_{i}\in\mathbb{R}_{\geq 0} denote the selected price in region ii, and we hypothesise that zi𝒩(μtxi,σt2)z_{i}\sim\mathcal{N}(\mu_{t}x_{i},\sigma_{t}^{2}), μi,t,σt0\mu_{i,t},\sigma_{t}\in\mathbb{R}_{\geq 0} as an example. We note that, although the grid operator can choose the price arbitrarily large to maximize its revenue, large prices may compel the fleet operator to withdraw its demand, thus motivating the use of a model where the mean grows linearly with the energy demand. Accordingly, we model the cost function of the EV operator as follows [3]:

t(x,z)=i𝒱zixi,tγi,txi+κi,txi2,\displaystyle\ell_{t}(x,z)=\sum_{i\in\mathcal{V}}z_{i}x_{i,t}-\gamma_{i,t}x_{i}+\kappa_{i,t}x_{i}^{2}, (22)

where γi,t0\gamma_{i,t}\in\mathbb{R}_{\geq 0}, models the charging aggressiveness of the fleet operator, and κi,txi,t2\kappa_{i,t}x_{i,t}^{2} quantifies the satisfaction the fleet operator achieves from consuming one unit of energy. In (22), the term zi,txi,tz_{i,t}x_{i,t} describes the charging cost at station ii, the quantity γi,txi,t\gamma_{i,t}x_{i,t}, and models the energy demand at the ii-th station. Notice that, because the displacement of vehicles can change over time, we assume that the parameters γi,t\gamma_{i,t} and ξi,t\xi_{i,t} are time dependent. We note that: (i) because of the capacity constraint xtCtx_{t}\in C_{t}, the decision variables xi,t,i𝒱,x_{i,t},i\in\mathcal{V}, are coupled, and (ii) although the optimization could be solved in a distributed fashion since (22) is separable, our focus is to solve it in a centralized way since the EV operator is unique.

We apply the proposed methods to a system of 10 homogeneous charging stations over 100 time steps with fixed net energy (ct=10c_{t}=10). Namely, γi,t=1/100|t50|+1\gamma_{i,t}=-1/100|t-50|+1 and κi,t=2\kappa_{i,t}=2 for i{1,,10}i\in\{1,\ldots,10\}. The charging cost distribution is informed by μt\mu_{t} and σt\sigma_{t}; in our case, μt\mu_{t} is the time series data of CAISO real-time prices deposited in Fig 1 (taken from http://www.energyonline.com) and σt=1\sigma_{t}=1. Given these parameter values, the cost is αt\alpha_{t}-strongly convex and βt\beta_{t}-jointly smooth with αt=βt=2\alpha_{t}=\beta_{t}=2. Following the results in [27], the distributional maps are εt\varepsilon_{t}-sensitive with εt=μt\varepsilon_{t}=\mu_{t}. The sequence of performatively stable points are computed in closed form by solving the KKT equations.

Refer to caption
Figure 1: Time series data representing the price of energy in dollars per kilowatt hour (kWh). Each time step represents 5 minutes.
Refer to caption
Figure 2: Performance comparison of OPGD and OSPGD.

For each experiment, we run OPGD and OSPGD with fixed step size ηt=0.3\eta_{t}=0.3 by drawing initial state x0x_{0} uniformly from a sphere of radius 55. For OSPGD, we compute the mean tracking error for both greedy and lazy deployments. The mean tracking error for each is computed via Monte Carlo simulation using 1,0001,000 realizations of the initial state.

In Fig. 2, we illustrate the tracking errors and corresponding upper bounds presented in Theorems 3.4 and 4.1. “True” (i.e., true gradient) refers to the OPGD, “greedy” to the OSPGD with Nt=1N_{t}=1, and “lazy” to the OSPGD with Nt=10N_{t}=10. We notice that the upper bound curve mimics the evolution of the tracking error; yet, in the instance of OSPGD the relationship is looser relative to the OPGD curves.

6 Conclusions

This paper considered online gradient and stochastic gradient methods for tracking solutions of time-varying stochastic optimization problems with decision-dependent distributions. Under a distributional sensitivity assumption, we derived explicit error bounds for the two methods. In particular, we derived convergence in expectation and in high probability for the OSPGD by assuming that the error in the gradient follows a sub-Weibull distribution. To the best of our knowledge, our convergence results for online gradient methods are the first in the literature for time-varying stochastic optimization problems with decision-dependent distributions.

References

  • [1] C. Wilson, Y. Bu, and V. V. Veeravalli, “Adaptive sequential machine learning,” Sequential Analysis, vol. 38, no. 4, pp. 545–568, 2019.
  • [2] J. L. Mathieu, D. S. Callaway, and S. Kiliccote, “Examining uncertainty in demand response baseline models and variability in automated responses to dynamic pricing,” in IEEE Conference on Decision and Control, 2011, pp. 4332–4339.
  • [3] W. Tushar, W. Saad, H. V. Poor, and D. B. Smith, “Economics of electric vehicle charging: A game theoretic approach,” IEEE Transactions on Smart Grid, vol. 3, no. 4, pp. 1767–1778, 2012.
  • [4] A. Hauswirth, S. Bolognani, G. Hug, and F. Dörfler, “Optimization algorithms as robust feedback controllers,” arXiv preprint arXiv:2103.11329, 2021.
  • [5] G. Bianchin, M. Vaquero, J. Cortes, and E. Dall’Anese, “Online stochastic optimization for unknown linear systems: Data-driven synthesis and controller analysis,” arXiv preprint arXiv:2108.13040, 2021.
  • [6] J. Perdomo, T. Zrnic, C. Mendler-Dünner, and M. Hardt, “Performative prediction,” in International Conference on Machine Learning. PMLR, 2020, pp. 7599–7609.
  • [7] D. Drusvyatskiy and L. Xiao, “Stochastic optimization with decision-dependent distributions,” arXiv preprint arXiv:2011.11173, 2020.
  • [8] A. Y. Popkov, “Gradient methods for nonstationary unconstrained optimization problems,” Automation and Remote Control, vol. 66, no. 6, pp. 883–891, 2005.
  • [9] D. D. Selvaratnam, I. Shames, J. H. Manton, and M. Zamani, “Numerical optimisation of time-varying strongly convex functions subject to time-varying constraints,” in IEEE Conference on Decision and Control, 2018, pp. 849–854.
  • [10] A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro, “Online optimization in dynamic environments: Improved regret rates for strongly convex problems,” in IEEE Conference on Decision and Control, 2016, pp. 7195–7201.
  • [11] A. Simonetto, “Time-varying convex optimization via time-varying averaged operators,” arXiv preprint arXiv:1704.07338, 2017.
  • [12] L. Madden, S. Becker, and E. Dall’Anese, “Bounds for the tracking error of first-order online optimization methods,” Journal of Optimization Theory and Applications, vol. 189, no. 2, pp. 437–457, 2021.
  • [13] J. Cutler, D. Drusvyatskiy, and Z. Harchaoui, “Stochastic optimization under time drift: iterate averaging, step decay, and high probability guarantees,” arXiv preprint arXiv:2108.07356, 2021.
  • [14] X. Cao, J. Zhang, and H. V. Poor, “Online stochastic optimization with time-varying distributions,” IEEE Tran. on Automatic Control, vol. 66, no. 4, pp. 1840–1847, 2021.
  • [15] I. Shames and F. Farokhi, “Online stochastic convex optimization: Wasserstein distance variation,” arXiv preprint arXiv:2006.01397, 2020.
  • [16] C. Wilson, V. V. Veeravalli, and A. Nedić, “Adaptive sequential stochastic optimization,” IEEE Tran. on Automatic Control, vol. 64, no. 2, pp. 496–509, 2018.
  • [17] C. Mendler-Dünner, J. C. Perdomo, T. Zrnic, and M. Hardt, “Stochastic optimization for performative prediction,” arXiv preprint arXiv:2006.06887, 2020.
  • [18] Z. Izzo, L. Ying, and J. Zou, “How to learn when data reacts to your model: performative gradient descent,” arXiv preprint arXiv:2102.07698, 2021.
  • [19] M. Vladimirova, S. Girard, H. Nguyen, and J. Arbel, “Sub-Weibull distributions: Generalizing sub-Gaussian and sub-Exponential properties to heavier tailed distributions,” Stat, vol. 9, no. 1, p. e318, 2020.
  • [20] R. Vershynin, High-dimensional probability: An introduction with applications in data science. Cambridge University press, 2018.
  • [21] L. V. Kantorovich and S. Rubinshtein, “On a space of totally additive functions,” Vestnik of the St. Petersburg University: Mathematics, vol. 13, no. 7, pp. 52–59, 1958.
  • [22] K. C. Wong, Z. Li, and A. Tewari, “Lasso guarantees for β\beta-mixing heavy-tailed time series,” The Annals of Statistics, vol. 48, no. 2, pp. 1124 – 1142, 2020.
  • [23] N. Bastianello, L. Madden, R. Carli, and E. Dall’Anese, “A stochastic operator framework for inexact static and online optimization,” arXiv preprint arXiv:2105.09884, 2021.
  • [24] Z.-P. Jiang and Y. Wang, “Input-to-state stability for discrete-time nonlinear systems,” Automatica, vol. 37, no. 6, pp. 857–869, 2001.
  • [25] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on optimization, vol. 19, no. 4, pp. 1574–1609, 2009.
  • [26] L. Hodgkinson and M. W. Mahoney, “Multiplicative noise and heavy tails in stochastic optimization,” arXiv preprint arXiv:2006.06293, 2020.
  • [27] C. R. Givens and R. M. Shortt, “A class of Wasserstein metrics for probability distributions.” Michigan Mathematical Journal, vol. 31, no. 2, pp. 231 – 240, 1984.