This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Structured Policy Iteration for Linear Quadratic Regulator

Youngsuk Park    Ryan A. Rossi    Zheng Wen    Gang Wu    Handong Zhao
Abstract

Linear quadratic regulator (LQR) is one of the most popular frameworks to tackle continuous Markov decision process tasks. With its fundamental theory and tractable optimal policy, LQR has been revisited and analyzed in recent years, in terms of reinforcement learning scenarios such as the model-free or model-based setting. In this paper, we introduce the Structured Policy Iteration (S-PI) for LQR, a method capable of deriving a structured linear policy. Such a structured policy with (block) sparsity or low-rank can have significant advantages over the standard LQR policy: more interpretable, memory-efficient, and well-suited for the distributed setting. In order to derive such a policy, we first cast a regularized LQR problem when the model is known. Then, our Structured Policy Iteration (S-PI) algorithm, which takes a policy evaluation step and a policy improvement step in an iterative manner, can solve this regularized LQR efficiently. We further extend the S-PI algorithm to the model-free setting where a smoothing procedure is adopted to estimate the gradient. In both the known-model and model-free setting, we prove convergence analysis under the proper choice of parameters. Finally, the experiments demonstrate the advantages of S-PI in terms of balancing the LQR performance and level of structure by varying the weight parameter.

Linear Quadratic Regulator, Low-rank structure, Regularized LQR, Reinforcement Learning

1 Introduction

Stochastic control for the class of linear quadratic regulator (LQR) has been applied in a wide variety of fields including supply-chain optimization, advertising, dynamic resource allocation, and optimal control (Sarimveis et al., 2008; Nerlove & Arrow, 1962; Elmaghraby, 1993; Anderson & Moore, 2007) spanning several decades.

This stochastic control has led to a wide class of fundamental machinery along the way, across theoretical analysis as well as tractable algorithms, where the model of transition dynamic and cost function are known. On the other hand, under the uncertain model of transition dynamics, reinforcement learning (RL) and data-driven approaches have achieved a great empirical success in recent years, from simulated game scenarios (Mnih et al., 2015; Silver et al., 2016) to robot manipulation (Tassa et al., 2012; Al Borno et al., 2012; Kumar et al., 2016). In recent years, LQR in discrete time domain in particular, has been revisited and analyzed under model uncertainty, not only in theoretical perspective like regret bound or sample complexity (Ibrahimi et al., 2012; Fazel et al., 2018; Recht, 2019; Mania et al., 2019), but also toward new real-world applications (Lewis et al., 2012; Lazic et al., 2018; Park et al., 2019).

Despite the importance and success of the standard LQR, discrete time LQR with a structured policy has been less explored in theoretical and practical perspective under both the known and unknown model settings, while such a policy may have a number of significant advantages over the standard LQR policy: interpretability, memory-efficiency, and is more suitable for the distributed setting. In this work, we describe a methodology for learning a structured policy for LQR along with theoretical analysis of it.

Summary of contributions.
  • We formulate the regularized LQR problem for discrete time system that is able to capture a structured linear policy (in Section 2.1).

  • To solve the regularized LQR problem when the model is known, we develop the Structured Policy Iteration (S-PI) algorithm that consists of a policy evaluation and policy improvement step (in Section 2.2).

  • We further extend S-PI to the model-free setting, utilizing a gradient estimate from a smoothing procedure (in Section 3.1).

  • We prove the linear convergence of S-PI to its stationary point under a proper choice of parameters in both the known-model (in Section 2.3) and model-free settings (in Section 3.2).

  • We examine the properties of the S-PI algorithm and demonstrate its capability of balancing the LQR performance and level of structure by varying the weight parameter (in Section 4).

1.1 Preliminary and background

Notations.

For a symmetric matrix Q𝐑n×nQ\in\mathbf{R}^{n\times n}, we denote Q0Q\succeq 0 and Q0Q\succ 0 as a positive semidefinite and positive definite matrix respectively. For a matrix A𝐑n×nA\in\mathbf{R}^{n\times n}, we denote ρ(A)\rho(A) as its spectral radius, i.e., the largest magnitude among eigenvalues of AA. For a matrix K𝐑n×mK\in\mathbf{R}^{n\times m}, σ1(K)σ2(K)σmin(n,m)(K)\sigma_{1}(K)\geq\sigma_{2}(K)\geq\cdots\geq\sigma_{\min(n,m)}(K) are defined as its ordered singular values where σmin(K)\sigma_{\mathrm{min}}(K) is the smallest one. A\left\|A\right\| denotes the 2\ell_{2} matrix norm, i.e., maxx2=1Ax2\max_{\left\|x\right\|_{2}=1}\left\|Ax\right\|_{2}. We also denote Frobenius norm AF=i,jAi,j2\left\|A\right\|_{F}=\sqrt{\sum_{i,j}A_{i,j}^{2}} induced by Frobenius inner product A,BF=𝐓𝐫(ATB)\langle A,B\rangle_{F}=\mathbf{Tr}(A^{T}B) where AA and BB are matrices of the same size. A ball with the radius rr and its surface are denoted as 𝔹r\mathbb{B}_{r} and 𝕊r\mathbb{S}_{r} respectively.

1.1.1 Optimal control for infinite time horizon.

We formally define the class of problems we target here. Consider a Markov Decision Process (MDP) in continuous space where xt𝒳𝐑nx_{t}\in\mathcal{X}\subset\mathbf{R}^{n}, ut𝒰𝐑mu_{t}\in\mathcal{U}\subset\mathbf{R}^{m}, wt𝒲𝐑nw_{t}\in\mathcal{W}\subset\mathbf{R}^{n} denote a state, an action, and some random disturbance at time tt, respectively. Further, let g:𝒳×𝒰×𝒳𝐑g:\mathcal{X}\times\mathcal{U}\times\mathcal{X}\rightarrow\mathbf{R} be a stage-cost function, and f:𝒳×𝒰×𝒲𝒳f:\mathcal{X}\times\mathcal{U}\times\mathcal{W}\rightarrow\mathcal{X} denote a transition dynamic when action utu_{t} is taken at state xtx_{t} with disturbance wtw_{t}111often denoted as wt+1w_{t+1} since it is fully revealed at time t+1t+1..

Our goal is to find the optimal stationary policy π:𝒳𝒰\pi:\mathcal{X}\rightarrow\mathcal{U} that solves

minimize𝜋\displaystyle\underset{\pi}{\text{minimize}}\quad t=0γt𝐄[g(xt,ut,xt+1)]\displaystyle\sum_{t=0}^{\infty}\gamma^{t}\mathbf{E}[g(x_{t},u_{t},x_{t+1})] (1)
subject to xt+1=f(xt,ut,wt),x0𝒟\displaystyle x_{t+1}=f(x_{t},u_{t},w_{t}),~{}x_{0}\sim\mathcal{D}\vspace{-5mm}

where γ(0,1]\gamma\in(0,1] is the discounted factor, 𝒟\mathcal{D} is the distribution over the initial state x0x_{0}.

1.1.2 Background on classic LQR

Infinite horizon (discounted) LQR is a subclass of the problem (1) where the stage cost is quadratic and the dynamic is linear, 𝒳=𝐑n\mathcal{X}=\mathbf{R}^{n}, 𝒰=𝐑m\mathcal{U}=\mathbf{R}^{m}, γ=1\gamma=1, and wt=0w_{t}=0, i.e.,

minimize𝜋\displaystyle\underset{\pi}{\text{minimize}}\quad 𝐄(t=0xtTQxt+utTRut)\displaystyle\mathbf{E}\left(\sum_{t=0}^{\infty}x_{t}^{T}Qx_{t}+u_{t}^{T}Ru_{t}\right) (2)
subject to xt+1=Axt+But,\displaystyle x_{t+1}=Ax_{t}+Bu_{t},
ut=π(xt),x0𝒟,\displaystyle u_{t}=\pi(x_{t}),~{}x_{0}\sim\mathcal{D},

where A𝐑n×n,B𝐑n×m,Q0A\in\mathbf{R}^{n\times n},B\in\mathbf{R}^{n\times m},Q\succeq 0, and R0R\succ 0.

Optimal policy and value function.   The optimal policy (or control gain) and value function (optimal cost-to-go) are known to be linear and convex quadratic on state respectively,

π(x)=Kx,V(x)=xTPx\displaystyle\pi^{\star}(x)=Kx,\quad V^{\star}(x)=x^{T}Px

where

P\displaystyle P =ATPA+QATPB(BTPB+R)1BTPA,\displaystyle=A^{T}PA+Q-A^{T}PB(B^{T}PB+R)^{-1}B^{T}PA, (3)
K\displaystyle K =(BTPB+R)1BTPA.\displaystyle=-(B^{T}PB+R)^{-1}B^{T}PA.

Note that Eq. (3) is known as the discrete time Algebraic Riccati equation (DARE). Here, we assume (A,B)(A,B) is controllable222 The linear system is controllable if the matrix [B,AB,,An1B][B,AB,\cdots,A^{n-1}B] has full column rank., then the solution is unique and can be efficiently computed via the Riccati recursion or some alternatives (Hewer, 1971; Laub, 1979; Lancaster & Rodman, 1995; Balakrishnan & Vandenberghe, 2003).

Variants and extensions on LQR.   There are several variants of LQR including noisy, finite horizon, time-varying, and trajectory LQR. Including linear constraints, jumping and random model are regarded as extended LQR. In these cases, some pathologies such as infinite cost may occur (Bertsekas et al., 1995) but we do not focus on such cases.

1.1.3 Zeroth order optimization.

Zeroth order optimization is the framework optimizing a function f(x)f(x), only accessing its function values (Conn et al., 2009; Nesterov & Spokoiny, 2017). It defines its perturbed function fσ2(x)=𝐄ϵ𝒩(0,σ2I)f(x+ϵ)f_{\sigma^{2}}(x)=\mathbf{E}_{\epsilon\sim\mathcal{N}(0,\sigma^{2}I)}f(x+\epsilon), which is close to the original function f(x)f(x) for small enough perturbation σ\sigma. And Gaussian smoothing provides its gradient form fσ2(x)=1σ2𝐄ϵ𝒩(0,σ2I)f(x+ϵ)ϵ\nabla f_{\sigma^{2}}(x)=\frac{1}{\sigma^{2}}\mathbf{E}_{\epsilon\sim\mathcal{N}(0,\sigma^{2}I)}f(x+\epsilon)\epsilon. Similarly, from (Flaxman et al., 2004; Fazel et al., 2018), we can do a smoothing procedure over a ball with the radius rr in 𝐑n\mathbf{R}^{n}. The perturbed function and its gradient can be defined as fr(x)=𝐄ϵ𝔹rf(x+ϵ)f_{r}(x)=\mathbf{E}_{\epsilon\sim\mathbb{B}_{r}}f(x+\epsilon) and fr(x)=nr2𝐄ϵ𝕊rf(x+ϵ)ϵ\nabla f_{r}(x)=\frac{n}{r^{2}}\mathbf{E}_{\epsilon\sim\mathbb{S}_{r}}f(x+\epsilon)\epsilon where ϵ\epsilon is sampled uniformly random from a ball 𝔹r\mathbb{B}_{r} and its surface 𝕊r\mathbb{S}_{r} respectively. Note that these simple (expected) forms allow Monte Carlo simulations to get unbiased estimates of the gradient based on function values, without explicit computation of the gradient.

1.2 Related work

The general method for solving these problems is dynamic programming (DP) (Bellman et al., 1954; Bertsekas et al., 1995). To overcome the issues of DP such as intractability and computational cost, the common alternatives are approximated dynamic programming (ADP) (Bertsekas & Tsitsiklis, 1996; Bertsekas & Shreve, 2004; Powell, 2007; De Farias & Van Roy, 2003; O’Donoghue et al., 2011) or Reinforcement Learning (RL) (Sutton et al., 1998) including policy gradient (Kakade, 2002; Schulman et al., 2017) or Q-learning based methods (Watkins & Dayan, 1992; Mnih et al., 2015).

LQR.   LQR is a classical problem in control that is able to capture problems in continuous state and action space, pioneered by Kalman in the late 1950’s (Kalman, 1964). Since then, many variations have been suggested and solved such as jump LQR, random LQR, (averaged) infinite horizon objective, etc (Florentin, 1961; Costa et al., 2006; Wonham, 1970). When the model is (assumed to be) known, Arithmetic Riccati Equation (ARE) (Hewer, 1971; Laub, 1979) can be used to efficiently compute the optimal value function and policy for generic LQR. Alternatively, we can use eigendecomposition (Lancaster & Rodman, 1995) or transform it into a semidefinite program (SDP) (Balakrishnan & Vandenberghe, 2003).

LQR under model uncertainty.   When the model is unknown, one class is model-based approaches where the transition dynamics is attempted to be estimated. For LQR, in particular, (Abbasi-Yadkori & Szepesvári, 2011; Ibrahimi et al., 2012) developed online algorithms with regret bounds where linear dynamic (and cost function) are learned and sampled from some confidence set, but without any computational demonstrations. Another line of work is to utilize robust control with system identification (Dean et al., 2017; Recht, 2019) with sample complexity analysis (Tu & Recht, 2017). Certainty Equivalent Control for LQR is also analyzed with suboptimality gap (Mania et al., 2019). The other class is model-free approaches where policy is directly learned without estimating dynamics. Regarding discrete time LQR, in particular, Q-learning (Bradtke et al., 1994) and policy gradient (Fazel et al., 2018) were developed together with lots of mathematically machinery therein. but with little empirical demonstrations.

LQR with structured policy.   For continuous time LQR, many work in control literature (Wang & Davison, 1973; Sandell et al., 1978; Lau et al., 1972) has been studied for sparse and decentralized feedback gain, supported by theoretical demonstrations. Recently, (Wytock & Kolter, 2013; Lin et al., 2012) developed fast algorithms for a large-scale system that induce the sparse feedback gain utilizing Alternating Direction Method of Multiplier (ADMM), Iterative Shrinkage Thresholding Algorithm (ISTA), Newton method, etc. And, most of these only consider sparse feedback gain under continuous time LQR.

However, to the best of our knowledge, there are few algorithms for discrete time LQR, which take various structured (linear) policies such as sparse, low-rank, or proximity to some reference policy into account. Furthermore, regarding learning such a structured policy, there is no theoretical work that demonstrates a computational complexity, sample complexity, or the dependency of stable stepsize on algorithm parameters either, even in known-model setting (and model-free setting).

2 Structured Policy for LQR

2.1 Problem statement: regularized LQR

From standard LQR in Eq. (2), we restrict policy to linear class, i.e., ut=Kxtu_{t}=Kx_{t}, and add a regularizer on the policy to induce the policy structure. We formally state a regularized LQR problem as

minimize𝐾\displaystyle\underset{K}{\text{minimize}}\quad 𝐄(t=0xtTQxt+utTRut)f(K)+λr(K)\displaystyle\overbrace{\mathbf{E}\left(\sum_{t=0}^{\infty}x_{t}^{T}Qx_{t}+u_{t}^{T}Ru_{t}\right)}^{f(K)}+\lambda r(K) (4)
subject to xt+1=Axt+But,\displaystyle x_{t+1}=Ax_{t}+Bu_{t},
ut=Kxt,x0𝒟,\displaystyle u_{t}=Kx_{t},\quad x_{0}\sim\mathcal{D},

for a nonnegative parameter λ0\lambda\geq 0. Here f(K)f(K) is the (averaged) cost-to-go under policy KK, and r:𝐑n×m𝐑r:\mathbf{R}^{n\times m}\rightarrow\mathbf{R} is a nonnegative convex regularizer inducing the structure of policy KK.

Regularizer.

Different regularizers induce different types of structures on the policy KK. We consider lasso r(K)=K1=i,j|Ki,j|r(K)=\|K\|_{1}=\sum_{i,j}|K_{i,j}|, group lasso r(K)=K𝒢,2=g𝒢Kg2r(K)=\|K\|_{\mathcal{G},2}=\sum_{g\in\mathcal{G}}\left\|K_{g}\right\|_{2} where Kg𝐑|g|K_{g}\in\mathbf{R}^{|g|} is the vector consisting of a index set gg, and nuclear-norm r(K)=K=iσi(K)r(K)=\|K\|_{*}=\sum_{i}\sigma_{i}(K) where σi(K)\sigma_{i}(K) is the iith largest singular value of KK. These induces sparse, block sparse, and low-rank structure respectively. For a given reference policy Kref𝐑n×mK^{\mathrm{ref}}\in\mathbf{R}^{n\times m}, we can similarly consider KKref1\|K-K^{\mathrm{ref}}\|_{1}, KKref𝒢,2\|K-K^{\mathrm{ref}}\|_{\mathcal{G},2}, and KKref\|K-K^{\mathrm{ref}}\|_{*}, penalizing the proximity (in different metric) to the reference policy KrefK^{\mathrm{ref}}.

Non-convexity.

For a standard (unregularized) LQR, the objective function f(K)f(K) is known to be not convex, quasi-convex, nor star-convex, but to be gradient dominant. Therefore, all the stationary points are optimal as long as 𝐄[x0x0T]0\mathbf{E}[x_{0}x_{0}^{T}]\succ 0 (see Lemma 2,3, in (Fazel et al., 2018)). However, in regularized LQR, all the stationary points of Eq. (4) may not be optimal under the existence of multiple stationary points. (See the supplementary material for the detail.)

2.2 Structured Policy Iteration (S-PI)

Eq. (4) can be simplified into

minimize F(K):=f(K)+λr(K).\displaystyle F(K):=f(K)+\lambda r(K). (5)

Here f(K)=𝐓𝐫(Σ0P)f(K)=\mathbf{Tr}(\Sigma_{0}P) where Σ0=𝐄[x0x0T]\Sigma_{0}=\mathbf{E}[x_{0}x_{0}^{T}] is the covariance matrix of initial state and PP is the quadratic value matrix satisfying the following Lyapunov equation

(A+BK)TP(A+BK)P+Q+KTRK=0.\displaystyle(A+BK)^{T}P(A+BK)-P+Q+K^{T}RK=0. (6)

We introduce the Stuctured Policy Iteration (S-PI) in Algorithm 1 to solve Eq. (5). The S-PI algorithm consists of two parts: (1) policy evaluation and (2) policy improvement. In the policy evaluation part, we solve Lyapunov equations to compute the quadratic value matrix PP and covariance matrix Σ\Sigma. In the policy improvement part, we try to improve the policy while encouraging some structure, via the proximal gradient method with proper choice of an initial stepsize and a backtracking linesearch strategy.

Algorithm 1 Stuctured Policy Iteration (S-PI)
1:  given initial stable policy K0K^{0} and initial state covariance matrix Σ0=𝐄[x0x0T]\Sigma_{0}=\mathbf{E}[x_{0}x_{0}^{T}], linesearch factor β<1\beta<1.
2:  repeat
3:     (1) Policy (and covariance) evaluation:
4:     compute (Pi,Σi)(P^{i},\Sigma^{i}) satisfying Lyapunov equations
(A+BKi)T\displaystyle(A+BK^{i})^{T} Pi(A+BKi)Pi+Q+(Ki)TRKi=0,\displaystyle P^{i}(A+BK^{i})-P^{i}+Q+(K^{i})^{T}RK^{i}=0,
(A+BKi)\displaystyle(A+BK^{i}) Σi(A+BKi)TΣi+Σ0=0.\displaystyle\Sigma^{i}(A+BK^{i})^{T}-\Sigma^{i}+\Sigma_{0}=0.
return (Pi,Σi)(P^{i},\Sigma^{i})
5:     (2) Policy improvement:
6:     initial stepsize ηi=𝒪(1/λ)\eta_{i}=\mathcal{O}(1/\lambda).
7:     compute gradient at KK
Kf(Ki)=2((R+BTPiB)Ki+BTPiA)Σi\displaystyle\nabla_{K}f(K^{i})=2\left(\left(R+B^{T}P^{i}B\right)K^{i}+B^{T}P^{i}A\right)\Sigma^{i}
8:     repeat
9:        ηi:=βηi\eta_{i}:=\beta\eta_{i}.
10:        Ki+1ProxGrad(f(Ki),ηi,r,λ)K^{i+1}{\small\leftarrow}\mathrm{ProxGrad}(\nabla f(K^{i}),\eta_{i},r,\lambda) (in Alg. 2).
11:     until  linesearch (7) criterion is satisfied. return next iterate Ki+1.K^{i+1}.
12:  until stopping criterion Ki+1Kiϵ\|K^{i+1}-K^{i}\|\leq\epsilon is satisfied.
Algorithm 2 Subroutine: ProxGrad(f(K),η,r,λ)\mathrm{ProxGrad}(\nabla f(K),\eta,r,\lambda)
1:  Input gradient oracle f(K)\nabla f(K), stepsize η\eta, and regularization rr and its parameter λ\lambda
2:  take gradient step
GKηKf(K)\displaystyle G\leftarrow K-\eta\nabla_{K}f(K)
3:  take proximal step
K+\displaystyle K^{+}\leftarrow 𝐩𝐫𝐨𝐱r(),λη(G)\displaystyle\mathbf{prox}_{r(\cdot),\lambda\eta}(G)
:=argmin 𝐾r(K)+12ληKGF2\displaystyle:=\underset{K}{\text{argmin~{}}}r(K)+\frac{1}{2\lambda\eta}\|K-G\|_{F}^{2}
4:  return K+K^{+}

Sensitivity to initial and fixed stepsize choice. Note that we use the initial stepsize η=𝒪(1/λ)\eta=\mathcal{O}(1/\lambda) as a rule of thumb whereas the typical initial stepsize (Barzilai & Borwein, 1988; Wright et al., 2009; Park et al., 2020) does not depend on the regularization parameter λ\lambda. This stepsize choice is motivated by theoretical analysis (see Lemma 7 in Section 2.3) as well as empirical demonstration (see Fig. 1 in Section 4). This order of stepsize automatically scales well when experimenting over various λ\lambdas, alleviating iteration counts and leading to a faster algorithm in practice. It turns out that proximal gradient step is very sensitive to stepsizes, often leading to an unstable policy KK with ρ(A+BK)1\rho(A+BK)\geq 1 or requiring a large number of iteration counts to converge. Therefore, we utilize linesearch over fixed stepsize choice.

Linesearch.   We adopt a backtracking linesearch (See (Parikh et al., 2014)). Given ηi\eta_{i}, KiK^{i}, f(Ki)\nabla f(K^{i}), and the potential next iterate Ki+1K^{i+1}, it check if the following criterion (the stability and the decrease of the objective) is satisfied,

f(Ki+1)f(Ki)\displaystyle f(K^{i+1})\leq f(K^{i}) ηi𝐓𝐫(f(Ki)TGηi(Ki))\displaystyle-\eta_{i}\mathbf{Tr}(\nabla f(K^{i})^{T}G_{\eta_{i}}(K^{i})) (7)
+ηi2Gηi(Ki)F2,\displaystyle+\frac{\eta_{i}}{2}\|G_{\eta_{i}}(K^{i})\|_{F}^{2},
ρ(A+BKi+1)\displaystyle\rho(A+BK^{i+1}) <1,\displaystyle<1,

where Gηi(K)=1ηi(K𝐩𝐫𝐨𝐱r,ληi(Kηif(K)))G_{\eta_{i}}(K)=\frac{1}{\eta_{i}}(K-\mathbf{prox}_{r,\lambda\eta_{i}}(K-\eta_{i}\nabla f(K))) and ρ()\rho(\cdot) is the spectral radius. Otherwise, it shrinks the stepsize ηi\eta_{i} by a factor of β<1\beta<1 and check it iteratively until Eq. (7) is satisfied.

Stabilizing sequence of policy.   We can start with a stable policy K0K^{0}, meaning ρ(A+BK0)<1\rho(A+BK^{0})<1. For example, under the standard assumptions on A,B,Q,RA,B,Q,R, Riccati recursion provides a stable policy, the solution of standard LQR in Eq. (2). Then, satisfying the linesearch criterion in Eq. (7) subsequently, the rest of the policies {Ki}\{K^{i}\} are a stabilizing sequence.

Computational cost.   The major cost incurs when solving the Lyapunov equations in the policy (and covariance) evaluation step. Note that if A+BKA+BK is stable, so does (A+BK)T(A+BK)^{T} since they share the same eigenvalues. Under the stability, each Lyapunov equation in Eq. (4) has a unique solution with the computational cost O(n3)O(n^{3}) (Jaimoukha & Kasenally, 1994; Li & White, 2002). Additionally, we can solve a sequence of Lyapunov equations with less cost, via using iterative methods with adopting the previous one (warm-start) or approximated one.

2.2.1 Regularizer and proximal operator

For various regularizers mentioned in Section 2.1, each has the closed-form solution for its proximal operator (Rockafellar, 1976; Parikh et al., 2014; Park et al., 2020). Here we only include a few representative examples and refer to the supplementary material for more examples.

Lemma 1 (Examples of proximal operator).
  1. 1.

    Lasso. For r(K)=K1r(K)=\left\|K\right\|_{1},

    (𝐩𝐫𝐨𝐱r,λη(K))i,j=sign(Ki,j)(|Ki,j|λη)+.\displaystyle\left(\mathbf{prox}_{r,\lambda\eta}(K)\right)_{i,j}=\mathrm{sign}(K_{i,j})(|K_{i,j}|-\lambda\eta)_{+}.

    And we denote 𝐩𝐫𝐨𝐱r,λη(K):=Sλη(K)\mathbf{prox}_{r,\lambda\eta}(K):=S_{\lambda\eta}(K) as a soft-thresholding operator.

  2. 2.

    Nuclear norm. For r(K)=Kr(K)=\left\|K\right\|_{*},

    𝐩𝐫𝐨𝐱r,λη(K)=U𝐝𝐢𝐚𝐠(Sλη(σ))VT.\displaystyle\mathbf{prox}_{r,\lambda\eta}(K)=U\mathbf{diag}(S_{\lambda\eta}(\sigma))V^{T}.

    where K=U𝐝𝐢𝐚𝐠(σ)VTK=U\mathbf{diag}(\sigma)V^{T} is singular value decomposition with singular values σ𝐑min(n,m)\sigma\in\mathbf{R}^{\min(n,m)}.

  3. 3.

    Proximity to KrefK^{\mathrm{ref}}. For r(K)=KKrefF2r(K)=\left\|K-K^{\mathrm{ref}}\right\|_{F}^{2},

    𝐩𝐫𝐨𝐱r,λη(K)=2ληKref+K2λη+1.\displaystyle\mathbf{prox}_{r,\lambda\eta}(K)=\frac{2\lambda\eta K^{\mathrm{ref}}+K}{2\lambda\eta+1}.

2.3 Convergence analysis of S-PI

Recall that \left\|\cdot\right\|, F\left\|\cdot\right\|_{F}, σmin()\sigma_{\mathrm{min}}(\cdot) is 2\ell_{2} matrix norm, Frobenius norm, and smallest singular value, as mentioned in Section 1.1. First, we start with (local) smoothness and strong convexity around a stable policy. Here we regard a policy KK is stable if ρ(A+BK)<1\rho(A+BK)<1.

Assumptions.  ρ(A+BK0)<1\rho(A+BK^{0})<1, Σ0=𝐄x0x0T0\Sigma_{0}=\mathbf{E}x_{0}x_{0}^{T}\succ 0, K0KΔ\left\|K^{0}-K^{\star}\right\|\leq\Delta, and KrefKΔ\left\|K^{\mathrm{ref}}-K^{\star}\right\|\leq\Delta.

Lemma 2.

For stable KK, f(K)f(K) is smooth (in local) with

LK=4Σ(K)R+BTP(K)B<,\displaystyle L_{K}=4\|\Sigma(K)\|\|R+B^{T}P(K)B\|<\infty,

within local ball around K(K;ρK)K\in\mathcal{B}(K;\rho_{K}) where the radius ρK\rho_{K} is

ρK=σmin(Σ0)4Σ(K)(A+BK+1)B>0.\rho_{K}=\frac{\sigma_{\mathrm{min}}(\Sigma_{0})}{4\|\Sigma(K)\|\left(\|A+BK\|+1\right)\|B\|}>0.

And f(K)f(K) is strongly convex (in local) with

m=σmin(Σ0)σmin(R)>0.\displaystyle m=\sigma_{\mathrm{min}}(\Sigma_{0})\sigma_{\mathrm{min}}(R)>0.

Then, we provide a proper stepsize that guarantees one iteration of proximal gradient step is still inside of the stable and (local) smooth ball (K;ρK)\mathcal{B}(K;\rho_{K}).

Lemma 3.

Let K+=𝐩𝐫𝐨𝐱r,λη(Kηf(K))K^{+}=\mathbf{prox}_{r,\lambda\eta}(K-\eta\nabla f(K)). Then

K+(K;ρK)K^{+}\in\mathcal{B}(K;\rho_{K})

holds for any 0<η<ηKλ,r0<\eta<\eta_{K}^{\lambda,r} where ηKλ,r\eta_{K}^{\lambda,r} is given as

ηKλ,r={ρKf(K)+λnmr(K)=K1ρKf(K)+λmin(n,m)r(K)=KρK2f(K)+2λKKrefr(K)=KKrefF2.\displaystyle\eta_{K}^{\lambda,r}=\begin{cases}\frac{\rho_{K}}{\|\nabla f(K)\|+\lambda nm}&~{}r(K)=\|K\|_{1}\\ \frac{\rho_{K}}{\|\nabla f(K)\|+\lambda\min(n,m)}&~{}r(K)=\|K\|_{*}\\ \frac{\rho_{K}}{2\|\nabla f(K)\|+2\lambda\left\|K-K^{\mathrm{ref}}\right\|}&~{}r(K)=\|K-K^{\mathrm{ref}}\|_{F}^{2}\end{cases}.

Next proposition describes that next iterate policy has the decrease in function value and is stable under sufficiently small stepsize.

Proposition 1.

Assume A+BKA+BK is stable. For any stepsize 0<ηmin(1LK,ηKλ,r)0<\eta\leq\min(\frac{1}{L_{K}},\eta_{K}^{\lambda,r}) and next iterate K+=𝐩𝐫𝐨𝐱r,ηλ(Kηf(K))K^{+}=\mathbf{prox}_{r,\eta\lambda}(K-\eta\nabla f(K)),

ρ(A+BK+)<1\displaystyle\rho(A+BK^{+})<1 (8)
F(K+)F(K)12ηKK+F2\displaystyle F(K^{+})\leq F(K)-\frac{1}{2\eta}\|K-K^{+}\|_{F}^{2} (9)

holds.

We also derive the bound on stepsize in Lemma 7, not dependent on iteration numbers.

Lemma 4.

Assume that {Ki}i=0,\{K^{i}\}_{i=0,\ldots} is a stabilizing sequence and associated {f(Ki)}i=0,\{f(K^{i})\}_{i=0,\ldots} and {KiKF}i=0,\{\|K^{i}-K^{\star}\|_{F}\}_{i=0,\ldots} are decreasing sequences. Then, Lemma 7 holds for

ηλ,r={ρLρf+λnmr(K)=K1ρLρf+λmin(n,m)r(K)=KρL2ρf+4λΔr(K)=KKrefF2.\displaystyle\eta^{\lambda,r}=\begin{cases}\frac{\rho^{L}}{\rho^{f}+\lambda nm}&~{}r(K)=\|K\|_{1}\\ \frac{\rho^{L}}{\rho^{f}+\lambda\min(n,m)}&~{}r(K)=\|K\|_{*}\\ \frac{\rho^{L}}{2\rho^{f}+4\lambda\Delta}&~{}r(K)=\|K-K^{\mathrm{ref}}\|_{F}^{2}\end{cases}. (10)

where

ρf\displaystyle\rho^{f} =2F(K0)σmin(Q)(BTF(K0)σmin(Σ0)A+\displaystyle=2\frac{F(K^{0})}{\sigma_{\min}(Q)}\bigg{(}\left\|B^{T}\right\|\frac{F(K^{0})}{\sigma_{\min}(\Sigma_{0})}\left\|A\right\|+
(R+BTF(K0)σ(Σ0)B)(Δ+K)),\displaystyle\left(\left\|R\right\|+\left\|B^{T}\right\|\frac{F(K^{0})}{\sigma(\Sigma_{0})}\left\|B\right\|\right)(\Delta+\left\|K^{\star}\right\|)\bigg{)},
ρL\displaystyle\rho^{L} =σmin(Σ0)28F(K0)B.\displaystyle=\frac{\sigma_{\mathrm{min}}(\Sigma_{0})^{2}}{8F(K^{0})\|B\|}.

Now we prove that the S-PI in Algorithm 1 converges to the stationary point linearly.

Theorem 1.

KiK^{i} from Algorithm 1 converges to the stationary point KK^{\star}. Moreover, it converges linearly, i.e., after NN iterations,

KNKF2(11κ)NK0KF2.\left\|K^{N}-K^{\star}\right\|_{F}^{2}\overset{}{\leq}\left(1-\frac{1}{\kappa}\right)^{N}\left\|K^{0}-K^{\star}\right\|_{F}^{2}.

Here, κ=1/(ηminσmin(Σ0)σmin(R)))>1\kappa=1/\left(\eta_{\min}\sigma_{\mathrm{min}}(\Sigma_{0})\sigma_{\mathrm{min}}(R))\right)>1 where

ηmin=hη(\displaystyle\eta_{\min}=h_{\eta}\bigg{(} σmin(Σ0),σmin(Q),1λ,\displaystyle\sigma_{\mathrm{min}}(\Sigma_{0}),\sigma_{\mathrm{min}}(Q),\frac{1}{\lambda},
1A,1B,1R,1Δ,1F(K0)),\displaystyle\frac{1}{\left\|A\right\|},\frac{1}{\left\|B\right\|},\frac{1}{\left\|R\right\|},\frac{1}{\Delta},\frac{1}{F(K^{0})}\bigg{)}, (11)

for some non-decreasing function hηh_{\eta} on each argument.

Note that (the global bound on) stepsize ηmin\eta_{\mathrm{min}} is inversely proportional to λ\lambda, which motivates the initial stepsize in linesearch.

Corollary 1.

Let KK^{\star} be the stationary point from algorithm 1. Then, after NN iterations

N2κlog(K0KFϵ),N\geq 2\kappa\log\left(\frac{\left\|K^{0}-K^{\star}\right\|_{F}}{\epsilon}\right),
KNKFϵ\left\|K^{N}-K^{\star}\right\|_{F}\leq\epsilon

holds where κ=1/(ηminσmin(Σ0)σmin(R)))>1\kappa=1/\left(\eta_{\min}\sigma_{\mathrm{min}}(\Sigma_{0})\sigma_{\mathrm{min}}(R))\right)>1 and ηmin\eta_{\mathrm{min}} in Eq. (20).

3 Toward a Model-Free Framework

In this section, we consider the scenario where the cost function and transition dynamic are unknown. Specifically in model-free setting, policy is directly learned from (trajectory) data without explicitly estimating the cost or transition model. In this section, we extend our Structured Policy Iteration (S-PI) to the model-free setting and prove its convergence.

3.1 Model-free Structured Policy Iteration (S-PI)

Note that, in model-free setting, model parameters A,B,QA,B,Q and RR cannot be directly accessed, which hinders the direct computation of PP, Σ\Sigma, and f(K)\nabla f(K) accordingly. Instead, we adopt a smoothing procedure to estimate the gradient based on samples.

Model-free S-PI in Algorithm 3 consists of two steps: (1) policy evaluation step and (2) policy improvement step. In (perturbed) policy evaluation step, perturbation UjU^{j} is uniformly drawn from the surface of the ball with radius rr, 𝕊r𝐑n×m\mathbb{S}_{r}\subset\mathbf{R}^{n\times m}. These data are used to estimate the gradient via a smoothing procedure for the policy improvement step. With this approximate gradient, proximal gradient subroutine tries to decrease the objective while inducing the structure of policy. Comparing to (known-model) S-PI in Algorithm 1, one important difference is its usage of a fixed stepsize η\eta, rather than an adaptive stepsize from a backtracking linesearch that requires to access function value f(K)=𝐓𝐫(Σ0P)f(K)=\mathbf{Tr}(\Sigma_{0}P) explicitly.

Algorithm 3 Model-free Stuctured Policy Iteration (Model-free S-PI)
1:  given initial stable policy K0K^{0}, number of trajectories NtrajN_{\mathrm{traj}}, roll-out horizon HH, smoothing parameter rr, and (fixed) stepsize η\eta.
2:  repeat
3:     (1) (Perturbed) policy evaluation:
4:     for j=1,,Ntrajj=1,\ldots,N_{\mathrm{traj}} do
5:        sample a perturbed policy K^i=Ki+Uj\hat{K}^{i}=K^{i}+U^{j} where and UjUniform(𝕊r)U^{j}\sim\mathrm{Uniform}(\mathbb{S}_{r}).
6:        roll out K^i\hat{K}^{i} from sampled initial state x0𝒟x_{0}\sim\mathcal{D}, over the horizon HH to estimate the cost-to-go
f^j=t=0Hgt\hat{f}^{j}=\sum_{t=0}^{H}g_{t}\vspace{-2mm}
where gt:=g(xt,K^ixt)g_{t}:=g(x_{t},\hat{K}^{i}x_{t}) is the stage cost incurred at time tt.
7:     end forreturn cost-to-go and perturbation {f^j,Uj}j=1Ntraj\{\hat{f}^{j},U^{j}\}_{j=1}^{N_{\mathrm{traj}}}.
8:     (2) Policy improvement:
9:     estimate the gradient
Kf(Ki)^=1Ntrajj=1Ntrajdr2f^jUj\displaystyle\widehat{\nabla_{K}f(K^{i})}=\frac{1}{N_{\mathrm{traj}}}\sum_{j=1}^{N_{\mathrm{traj}}}\frac{d}{r^{2}}\hat{f}^{j}U^{j} (12)
10:     Ki+1K^{i+1} \leftarrow ProxGrad(Kf(Ki)^,η,r,λ)\mathrm{ProxGrad}(\widehat{\nabla_{K}f(K^{i})},\eta,r,\lambda) (in Alg. 2). return next iterate Ki+1.K^{i+1}.
11:  until stopping criterion Ki+1Kiϵ\|K^{i+1}-K^{i}\|\leq\epsilon is satisfied.

3.2 Convergence analysis of model-free S-PI

The outline of proof is as following: We first claim that for proper parameters (perturbation, horizon number, numbers of trajectory), the gradient estimate from the smoothing procedure is close enough to actual gradient with high probability. Next we demonstrate that approximate proximal gradient still converges linearly with high probability.

Theorem 2.

Suppose F(K0)F(K^{0}) is finite, Σ00\Sigma_{0}\succ 0, and that x0𝒟x_{0}\sim\mathcal{D} has norm bounded by DD almost surly. Suppose the parameters in Algorithm 3 are chosen from

(Ntraj,H,1/r)=h(n,1ϵ,1σmin(Σ0)σmin(R),D2σmin(Σ0)).(N_{\mathrm{traj}},H,1/r)=h\left(n,\frac{1}{\epsilon},\frac{1}{\sigma_{\mathrm{min}}(\Sigma_{0})\sigma_{\mathrm{min}}(R)},\frac{D^{2}}{\sigma_{\mathrm{min}}(\Sigma_{0})}\right).

for some polynomials hh. Then, with the same stepsize in Eq. (20), there exist iteration NN at most 4κlog(K0KFϵ)4\kappa\log\left(\frac{\left\|K^{0}-K^{\star}\right\|_{F}}{\epsilon}\right) such that KNKϵ\left\|K^{N}-K^{\star}\right\|\leq\epsilon with at least 1o(ϵn1)1-{o}(\epsilon^{n-1}) probability. Moreover, it converges linearly,

KiK2(112κ)iK0K2,\left\|K^{i}-K^{\star}\right\|^{2}\overset{}{\leq}\left(1-\frac{1}{2\kappa}\right)^{i}\left\|K^{0}-K^{\star}\right\|^{2},

for the iteration i=1,,Ni=1,\ldots,N, where κ=ησmin(Σ0)σmin(R)>1\kappa=\eta\sigma_{\mathrm{min}}(\Sigma_{0})\sigma_{\mathrm{min}}(R)>1.

Remark.

For (unregularized) LQR, the model-free policy gradient method (Fazel et al., 2018) is the first one that adopted a smoothing procedure to estimate the gradient. Similar to this, our model-free S-PI for regularized LQR also has several major challenges for deploying in practice. First one is that finding an initial policy with finite F(K0)F(K^{0}) is non-trivial, especially when the open loop system is unstable, i.e., ρ(A)1\rho(A)\geq 1. Second one is its sensitivity to fixed stepsize η\eta as in the known model setting (in Section 2.2), wrong choice of which easily makes it divergent. Note that these two challenges hold over most of gradient based methods for LQR including policy gradient (Fazel et al., 2018), trust region policy optimization (TRPO) (Schulman et al., 2015), or proximal policy optimization (PPO) (Schulman et al., 2017). Last one is the joint sensitivity to multiple parameters Ntraj,H,rN_{\mathrm{traj}},H,r, which may lead to high variance or large sub-optimality gap. On the other hand, REINFORCE (Williams, 1992) may suffer less variance, but with another potential difficulty: estimating the state-action value function Q(x,u)Q(x,u). Moreover, it is not clear how to derive a structured policy. However, here we adopted smoothing procedures that enable us to theoretically analyze the convergence rate and parameter dependency.

4 Experiments

In experiments, we consider a LQR system for the purpose of validating the theoretical results and basic properties of the S-PI algorithm. As mentioned in Section 3.2, the simple example with an unstable open loop system, i.e., ρ(A)1\rho(A)\geq 1, is extremely sensitive to parameters even under known model setting, which may make it less in favor of the generic model-free RL approaches to deploy. Please note that our objective is total LQR cost-to-go in Eq. (2) more difficult than LQR cost-to-go averaged over time-horizon that some of works (Mania et al., 2018) considered. Under this difficulty, we demonstrate the properties of S-PI such as parameter sensitivity, convergence behaviors, and capability of balancing between LQR performance and policy structures. Finally, we illustrate the scalability of algorithms over various system dimensions.

4.1 Synthetic systems

In these experiments, we use the unstable Laplacian system  (Recht, 2019).

Large Laplacian dynamics.   A𝐑n×nA\in\mathbf{R}^{n\times n} where

Aij={1.1,i=j0.1,i=j+1 or j=i+10,otherwise\displaystyle A_{ij}=\begin{cases}1.1,&i=j\\ 0.1,&i=j+1\text{ or }j=i+1\\ 0,&\text{otherwise}\end{cases}

B=Q=In𝐑n×nB=Q=I_{n}\in\mathbf{R}^{n\times n} and R=1000×In𝐑n×nR=1000\times I_{n}\in\mathbf{R}^{n\times n}.

Synthetic system parameters.   For the Laplacian system, we regard (n,m)=(3,3)(n,m)=(3,3), (n,m)=(20,20)(n,m)=(20,20), and (n,m)=(103,103)(n,m)=(10^{3},10^{3}) dimension as small, medium, and large size of system. In addition, we experiment with Lasso regularizer over various λ=102106\lambda=10^{-2}\sim 10^{6}.

4.2 Algorithm parameters

Based on our theoretical results (Lemma 7) and sensitivity experiments that empirically show the dependency of stepsize η\eta on λ\lambda (in Fig. 1), we set the initial stepsize η=1λ\eta=\frac{1}{\lambda}. For the backtracking linesearch, we set β=12\beta=\frac{1}{2} and the convergence tolerance ϵtol=106\epsilon_{\mathrm{tol}}=10^{-6}. We start with an initial policy K0K^{0} from Riccati recursion.

4.3 Results

We denote the stationary point from S-PI (at each λ>0\lambda>0) as KK^{\star} and the solution of the (unregularized) LQR as KlqrK^{lqr} .

Refer to caption
Figure 1: Largest fixed stepsize leading stable system as λ\lambda varies. This demonstrates that stepsize ηfixed1λ\eta_{\mathrm{fixed}}\propto\frac{1}{\lambda}.

Dependency of stepsize η\eta on λ\lambda.   Under the same problem but with different choices of weight λ\lambda, the largest fixed stepsize η\eta that demonstrates the sequence of stable systems, i.e., A+BKi<1A+BK^{i}<1 actually varies, as Lemma 7 implies. Fig. 1 shows that the largest stepsize diminishes as λ\lambda increases. This motivates the choice of the initial stepsize (in linesearch) to be inversely proportional to λ\lambda.

Refer to caption
Figure 2: Convergence behavior of the Structured Policy Iteration (S-PI) with fixed stepsize for Laplacian system of (n,m)=(3,3)(n,m)=(3,3) with λ=3000\lambda=3000.

Convergence behavior and stepsize sensitivity.   S-PI with a linesearh strategy converges very fast, within 2-3 iterations for most of the Laplacian system with dimension n=3103n=3\sim 10^{3} and weight λ=102106\lambda=10^{-2}\sim 10^{6}. However, S-PI with fixed stepsize may perform with subtlety even though the fixed stepsize choice is common in typical optimization problems. In Fig. 2, S-PI with fixed stepsize for the small Laplacian system (n,m)=(3,3)(n,m)=(3,3) gets very close to optimal but begins to deviate after a certain amount of iterations. Note that it was performed over long iterations with small enough stepsize ηfixed=3×105\eta_{\mathrm{fixed}}=3\times 10^{-5}. Please refer to supplementary materials for more detailed results over various stepsizes. This illustrates that a linesearch strategy can be essential in S-PI (or even for the existing policy gradient method for LQR (Fazel et al., 2018)), even though the convergence analysis was shown with some fixed stepsize (but difficult to compute in practice).

Refer to caption
Figure 3: As λ\lambda become larger, the LQR cost slightly increases (top) within range λ<615\lambda<615 whereas sparsity is significantly improved by 50%50\% (bottom) for a Laplacian system with (n,m)=(20,20)(n,m)=(20,20).

Trade off between LQR performance and structure KK.   In Fig. 3 for a medium size Laplacian system, we show that as λ\lambda increases, the LQR cost f(K)f(K^{\star}) increases whereas cardinality decreases (sparsitiy is improved). Note that the LQR performance barely changes (or is slightly worse) for λ615\lambda\leq 615 but the sparsity is significantly improved by more than 50%50\%. In Fig. 4, we show the sparsity pattern (location of non-zero elements) of the policy matrix with λ=600\lambda=600 and λ=620\lambda=620.

Refer to caption
Figure 4: Sparse pattern of policy with λ=600\lambda=600 and λ=620\lambda=620 respectively for a Laplacian system with n=20n=20.

Scalability & runtime performance.   In Fig. 5, we report the runtime for Laplacian system of n=10,,500n=10,\ldots,500. Notably, it shows the scalability of S-PI, as it takes less than 2 minutes to solve a large system with n=500n=500 dimensions, where we used a MacBook Air (with a 1.3 GHz Intel Core i5 CPU) for experiments. These results demonstrate its applicability to large-scale problems in practice.

Refer to caption
Figure 5: The elapsed time (sec.) until S-PI converges over n=10,,500n=10,\ldots,500 for the Laplacian system.

5 Conclusion and Discussion

In this paper, we formulated a regularized LQR problem to derive a structured linear policy and provided an efficient algorithm, Structured Policy Iteration (S-PI). We proved that S-PI guarantees to converge linearly under a proper choice of stepsize, keeping the iterate within the set of stable policy as well as decreasing the objective at each iteration. In addition, we extended S-PI for model-free setting, utilizing a smoothing procedure. We also proved its convergence guarantees with high probability under a proper choice of parameters including stepsize, horizon counts, trajectory counts, etc. In the experiments, we examined some basic properties of the S-PI such as sensitivity on stepsize and regularization parameters to convergence behaviors, which turned out to be more critical than typical optimization problems. Lastly, we demonstrated that our method is effective in terms of balancing the quality of solution and structures.

We leave for future work the practical application of other penalty functions such as low-rank and proximity regularization. There are also new extensions on using proximal Newton or proximal natural gradient method as a subroutine, beyond what was developed in S-PI in Section 2, which could be further analyzed. Even though model-free algorithm for regularized LQR was suggested with theoretical guarantees, it is extremely difficult to deploy in practice, like most of model-free approaches for LQR. Finally, developing algorithm reducing variance for (regularized) LQR, possibly like (Papini et al., 2018; Park & Ryu, 2020), as well as some practical rule of thumb on the choice of hyper-parameters is another class of important problems to tackle toward model-free settings.

We described a new class of discrete time LQR problems that have yet to be studied theoretically and practically. And we discussed and demonstrated how such problems can be of practical importance despite them not being well-studied in the literature.While a few application were covered for this new class of problems, each of these contributions would open up our framework to new potential applications, providing additional benefits to future research on this topic.

References

  • Abbasi-Yadkori & Szepesvári (2011) Abbasi-Yadkori, Y. and Szepesvári, C. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pp.  1–26, 2011.
  • Al Borno et al. (2012) Al Borno, M., De Lasa, M., and Hertzmann, A. Trajectory optimization for full-body movements with complex contacts. IEEE transactions on visualization and computer graphics, 19(8):1405–1414, 2012.
  • Anderson & Moore (2007) Anderson, B. D. and Moore, J. B. Optimal control: linear quadratic methods. Courier Corporation, 2007.
  • Balakrishnan & Vandenberghe (2003) Balakrishnan, V. and Vandenberghe, L. Semidefinite programming duality and linear time-invariant systems. IEEE Transactions on Automatic Control, 48(1):30–41, 2003.
  • Barzilai & Borwein (1988) Barzilai, J. and Borwein, J. M. Two-point step size gradient methods. IMA journal of numerical analysis, 8(1):141–148, 1988.
  • Bellman et al. (1954) Bellman, R. et al. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6):503–515, 1954.
  • Bertsekas & Shreve (2004) Bertsekas, D. P. and Shreve, S. Stochastic optimal control: the discrete-time case. 2004.
  • Bertsekas & Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming, volume 5. Athena Scientific Belmont, MA, 1996.
  • Bertsekas et al. (1995) Bertsekas, D. P., Bertsekas, D. P., Bertsekas, D. P., and Bertsekas, D. P. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.
  • Bradtke et al. (1994) Bradtke, S. J., Ydstie, B. E., and Barto, A. G. Adaptive linear quadratic control using policy iteration. In Proceedings of 1994 American Control Conference-ACC’94, volume 3, pp.  3475–3479. IEEE, 1994.
  • Conn et al. (2009) Conn, A. R., Scheinberg, K., and Vicente, L. N. Introduction to derivative-free optimization, volume 8. Siam, 2009.
  • Costa et al. (2006) Costa, O. L. V., Fragoso, M. D., and Marques, R. P. Discrete-time Markov jump linear systems. Springer Science & Business Media, 2006.
  • De Farias & Van Roy (2003) De Farias, D. P. and Van Roy, B. The linear programming approach to approximate dynamic programming. Operations research, 51(6):850–865, 2003.
  • Dean et al. (2017) Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pp.  1–47, 2017.
  • Elmaghraby (1993) Elmaghraby, S. E. Resource allocation via dynamic programming in activity networks. European Journal of Operational Research, 64(2):199–215, 1993.
  • Fazel et al. (2018) Fazel, M., Ge, R., Kakade, S. M., and Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039, 2018.
  • Flaxman et al. (2004) Flaxman, A. D., Kalai, A. T., and McMahan, H. B. Online convex optimization in the bandit setting: gradient descent without a gradient. arXiv preprint cs/0408007, 2004.
  • Florentin (1961) Florentin, J. J. Optimal control of continuous time, markov, stochastic systems. International Journal of Electronics, 10(6):473–488, 1961.
  • Hewer (1971) Hewer, G. An iterative technique for the computation of the steady state gains for the discrete optimal regulator. IEEE Transactions on Automatic Control, 16(4):382–384, 1971.
  • Ibrahimi et al. (2012) Ibrahimi, M., Javanmard, A., and Roy, B. V. Efficient reinforcement learning for high dimensional linear quadratic systems. In Advances in Neural Information Processing Systems, pp. 2636–2644, 2012.
  • Jaimoukha & Kasenally (1994) Jaimoukha, I. M. and Kasenally, E. M. Krylov subspace methods for solving large lyapunov equations. SIAM Journal on Numerical Analysis, 31(1):227–251, 1994.
  • Kakade (2002) Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.
  • Kalman (1964) Kalman, R. E. When is a linear control system optimal? Journal of Basic Engineering, 86(1):51–60, 1964.
  • Kumar et al. (2016) Kumar, V., Todorov, E., and Levine, S. Optimal control with learned local models: Application to dexterous manipulation. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp.  378–383. IEEE, 2016.
  • Lancaster & Rodman (1995) Lancaster, P. and Rodman, L. Algebraic riccati equations. Clarendon press, 1995.
  • Lau et al. (1972) Lau, R., Persiano, R., and Varaiya, P. Decentralized information and control: A network flow example. IEEE Transactions on Automatic Control, 17(4):466–473, 1972.
  • Laub (1979) Laub, A. A schur method for solving algebraic riccati equations. IEEE Transactions on automatic control, 24(6):913–921, 1979.
  • Lazic et al. (2018) Lazic, N., Boutilier, C., Lu, T., Wong, E., Roy, B., Ryu, M. K., and Imwalle, G. Data center cooling using model-predictive control. In Advances in Neural Information Processing Systems (NeurIPS), pp.  3818–3827, 2018.
  • Lewis et al. (2012) Lewis, F. L., Vrabie, D., and Syrmos, V. L. Optimal control. 3rd ed. Hoboken, NJ: John Wiley Sons. , 2012.
  • Li & White (2002) Li, J.-R. and White, J. Low rank solution of lyapunov equations. SIAM Journal on Matrix Analysis and Applications, 24(1):260–280, 2002.
  • Lin et al. (2012) Lin, F., Fardad, M., and Jovanović, M. R. Sparse feedback synthesis via the alternating direction method of multipliers. In 2012 American Control Conference (ACC), pp.  4765–4770. IEEE, 2012.
  • Mania et al. (2018) Mania, H., Guy, A., and Recht, B. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
  • Mania et al. (2019) Mania, H., Tu, S., and Recht, B. Certainty equivalent control of lqr is efficient. arXiv preprint arXiv:1902.07826, 2019.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Nerlove & Arrow (1962) Nerlove, M. and Arrow, K. J. Optimal advertising policy under dynamic conditions. Economica, pp.  129–142, 1962.
  • Nesterov & Spokoiny (2017) Nesterov, Y. and Spokoiny, V. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
  • O’Donoghue et al. (2011) O’Donoghue, B., Wang, Y., and Boyd, S. Min-max approximate dynamic programming. In 2011 IEEE International Symposium on Computer-Aided Control System Design (CACSD), pp.  424–431. IEEE, 2011.
  • Papini et al. (2018) Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., and Restelli, M. Stochastic variance-reduced policy gradient. arXiv preprint arXiv:1806.05618, 2018.
  • Parikh et al. (2014) Parikh, N., Boyd, S., et al. Proximal algorithms. Foundations and Trends® in Optimization, 1(3):127–239, 2014.
  • Park & Ryu (2020) Park, Y. and Ryu, E. K. Linear convergence of cyclic saga. Optimization Letters, pp.  1–16, 2020.
  • Park et al. (2019) Park, Y., Mahadik, K., Rossi, R. A., Wu, G., and Zhao, H. Linear quadratic regulator for resource-efficient cloud services. In Proceedings of the ACM Symposium on Cloud Computing, pp. 488–489, 2019.
  • Park et al. (2020) Park, Y., Dhar, S., Boyd, S., and Shah, M. Variable metric proximal gradient method with diagonal barzilai-borwein stepsize. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  3597–3601. IEEE, 2020.
  • Powell (2007) Powell, W. B. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. John Wiley & Sons, 2007.
  • Recht (2019) Recht, B. A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2:253–279, 2019.
  • Rockafellar (1976) Rockafellar, R. T. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976.
  • Sandell et al. (1978) Sandell, N., Varaiya, P., Athans, M., and Safonov, M. Survey of decentralized control methods for large scale systems. IEEE Transactions on automatic Control, 23(2):108–128, 1978.
  • Sarimveis et al. (2008) Sarimveis, H., Patrinos, P., Tarantilis, C. D., and Kiranoudis, C. T. Dynamic modeling and control of supply chain systems: A review. Computers & Operations Research, 35(11):3530–3561, 2008.
  • Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897, 2015.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • Sutton et al. (1998) Sutton, R. S., Barto, A. G., et al. Introduction to reinforcement learning, volume 2. MIT press Cambridge, 1998.
  • Tassa et al. (2012) Tassa, Y., Erez, T., and Todorov, E. Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.  4906–4913. IEEE, 2012.
  • Tu & Recht (2017) Tu, S. and Recht, B. Least-squares temporal difference learning for the linear quadratic regulator. arXiv preprint arXiv:1712.08642, 2017.
  • Wang & Davison (1973) Wang, S.-H. and Davison, E. On the stabilization of decentralized control systems. IEEE Transactions on Automatic Control, 18(5):473–478, 1973.
  • Watkins & Dayan (1992) Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  • Williams (1992) Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • Wonham (1970) Wonham, W. M. Random differential equations in control theory. 1970.
  • Wright et al. (2009) Wright, S. J., Nowak, R. D., and Figueiredo, M. A. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, 2009.
  • Wytock & Kolter (2013) Wytock, M. and Kolter, J. Z. A fast algorithm for sparse controller design. arXiv preprint arXiv:1312.4892, 2013.

Appendix A Discussion on Non-convexity of Regularized LQR.

From Lemma 2 and appendix in (Fazel et al., 2018), unregularized objective f(K)f(K) is known to be not convex, quasi-convex, nor star-convex, but to be gradient dominant, which gives the claim that all the stationary points are optimal as long as 𝐄[x0x0T]0\mathbf{E}[x_{0}x_{0}^{T}]\succ 0 . However, in regularized LQR, this claim may not hold.

To see this claim that all stationary points may not be global optimal, let’s define regularized LQR with r(K)=KKlqrr(K)=\left\|K-K^{\mathrm{lqr}}\right\| where KlqrK^{\mathrm{lqr}} is the solution of the Riccati algorithm. We know that KlqrK^{\mathrm{lqr}} is the global optimal. Assume there is another distinct stationary point (like unregularized LQR) KK^{\prime}. Then, f(Klqr)+λr(Klqr)=f(Klqr)f(K^{\mathrm{lqr}})+\lambda r(K^{\mathrm{lqr}})=f(K^{\mathrm{lqr}}) is always less than f(K)+λKKlqrf(K^{\prime})+\lambda\left\|K^{\prime}-K^{\mathrm{lqr}}\right\|. If not,i.e., f(Klqr)f(K)+λKKlqrf(K^{\mathrm{lqr}})\geq f(K^{\prime})+\lambda\left\|K^{\prime}-K^{\mathrm{lqr}}\right\|, then f(K)<f(Klqr)f(K^{\prime})<f(K^{\mathrm{lqr}}) holds and this is contradiction, showing all stationary points is not global optimal like unregularized LQR. Whether regularized LQR has only one stationary point or not is still an open question.

Appendix B Additional Examples of Proximal Operators

Assume λ,λ1,λ2𝐑+\lambda,\lambda_{1},\lambda_{2}\in\mathbf{R}_{+} are positive numbers. We denote (z)i𝐑(z)_{i}\in\mathbf{R} as its iith element or (z)j𝐑nj(z)_{j}\in\mathbf{R}^{n_{j}} as its jjth block under an explicit block structure, and (z)+=max(z,0)(z)_{+}=\max(z,0).

  • Group lasso. For a group lasso penalty r(x)=jNxj2r(x)=\sum_{j}^{N}\|x_{j}\|_{2} with xj𝐑njx_{j}\in\mathbf{R}^{n_{j}},

    (𝐩𝐫𝐨𝐱r,λη(x))j=(1ληxj2)+xj\displaystyle\left(\mathbf{prox}_{r,\lambda\eta}(x)\right)_{j}=\left(1-\frac{\lambda\eta}{\|x_{j}\|_{2}}\right)_{+}x_{j}
  • Elastic net For a elastic net r(x)=λ1x1+λ2x22r(x)=\lambda_{1}\|x\|_{1}+\lambda_{2}\|x\|_{2}^{2},

    (𝐩𝐫𝐨𝐱g,U(x))i=sign(xi)(1λ2η+1|xi|λ1ηλ2η+1))+\displaystyle\left(\mathbf{prox}_{g,U}(x)\right)_{i}=\text{sign}(x_{i})\left(\frac{1}{\lambda_{2}\eta+1}|x_{i}|-\frac{\lambda_{1}\eta}{\lambda_{2}\eta+1})\right)_{+}
  • Nonnegative constraint. Let r(x)=𝟏(x0)r(x)=\mathbf{1}(x\geq 0) be the nonnegative constraint. Then

    𝐩𝐫𝐨𝐱r,λη(x)=(x)+\displaystyle\mathbf{prox}_{r,\lambda\eta}(x)=(x)_{+}
  • Simplex constraint Let r(x)=𝟏(x0,𝟏Tx=1)r(x)=\mathbf{1}(x\geq 0,\mathbf{1}^{T}x=1) be the simplex constraint. Then for U=Diag(u)U=\text{Diag}(u),

    (𝐩𝐫𝐨𝐱r,ηλ(x))i=(xiηλν)+,\displaystyle\left(\mathbf{prox}_{r,\eta\lambda}(x)\right)_{i}=(x_{i}-\eta\lambda\nu)_{+},

    Here, ν\nu is the solution satisfying i(xiηλν)+=1\sum_{i}(x_{i}-\eta\lambda\nu)_{+}=1, which can be found efficiently via bisection.

Appendix C Proof for Convergence Analysis of S-PI

Let’s define Σ(K)=𝐄x0𝒟[t=0xtxtT]\Sigma(K)=\mathbf{E}_{x_{0}\sim\mathcal{D}}[\sum_{t=0}^{\infty}x_{t}x_{t}^{T}]. We often adopt and modify several techincal Lemmas like perturbation analysis from (Fazel et al., 2018).

Lemma 5 (modification of Lemma 16 in (Fazel et al., 2018)).

Suppose A+BKA+BK is stable and KK^{\prime} is in the ball (K;ρK)\mathcal{B}(K;\rho_{K}), i.e.,

K(K;ρK):={K+ΔK𝐑m×nΔKρK}\displaystyle K^{\prime}\in\mathcal{B}(K;\rho_{K}):=\left\{K+\Delta K\in\mathbf{R}^{m\times n}\mid\|\Delta K\|\leq\rho_{K}\right\}

where the radius ρK\rho_{K} is

ρK=σmin(Σ0)4Σ(K)(A+BK+1)B.\rho_{K}=\frac{\sigma_{\mathrm{min}}(\Sigma_{0})}{4\|\Sigma(K)\|\left(\|A+BK\|+1\right)\|B\|}.

Then

Σ(K)Σ(K)Σ(K)\displaystyle\|\Sigma(K^{\prime})-\Sigma(K)\|\leq\|\Sigma(K)\| (13)
Lemma 6 (Lemma 2 restated).

For KK with stable A+BKA+BK, f(K)f(K) is locally smooth with

LK=4Σ(K)R+BTP(K)B<,\displaystyle L_{K}=4\|\Sigma(K)\|\|R+B^{T}P(K)B\|<\infty,

within local ball around K(K;ρK)K\in\mathcal{B}(K;\rho_{K})

And f(K)f(K) is (globally) strongly convex with

m=σmin(Σ0)σmin(R)0.\displaystyle m=\sigma_{\mathrm{min}}(\Sigma_{0})\sigma_{\mathrm{min}}(R)\geq 0.

In addition, A+BKA+BK^{\prime} is stable for all K(K;ρK)K^{\prime}\in\mathcal{B}(K;\rho_{K}).

Proof.

First, we describe the terms with Talyor expansion

f(K)=\displaystyle f(K^{\prime})= f(K)2𝐓𝐫(ΔKT((R+BTP(K)B)K\displaystyle f(K)-2\mathbf{Tr}(\Delta K^{T}\big{(}\!\!\left(R+B^{T}P(K)B\right)K
+BTP(K)A)Σ(K))\displaystyle+B^{T}P(K)A\big{)}\Sigma(K))
+𝐓𝐫(Σ(K)ΔKT(R+BTP(K)B)ΔK)\displaystyle+\underbrace{\mathbf{Tr}\left(\Sigma(K^{\prime})\Delta K^{T}(R+B^{T}P(K)B)\Delta K\right)}_{①}

The second order term is (locally) upper bounded by

\displaystyle① Σ(K)R+BTP(K)BΔKF2\displaystyle\leq\|\Sigma(K^{\prime})\|\|R+B^{T}P(K)B\|\|\Delta K\|_{F}^{2}
(a)2Σ(K)R+BTP(K)BΔKF2\displaystyle\overset{(a)}{\leq}2\|\Sigma(K)\|\|R+B^{T}P(K)B\|\|\Delta K\|_{F}^{2}

where (a) holds due to Lemma 5

Σ(K)Σ(K)+Σ(K)Σ(K)2Σ(K)\|\Sigma(K^{\prime})\|\leq\|\Sigma(K)\|+\|\Sigma(K)-\Sigma(K^{\prime})\|\leq 2\|\Sigma(K)\|

within a ball K(K;ρK)K^{\prime}\in\mathcal{B}(K;\rho_{K}).

On the other hand,

\displaystyle① σmin(Σ(K))σmin(R+BTP(K)B))ΔKF2\displaystyle\geq\sigma_{\min}(\Sigma(K^{\prime}))\sigma_{\min}(R+B^{T}P(K)B))\|\Delta K\|_{F}^{2}
(b)σmin(Σ0)σmin(R)ΔKF2\displaystyle\overset{(b)}{\geq}\sigma_{\min}(\Sigma_{0})\sigma_{\min}(R)\|\Delta K\|_{F}^{2}

where (b) hold due to Σ0Σ(K)\Sigma_{0}\preceq\Sigma(K^{\prime}) and RR+BTP(K)BR\preceq R+B^{T}P(K)B.

Therefore, the second order term is (locally) bounded by

m2ΔKF2LK2ΔKF2\frac{m}{2}\|\Delta K\|_{F}^{2}\leq①\leq\frac{L_{K}}{2}\|\Delta K\|_{F}^{2}

where

m=\displaystyle m= 2σmin(Σ0)σmin(R)0,\displaystyle 2\sigma_{\min}(\Sigma_{0})\sigma_{\min}(R)\geq 0,
LK=\displaystyle L_{K}= 2Σ(K)R+BTP(K)B<.\displaystyle 2\|\Sigma(K)\|\|R+B^{T}P(K)B\|<\infty.

Lemma 7 (Lemma 3 restated).

Let K+=𝐩𝐫𝐨𝐱r,λη(Kηf(K))K^{+}=\mathbf{prox}_{r,\lambda\eta}(K-\eta\nabla f(K)). Then

K+(K;ρK)K^{+}\in\mathcal{B}(K;\rho_{K})

holds for any 0<η<ηKλ,r0<\eta<\eta_{K}^{\lambda,r} where ηKλ,r\eta_{K}^{\lambda,r} is given as

ηKλ,r={ρKf(K)+λnmr(K)=K1ρKf(K)+λmin(n,m)r(K)=KρK2f(K)+2λKKrefr(K)=KKrefF2.\displaystyle\eta_{K}^{\lambda,r}=\begin{cases}\frac{\rho_{K}}{\|\nabla f(K)\|+\lambda nm}&~{}r(K)=\|K\|_{1}\\ \frac{\rho_{K}}{\|\nabla f(K)\|+\lambda\min(n,m)}&~{}r(K)=\|K\|_{*}\\ \frac{\rho_{K}}{2\|\nabla f(K)\|+2\lambda\left\|K-K^{\mathrm{ref}}\right\|}&~{}r(K)=\|K-K^{\mathrm{ref}}\|_{F}^{2}\end{cases}.
Proof.

For lasso, let SληS_{\lambda\eta} be a soft-thresholding operator.

K+\displaystyle\|K^{+} K(Kηf(K))K+K+(Kηf(K))\displaystyle-K\|\leq\|(K-\eta\nabla f(K))-K\|+\|K^{+}-(K-\eta\nabla f(K))\|
ηf(K)+(Sηλ(Kηf(K))(Kηf(K))\displaystyle\leq\eta\|\nabla f(K)\|+\|(S_{\eta\lambda}(K-\eta\nabla f(K))-(K-\eta\nabla f(K))\|
ηf(K)+ηλnm\displaystyle\leq\eta\|\nabla f(K)\|+\eta\lambda nm
η(f(K)+λnm)\displaystyle\leq\eta(\nabla f(K)+\lambda nm)
ρK\displaystyle\leq\rho_{K}

where the last inequality holds iff

ηρKf(K)+λnm.\eta\leq\frac{\rho_{K}}{\|\nabla f(K)\|+\lambda nm}.

For nuclear norm,

K+\displaystyle\|K^{+} (Kηf(K))\displaystyle-(K-\eta\nabla f(K))\|
(TrunSVDηλ(Kηf(K))(Kηf(K))\displaystyle\leq\|(\mathrm{TrunSVD}_{\eta\lambda}(K-\eta\nabla f(K))-(K-\eta\nabla f(K))\|
U(𝐝𝐢𝐚𝐠(Sλη[σ1,,σmin(n,m)])\displaystyle\leq\|U(\mathbf{diag}(S_{\lambda\eta}[\sigma_{1},\ldots,\sigma_{\min(n,m)}])
𝐝𝐢𝐚𝐠(σ1,,σmin(n,m)))VT\displaystyle\qquad-\mathbf{diag}(\sigma_{1},\ldots,\sigma_{\min(n,m)}))V^{T}\|
ηλmin(n,m)\displaystyle\leq\eta\lambda\min(n,m)

Therefore,

K+\displaystyle\|K^{+} K(Kηf(K))K+K+(Kηf(K))\displaystyle-K\|\leq\|(K-\eta\nabla f(K))-K\|+\|K^{+}-(K-\eta\nabla f(K))\|
η(f(K)+λmin(n,m))\displaystyle\leq\eta(\nabla f(K)+\lambda\min(n,m))
ρK\displaystyle\leq\rho_{K}

where the last inequality holds iff

ηρKf(K)+λmin(n,m).\eta\leq\frac{\rho_{K}}{\|\nabla f(K)\|+\lambda\min(n,m)}.

For the third regulazer,

K+\displaystyle\|K^{+} (Kηf(K))\displaystyle-(K-\eta\nabla f(K))\|
(a)2ηλKref+Kηf(K)2ηλ+1(Kηf(K))F\displaystyle\overset{(a)}{\leq}\left\|\frac{2\eta\lambda K^{\mathrm{ref}}+K-\eta\nabla f(K)}{2\eta\lambda+1}-(K-\eta\nabla f(K))\right\|_{F}
=2ηλ2ηλ+1(KrefK)2ηλ2ηλ+1ηf(K)F\displaystyle=\left\|\frac{2\eta\lambda}{2\eta\lambda+1}(K^{\mathrm{ref}}-K)-\frac{2\eta\lambda}{2\eta\lambda+1}\eta\nabla f(K)\right\|_{F}
(b)2ηλKrefKF+ηf(K)F,\displaystyle\overset{(b)}{\leq}2\eta\lambda\left\|K^{\mathrm{ref}}-K\right\|_{F}+\eta\left\|\nabla f(K)\right\|_{F},

where (a) holds from the closed solution of proximal operator in Lemma 1(in main paper) and (b) holds due to 2ηλ2ηλ+12ηλ\frac{2\eta\lambda}{2\eta\lambda+1}\leq 2\eta\lambda and 2ηλ2ηλ+11\frac{2\eta\lambda}{2\eta\lambda+1}\leq 1. Therefore, using this inequality gives

\displaystyle\| K+K\displaystyle K^{+}-K\|
(Kηf(K))K+K+(Kηf(K))\displaystyle\leq\|(K-\eta\nabla f(K))-K\|+\|K^{+}-(K-\eta\nabla f(K))\|
2η(f(K)+λKrefKF)\displaystyle\leq 2\eta(\left\|\nabla f(K)\right\|+\lambda\left\|K^{\mathrm{ref}}-K\right\|_{F})
ρK,\displaystyle\leq\rho_{K},

where the last inequality holds iff

ηρK2(f(K)+λKrefKF).\eta\leq\frac{\rho_{K}}{2(\left\|\nabla f(K)\right\|+\lambda\left\|K^{\mathrm{ref}}-K\right\|_{F})}.

Lemma 8.

For any 0<ηmin(1LK,ηKλ,r)0<\eta\leq\min(\frac{1}{L_{K}},\eta_{K}^{\lambda,r}), let K+=𝐩𝐫𝐨𝐱r,λη(Kηf(K))=KηGη(K)K^{+}=\mathbf{prox}_{r,\lambda\eta}(K-\eta\nabla f(K))=K-\eta G_{\eta}(K) where Gη(K)=1η(K𝐩𝐫𝐨𝐱r,λη(Kηf(K)))G_{\eta}(K)=\frac{1}{\eta}(K-\mathbf{prox}_{r,\lambda\eta}(K-\eta\nabla f(K))). Then, for any Z𝐑m×nZ\in\mathbf{R}^{m\times n},

F(K+)\displaystyle F(K^{+}) F(Z)+Gη(K)T(KZ)m2KZF2\displaystyle\leq F(Z)+G_{\eta}(K)^{T}(K-Z)-\frac{m}{2}\left\|K-Z\right\|_{F}^{2}
η2Gη(K)F2\displaystyle\qquad-\frac{\eta}{2}\left\|G_{\eta}(K)\right\|^{2}_{F} (14)

holds.

Proof.

For K+=KηGη(K)K^{+}=K-\eta G_{\eta}(K) with any 0<η0<\eta and any Z𝐑m×nZ\in\mathbf{R}^{m\times n}, we have

r(K\displaystyle r(K ηGη(K))\displaystyle-\eta G_{\eta}(K))
(a)r(Z)𝐓𝐫(r(KηGη(K))T(ZK+ηGη(K)))\displaystyle\overset{(a)}{\leq}r(Z)-\mathbf{Tr}\left(\partial r(K-\eta G_{\eta}(K))^{T}(Z-K+\eta G_{\eta}(K))\right)
=(b)r(Z)𝐓𝐫((Gη(K)f(K))T(ZK+ηGη(K)))\displaystyle\overset{(b)}{=}r(Z)-\mathbf{Tr}\left((G_{\eta}(K)-\nabla f(K))^{T}(Z-K+\eta G_{\eta}(K))\right)
=r(Z)+𝐓𝐫(Gη(K)T(KZ))ηGη(K)F2\displaystyle\overset{}{=}r(Z)+\mathbf{Tr}\left(G_{\eta}(K)^{T}(K-Z)\right)-\eta\left\|G_{\eta}(K)\right\|^{2}_{F}
+𝐓𝐫(f(K)T(ZK+ηGη(K)))\displaystyle\qquad+\mathbf{Tr}\left(\nabla f(K)^{T}(Z-K+\eta G_{\eta}(K))\right)

where (a) holds due to convexity of gg, (b) holds due to the property of subgradient on proximal operator. Next, for any 0<ηηKλ,r0<\eta\leq\eta_{K}^{\lambda,r}, K+(K;ρK)K^{+}\in\mathcal{B}(K;\rho_{K}) holds from Lemma 7 and thus f(K)f(K) is locally smooth. Therefore

f(K\displaystyle f(K ηGη(K))\displaystyle-\eta G_{\eta}(K))
(c)f(K)𝐓𝐫(f(K)TηGη(K))+LKη22Gη(K)F2\displaystyle\overset{(c)}{\leq}f(K)-\mathbf{Tr}\left(\nabla f(K)^{T}\eta G_{\eta}(K)\right)+\frac{L_{K}\eta^{2}}{2}\left\|G_{\eta}(K)\right\|_{F}^{2}
(d)f(K)𝐓𝐫(f(K)TηGη(K))+η2Gη(K)F2\displaystyle\overset{(d)}{\leq}f(K)-\mathbf{Tr}\left(\nabla f(K)^{T}\eta G_{\eta}(K)\right)+\frac{\eta}{2}\left\|G_{\eta}(K)\right\|_{F}^{2}
(e)f(Z)𝐓𝐫(f(K)T(ZK))m2ZKF2\displaystyle\overset{(e)}{\leq}f(Z)-\mathbf{Tr}\left(\nabla f(K)^{T}(Z-K)\right)-\frac{m}{2}\left\|Z-K\right\|_{F}^{2}
𝐓𝐫(f(K)TηGη(K))+η2Gη(K)F2\displaystyle\quad-\mathbf{Tr}\left(\nabla f(K)^{T}\eta G_{\eta}(K)\right)+\frac{\eta}{2}\left\|G_{\eta}(K)\right\|_{F}^{2}
=f(Z)𝐓𝐫(f(K)T(ZK+ηGη(K)))\displaystyle\overset{}{=}f(Z)-\mathbf{Tr}\left(\nabla f(K)^{T}(Z-K+\eta G_{\eta}(K))\right)
m2ZKF2+η2Gη(K)F2\displaystyle\qquad-\frac{m}{2}\left\|Z-K\right\|_{F}^{2}+\frac{\eta}{2}\left\|G_{\eta}(K)\right\|_{F}^{2} (15)

where (c) holds due to LL-smoothness for K+(K;ρK)K^{+}\in\mathcal{B}(K;\rho_{K}), (d) holds by η1LK\eta\leq\frac{1}{L_{K}}, (e) holds due to mm-strongly convexity at KK. And note that Substituting Z=KZ=K in (15) is equivalent to linesearch criterion in Eq. (8) (in main paper), which will be satisfied for small enough stepsize η\eta after linesearch iterations.

Adding two inequalities above gives

F(K+)=\displaystyle F(K^{+})= f(KηGη(K))+r(KηGη(K))\displaystyle f(K-\eta G_{\eta}(K))+r(K-\eta G_{\eta}(K))
F(Z)+𝐓𝐫(Gη(K)T(KZ))\displaystyle\leq F(Z)+\mathbf{Tr}\left(G_{\eta}(K)^{T}(K-Z)\right) (16)
m2ZKF2η2Gη(K)F2\displaystyle\quad-\frac{m}{2}\left\|Z-K\right\|_{F}^{2}-\frac{\eta}{2}\left\|G_{\eta}(K)\right\|^{2}_{F}\;

Proposition 2 (Proposition 1 restated).

Assume A+BKA+BK is stable. For any stepsize 0<ηmin(1LK,ηKr)0<\eta\leq\min(\frac{1}{L_{K}},\eta_{K}^{r}) and next iterate K+=𝐩𝐫𝐨𝐱r(),ηλ(Kηf(K))K^{+}=\mathbf{prox}_{r(\cdot),\eta\lambda}(K-\eta\nabla f(K)),

ρ(A+BK+)<1\displaystyle\rho(A+BK^{+})<1 (17)
F(K+)F(K)12ηKK+F2\displaystyle F(K^{+})\leq F(K)-\frac{1}{2\eta}\|K-K^{+}\|^{2}_{F} (18)

holds.

Proof.

From Lemma 7, (17) comes immediately from Lemma 8 in (Fazel et al., 2018). And applying Z=KZ=K in Lemma 8 gives (18). ∎

Lemma 9 (Lemma 4 restated).

Assume that {Ki}i=0,\{K^{i}\}_{i=0,\ldots} is a stabilizing sequence and associated {f(Ki)}i=0,\{f(K^{i})\}_{i=0,\ldots} and {KiKF}i=0,\{\|K^{i}-K^{\star}\|_{F}\}_{i=0,\ldots} are decreasing sequences. Then, Lemma 7 holds for

ηλ,r={ρLρf+λnmr(K)=K1ρLρf+λmin(n,m)r(K)=KρL2ρf+4λΔr(K)=KKrefF2.\displaystyle\eta^{\lambda,r}=\begin{cases}\frac{\rho^{L}}{\rho^{f}+\lambda nm}&~{}r(K)=\|K\|_{1}\\ \frac{\rho^{L}}{\rho^{f}+\lambda\min(n,m)}&~{}r(K)=\|K\|_{*}\\ \frac{\rho^{L}}{2\rho^{f}+4\lambda\Delta}&~{}r(K)=\|K-K^{\mathrm{ref}}\|_{F}^{2}\end{cases}. (19)

where

ρf\displaystyle\rho^{f} =2F(K0)σmin(Q)(BTF(K0)σmin(Σ0)A+\displaystyle=2\frac{F(K^{0})}{\sigma_{\min}(Q)}\bigg{(}\left\|B^{T}\right\|\frac{F(K^{0})}{\sigma_{\min}(\Sigma_{0})}\left\|A\right\|+
(R+BTF(K0)σ(Σ0)B)(Δ+K)),\displaystyle\left(\left\|R\right\|+\left\|B^{T}\right\|\frac{F(K^{0})}{\sigma(\Sigma_{0})}\left\|B\right\|\right)(\Delta+\left\|K^{\star}\right\|)\bigg{)},
ρL\displaystyle\rho^{L} =σmin(Σ0)28F(K0)B.\displaystyle=\frac{\sigma_{\mathrm{min}}(\Sigma_{0})^{2}}{8F(K^{0})\|B\|}.
Proof.

For the proof, we derive the global bound on f(Ki)ρf\|\nabla f(K^{i})\|\leq\rho^{f} and ρKρL\rho_{K}\geq\rho^{L}, then plug these into Lemma 7 to complete our claim. First, we utilize the derivation of the upperbound on P(Ki)\|P(K^{i})\| and Σ(Ki)\|\Sigma(K^{i})\| in (Fazel et al., 2018) under the assumption of decreasing sequence as follows,

P(Ki)F(K0)σmin(Σ0),Σ(Ki)F(K0)σmin(Q).\|P(K^{i})\|\leq\frac{F(K^{0})}{\sigma_{\min}(\Sigma_{0})},\qquad\|\Sigma(K^{i})\|\leq\frac{F(K^{0})}{\sigma_{\min}(Q)}.

From this, we have

ρK=σmin(Σ0)4Σ(K)(A+BK+1)Bσmin2(Σ0)8F(K0)B\rho_{K}=\frac{\sigma_{\mathrm{min}}(\Sigma_{0})}{4\|\Sigma(K)\|\left(\|A+BK\|+1\right)\|B\|}\geq\frac{\sigma^{2}_{\mathrm{min}}(\Sigma_{0})}{8F(K^{0})\|B\|}

holds where we used the fact that A+BKi<1\|A+BK^{i}\|<1 and Σ(Ki)F(K0)σmin(Q)\|\Sigma(K^{i})\|\leq\frac{F(K^{0})}{\sigma_{\min}(Q)}. Now we complete the proof by also providing ρf\rho^{f} .Since f(Ki)\|\nabla f(K^{i})\| is bounded as

\displaystyle\|\nabla f(Ki)2((R+BTPB)K+BTPA)Σ\displaystyle f(K^{i})\|\leq\left\|2\left(\left(R+B^{T}PB\right)K+B^{T}PA\right)\Sigma\right\|
2((R+BTPB)K\displaystyle\leq 2\big{(}\left(\left\|R\right\|+\left\|B^{T}\right\|\left\|P\right\|\left\|B\right\|\right)\left\|K\right\|
+BTPA)Σ\displaystyle\quad+\left\|B^{T}\right\|\left\|P\right\|\left\|A\right\|\big{)}\left\|\Sigma\right\|
2((R+BTF(K0)σmin(Σ0)B)K\displaystyle\leq 2\bigg{(}\left(\left\|R\right\|+\left\|B^{T}\right\|\frac{F(K^{0})}{\sigma_{\mathrm{min}}(\Sigma_{0})}\left\|B\right\|\right)\left\|K\right\|
+BTF(K0)σmin(Σ0)A)F(K0)σ(Q)\displaystyle\quad+\left\|B^{T}\right\|\frac{F(K^{0})}{\sigma_{\mathrm{min}}(\Sigma_{0})}\left\|A\right\|\bigg{)}\frac{F(K^{0})}{\sigma(Q)}
2((R+BTF(K0)σmin(Σ0)B)(K+Δ)\displaystyle\leq 2\bigg{(}\left(\left\|R\right\|+\left\|B^{T}\right\|\frac{F(K^{0})}{\sigma_{\mathrm{min}}(\Sigma_{0})}\left\|B\right\|\right)\left(\left\|K^{\star}\right\|+\Delta\right)
+BTF(K0)σ(Σ0)A)F(K0)σmin(Q)\displaystyle\quad+\left\|B^{T}\right\|\frac{F(K^{0})}{\sigma(\Sigma_{0})}\left\|A\right\|\bigg{)}\frac{F(K^{0})}{\sigma_{\mathrm{min}}(Q)}

where the last inequality holds due to K(K+KK)K+Δ\left\|K\right\|\leq\left(\left\|K^{\star}\right\|+\left\|K-K^{\star}\right\|\right)\leq\left\|K^{\star}\right\|+\Delta.

Proposition 3.

Let ηi\eta_{i} be the stepsize from backtracking linesearch at ii-th iteration. After NN iterations, it converges to a stationary point KK^{\star} satisfying

mini=1,,NGηi(Ki)F22(F(K0)F)ηminN\displaystyle\min_{i=1,\ldots,N}\|G_{\eta_{i}}(K^{i})\|^{2}_{F}\leq\frac{2(F(K^{0})-F^{\star})}{\eta_{\min}N}

where Gηi(Ki)f(Ki)+r(Kiηif(Ki))G_{\eta_{i}}(K^{i})\in\nabla f(K^{i})+\partial r(K^{i}-\eta_{i}\nabla f(K^{i})), Gηi(Ki)=0G_{\eta_{i}}(K^{i})=0 iff 0F(Ki)0\in\partial F(K^{i}). Moreover,

ηmin=hη(\displaystyle\eta_{\min}=h_{\eta}\bigg{(} σmin(Σ0),σmin(Q),1λ,\displaystyle\sigma_{\mathrm{min}}(\Sigma_{0}),\sigma_{\mathrm{min}}(Q),\frac{1}{\lambda},
1A,1B,1R,1Δ,1F(K0))\displaystyle\frac{1}{\left\|A\right\|},\frac{1}{\left\|B\right\|},\frac{1}{\left\|R\right\|},\frac{1}{\Delta},\frac{1}{F(K^{0})}\bigg{)}

where hηh_{\eta} is a function non-decreasing on each argument.

Proof.

From Lemma 2,

F(Ki+1)F(Ki)ηi2Gηi(Ki)F2i\displaystyle F(K^{i+1})\leq F(K^{i})-\frac{{\eta_{i}}}{2}\|G_{\eta_{i}}(K^{i})\|_{F}^{2}\;\forall i

Reordering terms and averaging over iterations i=1Ni=1\ldots N give

1Ni=1NGηi(Ki)22\displaystyle\frac{1}{N}\sum_{i=1}^{N}\|G_{\eta_{i}}(K^{i})\|_{2}^{2} 2Ni=1N1ηi(F(Ki)F(Ki+1))\displaystyle\leq\frac{2}{N}\sum_{i=1}^{N}\frac{1}{\eta_{i}}(F(K^{i})-F(K^{i+1}))
2(F(K0)F(K))mini=1,,NηiN.\displaystyle\leq\frac{2(F(K^{0})-F(K^{\star}))}{\underset{i=1,\ldots,N}{\min}\eta_{i}N}.

And LHS is lower bounded by

1Ni=1NGηi(Ki)F2mini=1,,NGηi(Ki)F2,\frac{1}{N}\sum_{i=1}^{N}\|G_{\eta_{i}}(K^{i})\|_{F}^{2}\geq\underset{{i=1,\ldots,N}}{\min}\|G_{\eta_{i}}(K^{i})\|_{F}^{2},

giving the desirable result. Moreover, it converges to the stationary point since limiGηi(Ki)=0\lim_{i\rightarrow\infty}G_{\eta_{i}}(K^{i})=0.

Now the remaining part is to bound the stepsize. Note that the stepsize ηi\eta_{i} after linesearch satisfies

ηi1βmin(1LKi,ηKir).\eta_{i}\geq\frac{1}{\beta}\min\left(\frac{1}{L_{K_{i}}},\eta_{K_{i}}^{r}\right).

First we bound 1LKi\frac{1}{L_{K_{i}}} as follows,

1LKi\displaystyle\frac{1}{L_{K^{i}}} =14Σ(K)R+BTP(K)B\displaystyle=\frac{1}{4\|\Sigma(K)\|\|R+B^{T}P(K)B\|}
14Σ(K)(R+BTP(K)B)\displaystyle\geq\frac{1}{4\|\Sigma(K)\|(\|R\|+\|B^{T}\|\|P(K)\|\|B\|)}
σmin(Σ0)σmin(Q)4F(K0)(σmin(Q)R+F(K0)BTB).\displaystyle\geq\frac{\sigma_{\min}(\Sigma_{0})\sigma_{\min}(Q)}{4F(K^{0})(\sigma_{\min}(Q)\|R\|+F(K^{0})\|B^{T}\|\|B\|)}.

Next, about the bound on ηKiλ,r\eta_{K_{i}}^{\lambda,r}, we already have ηKiλ,rηλ,r\eta_{K_{i}}^{\lambda,r}\geq\eta^{\lambda,r} from Lemma 9.

Note that both of bounds are proportional to σmin(Σ0)\sigma_{\mathrm{min}}(\Sigma_{0}) and σmin(Q)\sigma_{\mathrm{min}}(Q), and inverse-proportional to A,B,R,Δ\left\|A\right\|,\left\|B\right\|,\left\|R\right\|,\Delta and F(K0)F(K^{0}).

Therefore

mini=1,,ηiηmin=hη(\displaystyle\min_{i=1,\ldots,}\eta_{i}\geq\eta_{\min}=h_{\eta}\bigg{(} σmin(Σ0),σmin(Q),1λ,\displaystyle\sigma_{\mathrm{min}}(\Sigma_{0}),\sigma_{\mathrm{min}}(Q),\frac{1}{\lambda},
1A,1B,1R,1Δ,1F(K0))\displaystyle\frac{1}{\left\|A\right\|},\frac{1}{\left\|B\right\|},\frac{1}{\left\|R\right\|},\frac{1}{\Delta},\frac{1}{F(K^{0})}\bigg{)}

for some hηh_{\eta} that is non-decreasing on each argument.

Theorem 3 (Theorem 1 restated).

KiK^{i} from Algorithm 1 converges to the stationary point KK^{\star}. Moreover, it converges linearly, i.e., after NN iterations,

KNKF2(11κ)NK0KF2.\left\|K^{N}-K^{\star}\right\|_{F}^{2}\overset{}{\leq}\left(1-\frac{1}{\kappa}\right)^{N}\left\|K^{0}-K^{\star}\right\|_{F}^{2}.

Here, κ=1/(ηminσmin(Σ0)σmin(R)))>1\kappa=1/\left(\eta_{\min}\sigma_{\mathrm{min}}(\Sigma_{0})\sigma_{\mathrm{min}}(R))\right)>1 where

ηmin=hη(\displaystyle\eta_{\min}=h_{\eta}\bigg{(} σmin(Σ0),σmin(Q),1λ,\displaystyle\sigma_{\mathrm{min}}(\Sigma_{0}),\sigma_{\mathrm{min}}(Q),\frac{1}{\lambda},
1A,1B,1R,1Δ,1F(K0)),\displaystyle\frac{1}{\left\|A\right\|},\frac{1}{\left\|B\right\|},\frac{1}{\left\|R\right\|},\frac{1}{\Delta},\frac{1}{F(K^{0})}\bigg{)}, (20)

for some non-decreasing function hηh_{\eta} on each argument.

Proof.

Substituting Z=KZ=K^{\star} in Lemma 8 gives,

F(K+)F\displaystyle F(K^{+})-F^{\star}
𝐓𝐫(Gη(K)T(KK))m2KKF2η2Gη(K)F2\displaystyle\leq\mathbf{Tr}\left(G_{\eta}(K)^{T}(K-K^{\star})\right)-\frac{m}{2}\left\|K-K^{\star}\right\|_{F}^{2}-\frac{\eta}{2}\left\|G_{\eta}(K)\right\|_{F}^{2}
=12η(KKF2KKηGη(K)F2)\displaystyle=\frac{1}{2\eta}\left(\left\|K-K^{\star}\right\|_{F}^{2}-\left\|K-K^{\star}-\eta G_{\eta}(K)\right\|_{F}^{2}\right)
m2KKF2\displaystyle\qquad-\frac{m}{2}\left\|K-K^{\star}\right\|_{F}^{2}
=12η(KKF2K+KF2)m2KKF2.\displaystyle=\frac{1}{2\eta}\left(\left\|K-K^{\star}\right\|_{F}^{2}-\left\|K^{+}-K^{\star}\right\|_{F}^{2}\right)-\frac{m}{2}\left\|K-K^{\star}\right\|_{F}^{2}.

Reordering terms gives

K+KF2\displaystyle\left\|K^{+}-K^{\star}\right\|^{2}_{F} KKF2\displaystyle\leq\left\|K-K^{\star}\right\|^{2}_{F}
(2η(F(K+)K)+mηKKF2)\displaystyle\quad-\big{(}2\eta(F(K^{+})-K^{\star})+m\eta\left\|K-K^{\star}\right\|^{2}_{F}\big{)}
(1mη)KKF2\displaystyle\overset{}{\leq}(1-m\eta)\left\|K-K^{\star}\right\|^{2}_{F}

where the last inequality holds due to F(K+)F0F(K^{+})-F^{\star}\geq 0.

Therefore, after NN iterations,

KNKF2\displaystyle\left\|K^{N}-K^{\star}\right\|^{2}_{F} (1mηN)(1mη1)K0KF2\displaystyle\overset{}{\leq}(1-m\eta_{N})\cdots(1-m\eta_{1})\left\|K^{0}-K^{\star}\right\|^{2}_{F}
(1mηmin)NK0K2\displaystyle\overset{}{\leq}(1-m\eta_{\min})^{N}\left\|K^{0}-K^{\star}\right\|^{2}

where ηmin\eta_{\min} is the same one in Proposition 3

Corollary 2.

Let KK^{\star} be the stationary point from Algorithm 1. Then, after NN iterations

N2κlog(K0KFϵ),N\geq 2\kappa\log\left(\frac{\left\|K^{0}-K^{\star}\right\|_{F}}{\epsilon}\right),
KNKFϵ\left\|K^{N}-K^{\star}\right\|_{F}\leq\epsilon

holds where κ=1/(ηminσmin(Σ0)σmin(R)))>1\kappa=1/\left(\eta_{\min}\sigma_{\mathrm{min}}(\Sigma_{0})\sigma_{\mathrm{min}}(R))\right)>1 and ηmin\eta_{\mathrm{min}} in Eq. (20).

Proof.

This is immediate from Theorem 3, using the inequality (11/κ)NeN/κ(1-1/\kappa)^{N}\leq e^{-N/\kappa} and by taking the logarithm. ∎

Appendix D Proof for Convergence Analysis of Model-free S-PI

Lemma 10 (Lemma 30 in (Fazel et al., 2018)).

There exists polynomials , hNtrajh_{N_{\mathrm{traj}}}, hHh_{H}, hrh_{r} such that, when r<1/hr(1/ϵ)r<1/h_{r}(1/\epsilon), NtrajhNtraj(n,1/ϵ,L2σmin(Σ0))N_{\mathrm{traj}}\geq h_{N_{\mathrm{traj}}}(n,1/\epsilon,\frac{L^{2}}{\sigma_{\mathrm{min}}(\Sigma_{0})}) and HhH(n,1/ϵ)H\geq h_{H}(n,1/\epsilon), the gradient estimate f(K)^\widehat{\nabla f(K)} given in Eq. (13) of Algorithm 3 satisfies

f(K)^f(K)Fϵ\left\|\widehat{\nabla f(K)}-{\nabla f(K)}\right\|_{F}\leq\epsilon

with high probability (at least 1(ϵ/n)n1-{(\epsilon/n)}^{n}.

Theorem 4 (Theorem 2 restated).

Suppose f(K0)f(K^{0}) is finite, Σ00\Sigma_{0}\succ 0, and that x0𝒟x_{0}\sim\mathcal{D} has norm bounded by LL almost surly. Suppose the parameters in Algorithm 3 are chosen from

(Ntraj,H,1/r)=h(n,1(σmin(Σ0)σmin(R)),L2σmin(Σ0)).(N_{\mathrm{traj}},H,1/r)=h\left(n,\frac{1}{\left(\sigma_{\mathrm{min}}(\Sigma_{0})\sigma_{\mathrm{min}}(R)\right)},\frac{L^{2}}{\sigma_{\mathrm{min}}(\Sigma_{0})}\right).

for some polynomials hh. Then, with the same stepsize in Eq. (20), Algorithm 3 converges to its stationary point KK^{\star} with high probability. In particular, there exist iteration NN at most 4κlog(K0KFϵ)4\kappa\log\left(\frac{\left\|K^{0}-K^{\star}\right\|_{F}}{\epsilon}\right) such that KNKϵ\left\|K^{N}-K^{\star}\right\|\leq\epsilon with at least 1o(ϵn1)1-{o}(\epsilon^{n-1}) probability. Moreover, it converges linearly,

KiK2(112κ)iK0K2,\left\|K^{i}-K^{\star}\right\|^{2}\overset{}{\leq}\left(1-\frac{1}{2\kappa}\right)^{i}\left\|K^{0}-K^{\star}\right\|^{2},

for the iteration i=1,,Ni=1,\ldots,N, where κ=ησmin(Σ0)σmin(R)>1\kappa=\eta\sigma_{\mathrm{min}}(\Sigma_{0})\sigma_{\mathrm{min}}(R)>1.

Proof.

Let ϵ\epsilon be the error bound we want to obtain, i.e., KNKϵ\left\|K^{N}-K^{\star}\right\|\leq\epsilon where KNK^{N} is the policy from Algorithm 3 after NN iterations.

For a notational simplicity, we denote KKiK\leftarrow K^{i} and see the contraction of the proximal operator at iith iteration. First we use Lemma 10 to claim that, with high probability, f(K)^f(K)Fαϵ\left\|\widehat{\nabla f(K)}-{\nabla f(K)}\right\|_{F}\leq\alpha\epsilon for long enough numbers of trajactory NtrajN_{\mathrm{traj}} and horizon HH where α\alpha is specified later.

Second, we bound the error after one iteration of approximated proximal gradient step at the policy KK, i.e., KK+F\left\|K^{\prime}-K^{+}\right\|_{F}. Here let K=𝐩𝐫𝐨𝐱(Kηf(K)^K^{\prime}=\mathbf{prox}(K-\eta\widehat{\nabla f(K)} be the next iterate using approximate gradient f(K)^)\widehat{\nabla f(K)}) and K+=𝐩𝐫𝐨𝐱λr(Kηf(K)K^{+}=\mathbf{prox}_{\lambda r}(K-\eta\nabla f(K) be the one using the exact gradient f(K)\nabla f(K).

KK+F\displaystyle\left\|K^{\prime}-K^{+}\right\|_{F} =𝐩𝐫𝐨𝐱(Kηf(K)^𝐩𝐫𝐨𝐱(Kηf(K)F\displaystyle=\left\|\mathbf{prox}(K-\eta\widehat{\nabla f(K)}-\mathbf{prox}(K-\eta\nabla f(K)\right\|_{F}
(Kηf(K)^(Kηf(K)^F\displaystyle\leq\left\|(K-\eta\widehat{\nabla f(K)}-(K-\eta\widehat{\nabla f(K)}\right\|_{F}
=ηf(K)^f(K)F\displaystyle=\eta\left\|\widehat{\nabla f(K)}-\nabla f(K)\right\|_{F}
ηαϵ\displaystyle\leq\eta\alpha\epsilon

where we use the fact that proximal operator is non-expansive and f(K)^f(K)Fαϵ\left\|\widehat{\nabla f(K)}-{\nabla f(K)}\right\|_{F}\leq\alpha\epsilon holds for proper parameter choices (the claim in the previous paragraph).

Third, we find the contractive upperbound after one iteration using approximated proximal gradient.

KKF\displaystyle\left\|K^{\prime}-K^{\star}\right\|_{F} KK+F+K+KF\displaystyle\leq\left\|K^{\prime}-K^{+}\right\|_{F}+\left\|K^{+}-K^{\star}\right\|_{F}
KK+F+(11/κ)KKF\displaystyle\leq\left\|K^{\prime}-K^{+}\right\|_{F}+\sqrt{(1-1/\kappa)}\left\|K-K^{\star}\right\|_{F}
ηαϵ+(1κ)KKF.\displaystyle\leq\eta\alpha\epsilon+\sqrt{(1-\kappa)}\left\|K-K^{\star}\right\|_{F}.

Let’s assume KKFϵ\left\|K-K^{\star}\right\|_{F}\geq\epsilon under current policy. Then, taking square on both sides gives

KKF2\displaystyle\left\|K^{\prime}-K^{\star}\right\|^{2}_{F} η2α2ϵ2+2ηαϵ(11/κ)KKF\displaystyle\leq\eta^{2}\alpha^{2}\epsilon^{2}+2\eta\alpha\epsilon\sqrt{(1-1/\kappa)}\left\|K-K^{\star}\right\|_{F}
+(11/κ)KKF2\displaystyle\quad+(1-1/\kappa)\left\|K-K^{\star}\right\|^{2}_{F}
(11/κ+2αη+α2η2)KKF2,\displaystyle\leq\left(1-1/\kappa+2\alpha\eta+\alpha^{2}\eta^{2}\right)\left\|K-K^{\star}\right\|^{2}_{F},

where we used (11/κ)1\sqrt{(1-1/\kappa)}\leq 1, 1κ1\leq\kappa, and the assumption. Choosing α=15ηκ=15σmin(Σ0)σmin(R))\alpha=\frac{1}{5\eta\kappa}=\frac{1}{5\sigma_{\mathrm{min}}(\Sigma_{0})\sigma_{\mathrm{min}}(R))} results in

KK2(11/(2κ))KK2,\left\|K^{\prime}-K^{\star}\right\|^{2}\leq\left(1-1/(2\kappa)\right)\left\|K-K^{\star}\right\|^{2},

with high probability 1(αϵ/n)n1-(\alpha\epsilon/n)^{n}. This says the approximate proximal gradient is contractive, decreasing in error after one iteration. Keep applying this inequality, we get

KiK2(11/(2κ))iK0K2.\left\|K^{i}-K^{\star}\right\|^{2}\leq\left(1-1/(2\kappa)\right)^{i}\left\|K^{0}-K^{\star}\right\|^{2}.

as long as ϵKi1KF\epsilon\leq\left\|K^{i-1}-K^{\star}\right\|_{F}.

This says that there must exist the iteration N>0N>0 s.t.

KNKFϵKN1KF\displaystyle\left\|K^{N}-K^{\star}\right\|_{F}\leq\epsilon\leq\left\|K^{N-1}-K^{\star}\right\|_{F} (21)

Now we claim this NN is at most 4κlog(K0KFϵ)4\kappa\log\left(\frac{\left\|K^{0}-K^{\star}\right\|_{F}}{\epsilon}\right). To prove this claim, suppose it is not, i.e., N4κlog(K0KFϵ)+1N\geq 4\kappa\log\left(\frac{\left\|K^{0}-K^{\star}\right\|_{F}}{\epsilon}\right)+1. Then, for N>1N>1,

KN1KF2\displaystyle\left\|K^{N-1}-K^{\star}\right\|_{F}^{2} (11/(2κ))N1K0K2\displaystyle\leq\left(1-1/(2\kappa)\right)^{N-1}\left\|K^{0}-K^{\star}\right\|^{2}
<eN12κK0K2\displaystyle<e^{-\frac{N-1}{2\kappa}}\left\|K^{0}-K^{\star}\right\|^{2}
(K0KFϵ)2K0K2\displaystyle\leq\left(\frac{\left\|K^{0}-K^{\star}\right\|_{F}}{\epsilon}\right)^{-2}\left\|K^{0}-K^{\star}\right\|^{2}
=ϵ2,\displaystyle=\epsilon^{2},

which is a contradiction to  (21).

Finally, we show the probability that this event occurs. Note that all randomness occur when estimating NN gradient within αϵ\alpha\epsilon error. From union bound, it occurs at least 1N(αϵ/n)n1-N(\alpha\epsilon/n)^{n}. And this is bounded below by

1N(αϵ/n)n\displaystyle 1-N(\alpha\epsilon/n)^{n} 1(4κlog(K0KFϵ)+1)(αϵ/n)n\displaystyle\geq 1-\left(4\kappa\log\left(\frac{\left\|K^{0}-K^{\star}\right\|_{F}}{\epsilon}\right)+1\right)(\alpha\epsilon/n)^{n}
1o(ϵnlog(K0KFϵ))\displaystyle\geq 1-{o}\left(\epsilon^{n}\log\left(\frac{\left\|K^{0}-K^{\star}\right\|_{F}}{\epsilon}\right)\right)
=1o(ϵn1).\displaystyle=1-{o}(\epsilon^{n-1}).

Appendix E Additional Experiments on Stepsize Sensitivity

In this section, we scrutinize the convergence behaviors of S-PI under some fixed stepsize. For a very small Laplacian system (n,m)=(3,3)(n,m)=(3,3) with Lasso penalty λ=3000\lambda=3000, we run S-PI over a wide range of stepsizes. For stepsize larger than 3.7e43.7e-4, S-PI diverges and thus is ran under stepsizes smaller than 3.7e43.7e-4. Let KminK^{min} be the policy where the objective value attains its minimum among overall iterates and KK^{\star} be the policy from S-PI with linesearch (non-fixed stepsize). Here the cardinality of the optimal policy is 33. For a fixed stepsizes in [3.7e5,3.7e6][3.7e-5,3.7e-6], S-PI converges to the optimal. In Figure 7, the objective value monotonically decreases and the policy converges to optimal one based on errors and cardinality. However, for smaller stepsize like [3.7e7,3.7e8,3.7e9][3.7e-7,3.7e-8,3.7e-9], Figure 8 shows that S-PI still converges but does not show monotonic behaviors nor converges to the optimal policy. These figures demonstrate the sensitivity of a stepsize when S-PI is used under a fixed stepsize, rather than linesearch. Like in Figure 8, the algorithm can be unstable under fixed stepsize because the next iterate K+K^{+} may not satisfy the stability condition ρ(A+BK+)<1\rho(A+BK^{+})<1 and or are not guaranteed for a monotonic decrease. Moreover, this instability may lead to another stationary point even when the iterate falls in some stable policy region after certain iterations. This not only demonstrates the importance of lineasearch due to its sensitivity on the stepsize, but may provide the evidence for why other policy gradient type of methods for LQR did not perform well in practice.

Refer to caption
Figure 6: Convergence behavior of the Structured Policy Iteration (S-PI) under linesearch for Laplacian system of (n,m)=(3,3)(n,m)=(3,3) over various λ\lambdas.
Refer to caption
Refer to caption
Refer to caption
Figure 7: Convergence behavior of the Structured Policy Iteration (S-PI) over fixed stepsizes [3.7e5,3.7e6][3.7e-5,3.7e-6] for Laplacian system of (n,m)=(3,3)(n,m)=(3,3) with λ=3000\lambda=3000.
Refer to caption
Refer to caption
Refer to caption
Figure 8: Convergence behavior of the Structured Policy Iteration (S-PI) over fixed stepsizes [3.7e7,3.7e8,3.7e9][3.7e-7,3.7e-8,3.7e-9] for Laplacian system of (n,m)=(3,3)(n,m)=(3,3) with λ=3000\lambda=3000.