Robust and Kernelized Data-Enabled Predictive Control for Nonlinear Systems

Linbin Huang, John Lygeros, and Florian Dörfler The authors are with the Department of Information Technology and Electrical Engineering at ETH Zürich, Switzerland. (Emails: {linhuang, jlygeros, dorfler}@ethz.ch)This research was supported by the SNSF under NCCR Automation and ETH Zurich Funds.

Abstract

This paper presents a robust and kernelized data-enabled predictive control (RoKDeePC) algorithm to perform model-free optimal control for nonlinear systems using only input and output data. The algorithm combines robust predictive control and a non-parametric representation of nonlinear systems enabled by regularized kernel methods. The latter is based on implicitly learning the nonlinear behavior of the system via the representer theorem. Instead of seeking a model and then performing control design, our method goes directly from data to control. This allows us to robustify the control inputs against the uncertainties in data by considering a min-max optimization problem to calculate the optimal control sequence. We show that by incorporating a proper uncertainty set, this min-max problem can be reformulated as a nonconvex but structured minimization problem. By exploiting its structure, we present a projected gradient descent algorithm to effectively solve this problem. Finally, we test the RoKDeePC on two nonlinear example systems – one academic case study and a grid-forming converter feeding a nonlinear load – and compare it with some existing nonlinear data-driven predictive control methods.

Index Terms:

Data-driven control, kernel methods, nonlinear control, predictive control, robust optimization.

I Introduction

Data-driven control has attracted extensive attention in recent years, as it enables optimal control in scenarios, where data is readily available, but the system models are too complex to obtain or maintain; this is often the case, for example, in large-scale energy systems or robotics [1, 2, 3].

One standard approach to data-driven control is indirect. Here one first uses data to identify a system model and then performs control design based on the identified model. This paradigm has a long history in control theory, leading to the developments of system identification (SysID) [4], model predictive control (MPC) [5], etc. Prediction error, maximum likelihood, and subspace methods are popular SysID approaches, and have been successfully combined with model-based control design in many industrial applications. Broadly recognized challenges of the indirect approach are incompatible uncertainty estimates and customizing the SysID (e.g., in terms of objective and model order selection) for the ultimate control objective. We refer to [6, 7, 8, 9] for different approaches. Moreover, recent advances in systems theory, statistics, machine learning, and deep learning have introduced many promising techniques to learn the behavior of general dynamical systems from data, e.g., by using Koopman operators, Gaussian processes, kernel methods, and neural networks [10, 11, 12, 13, 14, 15]. The learned models can also be combined with model-based control design (e.g., MPC) to perform optimal control [16, 17, 18, 19, 20, 21]. However, these methods still belong to indirect data-driven control, and one may not be able to provide deterministic guarantees on the control performance, especially when complicated structures, e.g., neural networks, are used to learn the system’s behaviors. We note that compared to using neural networks, function estimation using regularized kernel methods has a tractable formulation and solution, thanks to the well-posedness of the function classes in reproducing kernel Hilbert spaces (RKHSs) [12, 22].

An alternative formulation is direct data-driven control that aims to go directly from data to control, bypassing the identification step. In recent years, a result formulated by Willems and his co-authors in the context of behavioral system theory, known as the Fundamental Lemma [23], has been widely used in formulating direct data-driven control methods such as [24, 25, 26, 1]. The Fundamental Lemma shows that the subspace of input/output trajectories of a linear time-invariant (LTI) system can be obtained from the column span of a data Hankel matrix, which acts as a data-centric representation of the system dynamics. Unlike indirect data-driven control methods which usually involve a bi-level optimization problem (an inner problem for the SysID step and an outer problem for the control design step), these direct data-driven control methods formulate one single data-to-control optimization problem. In this paradigm, one can conveniently provide robust control against uncertainties in data via regularization and relaxation [8]. For instance, in the data-enabled predictive control (DeePC) algorithm [24], a proper regularization leads to end-to-end distributional robustness against uncertainties in data [27, 28]. Moreover, performance guarantees on the realized input/output cost can be provided given that the inherent uncertainty set is large enough [29]. Motivated in part by the desire to extend this direct data-driven approach beyond LTI systems, Fundamental Lemma has been extended for certain classes of nonlinear systems, e.g., Hammerstein-Wiener systems [30], second order discrete Volterra systems [31], flat nonlinear systems [32], and polynomial time-invariant systems [33], which enables the control design of these classes of nonlinear systems. However, to the best of our knowledge, there has not been a version of Fundamental Lemma for general nonlinear systems.

In this paper, we develop a direct data-driven method for nonlinear systems, referred to as robust and kernelized DeePC (RoKDeePC), which combines kernel methods with a robustified DeePC framework. We use regularized kernel methods [22] to implicitly learn the future behavior of nonlinear systems, then directly formulate a data-to-control min-max optimization problem to obtain robust and optimal control sequences. We prove that the realized cost of the system is upper bounded by the optimization cost of this optimization problem, leading to deterministic performance guarantees. Furthermore, we demonstrate how this min-max problem can be reformulated as a nonconvex yet structured minimization problem when appropriate uncertainty sets are considered. To enable a real-time implementation, we develop a projected gradient descent algorithm exploiting the problem structure to effectively solve the RoKDeePC problem. Unlike learning-based control using neural networks, our method does not require a time-consuming off-line learning process and has real-time plug-and-play ability to handle nonlinear systems. We test RoKDeePC on two nonlinear example systems – one academic case study and a grid-forming converter feeding a nonlinear load – and compare its performance with the Koopman-based MPC method in [17] and MPC that uses a kernel-based predictor as the system model. These methods are all based on linear predictors in the lifted high-dimensional space and thus fair comparisons can possibly be made.

The rest of this paper is organized as follows: in Section II we review the linear predictor and then introduce the kernel-based nonlinear predictor using input/output data. Section III proposes the RoKDeePC algorithm and demonstrates its performance guarantees. Section IV develops gradient descent algorithms to efficiently solve the RoKDeePC. Section V tests the RoKDeePC algorithm on two example nonlinear systems. We conclude the paper in Section VI.

Notation

Let $\mathbb{N}$ denote the set of positive integers, $\mathbb{S}^{n}$ $(\mathbb{S}^{n}_{>0})$ the set of real $n$ -by- $n$ symmetric (positive definite) matrices. We denote the index set with cardinality $n\in\mathbb{N}$ as $[n]$ , that is, $[n]=\{1,...,n\}$ . We use $\|x\|$ ( $\|A\|$ ) to denote the (induced) 2-norm of the vector $x$ (the matrix $A$ ); we use $\|x\|_{1}$ to denote the 1-norm of the vector $x$ ; for a vector $x$ , we use ${\left\|x\right\|_{A}^{2}}$ to denote $x^{\top}Ax$ ; for a matrix $A$ , we use $\|A\|_{Q}$ to denote $\|Q^{\frac{1}{2}}A\|$ ; we use $\|A\|_{F}$ to denote the Frobenius norm of the matrix $A$ . We use $x_{[i]}$ to denote the $i{\rm th}$ element of $x$ and $x_{[i:j]}$ to denote the vector $[x_{[i]}\;x_{[i+1]}\;\cdots\;x_{[j]}]^{\top}$ ; we use $A_{[:i]}$ to denote the $i{\rm th}$ column of $A$ , $A_{[i:]}$ to denote the $i{\rm th}$ row of $A$ , and $A_{[ij]}$ to denote $i{\rm th}$ -row $j{\rm th}$ -column element of $A$ . We use $\boldsymbol{1}_{n}$ to denote a vector of ones of length $n$ , $\boldsymbol{1}_{m\times n}$ to denote a $m$ -by- $n$ matrix of ones, and $I_{n}$ to denote an $n$ -by- $n$ identity matrix (abbreviated as $I$ when the dimensions can be inferred from the context). We use ${\rm col}(Z_{0},Z_{1},...,Z_{\ell})$ to denote the matrix $[Z_{0}^{\top}\;Z_{1}^{\top}\;\cdots\;Z_{\ell}^{\top}]^{\top}$ .

II Kernel-Based Nonlinear Predictors

II-A Explicit Predictors for LTI Behavior

Consider a discrete-time linear time-invariant (LTI) system

\left\{\begin{array}[]{l}{x_{t+1}}=A{x_{t}}+B{u_{t}}\\ {y_{t}}=C{x_{t}}+D{u_{t}},\end{array}\right.\,

(1)

where $A\in\mathbb{R}^{n\times n}$ , $B\in\mathbb{R}^{n\times m}$ , $C\in\mathbb{R}^{p\times n}$ , $D\in\mathbb{R}^{p\times m}$ , $x_{t}\in\mathbb{R}^{n}$ is the state of the system at $t\in\mathbb{Z}_{\geq 0}$ , $u_{t}\in\mathbb{R}^{m}$ is the input vector, and $y_{t}\in\mathbb{R}^{p}$ is the output vector. Recall the extended observability matrix

\mathscr{O}_{\ell}(A,C):={\rm col}(C,CA,...,CA^{\ell-1})\,.

The lag of the system (1) is defined by the smallest integer $\ell\in\mathbb{Z}_{\geq 0}$ such that the observability matrix $\mathscr{O}_{\ell}(A,C)$ has rank $n$ , i.e., the state can be reconstructed from $\ell$ measurements. Consider $L,T\in\mathbb{Z}_{\geq 0}$ with $T\geq L>\ell$ and length- $T$ input and output trajectories of (1): $u={\rm col}(u_{0},u_{1},\dots u_{T-1})\in\mathbb{R}^{mT}$ and $y={\rm col}(y_{0},y_{1},\dots y_{T-1})\in\mathbb{R}^{pT}$ . For the inputs $u$ , define the Hankel matrix of depth $L$ as

\mathscr{H}_{L}(u):=\left[{\begin{array}[]{*{20}{c}}{{u_{0}}}&{{u_{1}}}&\cdots&{{u_{T-L}}}\\ {{u_{1}}}&{{u_{2}}}&\cdots&{{u_{T-L+1}}}\\ \vdots&\vdots&\ddots&\vdots\\ {{u_{L-1}}}&{{u_{L}}}&\cdots&{{u_{T-1}}}\end{array}}\right]\,.

(2)

Accordingly, for the outputs define the Hankel matrix $\mathscr{H}_{L}(y)$ . Consider the stacked matrix $\mathscr{H}_{L}(u,y)=\left[\begin{smallmatrix}\mathscr{H}_{L}(u)\\ \mathscr{H}_{L}(y)\end{smallmatrix}\right]$ . By [34, Corollary 19], the restricted behavior (i.e., the input/output trajectory of length $L$ ) equals the image of $\mathscr{H}_{L}(u,y)$ if and only if rank $\left(\mathscr{H}_{L}(u,y)\right)=mL+n$ .

This behavioral result can be leveraged for data-driven prediction and estimation as follows. Consider $T_{\rm ini},N,T\in\mathbb{Z}_{\geq 0}$ , as well as an input/output time series ${\rm col}(u^{\rm{d}},y^{\rm{d}})\in\mathbb{R}^{(m+p)T}$ so that rank $\left(\mathscr{H}_{T_{\rm ini}+N}(u^{\rm{d}},y^{\rm{d}})\right)=m(T_{\rm ini}+N)+n$ . Here the superscript “d” denotes data collected offline, and the rank condition is met by choosing $u^{\rm{d}}$ to be persistently exciting of sufficiently high order and by assuming that the system is controllable and observable. The Hankel matrix $\mathscr{H}_{T_{\rm ini}+N}(u^{\rm{d}},y^{\rm{d}})$ can be partitioned into

\left[{\begin{array}[]{*{20}{c}}{{U_{\rm P}}}\\ {{U_{\rm F}}}\end{array}}\right]:=\mathscr{H_{c}}_{T_{\rm ini}+N}(u^{\rm{d}})\quad\text{and}\quad\left[{\begin{array}[]{*{20}{c}}{{\bar{Y}_{\rm P}}}\\ {{\bar{Y}_{\rm F}}}\end{array}}\right]:=\mathscr{H_{c}}_{T_{\rm ini}+N}(y^{\rm{d}})\,,

where $U_{\rm P}\in\mathbb{R}^{mT_{\rm ini}\times H_{c}}$ , $U_{\rm F}\in\mathbb{R}^{mN\times H_{c}}$ , $\bar{Y}_{\rm P}\in\mathbb{R}^{pT_{\rm ini}\times H_{c}}$ , $\bar{Y}_{\rm F}\in\mathbb{R}^{pN\times H_{c}}$ , and $H_{c}=T-T_{\rm ini}-N+1$ . In the sequel, the data in the partition with subscript P (for “past”) will be used to implicitly estimate the initial condition of the system, whereas the data with subscript F will be used to predict the “future” trajectories. In this case, $T_{\rm ini}$ is the length of an initial trajectory measured in the immediate past during on-line operation, and $N$ is the length of a predicted trajectory starting from the initial trajectory. The Fundamental Lemma shows that the image of $\mathscr{H}_{T_{\rm ini}+N}(u^{\rm{d}},y^{\rm{d}})$ spans all length- $(T_{\rm ini}+N)$ trajectories [23], that is, ${\rm{col}}(u_{\rm ini},u,\bar{y}_{\rm ini},y)\in\mathbb{R}^{(m+p)(T_{\rm ini}+N)}$ is a trajectory of (1) if and only if there exists a vector $g\in\mathbb{R}^{H_{c}}$ so that

\left[{\begin{array}[]{*{20}{c}}{{U_{\rm P}}}\\ {{\bar{Y}_{\rm P}}}\\ {{U_{\rm F}}}\\ {{\bar{Y}_{\rm F}}}\end{array}}\right]g=\left[{\begin{array}[]{*{10}{c}}{{u_{\rm ini}}}\\ {{\bar{y}_{\rm ini}}}\\ u\\ y\end{array}}\right]\,.

(3)

The initial trajectory ${\rm{col}}(u_{\rm ini},\bar{y}_{\rm ini})\in\mathbb{R}^{(m+p)T_{\rm ini}}$ can be thought of as setting the initial condition for the future (to be predicted) trajectory ${\rm{col}}(u,y)\in\mathbb{R}^{(m+p)N}$ . In particular, if $T_{\rm ini}\geq\ell$ , for every given future input trajectory $u$ , the future output trajectory $y$ is uniquely determined through (3) [35].

In this case, one can consider an explicit linear predictor as

y=M{\rm col}(u_{\rm ini},\bar{y}_{\rm ini},u)\,,

(4)

where $M\in\mathbb{R}^{pN\times[(m+p)T_{\rm ini}+mN]}$ is a linear mapping and can be calculated from data as

M=\bar{Y}_{\rm F}\left[{\begin{array}[]{*{20}{c}}{{U_{\rm P}}}\\ {{\bar{Y}_{\rm P}}}\\ {{U_{\rm F}}}\end{array}}\right]^{+}\,,

(5)

where $+$ denotes the pseudoinverse operator. Note that (5) is the solution of the following linear regression problem

\begin{array}[]{cl}\displaystyle\mathop{\min}\limits_{M}&\left\|\bar{Y}_{{\rm F}}-M\left[{\begin{array}[]{*{20}{c}}{{U_{{\rm P}}}}\\ {{\bar{Y}_{{\rm P}}}}\\ {{U_{{\rm F}}}}\end{array}}\right]\right\|_{F},\end{array}

(6)

where ${\rm col}(U_{{\rm P}[:i]},\bar{Y}_{{\rm P}[:i]},U_{{\rm F}[:i]},\bar{Y}_{{\rm F}[:i]})$ can be considered as one sample trajectory of ${\rm col}(u_{\rm ini},\bar{y}_{\rm ini},u,y)$ .

In addition to Hankel matrices, one can also use more input/output data to construct (Chinese) Page matrices, mosaic Hankel matrices, or trajectory matrices as data-driven predictors [34, 36, 37].

II-B Explicit Kernel-Based Nonlinear Predictors

We now consider nonlinear systems, where the future outputs are nonlinear functions of the initial trajectory and the future inputs

y_{[i]}=f_{i}(u_{\rm ini},\bar{y}_{\rm ini},u),\;i\in[pN]\,.

(7)

In what follows, we focus on how to find the nonlinear functions $f_{i}$ that best fit the observed nonlinear trajectories contained in ${\rm col}(U_{{\rm P}},\bar{Y}_{{\rm P}},U_{{\rm F}},\bar{Y}_{{\rm F}})$ .

To reconcile flexibility of the functions class of $f_{i}$ and well-poseness of the function estimation problem, we assume that the functions $f_{i}$ belong to an RKHS [38].

Definition II.1.

An RKHS over a non-empty set $\mathscr{X}$ is a Hilbert space of functions $f:\mathscr{X}\to\mathbb{R}$ such that for each $x\in\mathscr{X}$ , the evaluation functional $E_{x}f:=f(x)$ is bounded.

An RKHS is associated with a positive semidefinite kernel, called reproducing kernel, which encodes the properties of the functions in this space.

Definition II.2.

A symmetric function $K:\mathscr{X}\times\mathscr{X}\to\mathbb{R}$ is called positive semidefinite kernel if, for any $h\in\mathbb{N}$ ,

\sum\limits_{i=1}^{h}\sum\limits_{j=1}^{h}\alpha_{i}\alpha_{j}K(x_{i},x_{j})\geq 0,\;\forall(x_{k},\alpha_{k})\in(\mathscr{X},\mathbb{R}),\;k\in[h]\,.

The kernel section of $K$ centered at $x$ is $K_{x}(\cdot)=K(x,\cdot),\;\forall x\in\mathscr{X}$ .

According to the Moore–Aronszajn theorem [39], an RKHS is fully characterized by its reproducing kernel. To be specific, if a function $f:\mathscr{X}\to\mathbb{R}$ belongs to an RKHS $\mathcal{H}$ with kernel $K$ , then it can be expressed as

f(x)=\sum\limits_{i=1}^{h}\alpha_{i}K_{x_{i}}(x)\,,

(8)

for some $\alpha_{i}\in\mathbb{R},\;x_{i}\in\mathscr{X}$ , for all $i\in[h]$ , possibly with infinite $h$ . Then, the inner product in the RKHS $\mathcal{H}$ reduces to

\langle f,g\rangle:=\sum\limits_{i=1}^{h}\sum\limits_{j=1}^{h^{\prime}}\alpha_{i}\beta_{j}K(x_{i},x_{j})\,,

where $f,g\in\mathcal{H}$ and $g(x)=\sum\nolimits_{j=1}^{h^{\prime}}\beta_{j}K_{x_{j}}(x)$ , and the induced norm of $f$ to $\|f\|_{\mathcal{H}}^{2}=\sum\nolimits_{i=1}^{h}\sum\nolimits_{j=1}^{h}\alpha_{i}\alpha_{j}K(x_{i},x_{j})$ .

We consider now the function estimation problem for the nonlinear system in (7). Since every column in the data matrix ${\rm col}(U_{{\rm P}},\bar{Y}_{{\rm P}},U_{{\rm F}},\bar{Y}_{{\rm F}})$ is a trajectory of the system and thus an input/output sample for (7), for each $i\in[pN]$ , we search for the best (in a regularized least-square sense) $f_{i}$ in an RKHS $\mathcal{H}$ with kernel $K$

\begin{array}[]{cl}\displaystyle\mathop{\min}\limits_{f_{i}\in\mathcal{H}}&\sum\limits_{j=1}^{H_{c}}(\bar{Y}_{{\rm F}[ij]}-f_{i}(x_{j}))^{2}+\gamma\|f_{i}\|_{\mathcal{H}}^{2}\end{array}

(9)

where $x_{j}={\rm col}(U_{{\rm P}[:j]},\bar{Y}_{{\rm P}[:j]},U_{{\rm F}[:j]})$ . Note that to simplify the forthcoming developments, the formulation in (9) does not consider temporal correlation between different indices $i$ and causality (i.e., $y_{t}$ is not affected by $u_{t+1}$ ). One can further consider causality by restricting the arguments of $f_{i}$ to be $(u_{\rm ini},y_{\rm ini},u_{[1:i]})$ , which will lead to more complicated formulations than those in this paper, as different $f_{i}$ is defined over sets with different dimensions.

The cost function in (9) includes the prediction errors and a regularization term, which is used to avoid over-fitting and penalize undesired behaviors. We note that since the functions in an RKHS can be parameterized as in (8), the representer theorem [22] ensures that the minimization problem in (9) can be solved in closed form. In our context, this leads to the following.

Lemma II.3.

The minimizer of (9) admits a closed-form prediction model for (7) as

y=\bar{Y}_{\rm F}(\mathbf{\bar{K}}+\gamma I)^{-1}k(u_{\rm ini},\bar{y}_{\rm ini},u)\,,

(10)

where $\mathbf{\bar{K}}\in\mathbb{S}_{\geq 0}^{H_{c}\times H_{c}}$ is the Gram matrix such that $\mathbf{\bar{K}}_{[ij]}=K(x_{i},x_{j})$ , and $k(\cdot)={\rm col}(K_{x_{1}}(\cdot),K_{x_{2}}(\cdot),\dots,K_{x_{H_{c}}}(\cdot))$ .

Eq. (10) serves as a data-driven nonlinear explicit predictor which can be used to predict future behaviors of the system (7) and embedded in predictive control algorithms. We note that the choice of kernel $K$ encodes the properties of the function space. Indeed, (10) shows that the synthesized functions are linear combinations of the kernel sections centered at the sampling points. Hence, one can choose the kernel $K$ based on a priori knowledge on the system. For example, if one knows that $f_{i}$ ( $i\in[pN]$ ) are polynomials in inputs and initial data, then a polynomial kernel $K(x_{i},x_{j})=(x_{i}^{\top}x_{j}+c)^{d}$ can be used. When the functions are very high-order polynomials or other complex forms, a Gaussian kernel $K(x_{i},x_{j})={\rm exp}\left(-\frac{\|x_{i}-x_{j}\|^{2}}{2\sigma^{2}}\right)$ can be used to span the set of continuous functions.

III DeePC with Kernel Methods

III-A Review of the DeePC algorithm

The DeePC algorithm proposed in [24] directly uses input/output data collected from the unknown LTI system to predict the future behaviour, and performs optimal and safe control without identifying a parametric system representation. More specifically, DeePC solves the following optimization problem to obtain the optimal future control inputs

\begin{array}[]{cl}\displaystyle\mathop{{\rm{min}}}\limits_{u\in\mathcal{U},y\in\mathcal{Y}\atop g}\left\{\ell(u)+{\left\|{y-r}\right\|_{Q}^{2}}\,\Bigg{|}\,\,{\begin{bmatrix}{{U_{\rm P}}}\\ {{\bar{Y}_{\rm P}}}\\ {{U_{\rm F}}}\\ {{\bar{Y}_{\rm F}}}\end{bmatrix}}g={\begin{bmatrix}{{u_{\rm ini}}}\\ {{\bar{y}_{\rm ini}}}\\ u\\ y\end{bmatrix}}\right\},\end{array}

(11)

where $\ell:\mathbb{R}^{mN}\to\mathbb{R}$ is the cost function of control actions, for instance, $\ell(u)=\|u\|_{R}$ or $\ell(u)=\|u\|_{R}^{2}$ . The sets of input/output constraints are defined as $\mathcal{U}=\{u\in\mathbb{R}^{mN}\ |\ W_{u}u\leq w_{u}\}$ and $\mathcal{Y}=\{y\in\mathbb{R}^{pN}\ |\ W_{y}y\leq w_{y}\}$ . The positive definite matrix $R\in\mathbb{S}^{mN\times mN}_{>0}$ and positive semi-definite matrix $Q\in\mathbb{S}^{pN\times pN}_{\geq 0}$ are the cost matrices. The vector $r\in\mathbb{R}^{pN}$ is a prescribed reference trajectory for the future outputs. DeePC involves solving the convex quadratic problem (11) in a receding horizon fashion, that is, after calculating the optimal control sequence $u^{\star}$ , we apply $(u_{t},...,u_{t+k-1})=(u_{0}^{\star},...,u_{k-1}^{\star})$ to the system for $k\leq N$ time steps, then, reinitialize the problem (11) by updating ${\rm col}(\bar{u}_{\rm ini},\bar{y}_{\rm ini})$ to the most recent input and output measurements, and setting $t$ to $t+k$ , to calculate the new optimal control for the next $k\leq N$ time steps. As in MPC, the control horizon $k$ is a design parameter.

The standard DeePC algorithm in (11) assumes perfect output data $(\bar{Y}_{\rm P},\bar{Y}_{\rm F},\bar{y}_{\rm ini})$ generated from the unknown system (1). However, in practice, perfect data is not accessible to the controller due to measurement noise. We note that regularized DeePC algorithms can be used to provide robustness against disturbances on the data, which often amount to one-norm or two-norm regularization of $g$ in the cost function of (11) [27, 29]. In this paper, we assume no process noise, i.e., the input data $(U_{\rm P},U_{\rm F},u_{\rm ini})$ is perfectly known.

Perfect Data and Noisy Data. Throughout the paper, we use $\bar{Y}_{\rm P}$ , $\bar{Y}_{\rm F}$ , and $\bar{y}_{\rm ini}$ to denote the perfect (noiseless) output data generated from the system, which accurately captures the system dynamics; we use $\hat{Y}_{\rm P}$ , $\hat{Y}_{\rm F}$ , and $\hat{y}_{\rm ini}$ to denote the corresponding noisy output data. We use $\mathbf{\bar{K}}$ to denote the Gram matrix calculated from perfect output data, and $\mathbf{\hat{K}}$ to denote the Gram matrix calculated from noisy output data.

III-B RoKDeePC algorithm

The DeePC algorithm above implicitly incorporates a linear predictor. In this section, we develop a RoKDeePC algorithm using an implicit and robustified version of the kernel-based nonlinear predictor (10) to perform data-driven optimal control for nonlinear systems. We first consider the following certainty-equivalence MPC problem using the kernel predictor in (10) (referred to as kernel-based MPC in this paper)

\begin{array}[]{cl}\mathop{\min}\limits_{u\in\mathcal{U},y\in\mathcal{Y}}&\ell(u)+\|y-r\|_{Q}\\ {\rm s.t.}&\eqref{eq:K_pred}\,.\end{array}

(12)

In what follows, we assume that when perfect data is available, the predictor (10) can faithfully predict the future behavior of the nonlinear dynamics (7).

Assumption 1.

When perfect data $(\bar{Y}_{\rm P},\bar{Y}_{\rm F},\bar{y}_{\rm ini})$ is available in the kernel-based nonlinear predictor (10), the prediction error is bounded, i.e.,

\|y_{\rm sys}-y_{\rm prd}\|_{Q}\leq\beta_{e}\,,

(13)

where $y_{\rm prd}=\bar{Y}_{\rm F}(\mathbf{\bar{K}}+\gamma I)^{-1}k(u_{\rm ini},\bar{y}_{\rm ini},u_{\rm sys})$ is the predicted output trajectory based on (10), $u_{\rm sys}$ is a control sequence applied to the system, $y_{\rm sys}$ is the realized output trajectory of the system in response to $u_{\rm sys}$ , and $\beta_{e}$ is the error bound.

Assumption 1 implicitly requires that sufficiently rich data is used to capture the system dynamics [40]. Moreover, one can possibly use more data to obtain a more accurate prediction and thus reduce $\beta_{e}$ . Since perfect data is available, this assumption can generally be satisfied if the unknown dynamics $f_{i}$ are contained in the RKHS of the chosen kernel $K$ , e.g., $f_{i}$ is polynomial and a sufficiently high-order polynomial kernel is used. Otherwise, this assumption can be met by using kernels with the universal approximation property (e.g., a Gaussian kernel) and by implicitly assuming that $f_{i}$ is continuous and that ${\rm col}(u_{\rm ini},\bar{y}_{\rm ini},u_{\rm sys})$ is contained in a compact set [41].

When noisy data is used, the prediction error may grow unbounded, and the obtained control sequence from (12) may not result in a robust closed loop. As a first step toward addressing this problem, we rewrite (10) implicitly as

\begin{bmatrix}\mathbf{\bar{K}}+\gamma I\\ \bar{Y}_{\rm F}\end{bmatrix}g=\begin{bmatrix}k(u_{\rm ini},\bar{y}_{\rm ini},u)\\ y\end{bmatrix}\,,

(14)

where $g$ is uniquely determined by $(\mathbf{\bar{K}}+\gamma I)g=k(u_{\rm ini},\bar{y}_{\rm ini},u)$ since $(\mathbf{\bar{K}}+\gamma I)$ has full rank, and $y=\bar{Y}_{\rm F}g$ predicts the future behavior of the system. We note that, given ${\rm col}(u_{\rm ini},\bar{y}_{\rm ini},u)$ , there exists a unique $g\in\mathbb{R}^{H_{c}}$ that satisfies (14). Unlike linear systems, however, given an arbitrary $g$ , there may not exist a vector ${\rm col}(u_{\rm ini},\bar{y}_{\rm ini},u)$ that satisfies (14); for linear systems such a vector exists, thanks to the Fundamental Lemma. Substituting (14) into (12) leads to

\begin{array}[]{cl}\mathop{\min}\limits_{u\in\mathcal{U},y\in\mathcal{Y},g}&\ell(u)+\|y-r\|_{Q}\\ {\rm s.t.}&\eqref{eq:K_predg}\,.\end{array}

(15)

The equality constraint (14) assumes perfect data to obtain a satisfactory prediction. When perfect data is not available, one may need to relax this equality constraint and ensure feasibility, for instance, by penalizing its violation in the cost function (i.e., soft constraints)

\mathop{\min}\limits_{u\in\mathcal{U}\atop g:\bar{Y}_{\rm F}g\in\mathcal{Y}}\ell(u)+\|\bar{Y}_{\rm F}g-r\|_{Q}+\lambda_{k}\|(\mathbf{\bar{K}}+\gamma I)g-k(u_{\rm ini},\bar{y}_{\rm ini},u)\|\,,

(16)

where $\lambda_{k}$ is a penalty parameter. We note that with a sufficiently large $\lambda_{k}$ , the solution to (16) is equal to the solution to (15) [42]. However, a robust solution may still not be obtained when noisy output data is used. Hence, we further robustify (16) and propose the following RoKDeePC formulation that leverages noisy output data for nonlinear, robust, and optimal control

\begin{array}[]{cl}\mathop{\min}\limits_{u\in\mathcal{U},g\in\mathcal{G}}&\mathop{\max}\limits_{[\Delta_{K}\;\Delta_{k}]\in\mathcal{D}_{1}\atop\Delta_{Y}\in\mathcal{D}_{2}}\;\;\ell(u)+\|(\hat{Y}_{\rm F}+\Delta_{Y})g-r\|_{Q}\vspace{1mm}\\ &+\lambda_{k}\|(\mathbf{\hat{K}}+\Delta_{K}+\gamma I)g-(k(u_{\rm ini},\hat{y}_{\rm ini},u)+\Delta_{k})\|,\vspace{1mm}\end{array}

(17)

where $[\Delta_{K}\;\Delta_{k}]$ and $\Delta_{Y}$ are disturbances added to the data matrices, which respectively reside in prescribed disturbance sets $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ , and $\mathcal{G}=\{g\;|\;(\hat{Y}_{\rm F}+\Delta_{Y})g\in\mathcal{Y},\;\forall\Delta_{Y}\in\mathcal{D}_{2}\}$ robustifies the output constraints.

The above min-max formulation considers the worst-case scenario of the uncertainties added to the data matrices, thereby leading to a robust solution. We will later show that performance guarantees can also be provided by employing this formulation. Note that min-max optimization problems could be in general difficult to solve, however, (17) admits a tractable reformulation as a minimization problem if appropriate disturbance sets $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ are considered. The following result shows that the inner max problem of (17) can be resolved analytically for appropriate disturbance sets $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ , resulting in regularized problem formulations.

Lemma III.1.

Consider $\mathcal{D}_{F1}=\{[\Delta_{K}\;\Delta_{k}]\;|\;\|[\Delta_{K}\;\Delta_{k}]\|_{F}\leq\rho_{1}\}$ , and $\mathcal{D}_{F2}=\{\Delta_{Y}\;|\;\|Q^{\frac{1}{2}}\Delta_{Y}\|_{F}\leq\rho_{2}\}$ . Given a vector $u\in\mathcal{U}$ and a vector $g\in\mathcal{G}$ , it holds that

\begin{split}&\mathop{\max}\limits_{[\Delta_{K}\;\Delta_{k}]\in\mathcal{D}_{F1}}\|(\mathbf{\hat{K}}+\Delta_{K}+\gamma I)g-(k(u_{\rm ini},\hat{y}_{\rm ini},u)+\Delta_{k})\|\\ &=\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)\|+\rho_{1}\sqrt{\|g\|^{2}+1},\end{split}

(18)

\mathop{\max}\limits_{\Delta_{Y}\in\mathcal{D}_{F2}}\|(\hat{Y}_{\rm F}+\Delta_{Y})g-r\|_{Q}=\|\hat{Y}_{\rm F}g-r\|_{Q}+\rho_{2}\|g\|.

(19)

Proof.

The proof is adapted from [43, Theorem 3.1]. Consider a fixed $g$ and a fixed $u$ in (18). It follows from triangle inequality that

		$\displaystyle\mathop{\max}\limits_{[\Delta_{K}\;\Delta_{k}]\in\mathcal{D}_{F1}}\\|(\mathbf{\hat{K}}+\Delta_{K}+\gamma I)g-(k(u_{\rm ini},\hat{y}_{\rm ini},u)+\Delta_{k})\\|$
	$\displaystyle\leq$	$\displaystyle\\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)\\|+\max_{[\Delta_{K}\;\Delta_{k}]\in\mathcal{D}_{F1}}\\|\Delta_{K}g-\Delta_{k}\\|$
	$\displaystyle=$	$\displaystyle\\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)\\|+\rho_{1}\sqrt{\\|g\\|^{2}+1}.$

Choose $\hat{\Delta}=[\hat{\Delta}_{K}\ \hat{\Delta}_{k}]\in\mathcal{D}_{F1}$ such that

[\hat{\Delta}_{K}\ \hat{\Delta}_{k}]=\frac{\rho_{1}\omega}{\sqrt{\|g\|^{2}+1}}[g^{\top}\ -1],\quad\text{where}

\omega=\begin{cases}\frac{(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)}{{\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)\|}}&\text{if }(\mathbf{\hat{K}}+\gamma I)g\neq k(u_{\rm ini},\hat{y}_{\rm ini},u)\\ \text{any unit-norm vector}&\text{otherwise}.\end{cases}

Since $\hat{\Delta}$ is a rank-one matrix, we have $\|\hat{\Delta}\|_{F}=\|\hat{\Delta}\|=\rho_{1}$ , and

		$\displaystyle\\|(\mathbf{\hat{K}}+\hat{\Delta}_{K}+\gamma I)g-(k(u_{\rm ini},\hat{y}_{\rm ini},u)+\hat{\Delta}_{k})\\|$
	$\displaystyle=$	$\displaystyle\\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)\\|+\left\\|\hat{\Delta}_{K}g-\hat{\Delta}_{k}\right\\|$
	$\displaystyle=$	$\displaystyle\\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)\\|+\rho_{1}\sqrt{\\|g\\|^{2}+1}$
	$\displaystyle\leq$	$\displaystyle\mathop{\max}\limits_{[\Delta_{K}\;\Delta_{k}]\in\mathcal{D}_{F1}}\\|(\mathbf{\hat{K}}+\Delta_{K}+\gamma I)g-(k(u_{\rm ini},\hat{y}_{\rm ini},u)+\Delta_{k})\\|,$

where the first equality follows from triangle inequality and the fact that $((\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u))$ is a nonnegative scalar multiple of $(\hat{\Delta}_{K}g-\hat{\Delta}_{k})$ . Since the lower bound coincides with the upper bound on the maximization problem (18) for any fixed $g$ and $u$ , the equality in (18) holds. By following a similar process, one can prove the equality in (19). ∎

Corollary III.2 (A tractable formulation for RoKDeePC).

By considering $\mathcal{D}_{1}=\mathcal{D}_{F1}$ and $\mathcal{D}_{2}=\mathcal{D}_{F2}$ in (17), it holds that

\begin{split}\eqref{eq:K_DeePC1}=\mathop{\min}\limits_{u\in\mathcal{U},g\in\mathcal{G}}\;&\ell(u)+\|\hat{Y}_{\rm F}g-r\|_{Q}+h(g)\\ &+\lambda_{k}\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)\|,\end{split}

(20)

where $h(g)=\lambda_{k}\rho_{1}\sqrt{\|g\|^{2}+1}+\rho_{2}\|g\|$ is the regularizer. Moreover, the minimizer of (20) coincides with that of (17).

The above result shows that one can get robust control inputs by simply solving a regularized minimization problem. The proposed RoKDeePC algorithm involves solving (20) in a receding horizon manner, similar to the DeePC algorithm. The initial trajectory $(u_{\rm ini},\hat{y}_{\rm ini})$ should be updated every time before solving (20). We remark that $\mathcal{D}_{F1}$ and $\mathcal{D}_{F2}$ are unstructured uncertainties sets which may lead to conservativeness. For instance, $\mathbf{\hat{K}}$ is a symmetric matrix, but $\mathcal{D}_{F1}$ does not restrict $\Delta_{K}$ to be symmetric; $\hat{Y}_{\rm F}$ may be a Hankel data matrix, but a Hankel structure cannot be imposed on $\Delta_{Y}$ by considering $\mathcal{D}_{F2}$ . As will be discussed in the simulation section, though the considered unstructured sets are conservative, they can generally lead to satisfactory performance.

Notice that the cost function in (20) is convex in $g$ , but nonconvex in $u$ . Hence, in general one may not be able to find the global minimizer. Given a feasible control sequence $u_{\rm sys}\in\mathcal{U}$ (e.g., a local minimizer), we define the minimum cost of (20) over $g$ as

\begin{split}c_{\rm opt}(u_{\rm sys})=\mathop{\min}\limits_{g\in\mathcal{G}}\;&\ell(u_{\rm sys})+\|\hat{Y}_{\rm F}g-r\|_{Q}+h(g)\\ &+\lambda_{k}\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u_{\rm sys})\|.\end{split}

(21)

III-C Performance guarantees of RoKDeePC

In what follows, we show that the proposed RoKDeePC algorithm provides deterministic performance guarantees when applying a feasible optimal control sequence to the system, thanks to the min-max formulation in (17). We begin by making the following assumption.

Assumption 2.

For any ${\rm col}(u_{\rm ini},\bar{y}_{\rm ini},u_{\rm sys})$ (contained in a compact set), there exists $[\Delta_{K}\;\Delta_{k}]\in\mathcal{D}_{F1}$ such that $\mathbf{\hat{K}}+\Delta_{K}=\mathbf{\bar{K}}$ and $k(u_{\rm ini},\hat{y}_{\rm ini},u_{\rm sys})+\Delta_{k}=k(u_{\rm ini},\bar{y}_{\rm ini},u_{\rm sys})$ ; there exists $\Delta_{Y}\in\mathcal{D}_{F2}$ such that $\hat{Y}_{\rm F}+\Delta_{Y}=\bar{Y}_{\rm F}$ .

Assumption 2 indicates that sufficiently large uncertainty sets $\mathcal{D}_{F1}$ and $\mathcal{D}_{F2}$ are used to cover the deviations of the data matrices induced by noise. This is necessary for (17) to provide a robust solution against the realized uncertainty in practice. Since ${\rm col}(u_{\rm ini},\bar{y}_{\rm ini},u_{\rm sys})$ is contained in a compact set and $k(\cdot)$ is continuous (by choosing a continuous kernel), Assumption 2 can be easily satisfied when the output noise is bounded and sufficiently large disturbance sets $\mathcal{D}_{F1}$ and $\mathcal{D}_{F2}$ are considered, which corresponds to choosing sufficiently large regularization parameters $\rho_{1}$ and $\rho_{2}$ in (20). We define the realized cost of the system as $c_{\rm realized}(u_{\rm sys})=\ell(u_{\rm sys})+\|y_{\rm sys}-r\|_{Q}$ (with $y_{\rm sys}$ defined in Assumption 1). The following result shows that the realized cost is upper-bounded by the optimization cost in (21) plus the prediction error in (13).

Theorem III.3.

If Assumptions 1 and 2 hold, then there exists a sufficiently large $\lambda_{k}$ for (21) such that

c_{\rm realized}(u_{\rm sys})\leq c_{\rm opt}(u_{\rm sys})+\beta_{e},\;{\rm for\;any\;feasible\;}u_{\rm sys}\in\mathcal{U}\,,

(22)

where $\beta_{e}$ is the prediction error bound in Assumption 1.

Proof.

According to the definition of $c_{\rm opt}(u_{\rm sys})$ , if Assumption 2 holds, we have

\begin{split}c_{\rm opt}(u_{\rm sys})&=\mathop{\min}\limits_{g\in\mathcal{G}}\mathop{\max}\limits_{[\Delta_{K}\;\Delta_{k}]\in\mathcal{D}_{1}\atop\Delta_{Y}\in\mathcal{D}_{2}}\ell(u_{\rm sys})+\|(\hat{Y}_{\rm F}+\Delta_{Y})g-r\|_{Q}\vspace{1mm}\\ &+\lambda_{k}\|(\mathbf{\hat{K}}+\Delta_{K}+\gamma I)g-(k(u_{\rm ini},\hat{y}_{\rm ini},u_{\rm sys})+\Delta_{k})\|\\ &\geq\ell(u_{\rm sys})+\|\bar{Y}_{\rm F}g^{\star}-r\|_{Q}\\ &\hskip 10.81204pt+\lambda_{k}\|(\mathbf{\bar{K}}+\gamma I)g^{\star}-k(u_{\rm ini},\bar{y}_{\rm ini},u_{\rm sys})\|,\end{split}

(23)

where $g^{\star}$ minimizes (17) in which $u=u_{\rm sys}$ . Given a vector $u_{\rm sys}$ , it follows from (14) that there exists a $\bar{g}$ such that

\begin{bmatrix}\mathbf{\bar{K}}+\gamma I\\ \bar{Y}_{\rm F}\end{bmatrix}\bar{g}=\begin{bmatrix}k(u_{\rm ini},\bar{y}_{\rm ini},u_{\rm sys})\\ y_{\rm prd}\end{bmatrix},

(24)

where $y_{\rm prd}$ is the predicted output trajectory from the kernel-based nonlinear predictor. By defining $\Delta_{g}=\bar{g}-g^{\star}$ , we have

\begin{bmatrix}\mathbf{\bar{K}}+\gamma I\\ \bar{Y}_{\rm F}\end{bmatrix}\Delta_{g}=\begin{bmatrix}\underbrace{k(u_{\rm ini},\bar{y}_{\rm ini},u_{\rm sys})-(\mathbf{\bar{K}}+\gamma I)g^{\star}}_{\epsilon}\\ y_{\rm prd}-\bar{Y}_{\rm F}g^{\star}\end{bmatrix},

(25)

which implies that

\bar{Y}_{\rm F}\Delta_{g}=\bar{Y}_{\rm F}(\mathbf{\bar{K}}+\gamma I)^{-1}\epsilon=y_{\rm prd}-\bar{Y}_{\rm F}g^{\star}.

(26)

By substituting (25) and (26) into (23) we obtain

\begin{split}c_{\rm opt}(u_{\rm sys})&\geq\ell(u_{\rm sys})+\|y_{\rm prd}-r-\bar{Y}_{\rm F}(\mathbf{\bar{K}}+\gamma I)^{-1}\epsilon\|_{Q}\\ &\hskip 11.38109pt+\lambda_{k}\|\epsilon\|\\ &\geq\ell(u_{\rm sys})+\|y_{\rm prd}-r\|_{Q}\\ &\hskip 11.38109pt-\|\bar{Y}_{\rm F}(\mathbf{\bar{K}}+\gamma I)^{-1}\epsilon\|_{Q}+\lambda_{k}\|\epsilon\|\\ &\geq\ell(u_{\rm sys})+\|y_{\rm prd}-r\|_{Q}\\ &\geq\ell(u_{\rm sys})+\|y_{\rm sys}-r\|_{Q}-\|y_{\rm sys}-y_{\rm prd}\|_{Q}\\ &\geq c_{\rm realized}(u_{\rm sys})-\beta_{e},\end{split}

(27)

where the second and the forth inequalities hold thanks to the reverse triangle inequality, the third inequality is satisfied if we take $\lambda_{k}$ large enough (by assumption) to ensure

\begin{split}&\lambda_{k}^{2}I\succeq(\mathbf{\bar{K}}+\gamma I)^{-1}\bar{Y}_{\rm F}^{\top}Q\bar{Y}_{\rm F}(\mathbf{\bar{K}}+\gamma I)^{-1}\,.\\ \end{split}

Hence, $\lambda_{k}\|\epsilon\|\geq\|(\mathbf{\bar{K}}+\gamma I)^{-1}\epsilon\|_{Q}$ , and the fifth inequality follows from Assumption 1 and the definition of $c_{\rm realized}(u_{\rm sys})$ . This completes the proof. ∎

The above result links the optimization cost to the realized cost, namely, the realized cost is always bounded by the optimization cost plus the prediction error even when noisy data is used. This also justifies the design of the cost function of the RoKDeePC in (20), and shows that one should try to find not only a feasible solution but ideally a minimizer (or better the global minimizer) to reduce the optimization cost and possibly the realized cost.

IV Gradient Descent Methods to Solve RoKDeePC

The RoKDeePC problem in (20) is nonconvex in $u$ unless the kernel function is linear in $u$ . One may use a generic nonlinear optimization toolbox (e.g., IPOPT) to solve this problem. However, solving a nonconvex problem is in general time-consuming, especially considering that the decision variable $g$ in (20) can be high-dimensional. We propose customized gradient descent methods exploiting the structure of (20) to solve the RoKDeePC problem more efficiently. Although a global minimum may still not be reached, the numerical study in Section V suggests that these methods generally lead to satisfactory performance with a reasonable running time.

IV-A Gradient descent algorithm for RoKDeePC

Notice that the optimization problem in (20) is nonconvex in $u$ but convex in $g$ . Moreover, the dimension of $g$ is generally much higher than that of $u$ . Due to this favorable structure, we apply a gradient descent method to update $u$ , while at each iteration, we solve a convex optimization problem to update $g$ . Note that the input/output constraints can be handled using projected gradient algorithms. We summarize the algorithm for solving RoKDeePC below.

Algorithm 1 A projected gradient algorithm to solve (20)

Input: cost function:

\begin{split}c(u,g)=&\;\ell(u)+\|\hat{Y}_{\rm F}g-r\|_{Q}+h(g)\\ &+\lambda_{k}\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)\|\,,\end{split}

(28)

initial vectors $u=u_{(0)}=0$ , $g=g_{(0)}=0$ ; constraint sets $\mathcal{U}$ , $\mathcal{G}$ ; step size $\alpha$ ; maximal iteration number $i_{\max}$ ; convergence threshold: $\xi$ .

Initialization: $i=0$ .

Iterate until $|c(u_{(i)},g_{(i)})-c(u_{(i-1)},g_{(i-1)})|<\xi$ or $i\geq i_{\max}$ :

1.

$i\leftarrow i+1$
2.

$u_{(i)}={\rm arg}\mathop{\rm min}\limits_{u\in\mathcal{U}}\Big{\|}u-\Big{(}u_{(i-1)}-\alpha\frac{\partial c(u,g)}{\partial u}\Big{|}_{\begin{subarray}{l}u=u_{(i-1)}\\ \vspace{2mm}g=g_{(i-1)}\end{subarray}}\Big{)}\Big{\|}$
3.

$g_{(i)}={\rm arg}\mathop{\rm min}\limits_{g\in\mathcal{G}}c(u_{(i)},g)$

Output: (sub)optimal control sequence $u^{\star}=u_{(i)}$ .

In Step 2, the projection of $u$ into the set $\mathcal{U}$ admits a closed-form solution if upper and lower bounds for the elements in $u$ are considered. Step 3 requires solving a second-order cone program (SOCP), which could make the algorithm time-consuming as one may need to solve the SOCP $i_{\max}$ times. To reduce the computational burden, in what follows, we consider an equivalent cost function which is quadratic in $g$ and thus admits a closed-form solution for Step 3 when the output constraints are inactive.

IV-B Quadratic reformulation

To enable a fast calculation of Step 3, we consider the following cost function that is quadratic in $g$

\begin{split}c_{q}(u,g)=&\;\ell(u)+\|\hat{Y}_{\rm F}g-r\|_{Q}^{2}+\lambda_{g}\|g\|^{2}\\ &+\lambda_{k}^{\prime}\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)\|^{2}\,.\\ \end{split}

(29)

The following result shows the connection between the cost functions in (28) and (29).

Proposition IV.1.

For any $u$ , if $g^{\star}\in\mathbb{R}^{H_{c}}$ is a minimizer of

\mathop{\min}\limits_{g\in\mathcal{G}}\;c_{q}(u,g)\,,

(30)

then $g^{\star}$ also minimizes

\mathop{\min}\limits_{g\in\mathcal{G}}\;c(u,g)\,,

(31)

with

\lambda_{k}=\left\{\begin{array}[]{ll}\frac{\lambda_{k}^{\prime}\|(\mathbf{\hat{K}}+\gamma I)g^{\star}-k(u_{\rm ini},\hat{y}_{\rm ini},u)\|}{\|\hat{Y}_{\rm F}g^{\star}-r\|_{Q}}&{\rm if}\;\hat{Y}_{\rm F}g^{\star}\neq r\\ \lambda_{k}^{\prime}\|(\mathbf{\hat{K}}+\gamma I)g^{\star}-k(u_{\rm ini},\hat{y}_{\rm ini},u)\|&{\rm otherwise,}\end{array}\right.

(32)

\frac{\lambda_{k}\rho_{1}\|g^{\star}\|}{\sqrt{\|g^{\star}\|^{2}+1}}+\rho_{2}=\left\{\begin{array}[]{ll}\frac{\lambda_{g}\|g^{\star}\|}{\|\hat{Y}_{\rm F}g^{\star}-r\|_{Q}}&{\rm if}\;\hat{Y}_{\rm F}g^{\star}\neq r\\ \lambda_{g}\|g^{\star}\|&{\rm otherwise.}\end{array}\right.

(33)

Moreover, $\lambda_{k}$ in (32) is monotonically increasing in the $\lambda_{k}^{\prime}$ chosen in (29); if $\rho_{2}$ ( $\rho_{1}$ ) remains constant, $\rho_{1}$ ( $\rho_{2}$ ) in (33) is monotonically increasing in the $\lambda_{g}$ chosen in (29).

Proof.

See Appendix A. ∎

The above result shows that using the cost function $c_{q}(u,g)$ in the RoKDeePC is equivalent to setting different parameters ( $\lambda_{k}$ , $\rho_{1}$ , and $\rho_{2}$ ) to the problem in (20) every time solving it during the receding horizon implementation, as the equivalent values of $\lambda_{k}$ , $\rho_{1}$ , and $\rho_{2}$ are related to $g^{\star}$ . This may lead to conservativeness as we need to choose a robust pair of $(\lambda_{g},\lambda_{k}^{\prime})$ . However, we generally observe excellent performance when using $c_{q}(u,g)$ as the cost function for RoKDeePC. During the online implementation, after solving (30), one can compute $\lambda_{k}$ , $\rho_{1}$ , and $\rho_{2}$ (by fixing a desired value of $\rho_{1}$ or $\rho_{2}$ and then compute the other) according to (32) and (33). If the obtained $\lambda_{k}$ , $\rho_{1}$ , and $\rho_{2}$ are not sufficiently large to satisfy Assumption 2 and Theorem III.3, then $\lambda_{k}^{\prime}$ and $\lambda_{g}$ should be increased to implicitly increase $\lambda_{k}$ , $\rho_{1}$ , and $\rho_{2}$ .

Moreover, when the output constraints are inactive, (30) admits a closed-form solution as

g^{\star}=M_{r}r+M_{k}k(u_{\rm ini},\hat{y}_{\rm ini},u)\,,

(34)

where $M_{r}=[\hat{Y}_{\rm F}^{\top}Q\hat{Y}_{\rm F}+\lambda_{g}I+\lambda_{k}^{\prime}(\mathbf{\hat{K}}+\gamma I)^{2}]^{-1}\hat{Y}_{\rm F}^{\top}Q$ , and $M_{k}=[\hat{Y}_{\rm F}^{\top}Q\hat{Y}_{\rm F}+\lambda_{g}I+\lambda_{k}^{\prime}(\mathbf{\hat{K}}+\gamma I)^{2}]^{-1}\lambda_{k}^{\prime}(\mathbf{\hat{K}}+\gamma I)$ . Notice that (34) requires simply a linear mapping to obtain $g^{\star}$ , which can greatly accelerate the gradient descent algorithm when the output constraints are inactive. In summary, we consider the following modified RoKDeePC optimization problem

\mathop{\min}\limits_{u\in\mathcal{U},g\in\mathcal{G}}\;c_{q}(u,g)\,,

(35)

which is related to the original formulation in (20) via Proposition IV.1, and we implement the following modifications to Algorithm 1 to solve (35):

•

Change the cost function from $c(u,g)$ to $c_{q}(u,g)$ ;
•

Change Step 3 in the iteration to:
$g_{(i)}=M_{r}r+M_{k}k(u_{\rm ini},\hat{y}_{\rm ini},u_{(i)})$
if $g_{(i)}\notin\mathcal{G}$ , $g_{(i)}\leftarrow{\rm arg}\mathop{\rm min}\limits_{g\in\mathcal{G}}c_{q}(u_{(i)},g)$ .

Step 3 becomes easier to solve if the output constraint set $\mathcal{Y}$ is a nonempty box constraint (i.e., upper and lower bounds for the outputs are considered), as the constraint set $\mathcal{G}$ can be formulated as $\mathcal{G}=\{g\;|\;G_{1}g+{\bf 1}\|g\|\leq q_{1}\}$ for some $G_{1}$ and $q_{1}$ [44, 29]. The computation cost can be further reduced by assuming that $\|g\|$ is bounded and (conservatively) restrict $\mathcal{G}$ to be a polyhedron $\mathcal{G}=\{g\;|\;G_{1}g\leq q^{\prime}\}$ for some $q^{\prime}$ . In our applications, we do not find these assumptions to be limiting.

We remark that one can also assume a kernel function that is nonlinear in $(u_{\rm ini},y_{\rm ini})$ but linear in $u$ such that (29) becomes a convex quadratic program, which can be more efficiently solved using standard solvers. For instance, one may consider

\begin{split}K(x_{i},{\rm col}(u_{\rm ini},y_{\rm ini},u))=&\exp({-\frac{\|x_{1,i}-{\rm col}(u_{\rm ini},y_{\rm ini})\|^{2}}{2\sigma^{2}}})\\ &+x_{2,i}^{\top}u\,,\end{split}

(36)

where $x_{i}={\rm col}(x_{1,i},x_{2,i})$ . Note that (36) is a combination of two positive semidefinite kernels and thus also satisfies the positive semidefinite property. If the future outputs of the system is linear in $u$ , such a kernel can be used to accurately capture the system’s dynamics. Otherwise, the obtained kernel-based predictor is the linear-input model that best fits the observed data, which may not lead to an accurate prediction.

V Simulation Results

In this section, we test the RoKDeePC algorithm on two example nonlinear systems to illustrate its effectiveness. We will compare the performance when different kernel functions are used, and we will compare the RoKDeePC algorithm with the traditional DeePC algorithm, the kernel-based MPC in (12), and the Koopman-based MPC in [17].

V-A Example 1: a nonlinear second-order model

We start with an academic case study to compare different predictors. Consider the following discrete-time nonlinear model of a polynomial single-input-single-output system

y_{t}=4y_{t-1}u_{t-1}-0.5y_{t-1}+2u_{t-1}u_{t}+u_{t}\,.

(37)

To excite the system and collect input and output data, we inject white noise (mean: 0; variance: 0.01) into the input channel and collect an input/output trajectory of length $T=600$ , which will be used in different predictors below. The observed data may also be subject to additive white measurement noise.

We first test the prediction quality of different predictors, including: i) the linear predictor in (4), ii) the kernel predictor in (10), and iii) the Koopman-based predictor in [17]. The parameters in (4) and (10) are: $T_{\rm ini}=1$ , $N=5$ , and $\gamma=0.01$ . We consider three popular kernels for the kernel predictor: 1) Polynomial kernel $K(x_{i},x_{j})=(x_{i}^{\top}x_{j}+1)^{10}$ ; 2) Gaussian kernel $K(x_{i},x_{j})=\exp({-\frac{\|x_{i}-x_{j}\|^{2}}{0.4}})$ ; 3) Exponential kernel $K(x_{i},x_{j})=\exp({\frac{x_{i}^{\top}x_{j}}{0.2}})$ . In this section, the parameters in these kernel functions are chosen to ensure that the kernel predictors have satisfactory performance when noiseless data is available. For instance, the Gaussian kernel has only one parameter to tune, and thus an appropriate value can be easily found with a line search.

For the Koopman-based predictor, one can consider $x_{t}=(u_{t-1},y_{t-1})$ as the state variables of the above system. Similar to [17], we consider the lifting functions in the form of

\phi_{i}(x_{t},u_{t})=\psi_{i}(x_{t})+\mathcal{L}_{i}(u_{t})\,,

(38)

where $\psi_{i}:\mathbb{R}^{n}\to\mathbb{R}$ is in general nonlinear but $\mathcal{L}_{i}:\mathbb{R}^{m}\to\mathbb{R}$ is linear. The lifting functions $\psi_{i}$ are chosen to be the linear and quadratic terms of the state variables and 10 thin plate spline radial basis functions (defined by $\psi(x)=\|x-x_{0}\|^{2}{\log}(\|x-x_{0}\|)$ whose center is $x_{0}$ ) with centers selected randomly from $[-1.5,1.5]^{2}$ . Under this setting, the input/output data can be used in the extended dynamic mode decomposition (EDMD) [10] to solve for the parametric system representation (Koopman-based predictor). The lifting functions in (38) leads to a quadratic program for the MPC formulation, which can be solved efficiently in practice.

TABLE I: Prediction comparison under different noise levels.

Prediction Error $\sum\\|\Delta y\\|^{2}$	Without measurement noise	With white noise (variance: $10^{-3}$ )
Linear predictor	0.2378	0.2384
Koopman predictor (linear in $u_{t}$ )	0.1927	0.2175
Kernel predictor (Polynomial kernel)	0.0008	0.0486
Kernel predictor (Gaussian kernel)	0.0051	0.0463
Kernel predictor (Exponential kernel)	0.0030	0.0456

TABLE II: Comparison of realized closed-loop performance under different measurement noise levels.

noise level	realized costs	RoKDeePC Gaussian kernel	Kernel-based MPC Gaussian kernel	RoKDeePC Exponential kernel	Kernel-based MPC Exponential kernel	RoKDeePC Polynomial kernel	Kernel-based MPC Polynomial kernel
variance: $10^{-4}$	mean	8.92	8.51	6.07	6.16	7.31	7.36
variance: $10^{-4}$	standard deviation	4.17	3.72	1.97	2.01	3.51	3.25
variance: $10^{-3}$	mean	25.06	28.19	25.11	29.37	34.94	40.77
variance: $10^{-3}$	standard deviation	17.99	19.24	15.31	17.66	24.97	31.43
variance: $1.5\times 10^{-3}$	mean	30.76	37.54 (+22%)	31.89	40.71 (+28%)	46.76	59.36 (+27%)
variance: $1.5\times 10^{-3}$	standard deviation	24.29	28.06	22.15	27.03	36.32	50.50

Table I compares the open-loop predictions of 50 time steps obtained by employing different predictors. For the linear predictor and the kernel predictor (whose prediction horizon is $N=5$ ), the time steps from $iN+1$ to $(i+1)N$ ( $i=\{1,...,9\}$ ) are predicted based on the previous predictions. It can be seen that the linear predictor cannot make an accurate prediction due to the strong nonlinearity of the system. The Koopman-based predictor also cannot make an accurate prediction because (38) assumes independence between $x_{t}$ and $u_{t}$ , which is not the case in (37). Moreover, we observe similar prediction errors even with more data in the EDMD, e.g., $T=200000$ (data not shown). The structure in (38) enables fast calculations of the associated MPC problem, but restricts the applicability in some classes of nonlinear systems. One can also assume a more general form of $\phi_{i}$ to improve the prediction accuracy, but it is out of the scope of this paper.

By comparison, all kernel predictors can predict the system’s behaviors with satisfactory accuracy. Moreover, when there is no measurement noise, the predictor with polynomial kernel almost perfectly predicts the system’s behaviors, as the future outputs are indeed polynomial functions of $(u_{t-1},y_{t-1},u_{t})$ due to the bilinear structure in (37).

Refer to caption — Figure 1: Time-domain responses of the system with different control methods and different measurement noise levels. (a) Without measurement noise. (b) White noise (variance: $10^{-4}$ ). (c) White noise (variance: $10^{-3}$ ).

We next compare the time-domain performance when different control methods are applied: 1) DeePC with quadratic regularization [28]; 2) Koopman-based MPC [17]; 3) nominal (i.e., non-robustified) Kernel-based MPC in (12) (the cost function is modified to be $\ell(u)+\|y-r\|_{Q}^{2}$ ); 4) RoKDeePC in (35). We consider the following input and output costs in all these methods: $\ell(u)+\|y-r\|_{Q}^{2}=\sum\nolimits_{t=1}^{N}(\|u_{t}\|^{2}+10^{3}\|y_{t}-r_{t}\|^{2})+\sum\nolimits_{t=2}^{N}10^{2}\|u_{t}-u_{t-1}\|^{2}$ , where $r_{t}$ is the output reference over time. This cost function penalizes the control inputs, the tracking errors, and the rates of changes of control inputs. The other parameters of the RoKDeePC are: $\lambda^{\prime}_{k}=10^{8}$ , $\lambda_{g}=1$ . The control horizon is $k=1$ in all the methods. We assume no input/output constraints and focus on comparing the tracking performance. Hence, no projection is needed in the gradient descent algorithm.

Fig. 1 shows the time-domain responses of the system when different control methods are applied, where the output reference $r_{t}$ steps from $0$ to $0.1$ at $t=0.1{\rm s}$ , and then to $0.05$ at $t=0.3{\rm s}$ . The sampling time of the discrete-time system is 2ms. It can be seen that when DeePC or Koopman-based MPC is applied, the output cannot accurately track the reference because the linear predictor and the Koopman predictor (linear in $u_{t}$ ) cannot make an accurate prediction (Table I). By comparison, the RoKDeePC and the kernel-based MPC both achieve satisfactory performance even under measurement noise.

We remark that with the gradient descent algorithm in Section IV, the RoKDeePC has a reasonable running time. On an Intel Core i7-9750H CPU with 16GB RAM, it requires approximately 0.2s to solve (35) (with any of the considered kernels) every time in the above simulations. The kernel-based MPC requires similar time to solve (12) using a similar gradient descent algorithm. By comparison, the Koopman-based MPC and DeePC only require to solve quadratic programs, which take about 2ms and 10ms respectively using OSQP [45] as the solver. However, their performance is vastly inferior.

To compare the robustness of the RoKDeePC and the kernel-based MPC, the above simulations were repeated 100 times with different datasets to construct the Hankel matrices and different random seeds to generate the measurement noise. Table II displays the mean and standard deviation of realized closed-loop costs, i.e., $\sum\nolimits_{t=1}^{200}(\|\bar{u}_{t}\|^{2}+10^{3}\|\bar{y}_{t}-r_{t}\|^{2})+\sum\nolimits_{t=2}^{200}10^{2}\|\bar{u}_{t}-\bar{u}_{t-1}\|^{2}$ where $\bar{u}_{t}$ and $\bar{y}_{t}$ are measured from the system in the simulations. It shows that with the increase of the measurement noise level, the RoKDeePC achieves superior performance than the (certainty-equivalence) kernel-based MPC. For instance, when the variance of the measurement noise is $1.5\times 10^{-3}$ , the averaged closed-loop costs of the RoKDeePC are respectively $22\%$ (Gaussian kernel), $28\%$ (Exponential kernel), and $27\%$ (Polynomial kernel) lower than those of the kernel-based MPC. We attribute this performance gap to the robustness of the RoKDeePC, namely, the control sequence obtained from RoKDeePC is robust against the uncertainties in data; by comparison, the kernel-based MPC implicitly assumes certainty equivalence and may not lead to a robust solution. Moreover, we observe that the Gaussian kernel and Exponential kernel have better performance than the Polynomial kernel under noisy measurements, even though the future behaviors of the systems can be described by polynomial functions.

V-B Example 2: A grid-forming converter with nonlinear load

We now test the performance of the RoKDeePC in a grid-forming converter that is feeding a local nonlinear load, as shown in Fig. 2. The grid-forming converter employs virtual synchronous machine control for synchronization, which emulates the swing equation of synchronous generators to generate the frequency[46, 47, 48]. The converter is connected to a local nonlinear load, whose active and reactive power consumptions are (in per-unit values)

\begin{split}P_{\rm Load}=\;&0.3+0.2U^{3}+10\Delta U^{2}+5\Delta U\,,\\ Q_{\rm Load}=\;&0.04+8\Delta U^{2}+2\Delta U\,,\end{split}

where $U$ is the terminal voltage magnitude of the load, and $\Delta U=U-U_{0}$ is the voltage deviation from its nominal value $U_{0}=1{(\rm p.u.)}$ . The other system parameters are: $L_{F}=0.2{(\rm p.u.)}$ , $C_{F}=0.07{(\rm p.u.)}$ , $L_{g}=0.4{(\rm p.u.)}$ , $\omega=100\pi{\rm(rad/s)}$ , $P_{0}=0.5{(\rm p.u.)}$ , $J=0.02$ , and $D=0.08$ .

We apply the RoKDeePC algorithm to perform optimal voltage (or reactive power) control in the converter. As shown in Fig. 2, the input of the converter system is the internal voltage magnitude of the converter (i.e., $U_{d}^{\star}$ ), and the output can either be the reactive power $Q_{E}$ or the terminal voltage deviation of the load $\Delta U$ . Since the converter system is more complex than (37), we choose $T_{\rm ini}=4$ , $T=1500$ , $N=6$ , and a control horizon of $k=1$ . The other settings of the RoKDeePC are the same as those in the previous subsection (cost function, excitation signals, etc.). We again assume no input/output constraints and focus on comparing the tracking performance. For the Koopman-based MPC, we consider $(u_{\rm ini},y_{\rm ini})$ as the state variables of the system, and the choices of the lifting functions are the same as those in the previous subsection.

Fig. 3 (a) shows the reactive power responses of the converter in grid-connected mode (i.e., the breaker in Fig. 2 is closed), and we choose the reactive power $Q_{E}$ to be the output of the system. The sampling time of the system is 2ms. The reference for $Q_{E}$ steps from $0$ to $0.1{\rm(p.u.)}$ at $t=0.1{\rm s}$ , and then to $0.05{\rm(p.u.)}$ at $t=0.3{\rm s}$ . It can be seen that the converter has satisfactory performance when the RoKDeePC is applied (with Gaussian, Exponential, or Polynomial kernel). However, when the Koopman-based MPC or DeePC is applied, there are severe tracking errors; with the former we also observe some oscillations that persist even with more data to solve the EDMD problem and a longer prediction horizon (e.g., $T=20000$ and $N=20$ ). Similar to the results in Section V-A, this is because of the strong nonlinearity of the systems and the special structure of the lifting functions in (38) that are linear in $u_{t}$ . With the kernel-based MPC in (12) (with Gaussian, Exponential, or Polynomial kernel), the system became unstable and thus the time-domain responses are not displayed in Fig. 3.

Although the RoKDeePC (with Gaussian, Exponential, or Polynomial kernel) has superior performance, it requires solving a nonlinear program in real time. In this instance, it took approximately 0.6s to solve (35). By comparison, the Koopman-based MPC and DeePC require about 2ms and 80ms respectively using OSQP as the solver.

To enable a real-time implementation of the RoKDeePC, we further consider a kernel that is nonlinear in $(u_{\rm ini},y_{\rm ini})$ but linear in $u$ , as shown in (36), where $\sigma^{2}=0.2$ . With this kernel, (35) becomes a quadratic program, which admits a closed-form solution when the input/output constraints are inactive. Fig. 3 (a) also plots the reactive power response of the converter when this kernel is used, which shows better performance than the Koopman-based MPC and DeePC (with $k=1$ , the tracking error is less then $10\%$ ). Moroever, it only requires about 10ms to calculate the closed-form solutions at each time step, and thus enables a possible real-time implementation with a control horizon $k\geq 6$ .

We next test the control performance when the converter is in an islanded mode (i.e., the breaker in Fig. 2 is open). We choose the voltage deviation $\Delta U$ to be the output of the system, that is, the RoKDeePC aims at regulating the voltage deviation of the load. Since the voltage dynamics are in general fast, we choose the sampling time to be 0.5ms. Fig. 3 (b) shows the time-domain responses of the voltage deviation when its reference steps from $0$ to $0.05{\rm(p.u.)}$ at $t=0.1{\rm s}$ , and then to $0.025{\rm(p.u.)}$ at $t=0.3{\rm s}$ . We observe that the RoKDeePC (with Gaussian, Exponential, or Polynomial kernel) achieves the best performance, whereas the Koopman-based MPC and DeePC lead to substantial tracking error. The tracking error is about $10\%$ when the kernel in (36) is used in the RoKDeePC to enable faster calculations.

VI Conclusions

RoKDeePC, an algorithm to perform data-driven, robust, and optimal control for general nonlinear systems based on implicitly predicting the future behaviors of the system using the representer theorem was presented. RoKDeePC involves solving a data-to-control optimization problem in a receding-horizon manner, which does not require any time-consuming off-line learning step. Moreover, we demonstrate how to provide end-to-end robustification for the control sequence against measurement noise in the output data, leading to strong performance guarantees. To handle the nonconvexity of the optimization problem, we exploited its structure and developed projected gradient descent algorithms to enable fast calculations of the (sub)optimal solutions. The RoKDeePC was tested in two example nonlinear systems (including a high-fidelity converter model feeding a nonlinear load) and showed excellent performance with reasonable running time, compatible with real-time implementations in real-world applications. A comparison with related nonlinear data-driven MPC methods showed the favorable performance of our approach.

Acknowledgment

The authors would like to thank Liviu Aolaritei, Francesco Micheli, and Jianzhe Zhen for fruitful discussions.

Appendix A Proof of Proposition IV.1

Proof.

Since $\mathcal{Y}$ is a polyhedron, we can compactly rewrite $\mathcal{G}$ as $\mathcal{G}=\{g\ |\ Gg+\mathcal{N}_{G}(g)\leq q\}$ where $\mathcal{N}_{G}(g)=[\|G^{(1)}g\|\;\|G^{(2)}g\|\;\cdots\;\|G^{(n_{q})}g\|]^{\top}$ for some $G\in\mathbb{R}^{n_{q}\times H_{c}}$ , $q\in\mathbb{R}^{n_{q}}$ , and $G^{(i)}\in\mathbb{R}^{pNH_{c}\times H_{c}},\;\forall i\in[n_{q}]$ [44].

We start by respectively rewriting (30) and (31) as

\mathop{\min}\limits_{g\in\mathcal{G}}\;\|\hat{Y}_{\rm F}g-r\|_{Q}^{2}+\lambda_{g}\|g\|^{2}+\lambda_{k}^{\prime}\|Vg-v\|^{2}+c_{u}\,,

(39)

\mathop{\min}\limits_{g\in\mathcal{G}}\;\|\hat{Y}_{\rm F}g-r\|_{Q}+h(g)+\lambda_{k}\|Vg-v\|+c_{u}\,,

(40)

where $c_{u}=\ell(u)$ is a constant, $V=\mathbf{\hat{K}}+\gamma I$ , $v=k(u_{\rm ini},\hat{y}_{\rm ini},u)$ , and $h(g)=\lambda_{k}\rho_{1}\sqrt{\|g\|^{2}+1}+\rho_{2}\|g\|$ .

Consider the Lagrangian of (39)

\begin{split}\mathcal{L}_{q}(g,\mu_{q})=&\hskip 2.27621pt\|\hat{Y}_{\rm F}g-r\|_{Q}^{2}+\lambda_{g}\|g\|^{2}+\lambda_{k}^{\prime}\|Vg-v\|^{2}\\ &+\mu_{q}^{\top}(Gg+\mathcal{N}_{G}(g)-q)\,,\end{split}

where $\mu_{q}$ is the vector of the dual variables. Since $\mathcal{G}$ is nonempty, there exists a solution $(g^{\star},\mu_{q}^{\star})$ to the Karush–Kuhn–Tucker (KKT) conditions of (39)


	$\displaystyle\begin{array}[]{r}2\hat{Y}_{\rm F}^{\top}Q(\hat{Y}_{\rm F}g-r)+2\lambda_{g}g+2\lambda^{\prime}_{k}V^{\top}(Vg-v)\\ +\hskip 1.42262ptG^{\top}\mu_{q}+\sum_{i\in[n_{q}]}\mu_{q[i]}\frac{(G^{(i)})^{\top}G^{(i)}g}{\\|G^{(i)}g\\|}=0\,,\end{array}$		(41c)
	$\displaystyle\mu_{q}^{\top}(Gg+\mathcal{N}_{G}(g)-q)=0\,,$		(41d)
	$\displaystyle Gg+\mathcal{N}_{G}(g)\leq q\,,$		(41e)
	$\displaystyle\mu_{q}\geq 0\,.$		(41f)

where $g^{\star}$ is a minimizer of (39) (and (30)).

Consider the Lagrangian of (40)

\begin{split}\mathcal{L}(g,\mu)=&\hskip 2.27621pt\|\hat{Y}_{\rm F}g-r\|_{Q}+\lambda_{k}\rho_{1}\sqrt{\|g\|^{2}+1}+\rho_{2}\|g\|\\ &+\lambda_{k}\|Vg-v\|+\mu^{\top}(Gg+\mathcal{N}_{G}(g)-q)\,,\end{split}

where $\mu$ is the vector of the dual variables. By choosing $\lambda_{k}$ , $\rho_{1}$ , and $\rho_{2}$ according to (32) and (33), it can be verified that $(g^{\star},\mu^{\star},y^{\star},w^{\star},z^{\star})$ , where

(\mu^{\star},y^{\star})=\begin{cases}\left(\frac{\mu_{q}^{\star}}{2\|\hat{Y}_{\rm F}g^{\star}-r\|_{Q}},\frac{\hat{Y}_{\rm F}^{\top}Q(\hat{Y}_{\rm F}g^{\star}-r)}{\|\hat{Y}_{\rm F}g^{\star}-r\|_{Q}}\right)\quad\text{if $\hat{Y}_{\rm F}g^{\star}\neq r$,}\\ \left(\frac{\mu_{q}^{\star}}{2},0\right)\quad\text{otherwise,}\end{cases}

z^{\star}=\begin{cases}\frac{\lambda_{k}V^{\top}(Vg^{\star}-v)}{\|Vg^{\star}-v\|}\quad\text{if $Vg^{\star}\neq v$,}\\ 0\quad\text{otherwise,}\end{cases}

$w^{\star}=\rho_{2}g^{\star}/\|g^{\star}\|$ if $g^{\star}\neq 0$ , and $w^{\star}=0$ otherwise, and $g^{\star}$ as before, satisfy the KKT conditions of (40)


	$\displaystyle\begin{array}[]{l}y+\frac{\lambda_{k}\rho_{1}g}{\sqrt{\\|g\\|^{2}+1}}+w+z+G^{\top}\mu\\ \hskip 14.22636pt+\sum_{i\in[n_{q}]}\mu_{[i]}\frac{(G^{(i)})^{\top}G^{(i)}g}{\\|G^{(i)}g\\|}=0\,,\end{array}$		(42c)
	$\displaystyle\begin{array}[]{r}\\|\hat{Y}_{\rm F}\bar{g}-r\\|_{Q}\geq\\|\hat{Y}_{\rm F}g-r\\|_{Q}+y^{\top}(\bar{g}-g),\\ \forall\bar{g}\in\mathbb{R}^{H_{c}},\end{array}$		(42f)
	$\displaystyle\rho_{2}\\|\bar{g}\\|\geq\rho_{2}\\|g\\|+w^{\top}(\bar{g}-g),\;\forall\bar{g}\in\mathbb{R}^{H_{c}},$		(42g)
	$\displaystyle\begin{array}[]{r}\lambda_{k}\\|Vg-v\\|\geq\lambda_{k}\\|Vg-v\\|+z^{\top}(\bar{g}-g),\\ \forall\bar{g}\in\mathbb{R}^{H_{c}},\end{array}$		(42j)
	$\displaystyle\mu^{\top}(Gg+\mathcal{N}_{G}(g)-q)=0\,,$		(42k)
	$\displaystyle Gg+\mathcal{N}_{G}(g)\leq q\,,$		(42l)
	$\displaystyle\mu\geq 0\,.$		(42m)

Thus, the vector $g^{\star}$ is also a minimizer of (40) (and (31)).

Next we prove the monotonic relationship between $\lambda_{k}$ and $\lambda_{k}^{\prime}$ . Let $g_{1}$ be the minimizer of (39) with $\lambda_{k}^{\prime}=\lambda_{k1}^{\prime}>0$ (and the minimizer of (40) with $\lambda_{k}=\lambda_{k1}$ ), and $g_{2}$ the minimizer of (39) with $\lambda_{k}^{\prime}=\lambda_{k2}^{\prime}>\lambda_{k1}^{\prime}$ (and the minimizer of (40) with $\lambda_{k}=\lambda_{k2}$ ). If $g_{1}=g_{2}$ , we directly obtain $\lambda_{k1}<\lambda_{k2}$ according to (32). If $g_{1}\neq g_{2}$ , according to the definitions of $g_{1}$ and $g_{2}$ , we have

\begin{split}&\|\hat{Y}_{\rm F}g_{1}-r\|_{Q}^{2}+\lambda_{g}\|g_{1}\|^{2}+\lambda_{k1}^{\prime}\|Vg_{1}-v\|^{2}\\ <\;&\|\hat{Y}_{\rm F}g_{2}-r\|_{Q}^{2}+\lambda_{g}\|g_{2}\|^{2}+\lambda_{k1}^{\prime}\|Vg_{2}-v\|^{2}\,,\end{split}

which can be reorganized as

\begin{split}&\|\hat{Y}_{\rm F}g_{1}-r\|_{Q}^{2}+\lambda_{g}\|g_{1}\|^{2}+\lambda_{k1}^{\prime}(\|Vg_{1}-v\|^{2}-\|Vg_{2}-v\|^{2})\\ &<\|\hat{Y}_{\rm F}g_{2}-r\|_{Q}^{2}+\lambda_{g}\|g_{2}\|^{2}\,.\end{split}

(43)

Moreover, we have

\begin{split}&\|\hat{Y}_{\rm F}g_{2}-r\|_{Q}^{2}+\lambda_{g}\|g_{2}\|^{2}+\lambda_{k2}^{\prime}\|Vg_{2}-v\|^{2}\\ <\;&\|\hat{Y}_{\rm F}g_{1}-r\|_{Q}^{2}+\lambda_{g}\|g_{1}\|^{2}+\lambda_{k2}^{\prime}\|Vg_{1}-v\|^{2}\,,\end{split}

which can be reorganized as

\begin{split}&\|\hat{Y}_{\rm F}g_{1}-r\|_{Q}^{2}+\lambda_{g}\|g_{1}\|^{2}+\lambda_{k2}^{\prime}(\|Vg_{1}-v\|^{2}-\|Vg_{2}-v\|^{2})\\ &>\|\hat{Y}_{\rm F}g_{2}-r\|_{Q}^{2}+\lambda_{g}\|g_{2}\|^{2}\,.\end{split}

(44)

By combining (43) and (44), we obtain

\begin{split}&\lambda_{k1}^{\prime}(\|Vg_{1}-v\|^{2}-\|Vg_{2}-v\|^{2})\\ <\;&\lambda_{k2}^{\prime}(\|Vg_{1}-v\|^{2}-\|Vg_{2}-v\|^{2})\,,\end{split}

which indicates that $\|Vg_{1}-v\|^{2}>\|Vg_{2}-v\|^{2}$ as $\lambda_{k2}^{\prime}>\lambda_{k1}^{\prime}>0$ . Then, we have $\|Vg_{1}-v\|>\|Vg_{2}-v\|$ .

According to the definitions of $\lambda_{k1}$ and $\lambda_{k2}$ , we also have

\begin{split}&\|\hat{Y}_{\rm F}g_{1}-r\|_{Q}+h(g_{1})+\lambda_{k1}(\|Vg_{1}-v\|-\|Vg_{2}-v\|)\\ &<\|\hat{Y}_{\rm F}g_{2}-r\|_{Q}+h(g_{2})\,,\end{split}

\begin{split}&\|\hat{Y}_{\rm F}g_{1}-r\|_{Q}+h(g_{1})+\lambda_{k2}(\|Vg_{1}-v\|-\|Vg_{2}-v\|)\\ &>\|\hat{Y}_{\rm F}g_{2}-r\|_{Q}+h(g_{2})\,,\end{split}

leading to

\begin{split}&\lambda_{k1}(\underbrace{\|Vg_{1}-v\|-\|Vg_{2}-v\|}_{>0})\\ <\hskip 1.70717pt&\lambda_{k2}(\underbrace{\|Vg_{1}-v\|-\|Vg_{2}-v\|}_{>0})\end{split}

It can then be deduced that $\lambda_{k1}<\lambda_{k2}$ . Hence, $\lambda_{k}$ is increasing with the increase of $\lambda_{k}^{\prime}$ . By following a similar process, one can also prove that $\rho_{1}$ ( $\rho_{2}$ ) is increasing with the increase of $\lambda_{g}$ . This completes the proof. ∎

References

[1] I. Markovsky and F. Dörfler, “Behavioral systems theory in data-driven analysis, signal processing, and control,” Annual Reviews in Control, vol. 52, pp. 42–64, 2021.
[2] L. Huang, J. Coulson, J. Lygeros, and F. Dörfler, “Decentralized data-enabled predictive control for power system oscillation damping,” IEEE Transactions on Control Systems Technology, 2021.
[3] E. Elokda, J. Coulson, P. N. Beuchat, J. Lygeros, and F. Dörfler, “Data-enabled predictive control for quadcopters,” International Journal of Robust and Nonlinear Control, vol. 31, no. 18, pp. 8916–8936, 2021.
[4] L. Ljung, “System identification,” Wiley Encyclopedia of Electrical and Electronics Engineering, pp. 1–19, 1999.
[5] M. Morari and J. H. Lee, “Model predictive control: past, present and future,” Computers & Chemical Engineering, vol. 23, no. 4-5, pp. 667–682, 1999.
[6] H. Hjalmarsson, M. Gevers, and F. De Bruyne, “For model-based control design, closed-loop identification gives better performance,” Automatica, vol. 32, no. 12, pp. 1659–1673, 1996.
[7] S. Formentin and A. Chiuso, “Core: Control-oriented regularization for system identification,” in 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018, pp. 2253–2258.
[8] F. Dorfler, J. Coulson, and I. Markovsky, “Bridging direct & indirect data-driven control formulations via regularizations and relaxations,” IEEE Transactions on Automatic Control, 2022.
[9] L. Campestrini, D. Eckhard, A. S. Bazanella, and M. Gevers, “Data-driven model reference control design by prediction error identification,” Journal of the Franklin Institute, vol. 354, no. 6, pp. 2628–2647, 2017.
[10] M. O. Williams, I. G. Kevrekidis, and C. W. Rowley, “A data–driven approximation of the koopman operator: Extending dynamic mode decomposition,” Journal of Nonlinear Science, vol. 25, no. 6, pp. 1307–1346, 2015.
[11] C. E. Rasmussen, “Gaussian processes in machine learning,” in Summer school on machine learning. Springer, 2003, pp. 63–71.
[12] G. Pillonetto, F. Dinuzzo, T. Chen, G. De Nicolao, and L. Ljung, “Kernel methods in system identification, machine learning and function estimation: A survey,” Automatica, vol. 50, no. 3, pp. 657–682, 2014.
[13] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neural networks,” arXiv preprint arXiv:1312.6026, 2013.
[14] Y. Lian, R. Wang, and C. N. Jones, “Koopman based data-driven predictive control,” arXiv preprint arXiv:2102.05122, 2021.
[15] H. J. van Waarde and R. Sepulchre, “Kernel-based models for system analysis,” arXiv preprint arXiv:2110.11735, 2021.
[16] I. Lenz, R. A. Knepper, and A. Saxena, “Deepmpc: Learning deep latent features for model predictive control.” in Robotics: Science and Systems, vol. 10. Rome, Italy, 2015.
[17] M. Korda and I. Mezić, “Linear predictors for nonlinear dynamical systems: Koopman operator meets model predictive control,” Automatica, vol. 93, pp. 149–160, 2018.
[18] J. Kocijan, R. Murray-Smith, C. E. Rasmussen, and A. Girard, “Gaussian process model based predictive control,” in Proceedings of the 2004 American control conference, vol. 3. IEEE, 2004, pp. 2214–2219.
[19] Y. Chen, Y. Shi, and B. Zhang, “Optimal control via neural networks: A convex approach,” in International Conference on Learning Representations, 2018.
[20] L. Hewing, J. Kabzan, and M. N. Zeilinger, “Cautious model predictive control using gaussian process regression,” IEEE Transactions on Control Systems Technology, vol. 28, no. 6, pp. 2736–2743, 2019.
[21] E. T. Maddalena, P. Scharnhorst, Y. Jiang, and C. N. Jones, “Kpc: Learning-based model predictive control with deterministic guarantees,” in Learning for Dynamics and Control. PMLR, 2021, pp. 1015–1026.
[22] B. Schölkopf, R. Herbrich, and A. J. Smola, “A generalized representer theorem,” in International conference on computational learning theory. Springer, 2001, pp. 416–426.
[23] J. C. Willems, P. Rapisarda, I. Markovsky, and B. L. De Moor, “A note on persistency of excitation,” Systems & Control Letters, vol. 54, no. 4, pp. 325–329, 2005.
[24] J. Coulson, J. Lygeros, and F. Dörfler, “Data-enabled predictive control: In the shallows of the DeePC,” in 2019 18th European Control Conference (ECC). IEEE, 2019, pp. 307–312.
[25] C. De Persis and P. Tesi, “Formulas for data-driven control: Stabilization, optimality, and robustness,” IEEE Trans. Autom. Control, vol. 65, no. 3, pp. 909–924, 2019.
[26] J. Berberich, J. Köhler, M. A. Muller, and F. Allgower, “Data-driven model predictive control with stability and robustness guarantees,” IEEE Trans. Autom. Control, vol. 66, no. 4, pp. 1702–1717, 2021.
[27] J. Coulson, J. Lygeros, and F. Dörfler, “Regularized and distributionally robust data-enabled predictive control,” in 2019 IEEE 58th Conference on Decision and Control (CDC). IEEE, 2019.
[28] L. Huang, J. Zhen, J. Lygeros, and F. Dörfler, “Quadratic regularization of data-enabled predictive control: Theory and application to power converter experiments,” arXiv preprint arXiv:2012.04434, 2020.
[29] ——, “Robust data-enabled predictive control: Tractable formulations and performance guarantees,” arXiv preprint arXiv:2105.07199, 2021.
[30] J. Berberich and F. Allgöwer, “A trajectory-based framework for data-driven system analysis and control,” in 2020 European Control Conference (ECC). IEEE, 2020, pp. 1365–1370.
[31] J. G. Rueda-Escobedo and J. Schiffer, “Data-driven internal model control of second-order discrete volterra systems,” in 2020 59th IEEE Conference on Decision and Control (CDC). IEEE, 2020, pp. 4572–4579.
[32] M. Alsalti, J. Berberich, V. G. Lopez, F. Allgöwer, and M. A. Müller, “Data-based system analysis and control of flat nonlinear systems,” arXiv preprint arXiv:2103.02892, 2021.
[33] I. Markovsky, “Data-driven simulation of nonlinear systems via linear time-invariant embedding,” Vrije Universiteit Brussel, Tech. Rep, 2021.
[34] I. Markovsky and F. Dörfler, “Identifiability in the behavioral setting,” 2020, available online.
[35] I. Markovsky and P. Rapisarda, “Data-driven simulation and control,” International Journal of Control, vol. 81, no. 12, pp. 1946–1959, 2008.
[36] J. Coulson, J. Lygeros, and F. Dorfler, “Distributionally robust chance constrained data-enabled predictive control,” IEEE Transactions on Automatic Control, 2021.
[37] H. J. van Waarde, C. De Persis, M. K. Camlibel, and P. Tesi, “Willems’ fundamental lemma for state-space systems and its extension to multiple datasets,” IEEE Control Systems Letters, vol. 4, no. 3, pp. 602–607, 2020.
[38] S. Saitoh and Y. Sawano, Theory of reproducing kernels and applications. Springer, 2016.
[39] N. Aronszajn, “Theory of reproducing kernels,” Transactions of the American mathematical society, vol. 68, no. 3, pp. 337–404, 1950.
[40] E. T. Maddalena, P. Scharnhorst, and C. N. Jones, “Deterministic error bounds for kernel-based learning techniques under bounded noise,” Automatica, vol. 134, p. 109896, 2021.
[41] C. A. Micchelli, Y. Xu, and H. Zhang, “Universal kernels.” Journal of Machine Learning Research, vol. 7, no. 12, 2006.
[42] E. C. Kerrigan and J. M. Maciejowski, “Soft constraints and exact penalty functions in model predictive control,” in Control 2000 Conference, Cambridge. Citeseer, 2000, pp. 2319–2327.
[43] L. El Ghaoui and H. Lebret, “Robust solutions to least-squares problems with uncertain data,” SIAM Journal on matrix analysis and applications, vol. 18, no. 4, pp. 1035–1064, 1997.
[44] A. Ben-Tal, D. den Hertog, and J.-P. Vial, “Deriving robust counterparts of nonlinear uncertain inequalities,” Mathematical Programming, vol. 149, no. 1, pp. 265–299, 2015.
[45] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd, “OSQP: An operator splitting solver for quadratic programs,” Mathematical Programming Computation, pp. 1–36, 2020.
[46] F. Dörfler and F. Bullo, “Synchronization and transient stability in power networks and nonuniform kuramoto oscillators,” SIAM Journal on Control and Optimization, vol. 50, no. 3, pp. 1616–1642, 2012.
[47] S. D’Arco, J. A. Suul, and O. B. Fosso, “A virtual synchronous machine implementation for distributed control of power converters in smartgrids,” Elect. Power Syst. Res., vol. 122, pp. 180–197, 2015.
[48] C. Yang, L. Huang, H. Xin, and P. Ju, “Placing grid-forming converters to enhance small signal stability of pll-integrated power systems,” IEEE Transactions on Power Systems, vol. 36, no. 4, pp. 3563–3573, 2021.

		$\displaystyle\mathop{\max}\limits_{[\Delta_{K}\;\Delta_{k}]\in\mathcal{D}_{F1}}\\|(\mathbf{\hat{K}}+\Delta_{K}+\gamma I)g-(k(u_{\rm ini},\hat{y}_{\rm ini},u)+\Delta_{k})\\|$
	$\displaystyle\leq$	$\displaystyle\\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)\\|+\max_{[\Delta_{K}\;\Delta_{k}]\in\mathcal{D}_{F1}}\\|\Delta_{K}g-\Delta_{k}\\|$
	$\displaystyle=$	$\displaystyle\\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)\\|+\rho_{1}\sqrt{\\|g\\|^{2}+1}.$

		$\displaystyle\\|(\mathbf{\hat{K}}+\hat{\Delta}_{K}+\gamma I)g-(k(u_{\rm ini},\hat{y}_{\rm ini},u)+\hat{\Delta}_{k})\\|$
	$\displaystyle=$	$\displaystyle\\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)\\|+\left\\|\hat{\Delta}_{K}g-\hat{\Delta}_{k}\right\\|$
	$\displaystyle=$	$\displaystyle\\|(\mathbf{\hat{K}}+\gamma I)g-k(u_{\rm ini},\hat{y}_{\rm ini},u)\\|+\rho_{1}\sqrt{\\|g\\|^{2}+1}$
	$\displaystyle\leq$	$\displaystyle\mathop{\max}\limits_{[\Delta_{K}\;\Delta_{k}]\in\mathcal{D}_{F1}}\\|(\mathbf{\hat{K}}+\Delta_{K}+\gamma I)g-(k(u_{\rm ini},\hat{y}_{\rm ini},u)+\Delta_{k})\\|,$