This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\coltauthor\Name

Tuhin Sarkar \Emailtsarkar@mit.edu
\addr77 Massachusetts Ave, Cambridge, MA 02139 and \NameAlexander Rakhlin \Emailrakhlin@mit.edu
\addr77 Massachusetts Ave, Cambridge, MA 02139 and \NameMunther A. Dahleh \Emaildahleh@mit.edu
\addr77 Massachusetts Ave, Cambridge, MA 02139

Nonparametric Finite Time LTI System Identification

Abstract

We address the problem of learning the parameters of a stable linear time invariant (LTI) system or linear dynamical system (LDS) with unknown latent space dimension, or order, from a single time–series of noisy input-output data. We focus on learning the best lower order approximation allowed by finite data. Motivated by subspace algorithms in systems theory, where the doubly infinite system Hankel matrix captures both order and good lower order approximations, we construct a Hankel-like matrix from noisy finite data using ordinary least squares. This circumvents the non-convexities that arise in system identification, and allows accurate estimation of the underlying LTI system. Our results rely on careful analysis of self-normalized martingale difference terms that helps bound identification error up to logarithmic factors of the lower bound. We provide a data-dependent scheme for order selection and find an accurate realization of system parameters, corresponding to that order, by an approach that is closely related to the Ho-Kalman subspace algorithm. We demonstrate that the proposed model order selection procedure is not overly conservative, i.e., for the given data length it is not possible to estimate higher order models or find higher order approximations with reasonable accuracy.

keywords:
Linear Dynamical Systems, System Identification, Non–parametric statistics, control theory, Statistical Learning theory

1 Introduction

Finite-time system identification—the problem of estimating the system parameters given a finite single time series of its output—is an important problem in the context of control theory, time series analysis, robotics, and economics, among many others. In this work, we focus on parameter estimation and model approximation of linear time invariant (LTI) systems or linear dynamical system (LDS), which are described by

Xt+1\displaystyle X_{t+1} =AXt+BUt+ηt+1\displaystyle=AX_{t}+BU_{t}+\eta_{t+1}
Yt\displaystyle Y_{t} =CXt+wt.\displaystyle=CX_{t}+w_{t}. (1)

Here Cp×n,An×n,Bn×mC\in\mathbb{R}^{p\times n},A\in\mathbb{R}^{n\times n},B\in\mathbb{R}^{n\times m}; {ηt,wt}t=1\{\eta_{t},w_{t}\}_{t=1}^{\infty} are process and output noise, UtU_{t} is an external control input, XtX_{t} is the latent state variable and YtY_{t} is the observed output. The goal here is parameter estimation, i.e., learning (C,A,B)(C,A,B) from a single finite time series of {Yt,Ut}t=1T\{Y_{t},U_{t}\}_{t=1}^{T} when the order, nn, is unknown. Since typically p,m<np,m<n, it becomes challenging to find suitable parametrizations of LTI systems for provably efficient learning. When {Xj}j=1\{X_{j}\}_{j=1}^{\infty} are observed (or, CC is known to be the identity matrix), identification of (C,A,B)(C,A,B) in Eq. (1) is significantly easier, and ordinary least squares (OLS) is a statistically optimal estimator. It is, in general, unclear how (or if) OLS can be employed in the case when XtX_{t}’s are not observed.

To motivate the study of a lower-order approximation of a high-order system, consider the following example:

Example 1.1.

Consider M1=(A1,B1,C1)M_{1}=(A_{1},B_{1},C_{1}) with

A1\displaystyle A_{1} =[010000010000001a0000]n×nB1=[0001]n×1C1=B1\displaystyle=\begin{bmatrix}0&1&0&0&\ldots&0\\ 0&0&1&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&0&\ldots&1\\ -a&0&0&0&\ldots&0\end{bmatrix}_{n\times n}B_{1}=\begin{bmatrix}0\\ 0\\ \vdots\\ 0\\ 1\end{bmatrix}_{n\times 1}C_{1}=B_{1}^{\top} (2)

where na1na\ll 1 and n>20n>20. Here the order of M1M_{1} is nn. However, it can be approximated well by M2M_{2} which is of a much lower order and given by

A2\displaystyle A_{2} =[0010]B2=[01]C2=B2.\displaystyle=\begin{bmatrix}0&0\\ 1&0\end{bmatrix}\hskip 5.69054ptB_{2}=\begin{bmatrix}0\\ 1\end{bmatrix}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ C_{2}=B_{2}^{\top}. (3)

For the same input UtU_{t}, if Yt(1),Yt(2)Y^{(1)}_{t},Y^{(2)}_{t} be the output generated by M1M_{1} and M2M_{2} respectively then a simple computation shows that

supUt=1(Yt(1)Yt(2))2Ut24n2a21\sup_{U}\sum_{t=1}^{\infty}\frac{(Y^{(1)}_{t}-Y^{(2)}_{t})^{2}}{U_{t}^{2}}\leq 4n^{2}a^{2}\ll 1

This suggests that the actual value of nn is not important; rather there exists an effective order, rr (which is 22 in this case). This lower order model captures “most” of the LTI system.

Since the true model order is not known in many cases, we emphasize a nonparametric approach to identification: one which adaptively selects the best model order for the given data and approximates the underlying LTI system better as TT (length of data) grows. The key to this approach will be designing an estimator M^\hat{M} from which we obtain a realization (C^,A^,B^)(\hat{C},\hat{A},\hat{B}) of the selected order.

1.1 Related Work

Linear time invariant systems are an extensively studied class of models in control and systems theory. These models are used in feedback control systems (for example in planetary soft landing systems for rockets (Açıkmeşe et al., 2013)) and as linear approximations to many non–linear systems that nevertheless work well in practice. In the absence of process and output noise, subspace-based system identification methods are known to learn (C,A,B)(C,A,B) (up to similarity transformation)(Ljung, 1987; Van Overschee and De Moor, 2012). These typically involve constructing a Hankel matrix from the input–output pairs and then obtaining system parameters by a singular value decomposition. Such methods are inspired by the celebrated Ho-Kalman realization algorithm (Ho and Kalman, 1966). The correctness of these methods is predicated on the knowledge of nn or presence of infinite data. Other approaches include rank minimization-based methods for system identification (Fazel et al., 2013; Grussler et al., 2018), further relaxing the rank constraint to a suitable convex formulation. However, there is a lack of statistical guarantees for these algorithms, and it is unclear how much data is required to obtain accurate estimates of system parameters from finite noisy data. Empirical methods such as the EM algorithm (Dempster et al., 1977) are also used in practice; however, these suffer from non-convexity in problem formulation and can get trapped in local minima. Learning simpler approximations to complex models in the presence of finite noisy data was studied in Venkatesh and Dahleh (2001) where identification error is decomposed into error due to approximation and error due to noise; however the analysis assumes the knowledge of a “good” parametrization and does not provide statistical guarantees for learning the system parameters of such an approximation.

More recently, there has been a resurgence in the study of statistical identification of LTI systems from a single time series in the machine learning community. In cases when C=IC=I, i.e., XtX_{t} is observed directly, sharp finite time error bounds for identification of A,BA,B from a single time series are provided in Faradonbeh et al. (2017); Simchowitz et al. (2018); Sarkar and Rakhlin (2018). The approach to finding A,BA,B is based on a standard ordinary least squares (OLS) given by

(A^,B^)=argminA,Bt=1TXt+1[A,B][Xt,Ut]22.(\hat{A},\hat{B})=\arg\min_{A,B}\sum_{t=1}^{T}||X_{t+1}-[A,B][X_{t}^{\top},U_{t}^{\top}]^{\top}||_{2}^{2}.

Another closely related area is that of online prediction in time series Hazan et al. (2018); Agarwal et al. (2018). Finite time regret guarantees for prediction in linear time series are provided in Hazan et al. (2018). The approach there circumvents the need for system identification and instead uses a filtering technique that convolves the time series with eigenvectors of a specific Hankel matrix.

Closest to our work is that of Oymak and Ozay (2018). Their algorithm, which takes inspiration from the Kalman–Ho algorithm, assumes the knowledge of model order nn. This limits the applicability of the algorithm in two ways: first, it is unclear how the techniques can be extended to the case when nn is unknown—as is usually the case—and, second, in many cases nn is very large and a much lower order LTI system can be a very good approximation of the original system. In such cases, constructing the order nn estimate might be unnecessarily conservative (See Example 1.1). Consequently, the error bounds do not reflect accurate dependence on the system parameters.

When nn is unknown, it is unclear when a singular value decomposition should be performed to obtain the parameter estimates via Ho-Kalman algorithm. This leads to the question of model order selection from data. For subspace based methods, such problems have been addressed in Shibata (1976) and Bauer (2000). These papers address the question of estimating order in the context of subspace methods. Specifically, order estimation is achieved by analyzing the information contained in the estimated singular values and/or estimated innovation variance. Furthermore, they provide guarantees for asymptotic consistency of the methods described. Another line of literature studied in Ljung et al. (2015) for example, approaches the identification of systems with unknown order by first learning the largest possible model that fits the data and then performing model reduction to obtain the final system. Although one can show that asymptotically this method outputs the true model, we show that such a two step procedure may underperform in a finite time setting. A possible explanation for this could be that learning the largest possible model with finite data over-fits on the exogenous noise and therefore gives poor model estimates.

Other related work on identifying finite impulse response approximations include Goldenshluger (1998); Tu et al. (2017); but they do not discuss parameter estimation or reduced order modeling. Several authors Campi and Weyer (2002); Shah et al. (2012); Hardt et al. (2016) and references therein have studied the problem of system identification in different contexts. However, they fail to capture the correct dependence of system parameters on error rates. More importantly, they suffer from the same limitation as Oymak and Ozay (2018) that they require the knowledge of nn.

2 Mathematical Preliminaries

Throughout the paper, we will refer to an LTI system with dynamics as Eq. (1) by M=(C,A,B)M=(C,A,B). For a matrix AA, let σi(A)\sigma_{i}(A) be the ithi^{\text{th}} singular value of AA with σi(A)σi+1(A)\sigma_{i}(A)\geq\sigma_{i+1}(A). Further, σmax(A)=σ1(A)=σ(A)\sigma_{\max}(A)=\sigma_{1}(A)=\sigma(A). Similarly, we define ρi(A)=|λi(A)|\rho_{i}(A)=|\lambda_{i}(A)|, where λi(A)\lambda_{i}(A) is an eigenvalue of AA with ρi(A)ρi+1(A)\rho_{i}(A)\geq\rho_{i+1}(A). Again, ρmax(A)=ρ1(A)=ρ(A)\rho_{\max}(A)=\rho_{1}(A)=\rho(A).

Definition 2.1.

A matrix AA is Schur stable if ρmax(A)<1\rho_{\max}(A)<1.

We will only be interested in the class of LTI systems that are Schur stable. Fix γ>0\gamma>0 (and possibly much greater than 11). The model class r\mathcal{M}_{r} of LTI systems parametrized by r+r\in\mathbb{Z}_{+} is defined as

r={(C,A,B)|Cp×r,Ar×r,Br×m,ρ(A)<1,σ(A)γ}.\mathcal{M}_{r}=\{(C,A,B)\hskip 2.84526pt|\hskip 2.84526ptC\in\mathbb{R}^{p\times r},A\in\mathbb{R}^{r\times r},B\in\mathbb{R}^{r\times m},\rho(A)<1,\sigma(A)\leq\gamma\}. (4)
Definition 2.2.

The (k,p,q)(k,p,q)–dimensional Hankel matrix for M=(C,A,B)M=(C,A,B) as

k,p,q(M)=[CAkBCAk+1BCAq+k1BCAk+1BCAk+2BCAq+kBCAp+k1BCAp+q+k2B]\displaystyle\mathcal{H}_{k,p,q}(M)=\begin{bmatrix}CA^{k}B&CA^{k+1}B&\ldots&CA^{q+k-1}B\\ CA^{k+1}B&CA^{k+2}B&\ldots&CA^{q+k}B\\ \vdots&\vdots&\ddots&\vdots\\ CA^{p+k-1}B&\ldots&\ldots&CA^{p+q+k-2}B\end{bmatrix}

and its associated Toeplitz matrix as

𝒯k,d(M)=[0000CAkB0000CAd+k3BCAkB00CAd+k2BCAd+k3BCAkB0].\displaystyle\mathcal{T}_{k,d}(M)=\begin{bmatrix}0&0&\ldots&0&0\\ CA^{k}B&0&\ldots&0&0\\ \vdots&\ddots&\ddots&\vdots&0\\ CA^{d+k-3}B&\ldots&CA^{k}B&0&0\\ CA^{d+k-2}B&CA^{d+k-3}B&\ldots&CA^{k}B&0\end{bmatrix}.

We will slightly abuse notation by referring to k,p,q(M)=k,p,q\mathcal{H}_{k,p,q}(M)=\mathcal{H}_{k,p,q}. Similarly for the Toeplitz matrices 𝒯k,d(M)=𝒯k,d\mathcal{T}_{k,d}(M)=\mathcal{T}_{k,d}. The matrix 0,,(M)\mathcal{H}_{0,\infty,\infty}(M) is known as the system Hankel matrix corresponding to MM, and its rank is known as the model order (or simply order) of MM. The system Hankel matrix has two well-known properties that make it useful for system identification. First, the rank of 0,,\mathcal{H}_{0,\infty,\infty} has an upper bound nn. Second, it maps the “past” inputs to “future” outputs. These properties are discussed in detail in appendix as Section 9.2. For infinite matrices 0,,\mathcal{H}_{0,\infty,\infty}, 0,,20,,op||\mathcal{H}_{0,\infty,\infty}||_{2}\triangleq||\mathcal{H}_{0,\infty,\infty}||_{\text{op}}, i.e., the operator norm.

Definition 2.3.

The transfer function of M=(C,A,B)M=(C,A,B) is given by G(z)=C(zIA)1BG(z)=C(zI-A)^{-1}B where zz\in\mathbb{C}.

The transfer function plays a critical role in control theory as it relates the input to the output. Succinctly, the transfer function of an LTI system is the Z–transform of the output in response to a unit impulse input. Since for any invertible SS the LTI systems M1=(CS1,SAS1,SB),M2=(C,A,B)M_{1}=(CS^{-1},SAS^{-1},SB),M_{2}=(C,A,B) have identical transfer functions, identification may not be unique, but equivalent up to a transformation SS, i.e., (C,A,B)(CS,S1AS,S1B)(C,A,B)\equiv(CS,S^{-1}AS,S^{-1}B). Next, we define a system norm that will be important from the perspective of model identification and approximation.

Definition 2.4.

The \mathcal{H}_{\infty}–system norm of a Schur stable LTI system MM is given by

M=supωσmax(G(ejω)).||M||_{\infty}=\sup_{\omega\in\mathbb{R}}\sigma_{\max}(G(e^{j\omega})).

Here, G()G(\cdot) is the transfer function of MM. The rr–truncation of the transfer function is defined as

Gr[CB,CAB,,CAr1B].G_{r}\coloneqq[CB,CAB,\ldots,CA^{r-1}B]. (5)

For a stable LTI system MM we have

Proposition 1 (Lemma 2.2 Glover (1987)).

Let MM be a LTI system then

MH=σ1M2(σ1++σn)||M||_{H}=\sigma_{1}\leq||M||_{\infty}\leq 2(\sigma_{1}+\ldots+\sigma_{n})

where σi\sigma_{i} are the singular values of 0,,(M)\mathcal{H}_{0,\infty,\infty}(M).

For any matrix ZZ, define Zm:n,p:qZ_{m:n,p:q} as the submatrix including row mm to nn and column pp to qq. Further, Zm:n,:Z_{m:n,:} is the submatrix including row mm to nn and all columns and a similar notion exists for Z:,p:qZ_{:,p:q}. Finally, we define balanced truncated models which will play an important role in our algorithm.

Definition 2.5 (Kung and Lin (1981)).

Let 0,,(M)=UΣV\mathcal{H}_{0,\infty,\infty}(M)=U\Sigma V^{\top} where Σn×n\Sigma\in\mathbb{R}^{n\times n} (nn is the model order). Then for any rnr\leq n, the rr–order balanced truncated model parameters are given by

Cr=[UΣ1/2]1:p,1:r,Ar=Σ1:r,1:r1/2U:,1:r[UΣ1/2]p+1:,1:r,Br=[Σ1/2V]1:r,1:m.C_{r}=[U\Sigma^{1/2}]_{1:p,1:r},A_{r}=\Sigma_{1:r,1:r}^{-1/2}U_{:,1:r}^{\top}[U\Sigma^{1/2}]_{p+1:,1:r},B_{r}=[\Sigma^{1/2}V^{\top}]_{1:r,1:m}.

For r>nr>n, the rr–order balanced truncated model parameters are the nn–order truncated model parameters.

Definition 2.6.

We say a random vector vdv\in\mathbb{R}^{d} is subgaussian with variance proxy τ2\tau^{2} if

supθ2=1supp1{p1/2(𝔼[|v,θ|p])1/p}=τ\sup_{||\theta||_{2}=1}\sup_{p\geq 1}\left\{p^{-1/2}\left(\mathbb{E}[\left|\langle v,\theta\rangle\right|^{p}]\right)^{1/p}\right\}=\tau

and 𝔼[v]=0\mathbb{E}[v]=\textbf{0}. We denote this by v𝗌𝗎𝖻𝗀(τ2)v\sim\mathsf{subg}(\tau^{2}).

A fundamental result in model reduction from systems theory is the following

Theorem 2 (Theorem 21.26 Zhou et al. (1996)).

Let M=(C,A,B)M=(C,A,B) be the true model of order nn and Mr=(Cr,Ar,Br)M_{r}=(C_{r},A_{r},B_{r}) be its balance truncated model of order r<nr<n. Assume that σrσr+1\sigma_{r}\neq\sigma_{r+1}. Then

MMr2(σr+1+σr+2++σn)||M-M_{r}||_{\infty}\leq 2(\sigma_{r+1}+\sigma_{r+2}+\ldots+\sigma_{n})

where σi\sigma_{i} are the Hankel singular values of MM.

Critical to obtaining refined error rates, will be a result from the theory of self–normalized martingales, an application of the pseudo-maximization technique in (Peña et al., 2008, Theorem 14.7):

Theorem 3.

Let {𝓕t}t=0\{\bm{\mathcal{F}}_{t}\}_{t=0}^{\infty} be a filtration. Let {ηtm,Xtd}t=1\{\eta_{t}\in\mathbb{R}^{m},X_{t}\in\mathbb{R}^{d}\}_{t=1}^{\infty} be stochastic processes such that ηt,Xt\eta_{t},X_{t} are 𝓕t\bm{\mathcal{F}}_{t} measurable and ηt\eta_{t} is 𝓕t1\bm{\mathcal{F}}_{t-1}-conditionally 𝗌𝗎𝖻𝗀(L2)\mathsf{subg}(L^{2}) for some L>0L>0. For any t0t\geq 0, define Vt=s=1tXsXs,St=s=1tηs+1XsV_{t}=\sum_{s=1}^{t}X_{s}X_{s}^{\prime},S_{t}=\sum_{s=1}^{t}\eta_{s+1}X_{s}. Then for any δ>0,V0\delta>0,V\succ 0 and all t0t\geq 0 we have with probability at least 1δ1-\delta

St(V+Vt)1St4L2(log(1δ)+log(det(V+Vt)det(V))+m).S_{t}^{\top}(V+V_{t})^{-1}S_{t}\leq 4L^{2}\left(\log{\frac{1}{\delta}}+\log{\frac{\text{det}(V+V_{t})}{\text{det}(V)}}+m\right).

The proof of this result can be found as Theorem 7.

We denote by cc universal constants which can change from line to line. For numbers a,ba,b, we define abmin(a,b)a\wedge b\triangleq\min{(a,b)} and abmax(a,b)a\vee b\triangleq\max{(a,b)}.

Finally, for two matrices M1l1×l1,M2l2×l2M_{1}\in\mathbb{R}^{l_{1}\times l_{1}},M_{2}\in\mathbb{R}^{l_{2}\times l_{2}} with l1<l2l_{1}<l_{2}, M1M2M~1M2M_{1}-M_{2}\triangleq\tilde{M}_{1}-M_{2} where M~1=[M10l1×l2l10l2l1×l10l2l1×l2l1]\tilde{M}_{1}=\begin{bmatrix}M_{1}&0_{l_{1}\times l_{2}-l_{1}}\\ 0_{l_{2}-l_{1}\times l_{1}}&0_{l_{2}-l_{1}\times l_{2}-l_{1}}\end{bmatrix}.

Proposition 4 (System Reduction).

Let SPϵ||S-P||\leq\epsilon and the singular values of SS be arranged as follows:

σ1(S)>>σr1(S)>σr(S)σr+1(S)σs(S)>σs+1(S)>σn(S)>σn+1(S)=0\sigma_{1}(S)>\ldots>\sigma_{r-1}(S)>\sigma_{r}(S)\geq\sigma_{r+1}(S)\geq\ldots\geq\sigma_{s}(S)>\sigma_{s+1}(S)>\ldots\sigma_{n}(S)>\sigma_{n+1}(S)=0

Furthermore, let ϵ\epsilon be such that

ϵinf{1ir1}{s+1in}(σi(P)σi+1(P)2).\epsilon\leq\inf_{\{1\leq i\leq r-1\}\cup\{s+1\leq i\leq n\}}\Big{(}\frac{\sigma_{i}(P)-\sigma_{i+1}(P)}{2}\Big{)}. (6)

Define K0=[1,2,,r1][s+1,s+2,,n]K_{0}=[1,2,\ldots,r-1]\cup[s+1,s+2,\ldots,n], then

UK0S(ΣK0S)1/2UK0P(ΣK0P)1/22\displaystyle||U^{S}_{K_{0}}(\Sigma^{S}_{K_{0}})^{1/2}-U^{P}_{K_{0}}(\Sigma^{P}_{K_{0}})^{1/2}||_{2} 2i=1r1σiϵ2(σiσi+1)2(σi1σi)2\displaystyle\leq 2\sqrt{\sum_{i=1}^{r-1}\frac{\sigma_{i}\epsilon^{2}}{(\sigma_{i}-\sigma_{i+1})^{2}\wedge(\sigma_{i-1}-\sigma_{i})^{2}}}
+2σsϵ2((σr1σs)(σrσs+1))2+sup1is|σiσ^i|\displaystyle+2\sqrt{\frac{\sigma_{s}\epsilon^{2}}{((\sigma_{r-1}-\sigma_{s})\wedge(\sigma_{r}-\sigma_{s+1}))^{2}}}+\sup_{1\leq i\leq s}|\sqrt{\sigma_{i}}-\sqrt{\hat{\sigma}_{i}}|

and σi=σi(S),σ^i=σi(P)\sigma_{i}=\sigma_{i}(S),\hat{\sigma}_{i}=\sigma_{i}(P).

The proof is provided in Proposition 4 in the appendix. This is an extension of Wedin’s result that allows us to scale the recovery error of the rthr^{th} singular vector by only condition number of that singular vector. This is useful to represent the error of identifying a rr-order approximation as a function of the rthr^{th}-singular value only.

We briefly summarize our contributions below.

3 Contributions

In this paper we provide a purely data-driven approach to system identification from a single time–series of finite noisy data. Drawing from tools in systems theory and the theory of self–normalized martingales, we offer a nearly optimal OLS-based algorithm to learn the system parameters. We summarize our contributions below:

  • The central theme of our approach is to estimate the infinite system Hankel matrix (to be defined below) with increasing accuracy as the length TT of data grows. By utilizing a specific reformulation of the input–output relation in Eq. (1) we reduce the problem of Hankel matrix identification to that of regression between appropriately transformed versions of output and input. The OLS solution is a matrix ^\hat{\mathcal{H}} of size d^\hat{d}. More precisely, we show that with probability at least 1δ1-\delta,

    ^0,d^,d^2\displaystyle\Big{|}\Big{|}\hat{\mathcal{H}}-\mathcal{H}_{0,\hat{d},\hat{d}}\Big{|}\Big{|}_{2} β2d^Tpd^+log(Tδ)\displaystyle\lesssim\sqrt{\frac{\beta^{2}\hat{d}}{T}}\sqrt{p\hat{d}+\log{\frac{T}{\delta}}}

    for TT above a certain threshold, where 0,d^,d^\mathcal{H}_{0,\hat{d},\hat{d}} is the pd^×md^p\hat{d}\times m\hat{d} principal submatrix of the system Hankel. Here β\beta is the \mathcal{H}_{\infty}–system norm.

  • We show that by growing d^\hat{d} with TT in a specific fashion, ^\hat{\mathcal{H}} becomes the minimax optimal estimator of the system Hankel matrix. The choice of d^\hat{d} for a fixed TT is purely data-dependent and does not depend on spectral radius of AA or nn.

  • It is well known in systems theory that SVD of the doubly infinite system Hankel matrix gives us A,B,CA,B,C. However, the presence of finite noisy data prevents learning these parameters accurately. We show that it is always possible to learn the parameters of a lower-order approximation of the underlying system. This is achieved by selecting the top kk singular vectors of ^\hat{\mathcal{H}}. The estimation guarantee corresponds to model selection in Statistics. More precisely, for every kd^k\leq\hat{d} if (Ak,Bk,Ck)(A_{k},B_{k},C_{k}) are the parameters of a kk-order balanced approximation of the original LTI system and (A^k,B^k,C^k)(\hat{A}_{k},\hat{B}_{k},\hat{C}_{k}) are the estimates of our algorithm then for TT above a certain threshold we have

    CkC^k2+AkA^k2+BkB^k2\displaystyle||C_{k}-\hat{C}_{k}||_{2}+||A_{k}-\hat{A}_{k}||_{2}+||B_{k}-\hat{B}_{k}||_{2} β2d^σ^k2Tpd^+log(Tδ)\displaystyle\lesssim\sqrt{\frac{\beta^{2}\hat{d}}{\hat{\sigma}_{k}^{2}T}}\sqrt{p\hat{d}+\log{\frac{T}{\delta}}}

    with probability at least 1δ1-\delta where σ^i\hat{\sigma}_{i} is the ithi^{\text{th}} largest singular value of ^\hat{\mathcal{H}}.

4 Problem Formulation and Discussion

4.1 Data Generation

Assume there exists an unknown M=(C,A,B)nM=(C,A,B)\in\mathcal{M}_{n} for some unknown nn. Let the transfer function of MM be G(z)G(z). Suppose we observe the noisy output time series {Ytp×1}t=1T\{Y_{t}\in\mathbb{R}^{p\times 1}\}_{t=1}^{T} in response to user chosen input series, {Utm×1}t=1T\{U_{t}\in\mathbb{R}^{m\times 1}\}_{t=1}^{T}. We refer to this data generated by MM as ZT={(Ut,Yt)}t=1TZ_{T}=\{(U_{t},Y_{t})\}_{t=1}^{T}. We enforce the following assumptions on MM.

Assumption 1

The noise process {ηt,wt}t=1\{\eta_{t},w_{t}\}_{t=1}^{\infty} in the dynamics of MM given by Eq. (1) are i.i.d. and ηt,wt\eta_{t},w_{t} are isotropic with subGaussian parameter 11. Furthermore, X0=0X_{0}=0 almost surely. We will only select inputs, {Ut}t=1T\{U_{t}\}_{t=1}^{T}, that are isotropic subGaussian with subGaussian parameter 11.

The input–output map of Eq. (1) can be represented in multiple alternate ways. One commonly used reformulation of the input–output map in systems and control theory is the following

[Y1Y2YT]=𝒯0,T[U1U2UT]+𝒯𝒪0,T[η1η2ηT]+[w1w2wT]\begin{bmatrix}Y_{1}\\ Y_{2}\\ \vdots\\ Y_{T}\end{bmatrix}=\mathcal{T}_{0,T}\begin{bmatrix}U_{1}\\ U_{2}\\ \vdots\\ U_{T}\end{bmatrix}+\mathcal{T}\mathcal{O}_{0,T}\begin{bmatrix}\eta_{1}\\ \eta_{2}\\ \vdots\\ \eta_{T}\end{bmatrix}+\begin{bmatrix}w_{1}\\ w_{2}\\ \vdots\\ w_{T}\end{bmatrix}

where 𝒯𝒪k,d\mathcal{T}\mathcal{O}_{k,d} is defined as the Toeplitz matrix corresponding to process noise ηt\eta_{t} (similar to Definition 2.2):

𝒯𝒪k,d=[0000CAk0000CAd+k3CAk00CAd+k2CAd+k3CAk0].\displaystyle\mathcal{T}\mathcal{O}_{k,d}=\begin{bmatrix}0&0&\ldots&0&0\\ CA^{k}&0&\ldots&0&0\\ \vdots&\ddots&\ddots&\vdots&0\\ CA^{d+k-3}&\ldots&CA^{k}&0&0\\ CA^{d+k-2}&CA^{d+k-3}&\ldots&CA^{k}&0\end{bmatrix}.

𝒯0,T2,𝒯𝒪0,T2||\mathcal{T}_{0,T}||_{2},||\mathcal{T}\mathcal{O}_{0,T}||_{2} denote observed amplifications of the control input and process noise respectively. Note that stability of AA ensures 𝒯0,2,𝒯𝒪0,2<||\mathcal{T}_{0,\infty}||_{2},||\mathcal{T}\mathcal{O}_{0,\infty}||_{2}<\infty. Suppose both ηt,wt=0\eta_{t},w_{t}=0 in Eq. (1). Then it is a well-known fact that

M=supUtt=0YtYtt=0UtUtM=𝒯0,20,,2.||M||_{\infty}=\sup_{U_{t}}\sqrt{\frac{\sum_{t=0}^{\infty}Y_{t}^{\top}Y_{t}}{\sum_{t=0}^{\infty}U_{t}^{\top}U_{t}}}\implies||M||_{\infty}=||\mathcal{T}_{0,\infty}||_{2}\geq||\mathcal{H}_{0,\infty,\infty}||_{2}. (7)
Assumption 2

There exist universal constants β,R1\beta,R\geq 1 such that 𝒯0,2β,𝒯𝒪0,2𝒯0,2R.||\mathcal{T}_{0,\infty}||_{2}\leq\beta,\leavevmode\nobreak\ \leavevmode\nobreak\ \frac{||\mathcal{T}\mathcal{O}_{0,\infty}||_{2}}{||\mathcal{T}_{0,\infty}||_{2}}\leq R.

Remark 4.1 (\mathcal{H}_{\infty}-norm estimation).

Assumption 2 implies that an upper bound to the \mathcal{H}_{\infty}–norm of the system. It is possible to estimate M||M||_{\infty} from data (See Tu et al. (2018a) and references therein). It is reasonable to expect that error rates for identification of the parameters (C,A,B)(C,A,B) depend on the noise-to-signal ratio 𝒯𝒪0,2𝒯0,2\frac{||\mathcal{T}\mathcal{O}_{0,\infty}||_{2}}{||\mathcal{T}_{0,\infty}||_{2}}, i.e., identification is much harder when the ratio is large.

Remark 4.2 (RR estimation).

The noise to signal ratio hyperparameter can also be estimated from data, by allowing the system to run with Ut=0U_{t}=0 and taking the average 2\ell_{2} norm of the output YtY_{t}, i.e., (1/T)t=1TYt22(1/T)\sum_{t=1}^{T}\|Y_{t}\|^{2}_{2}. For the purpose of the results of the paper we simply assume an upper bound on RR. If UtU_{t} was 𝗌𝗎𝖻𝗀(L)\mathsf{subg}(L) instead of 𝗌𝗎𝖻𝗀(1)\mathsf{subg}(1), the noise-to-signal ratio is modified to R/LR/L instead.

mm: Input dimension, pp: Output dimension
γ\gamma: Known upper bound on A2||A||_{2}
δ\delta: Error probability
c,𝒞c,\mathcal{C}: Known absolute constants
RR: Known noise to signal ratio, or, 𝒯𝒪0,2𝒯0,2\frac{||\mathcal{T}\mathcal{O}_{0,\infty}||_{2}}{||\mathcal{T}_{0,\infty}||_{2}}
β\beta: Known upper bound on \mathcal{H}_{\infty}-norm of LTI system
𝒟(T)={d|Tcm2dlog2(d)log2(m2/δ)+cdlog3(2d)}\mathcal{D}(T)=\{d|T\geq cm^{2}d\log^{2}{(d)}\log^{2}{(m^{2}/\delta)}+cd\log^{3}{(2d)}\}
σA=l=1dCAlB2\sigma_{A}=\sum_{l=1}^{d}||CA^{l}B||_{2}, σB=l=1dCAl2\sigma_{B}=\sum_{l=1}^{d}||CA^{l}||_{2}
σC=σ(k=1d𝒯d+k,T𝒯d+k,T)\sigma_{C}=\sqrt{\sigma\left(\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T}\right)}, σD=σ(k=1d𝒯𝒪d+k,T𝒯𝒪d+k,T)\sigma_{D}=\sqrt{\sigma\left(\sum_{k=1}^{d}\mathcal{T}\mathcal{O}_{d+k,T}^{\top}\mathcal{T}\mathcal{O}_{d+k,T}\right)}
α(l)=l(lp+log((T/δ))+mT)\alpha(l)=\sqrt{l}\left(\sqrt{\frac{lp+\log{(T/\delta)}+m}{T}}\right)
Table 1: Summary of constants

5 Algorithmic Details

We will now represent the input–output relationship in terms of the Hankel and Toeplitz matrices defined before. Fix a dd, then for any ll we have

[YlYl+1Yl+d1]\displaystyle\begin{bmatrix}Y_{l}\\ Y_{l+1}\\ \vdots\\ Y_{l+d-1}\end{bmatrix} =0,d,d[Ul1Ul2Uld]+𝒯0,d[UlUl+1Ul+d1]+𝒪0,d,d[ηl1ηl2ηld+1]+𝒯𝒪0,d[ηlηl+1ηl+d1]\displaystyle=\mathcal{H}_{0,d,d}\begin{bmatrix}U_{l-1}\\ U_{l-2}\\ \vdots\\ U_{l-d}\end{bmatrix}+\mathcal{T}_{0,d}\begin{bmatrix}U_{l}\\ U_{l+1}\\ \vdots\\ U_{l+d-1}\end{bmatrix}+\mathcal{O}_{0,d,d}\begin{bmatrix}\eta_{l-1}\\ \eta_{l-2}\\ \vdots\\ \eta_{l-d+1}\end{bmatrix}+\mathcal{T}\mathcal{O}_{0,d}\begin{bmatrix}\eta_{l}\\ \eta_{l+1}\\ \vdots\\ \eta_{l+d-1}\end{bmatrix}
+d,d,ld1[Uld1Uld1U1]+𝒪d,d,ld1[ηld1ηld1η1]+[wlwl+1wl+d1]\displaystyle+\mathcal{H}_{d,d,l-d-1}\begin{bmatrix}U_{l-d-1}\\ U_{l-d-1}\\ \vdots\\ U_{1}\end{bmatrix}+\mathcal{O}_{d,d,l-d-1}\begin{bmatrix}\eta_{l-d-1}\\ \eta_{l-d-1}\\ \vdots\\ \eta_{1}\end{bmatrix}+\begin{bmatrix}w_{l}\\ w_{l+1}\\ \vdots\\ w_{l+d-1}\end{bmatrix} (8)

or, succinctly,

Y~l,d+\displaystyle\tilde{Y}^{+}_{l,d} =0,d,dU~l1,d+𝒯0,dU~l,d++d,d,ld1U~ld1,ld1\displaystyle=\mathcal{H}_{0,d,d}\tilde{U}^{-}_{l-1,d}+\mathcal{T}_{0,d}\tilde{U}^{+}_{l,d}+\mathcal{H}_{d,d,l-d-1}\tilde{U}^{-}_{l-d-1,l-d-1}
+𝒪0,d,dη~l1,d+𝒯𝒪0,dη~l,d++𝒪d,d,ld1η~ld1,ld1+w~l,d+\displaystyle+\mathcal{O}_{0,d,d}\tilde{\eta}^{-}_{l-1,d}+\mathcal{T}\mathcal{O}_{0,d}\tilde{\eta}^{+}_{l,d}+\mathcal{O}_{d,d,l-d-1}\tilde{\eta}^{-}_{l-d-1,l-d-1}+\tilde{w}^{+}_{l,d} (9)

Here

𝒪k,p,q\displaystyle\mathcal{O}_{k,p,q} =[CAkCAk+1CAq+k1CAk+1CAk+2CAd+kCAp+k1CAp+q+k2],Y~l,d=[YlYl1Yld+1],Y~l,d+=[YlYl+1Yl+d1].\displaystyle=\begin{bmatrix}CA^{k}&CA^{k+1}&\ldots&CA^{q+k-1}\\ CA^{k+1}&CA^{k+2}&\ldots&CA^{d+k}\\ \vdots&\vdots&\ddots&\vdots\\ CA^{p+k-1}&\ldots&\ldots&CA^{p+q+k-2}\end{bmatrix},\tilde{Y}^{-}_{l,d}=\begin{bmatrix}Y_{l}\\ Y_{l-1}\\ \vdots\\ Y_{l-d+1}\end{bmatrix},\tilde{Y}^{+}_{l,d}=\begin{bmatrix}Y_{l}\\ Y_{l+1}\\ \vdots\\ Y_{l+d-1}\end{bmatrix}.

Furthermore, U~l,d,η~l,d\tilde{U}^{-}_{l,d},\tilde{\eta}^{-}_{l,d} are defined similar to Y~l,d\tilde{Y}^{-}_{l,d} and U~l,d+,η~l,d+,w~l,d+\tilde{U}^{+}_{l,d},\tilde{\eta}^{+}_{l,d},\tilde{w}^{+}_{l,d} are similar to Y~l,d+\tilde{Y}^{+}_{l,d}. The ++ and - signs indicate moving forward and backward in time respectively. This representation will be at the center of our analysis.

There are three key steps in our algorithm which we describe in the following sections:

  • (a)

    Hankel submatrix estimation: Estimating 0,l,l\mathcal{H}_{0,l,l} for every 1lT1\leq l\leq T. We refer to the estimators as {^0,l,l}l=1T\{\hat{\mathcal{H}}_{0,l,l}\}_{l=1}^{T}.

  • (b)

    Model Selection: From the estimators {^0,l,l}l=1T\{\hat{\mathcal{H}}_{0,l,l}\}_{l=1}^{T} select ^0,d^,d^\hat{\mathcal{H}}_{0,\hat{d},\hat{d}} in a data dependent way such that it “best” estimates 0,,\mathcal{H}_{0,\infty,\infty}.

  • (c)

    Parameter Recovery: For every kd^k\leq\hat{d}, we do a singular value decomposition of ^0,d^,d^\hat{\mathcal{H}}_{0,\hat{d},\hat{d}} to obtain parameter estimates for a “good” kk-order approximation of the true model.

5.1 Hankel Submatrix Estimation

The goal of our systems identification is to estimate either 0,n,n\mathcal{H}_{0,n,n} or 0,,\mathcal{H}_{0,\infty,\infty}. Since we only have finite data and no apriori knowledge of nn it is not possible to directly estimate the unknown matrices. The first step then is to estimate all possible Hankel submatrices that are “allowed” by data, i.e., 0,d,d\mathcal{H}_{0,d,d} for dTd\leq T. For a fixed dd, Algorithm 1 estimates the d×dd\times d principal submatrix 0,d,d\mathcal{H}_{0,d,d}.

Algorithm 1 LearnSystem(T,d,m,pT,d,m,p)

Input T=Horizon for learningT=\text{Horizon for learning}
d=Hankel Sized=\text{Hankel Size}
m=Input dimensionm=\text{Input dimension}
p=Output dimensionp=\text{Output dimension}
Output System Parameters: ^0,d,d\hat{\mathcal{H}}_{0,d,d}

1: Generate 2T2T i.i.d. inputs {Uj𝒩(0,Im×m)}j=12T\{U_{j}\sim{\mathcal{N}}(0,I_{m\times m})\}_{j=1}^{2T}.
2: Collect 2T2T input–output pairs {Uj,Yj}j=12T\{U_{j},Y_{j}\}_{j=1}^{2T}.
3:^0,d,d=argminl=0T1Y~l+d+1,d+U~l+d,d22\hat{\mathcal{H}}_{0,d,d}=\arg\min_{\mathcal{H}}\sum_{l=0}^{T-1}||\tilde{Y}^{+}_{l+d+1,d}-\mathcal{H}\tilde{U}^{-}_{l+d,d}||_{2}^{2}
4:return  ^0,d,d\hat{\mathcal{H}}_{0,d,d}

It can be shown that

^0,d,d=(l=0T1Y~l+d+1,d+(U~l+d,d))(l=0T1U~l+d,d(U~l+d,d))+\hat{\mathcal{H}}_{0,d,d}=\Big{(}\sum_{l=0}^{T-1}\tilde{Y}^{+}_{l+d+1,d}(\tilde{U}^{-}_{l+d,d})^{\top}\Big{)}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}(\tilde{U}^{-}_{l+d,d})^{\top}\Big{)}^{+} (10)

and by running the algorithm TT times, we obtain {^0,d,,d}d=1T\{\hat{\mathcal{H}}_{0,d,,d}\}_{d=1}^{T}. A key step in showing that ^0,d,d\hat{\mathcal{H}}_{0,d,d} is a good estimator for 0,d,d\mathcal{H}_{0,d,d} is to prove the finite time isometry of VT=l=0T1U~l+d,d(U~l+d,d)V_{T}=\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}(\tilde{U}^{-}_{l+d,d})^{\top}, i.e., the sample covariance matrix.

Lemma 1.

Define

T0(δ,d)=cm2dlog2(d)log2(m2/δ)+cdlog3(2d)\displaystyle T_{0}(\delta,d)=cm^{2}d\log^{2}{(d)}\log^{2}{(m^{2}/\delta)}+cd\log^{3}{(2d)}

where cc is some universal constant. Define the sample covariance matrix VTl=0T1U~l+d,d(U~l+d,d)V_{T}\coloneqq\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}(\tilde{U}^{-}_{l+d,d})^{\top}. We have with probability 1δ1-\delta and for T>T0(δ,d)T>T_{0}(\delta,d)

12TIVT32TI\displaystyle\frac{1}{2}TI\preceq V_{T}\preceq\frac{3}{2}TI (11)

Lemma 1 allows us to write Eq. (10) as ^0,d,d=(l=0T1Y~l+d+1,d+(U~l+d,d))(l=0T1U~l+d,d(U~l+d,d))1\hat{\mathcal{H}}_{0,d,d}=\Big{(}\sum_{l=0}^{T-1}\tilde{Y}^{+}_{l+d+1,d}(\tilde{U}^{-}_{l+d,d})^{\top}\Big{)}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}(\tilde{U}^{-}_{l+d,d})^{\top}\Big{)}^{-1} with high probability and upper bound estimation error for d×dd\times d principal submatrix.

Theorem 2.

Fix dd and let ^0,d,d\hat{\mathcal{H}}_{0,d,d} be the output of Algorithm 1. Then for any 0<δ<10<\delta<1 and TT0(δ,d)T\geq T_{0}(\delta,d), we have with probability at least 1δ1-\delta

^0,d,d0,d,d2\displaystyle\Big{|}\Big{|}\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,d,d}\Big{|}\Big{|}_{2} 4σ1Tpd+log(1δ)+m.\displaystyle\leq 4\sigma\sqrt{\frac{1}{T}}\sqrt{pd+\log{\frac{1}{\delta}}+m}.

Here T0(δ,d)=cm2dlog2(d)log2(m2/δ)+cdlog3(2d)T_{0}(\delta,d)=cm^{2}d\log^{2}{(d)}\log^{2}{(m^{2}/\delta)}+cd\log^{3}{(2d)}, cc is a universal constant and σ=max(σA,σB,σC,σD)\sigma=\max{(\sigma_{A},\sigma_{B},\sigma_{C},\sigma_{D})} from Table 1.

Proof 5.1.

We outline the proof here. Recall Eq. (8), (9). Then for a fixed dd

^0,d,d=(l=0T1Y~l+d+1,d+(U~l+d,d))VT+.\hat{\mathcal{H}}_{0,d,d}=\Big{(}\sum_{l=0}^{T-1}\tilde{Y}^{+}_{l+d+1,d}(\tilde{U}^{-}_{l+d,d})^{\top}\Big{)}V_{T}^{+}.

Then the identification error is

^0,d,d0,d,d2\displaystyle\Big{|}\Big{|}\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,d,d}\Big{|}\Big{|}_{2} =||VT+(l=0T1U~l+d,dU~l+d+1,d+𝒯0,d+U~l+d,dU~l,ld,d,l+U~l+d,dw~l+d+1,d+\displaystyle=\Big{|}\Big{|}V_{T}^{+}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{w}^{+\top}_{l+d+1,d}
+U~l+d,dη~l+d,d𝒪0,d,d+U~l+d,dη~l+d+1,d+𝒯𝒪0,d+U~l+d,dη~l,l𝒪d,d,l)||2\displaystyle+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l+d,d}\mathcal{O}_{0,d,d}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\top}_{l+d+1,d}\mathcal{T}\mathcal{O}^{\top}_{0,d}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l,l}\mathcal{O}_{d,d,l}^{\top}\Big{)}\Big{|}\Big{|}_{2}
=VT+E2\displaystyle=||V_{T}^{+}E||_{2} (12)

with

E\displaystyle E =l=0T1U~l+d,dU~l+d+1,d+𝒯0,d+U~l+d,dU~l,ld,d,l+U~l+d,dw~l+d+1,d+\displaystyle=\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{w}^{+\top}_{l+d+1,d}
+U~l+d,dη~l+d,d𝒪0,d,d+U~l+d,dη~l+d+1,d+𝒯𝒪0,d+U~l+d,dη~l,l𝒪d,d,l.\displaystyle+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l+d,d}\mathcal{O}_{0,d,d}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\top}_{l+d+1,d}\mathcal{T}\mathcal{O}^{\top}_{0,d}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l,l}\mathcal{O}_{d,d,l}^{\top}.

By Lemma 1 we have, whenever TT0(δ,d)T\geq T_{0}(\delta,d), with probability at least 1δ1-\delta

TI2VT3TI2.\frac{TI}{2}\preceq V_{T}\preceq\frac{3TI}{2}. (13)

This ensures that, with high probability, that VT1V_{T}^{-1} exists and decays as O(T1)O(T^{-1}). The next step involves showing that E2||E||_{2} grows at most as T\sqrt{T} with high probability. This is reminiscent of Theorem 3 and the theory of self–normalized martingales. However, unlike that cases the conditional sub-Gaussianity requirements do not hold here. For example, let 𝓕l=σ(η1,,ηl)\bm{\mathcal{F}}_{l}=\sigma(\eta_{1},\ldots,\eta_{l}) then 𝔼[vη~l+1,l+1|𝓕l]0\mathbb{E}[v^{\top}\tilde{\eta}^{-}_{l+1,l+1}|\bm{\mathcal{F}}_{l}]\neq 0 for all vv since {η~l+1,l+1}l=0T1\{\tilde{\eta}^{-}_{l+1,l+1}\}_{l=0}^{T-1} is not an independent sequence. As a result it is not immediately obvious on how to apply Theorem 3 to our case. Under the event when Eq. (13) holds (which happens with high probability), a careful analysis of the normalized cross terms, i.e., VT1/2EV_{T}^{-1/2}E shows that VT1/2E2=O(1)||V_{T}^{-1/2}E||_{2}=O(1) with high probability. This is summarized in Propositions 1-3. The idea is to decompose EE into a linear combination of independent subgaussians and reduce it to a form where we can apply Theorem 3. This comes at the cost of additional scaling in the form of system dependent constants – such as the \mathcal{H}_{\infty}–norm. Then we can conclude with high probability that ^0,d,d2VT1/22VT1/2E2T1/2O(1)||\hat{\mathcal{H}}-\mathcal{H}_{0,d,d}||_{2}\leq||V_{T}^{-1/2}||_{2}||V_{T}^{-1/2}E||_{2}\leq T^{-1/2}O(1). The full proof has been deferred to Section 11.1 in Appendix 11.

Remark 5.2.

Recall 𝒟(T)\mathcal{D}(T) from Table 1. Since

d𝒟(T)TT0(δ,d)d\in\mathcal{D}(T)\implies T\geq T_{0}(\delta,d)

we can restate Theorem 2 as follows: for a fixed TT, we have with probability at least 1δ1-\delta that

^0,d,d0,d,d24σ1Tpd+log(1δ)+m\Big{|}\Big{|}\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,d,d}\Big{|}\Big{|}_{2}\leq 4\sigma\sqrt{\frac{1}{T}}\sqrt{pd+\log{\frac{1}{\delta}}+m}

when d𝒟(T)d\in\mathcal{D}(T).

We next present bounds on σ\sigma in Theorem 2. From the perspective of model selection in later sections, we require that σ\sigma be known. In the next proposition we present two bounds on σ\sigma, the first one depends on unknown parameters and recovers the precise dependence on dd. The second bound is an apriori known upper bound and incurs an additional factor of d\sqrt{d}.

Proposition 3.

σ\sigma upper bound independent of dd:

σcn(1ρ(A))2\sigma\leq\frac{c_{n}}{(1-\rho(A))^{2}}

where cnc_{n} depends only on nn.

σ\sigma upper bound dependent on dd:

σβRd.\sigma\leq\beta R\sqrt{d}.

where RR is the noise-to-signal ratio as in Table 1

Proof 5.3.

By Gelfand’s formula, since Ad2c(n)ρmax(A)d||A^{d}||_{2}\leq c(n)\rho_{\max}(A)^{d} where ρmax(A)<1\rho_{\max}(A)<1 and c(n)c(n) is a constant that only depends on nn, it implies that

σA\displaystyle\sigma_{A} =l=0dCAlB2l=0CAlB2l=0c(n)ρ(A)l=c(n)1ρ(A),\displaystyle=\sum_{l=0}^{d}||CA^{l}B||_{2}\leq\sum_{l=0}^{\infty}||CA^{l}B||_{2}\leq\sum_{l=0}^{\infty}c(n)\rho(A)^{l}=\frac{c(n)}{1-\rho(A)},

and

𝒯d+k,T2l=0T1CAd+k+lB2c(n)ρ(A)d+k1ρ(A).\displaystyle||\mathcal{T}_{d+k,T}||_{2}\leq\sum_{l=0}^{T-1}||CA^{d+k+l}B||_{2}\leq\frac{c(n)\rho(A)^{d+k}}{1-\rho(A)}.

Then

σC=σ(k=1d𝒯d+k,T𝒯d+k,T)c(n)ρ(A)d(1ρ(A))2c(n)(1ρ(A))2.\sigma_{C}=\sqrt{\sigma\left(\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T}\right)}\leq\frac{c(n)\rho(A)^{d}}{(1-\rho(A))^{2}}\leq\frac{c(n)}{(1-\rho(A))^{2}}.

Similarly, there exists a finite upper bound on σB,σD\sigma_{B},\sigma_{D} by replacing CAlBCA^{l}B and 𝒯d+k,T\mathcal{T}_{d+k,T} with CAlCA^{l} and 𝒯𝒪d+k,T\mathcal{T}\mathcal{O}_{d+k,T} respectively. For the dd independent upper bound, we have

σA=l=0dCAlB2dl=0dCAlB22dMHdβ.\sigma_{A}=\sum_{l=0}^{d}||CA^{l}B||_{2}\leq\sqrt{d}\sqrt{\sum_{l=0}^{d}||CA^{l}B||^{2}_{2}}\leq\sqrt{d}||M||_{H}\leq\sqrt{d}\beta.

Since σ(𝒯d+k,T𝒯d+k,T)β\sigma\left(\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T}\right)\leq\beta, then

σC=σ(k=1d𝒯d+k,T𝒯d+k,T)βd.\sigma_{C}=\sqrt{\sigma\left(\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T}\right)}\leq\beta\sqrt{d}.

For the σB,σD\sigma_{B},\sigma_{D} we get an extra RR because 𝒯𝒪0,βR\mathcal{T}\mathcal{O}_{0,\infty}\leq\beta R.

The key feature of the data dependent upper bound is that it only depends on β\beta and RR which are known apriori.

Recall that Gd=[CB,CAB,,CAd1B]G_{d}=[CB,CAB,\ldots,CA^{d-1}B], i.e., the dd-order FIR truncation of G(z)G(z). Since the pp rows of the 0,d,d\mathcal{H}_{0,d,d} matrix corresponds to GdG_{d} we can obtain estimators for any dd-order FIR.

Corollary 4.

Let G^d=^0,d,d[1:p,:]\widehat{G}_{d}=\hat{\mathcal{H}}_{0,d,d}[1:p,:] denote the first pp-rows of ^0,d,d\hat{\mathcal{H}}_{0,d,d}. Then for any 0<δ<10<\delta<1 and TT0(δ,d)T\geq T_{0}(\delta,d), we have with probability at least 1δ1-\delta,

G^dGd24σ1Tpd+log(1δ)+m.||\widehat{G}_{d}-G_{d}||_{2}\leq 4\sigma\sqrt{\frac{1}{T}}\sqrt{pd+\log{\frac{1}{\delta}}+m}.
Proof 5.4.

Proof follows because Gd=0,d,d[1:p,:]{G}_{d}=\mathcal{H}_{0,d,d}[1:p,:] and Theorem 2.

Next, we show that the error in Theorem 2 is minimax optimal (up to logarithmic factors) and cannot be improved by any estimation method.

Proposition 5.

Let Tc\sqrt{T}\geq c where cc is an absolute constant. Then for any estimator ^\hat{\mathcal{H}} of 0,,\mathcal{H}_{0,\infty,\infty} we have

sup^𝔼[^0,,2]cnlog(T)T\sup_{\hat{\mathcal{H}}}\mathbb{E}[||\hat{\mathcal{H}}-\mathcal{H}_{0,\infty,\infty}||_{2}]\geq c_{n}\cdot{\sqrt{\frac{\log{T}}{T}}}

where cn>0c_{n}>0 is a constant that is independent of TT but can depend on system level parameters.

Proof 5.5.

Assume the contrary that

sup^𝔼[^0,,2]=o(log(T)T).\sup_{\hat{\mathcal{H}}}\mathbb{E}[||\hat{\mathcal{H}}-\mathcal{H}_{0,\infty,\infty}||_{2}]=o\left(\sqrt{\frac{\log{T}}{T}}\right).

Then recall that [0,,]1:p,:=[CB,CAB,,][\mathcal{H}_{0,\infty,\infty}]_{1:p,:}=[CB,CAB,\ldots,] and G(z)=z1CB+z2CAB+G(z)=z^{-1}CB+z^{-2}CAB+\ldots. Similarly we have G^(z)\hat{G}(z). Define

GG^2=k=0CAkBC^A^kB^22.||G-\hat{G}||_{2}=\sqrt{\sum_{k=0}^{\infty}||CA^{k}B-\hat{C}\hat{A}^{k}\hat{B}||_{2}^{2}}.

If sup^𝔼[^0,,2]=o(log(T)T)\sup_{\hat{\mathcal{H}}}\mathbb{E}[||\hat{\mathcal{H}}-\mathcal{H}_{0,\infty,\infty}||_{2}]=o\Big{(}\sqrt{\frac{\log{T}}{T}}\Big{)}, then since ^0,,2GG^2||\hat{\mathcal{H}}-\mathcal{H}_{0,\infty,\infty}||_{2}\geq||G-\hat{G}||_{2} we can conclude that

𝔼[GG^2]=o(log(T)T)\mathbb{E}[||G-\hat{G}||_{2}]=o\left(\sqrt{\frac{\log{T}}{T}}\right)

which contradicts Theorem 5 in (Goldenshluger, 1998). Thus, sup^𝔼[^0,,2]cnlog(T)T\sup_{\hat{\mathcal{H}}}\mathbb{E}[||\hat{\mathcal{H}}-\mathcal{H}_{0,\infty,\infty}||_{2}]\geq c_{n}\cdot{\sqrt{\frac{\log{T}}{T}}}.

5.2 Model Selection

At a high level, we want to choose ^0,d^,d^\hat{\mathcal{H}}_{0,\hat{d},\hat{d}} from {^0,d,d}d=1T\{\hat{\mathcal{H}}_{0,d,d}\}_{d=1}^{T} such that ^0,d^,d^\hat{\mathcal{H}}_{0,\hat{d},\hat{d}} is a good estimator of 0,,\mathcal{H}_{0,\infty,\infty}. Our idea of model selection is motivated by (Goldenshluger, 1998). For any ^0,d,d\hat{\mathcal{H}}_{0,d,d}, the error from 0,,\mathcal{H}_{0,\infty,\infty} can be broken as:

^0,d,d0,,2^0,d,d0,d,d2=Estimation Error+0,d,d0,,2=Truncation Error.||\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq\underbrace{||\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,d,d}||_{2}}_{=\text{Estimation Error}}+\underbrace{||\mathcal{H}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||_{2}}_{=\text{Truncation Error}}.

We would like to select a d=d^d=\hat{d} such that it balances the truncation and estimation error in the following way:

c2Data dependent upper bound c1Estimation ErrorTruncation Errorc_{2}\cdot\text{Data dependent upper bound }\geq c_{1}\cdot\text{Estimation Error}\geq\text{Truncation Error}

where cic_{i} are absolute constants. Such a balancing ensures that

^0,d^,d^0,,2c2(1/c1+1)Data dependent upper bound .||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq c_{2}\cdot(1/c_{1}+1)\cdot\text{Data dependent upper bound }. (14)

Note that such a balancing is possible because the estimation error increases as dd grows and truncation error decreases with dd. Furthermore, a data dependent upper bound for estimation error can be obtained from Theorem 2. Unfortunately (C,A,B)(C,A,B) are unknown and it is not immediately clear on how to obtain such a bound for truncation error.

To achieve this, we first define a truncation error proxy, i.e., how much do we truncate if a specific ^0,d,d\hat{\mathcal{H}}_{0,d,d} is used. For a given dd, we look at ^0,d,d^0,l,l2||\hat{\mathcal{H}}_{0,d,d}-\hat{\mathcal{H}}_{0,l,l}||_{2} for l𝒟(T)dl\in\mathcal{D}(T)\geq d. This measures the additional error incurred if we choose ^0,d,d\hat{\mathcal{H}}_{0,d,d} as an estimator for 0,,\mathcal{H}_{0,\infty,\infty} instead of ^0,l,l\hat{\mathcal{H}}_{0,l,l} for l>dl>d. Then we pick d^\hat{d} as follows:

d^inf{d|^0,d,d^0,l,l216βRα(l)l𝒟(T)d}.\hat{d}\coloneqq\inf\Bigg{\{}d\Bigg{|}||\hat{\mathcal{H}}_{0,d,d}-\hat{\mathcal{H}}_{0,l,l}||_{2}\leq 16\beta R\cdot\alpha(l)\quad{}\forall l\in\mathcal{D}(T)\geq d\Bigg{\}}. (15)

Recall that α(l)=llog((l/δ))+pl2+mlT\alpha(l)=\sqrt{\frac{l\log{(l/\delta)}+pl^{2}+ml}{T}}, where log((l/δ))+pl+mT\sqrt{\frac{\log{(l/\delta)}+pl+m}{T}} denotes how much estimation error is incurred in learning l×ll\times l Hankel submatrix, the extra βl\beta\sqrt{l} is incurred because we need a data dependent, albeit coarse, upper bound on the estimation error.

A key step will be to show that for any ldl\geq d, whenever

^0,d,d^0,l,l2cβRα(l)||\hat{\mathcal{H}}_{0,d,d}-\hat{\mathcal{H}}_{0,l,l}||_{2}\leq c\beta R\cdot\alpha(l)

ensures that

^0,d,d0,,2cβRα(l)and^0,l,l0,,2cβRα(l)||\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq c\beta R\cdot\alpha(l)\quad{}\text{and}\quad{}||\hat{\mathcal{H}}_{0,l,l}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq c\beta R\cdot\alpha(l)

and there is no gain in choosing a larger Hankel submatrix estimate. By picking the smallest dd for which such a property holds for all larger Hankel submatrices, we ensure that a regularized model is estimated that “agrees” with the data.

Algorithm 2 Choice of dd

Output ^0,d^,d^,d^\hat{\mathcal{H}}_{0,\hat{d},\hat{d}},\quad{}\hat{d}

1:𝒟(T)={d|dTcm2log3(Tm/δ)},α(h)=h(m+hp+log((Tδ))T)\mathcal{D}(T)=\Big{\{}d\Big{|}d\leq\frac{T}{cm^{2}\log^{3}{(Tm/\delta)}}\Big{\}},\alpha(h)=\sqrt{h}\Big{(}\sqrt{\frac{m+hp+\log{\left(\frac{T}{\delta}\right)}}{T}}\Big{)}.
2:d0(T,δ)=inf{l|^0,l,l^0,h,h216βR(α(h)+2α(l))h𝒟(T),hl}d_{0}(T,\delta)=\inf\Big{\{}l\Big{|}||\hat{\mathcal{H}}_{0,l,l}-\hat{\mathcal{H}}_{0,h,h}||_{2}\leq 16\beta R(\alpha(h)+2\alpha(l))\hskip 5.69054pt\forall h\in\mathcal{D}(T),h\geq l\Big{\}}.
3:d^=max(d0(T,δ),log((Tδ)))\hat{d}=\max{\left(d_{0}(T,\delta),\log{\left(\frac{T}{\delta}\right)}\right)}
4:return  ^0,d^,d^,d^\hat{\mathcal{H}}_{0,\hat{d},\hat{d}},\quad{}\hat{d}

We now state the main estimation result for 0,,\mathcal{H}_{0,\infty,\infty} for d=d^d=\hat{d} as chosen in Algorithm 2. Define

T(δ)\displaystyle T_{*}(\delta) =inf{T|d(T,δ)𝒟(T),d(T,δ)2d(T256,δ)}\displaystyle=\inf\Big{\{}T\Big{|}d_{*}(T,\delta)\in\mathcal{D}(T),\hskip 5.69054ptd_{*}(T,\delta)\leq 2d_{*}\left(\frac{T}{256},\delta\right)\Big{\}} (16)

where

d(T,δ)=inf{d|16βRα(d)0,d,d0,,2}.d_{*}(T,\delta)=\inf\Bigg{\{}d\Bigg{|}16\beta R\alpha(d)\geq||\mathcal{H}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||_{2}\Bigg{\}}. (17)

A close look at Eq. (17) reveals that picking d=d(T,δ)d=d_{*}(T,\delta) ensures the balancing of Eq. (14). However, d(T,δ)d_{*}(T,\delta) depends on unknown quantities and is unknown. In such a case, d^\hat{d} in Eq. (15) becomes a proxy for d(T,δ)d_{*}(T,\delta). From an algorithmic stand point, we no longer need any unknown information; the unknown parameter only appear in T(δ)T_{*}(\delta), which is only required to make the theoretical guarantee of Theorem 6 below.

Theorem 6.

Whenever we have TT(δ)T\geq T_{*}(\delta) we have with probability at least 1δ1-\delta that

^0,d^,d^0,,212cβR(md^+pd^2+d^log(Tδ)T).||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq 12c\beta R\left(\sqrt{\frac{m\hat{d}+p\hat{d}^{2}+\hat{d}\log{\frac{T}{\delta}}}{T}}\right).

The proof of Theorem 6 can be found as Proposition 9 in Appendix 13. We see that the error between ^0,d^,d^\hat{\mathcal{H}}_{0,\hat{d},\hat{d}} and 0,,\mathcal{H}_{0,\infty,\infty} can be upper bounded by a purely data dependent quantity. The next proposition shows that d^\hat{d} does not grow more that logarithmically in TT.

Proposition 7.

Let TT(δ)T\geq T_{*}(\delta), d(T,δ)d_{*}(T,\delta) be as in Eq. (17). Then with probability at least 1δ1-\delta we have

d^d(T,δ)log((Tδ)).\hat{d}\leq d_{*}(T,\delta)\vee\log{\Big{(}\frac{T}{\delta}\Big{)}}.

Furthermore,

d(T,δ)clog((cT+log(1δ)))log(R)+log(β)log(1ρ(A)).d_{*}(T,\delta)\leq\frac{c\log{(cT+\log{\frac{1}{\delta}})}-\log{R}+\log{\beta}}{\log{\frac{1}{\rho(A)}}}.

The effect of unknown quantities, such as the spectral radius, are subsumed in the finite time condition TT(δ)T\geq T_{*}(\delta) and appear in an upper bound for d^\hat{d}; however this information is not needed from an algorithmic perspective as the selection of d^\hat{d} is agnostic to the knowledge of ρ(A)\rho(A). The proof of proposition can be found as Propositions 7 and 4.

5.3 Parameter Recovery

Next we discuss finding the system parameters. To obtain system parameters we use a balanced truncation algorithm on ^0,d^,d^\hat{\mathcal{H}}_{0,\hat{d},\hat{d}} where d^\hat{d} is the output of Algorithm 2. The details are summarized in Algorithm 3 where =^0,d^,d^\mathcal{H}=\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}.

Algorithm 3 Hankel2Sys(T,d^,k,m,pT,\hat{d},k,m,p)

Input T=Horizon for LearningT=\text{Horizon for Learning}
d^=Hankel Size\hat{d}=\text{Hankel Size}
m=Input dimensionm=\text{Input dimension}
p=Output dimensionp=\text{Output dimension}
Output System Parameters: (C^d^,A^d^,B^d^)(\hat{C}_{\hat{d}},\hat{A}_{\hat{d}},\hat{B}_{\hat{d}})

1:=0,d^,d^\mathcal{H}=\mathcal{H}_{0,\hat{d},\hat{d}}
2: Pad \mathcal{H} with zeros to make of dimension 4pd^×4md^4p\hat{d}\times 4m\hat{d}
3:U,Σ,VU,\Sigma,V\leftarrow SVD of \mathcal{H}
4:Ud^,Vd^U_{\hat{d}},V_{\hat{d}}\leftarrow top d^\hat{d} singular vectors
5:C^d^\hat{C}_{\hat{d}}\leftarrow first pp rows of Ud^Σd^1/2U_{\hat{d}}\Sigma_{\hat{d}}^{1/2}
6:B^d^\hat{B}_{\hat{d}}\leftarrow first mm columns of Σd^1/2Vd^\Sigma_{\hat{d}}^{1/2}V^{\top}_{\hat{d}}
7:Z0=[Ud^Σd^1/2]1:4pd^p,:,Z1=[Ud^Σd^1/2]p+1:,:Z_{0}=[U_{\hat{d}}\Sigma_{\hat{d}}^{1/2}]_{1:4p\hat{d}-p,:},Z_{1}=[U_{\hat{d}}\Sigma_{\hat{d}}^{1/2}]_{p+1:,:}
8:A^d^(Z0Z0)1Z0Z1\hat{A}_{\hat{d}}\leftarrow(Z_{0}^{\top}Z_{0})^{-1}Z_{0}^{\top}Z_{1}.
9:return  (C^d^,A^d^,B^d^)(\hat{C}_{\hat{d}},\hat{A}_{\hat{d}},\hat{B}_{\hat{d}})

To state the main result we define a quantity that measures the singular value weighted subspace gap of a matrix SS:

Γ(S,ϵ)=σmax1/ζ12+σmax2/ζ22++σmaxl/ζl2,\Gamma(S,\epsilon)=\sqrt{{\sigma}^{1}_{\max}/\zeta_{1}^{2}+{\sigma}^{2}_{\max}/\zeta_{2}^{2}+\ldots+{\sigma}^{l}_{\max}/\zeta_{l}^{2}},

where S=UΣVS=U\Sigma V^{\top} and Σ{\Sigma} is arranged into blocks of singular values such that in each block ii we have supjσjiσj+1iϵ\sup_{j}\sigma^{i}_{j}-\sigma^{i}_{j+1}\leq\epsilon, i.e.,

Σ=[Λ1000Λ20000Λl]{\Sigma}=\begin{bmatrix}\Lambda_{1}&0&\ldots&0\\ 0&\Lambda_{2}&\ldots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\ldots&\Lambda_{l}\\ \end{bmatrix}

where Λi\Lambda_{i} are diagonal matrices, σji\sigma^{i}_{j} is the jthj^{th} singular value in the block Λi\Lambda_{i} and σmini,σmaxi\sigma^{i}_{\min},\sigma^{i}_{\max} are the minimum and maximum singular values of block ii respectively. Furthermore,

ζi=min(σmini1σmaxi,σminiσmaxi+1)\zeta_{i}=\min{({\sigma}^{i-1}_{\min}-{\sigma}^{i}_{\max},{\sigma}^{i}_{\min}-{\sigma}^{i+1}_{\max})}

for 1<i<l1<i<l, ζ1=σmin1σmax2\zeta_{1}={\sigma}^{1}_{\min}-{\sigma}^{2}_{\max} and ζl=min(σminl1σmaxl,σminl)\zeta_{l}=\min{({\sigma}^{l-1}_{\min}-{\sigma}^{l}_{\max},{\sigma}^{l}_{\min})}. Informally, the ζi\zeta_{i} measure the singular value gaps between each blocks. It should be noted that ll, the number of separated blocks, is a function of ϵ\epsilon itself. For example: if ϵ=0\epsilon=0 then the number of blocks correspond to the number of distinct singular values. On the other hand, if ϵ\epsilon is very large then l=1l=1.

Theorem 8.

Let MM be the true unknown model and

ϵ=12cβR(md^+pd^2+d^log(Tδ)T).\epsilon=12c\beta R\left(\sqrt{\frac{m\hat{d}+p\hat{d}^{2}+\hat{d}\log{\frac{T}{\delta}}}{T}}\right).

Then whenever TT(δ)T\geq T_{*}(\delta), we have with probability at least 1δ1-\delta:

Cd^C^d^2Bd^B^d^2Ad^A^d^2}γ¯ϵΓ(^0,d^,d^,2ϵ)+γ¯sup1id^(σ^maxiσ^mini)+γ¯ϵσ^d^ϵσ^d^\begin{rcases*}||C_{\hat{d}}-\hat{C}_{\hat{d}}||_{2}\\ ||B_{\hat{d}}-\hat{B}_{\hat{d}}||_{2}\\ ||A_{\hat{d}}-\hat{A}_{\hat{d}}||_{2}\end{rcases*}\leq\bar{\gamma}\epsilon\Gamma(\hat{\mathcal{H}}_{0,\hat{d},\hat{d}},2\epsilon)+\bar{\gamma}\sup_{1\leq i\leq\hat{d}}\left(\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}\right)+\bar{\gamma}\cdot\frac{\epsilon\wedge\sqrt{\hat{\sigma}_{\hat{d}}\epsilon}}{\sqrt{\hat{\sigma}_{\hat{d}}}}

where sup1id^σ^maxiσ^mini2σ^d^ϵd^2d^ϵ\sup_{1\leq i\leq\hat{d}}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}\leq\frac{2}{\sqrt{\hat{\sigma}_{\hat{d}}}}\epsilon\hat{d}\wedge\sqrt{2\hat{d}\epsilon} and γ¯=max(4γ,8)\bar{\gamma}=\max{(4\gamma,8)}.

The proof of Theorem 8 follows directly from Theorem 9 where we show

^0,d^,d^0,,2ϵ||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq\epsilon

and Proposition 2. Theorem 8 provides an error bound between parameters (of model order d^\hat{d}) when true order is unknown. The subspace gap measure, Γ(^0,d^,d^,2ϵ)\Gamma(\hat{\mathcal{H}}_{0,\hat{d},\hat{d}},2\epsilon), is bounded even when ϵ=0\epsilon=0. To see this, note that when ϵ=0\epsilon=0, ^0,d^,d^\hat{\mathcal{H}}_{0,\hat{d},\hat{d}} corresponds exactly to 0,d^,d^\mathcal{H}_{0,\hat{d},\hat{d}}. In that case, the number of blocks correspond to the number of distinct singular values of 0,d^,d^\mathcal{H}_{0,\hat{d},\hat{d}}, and ζni\zeta_{n_{i}} then corresponds to singular value gap between the unequal singular values. As a result Γ(^0,d^,d^,2ϵ)=Δ<\Gamma(\hat{\mathcal{H}}_{0,\hat{d},\hat{d}},2\epsilon)=\Delta<\infty. Then the bound decays as ϵ=O(d^2/T)\epsilon=O\left(\sqrt{\hat{d}^{2}/T}\right) for singular values σ^d^>d^ϵ\hat{\sigma}_{\hat{d}}>\hat{d}\epsilon, but for much smaller singular values the bound decays as ϵ=O((d2/T)1/4)\sqrt{\epsilon}=O\left(\left(d^{2}/T\right)^{1/4}\right).

To shed more light on the behavior of our bounds, we consider the special case of known order. If nn is the model order, then we can set d^=n\hat{d}=n. If σi=σi(0,,)\sigma_{i}=\sigma_{i}(\mathcal{H}_{0,\infty,\infty}), then for large enough TT one can ensure that

minσiσi+1(σiσi+1)/2>ϵ,\min_{\sigma_{i}\neq\sigma_{i+1}}(\sigma_{i}-\sigma_{i+1})/2>\epsilon,

i.e., ϵ\epsilon is less than the singular value gap and small enough that the spectrum of ^0,n,n\hat{\mathcal{H}}_{0,n,n} is very close to that of 0,,\mathcal{H}_{0,\infty,\infty}. Consequently σ^nσn/2\hat{\sigma}_{n}\geq\sigma_{n}/2 and we have that

CnC^n2BnB^n2AnA^n2}γ¯ϵΔ+γ¯ϵ/σn=cβγ¯R(pn2+nlog(Tδ)σnT).\begin{rcases*}||C_{n}-\hat{C}_{n}||_{2}\\ ||B_{n}-\hat{B}_{n}||_{2}\\ ||A_{n}-\hat{A}_{n}||_{2}\end{rcases*}\leq\bar{\gamma}\epsilon\Delta+\bar{\gamma}\epsilon/\sqrt{\sigma_{n}}=c\beta\bar{\gamma}R\left(\sqrt{\frac{pn^{2}+n\log{\frac{T}{\delta}}}{\sigma_{n}T}}\right). (18)

This upper bound is (nearly) identical to the bounds obtained in Oymak and Ozay (2018) for the known order case. The major advantage of our result is that we do not require any information/assumption on the LTI system besides β\beta. Nonparametric approaches to estimating β\beta have been studied in Tu et al. (2017).

5.4 Order Estimation Lower Bound

In Theorem 8 it is shown that whenever T=Ω(1σd^2)T=\Omega\left(\frac{1}{\sigma_{\hat{d}}^{2}}\right) we can find an accurate d^\hat{d}–order approximation. Now we show that if T=O(1σd^2)T=O\left(\frac{1}{\sigma_{\hat{d}}^{2}}\right) then there is always some non–zero probability with which we can not recover the singular vector corresponding to the σd^+1\sigma_{\hat{d}+1}. We prove the following lower bound for model order estimation when inputs {Ut}t=1T\{U_{t}\}_{t=1}^{T} are active and bounded which we define below

Definition 5.6.

An input sequence {Ut}t=1T\{U_{t}\}_{t=1}^{T} is said to be active if UtU_{t} is allowed to depend on past history {Ul,Yl}l=1t1\{U_{l},Y_{l}\}_{l=1}^{t-1}. The input sequence is bounded if 𝔼[UtUt]1\mathbb{E}[U_{t}^{\top}U_{t}]\leq 1 for all tt.

Active inputs allow for the case when input selection can be adaptive due to feedback.

Theorem 9.

Fix δ>0,ζ(0,1/2)\delta>0,\zeta\in(0,1/2). Let M1,M2M_{1},M_{2} be two LTI systems and σi(1),σi(2)\sigma_{i}^{(1)},\sigma_{i}^{(2)} be the ithi^{th}-Hankel singular values respectively. Let σ1(1)σ2(1)2ζ\frac{\sigma^{(1)}_{1}}{\sigma^{(1)}_{2}}\leq\frac{2}{\zeta} and σ2(2)=0\sigma^{(2)}_{2}=0. Then whenever T𝒞R2ζ2log(2δ)T\leq\frac{\mathcal{C}R^{2}}{\zeta^{2}}\log{\frac{2}{\delta}} we have

supM{M1,M2}ZTM(order(M^(ZT))order(M))δ\sup_{M\in\{M_{1},M_{2}\}}\mathbb{P}_{Z_{T}\sim M}(\text{order}(\hat{M}(Z_{T}))\neq\text{order}(M))\geq\delta

Here ZT={Ut,Yt}t=1TMZ_{T}=\{U_{t},Y_{t}\}_{t=1}^{T}\sim M means MM generates TT data points {Yt}t=1T\{Y_{t}\}_{t=1}^{T} in response to active and bounded inputs {Ut}t=1T\{U_{t}\}_{t=1}^{T} and M^(ZT)\hat{M}(Z_{T}) is any estimator.

Proof 5.7.

The proof can be found in appendix in Section 15 and involves using Fano’s (or Birge’s) inequality to compute the minimax risk between the probability density functions generated by two different LTI systems:

A0\displaystyle A_{0} =[010000ζ00],A1=A0,B0=[00β/R],B1=[0β/Rβ/R],C0=[00βR],C1=C0.\displaystyle=\begin{bmatrix}0&1&0\\ 0&0&0\\ \zeta&0&0\end{bmatrix},A_{1}=A_{0},B_{0}=\begin{bmatrix}0\\ 0\\ \sqrt{\beta}/R\end{bmatrix},B_{1}=\begin{bmatrix}0\\ \sqrt{\beta}/R\\ \sqrt{\beta}/R\end{bmatrix},C_{0}=\begin{bmatrix}0&0&\sqrt{\beta}R\end{bmatrix},C_{1}=C_{0}. (19)

A0,A1A_{0},A_{1} are Schur stable whenever |ζ|<1|\zeta|<1.

Theorem 9 shows that when the time required to recover higher order models depends inversely on the condition number. Specifically, to correctly distinguish between an order 11 and order 22 model TΩ(2/ζ2)T\geq\Omega(2/\zeta^{2}) where ζ\zeta is the condition number of the 22-order model. We compare this to our upper bound in Theorem 8 and Eq. (18), assume Γ(^0,d^,d^,2ϵ)Δ\Gamma(\hat{\mathcal{H}}_{0,\hat{d},\hat{d}},2\epsilon)\leq\Delta for all ϵ[0,1]\epsilon\in[0,1] and d^ϵσ^d^\hat{d}\epsilon\leq\hat{\sigma}_{\hat{d}}, then since parameter error, \mathcal{E}, is upper bounded as

cβΔR(md^+pd^2+d^log(Tδ)σd^T),\mathcal{E}\leq c\beta\Delta R\left(\sqrt{\frac{m\hat{d}+p\hat{d}^{2}+\hat{d}\log{\frac{T}{\delta}}}{\sigma_{\hat{d}}T}}\right),

we need

Tlog(Tδ)Ω(β2Δ2R2d^2σd^2)\frac{T}{\log{\frac{T}{\delta}}}\geq\Omega\left(\frac{\beta^{2}\Delta^{2}R^{2}\hat{d}^{2}}{\sigma_{\hat{d}}^{2}}\right)

to correctly identify d^\hat{d}-order model. The ratio (β/σd^)(\beta/\sigma_{\hat{d}}) is equal to the condition number of the Hankel matrix. In this sense, the model selection followed by singular value thresholding is not too conservative in terms of RR (the signal-to-noise ratio) and conditioning of the Hankel matrix.

6 Experiments

The experiments in this paper are for the single trajectory case. A detailed analysis for system identification from multiple trajectories can be found in Tu et al. (2017). Suppose that the LTI system generating data, MM, has transfer function given by

G(z)=α0+l=1149αlρlzl,ρ<1G(z)=\alpha_{0}+\sum_{l=1}^{149}\alpha_{l}\rho^{l}z^{-l},\hskip 5.69054pt\rho<1 (20)

where αi𝒩(0,1)\alpha_{i}\sim{\mathcal{N}}(0,1). MM is a finite dimensional LTI system or order 150150 with parameters as M=(C1×150,A150×150,B150×1)M=(C\in\mathbb{R}^{1\times 150},A\in\mathbb{R}^{150\times 150},B\in\mathbb{R}^{150\times 1}). For these illustrations, we assume a balanced system and choose R=1,δ=0.05R=1,\delta=0.05. We estimate β0.6=15,β0.9=40,β0.99=140\beta_{0.6}=15,\beta_{0.9}=40,\beta_{0.99}=140, pick Ut𝒩(0,1)U_{t}\sim{\mathcal{N}}(0,1) and {wt,ηt}{𝒩(0,1),𝒩(0,I)}\{w_{t},\eta_{t}\}\sim\{{\mathcal{N}}(0,1),{\mathcal{N}}(0,I)\} respectively.

Refer to caption
Figure 1: Variation of Hankel size =d^=\hat{d} with TT for different values of ρ\rho

Fig. 1 shows how d=d^d=\hat{d} change with the number of data points for different values of ρ\rho. When ρ=0.6\rho=0.6, i.e., small, d^\hat{d} does not grow too big with TT even when the number of data points is increased. This shows that a small model order is sufficient to specify system dynamics. On the other hand, when ρ=0.99\rho=0.99, i.e., closer to instability the d^\hat{d} required is much larger, indicating the need for a higher order. Although d^\hat{d} implicitly captures the effect of spectral radius, the knowledge of ρ\rho is not required for d^\hat{d} selection.

In principle, our algorithm increases the Hankel size to the “appropriate” size as the data increases. We compare this to a deterministic growth policy d=log((T))d=\log{(T)} and the SSREGEST algorithm Ljung et al. (2015). The SSREGEST algorithm first learns a large model from data and then performs model reduction to obtain a final model. In contrast, we go to reduced model directly by picking a small d^\hat{d}. This reduces the sensitivity to noise.

In Fig. 2 shows the model errors for a deterministic growth policy d=log((T))d=\log{(T)} and our algorithm. Although the difference is negligible when ρ=0.6\rho=0.6 (small), we see that our algorithm does better ρ=0.99\rho=0.99 due to its adaptive nature, i.e., d^\hat{d} responds faster for our algorithm.

Refer to caption
Figure 2: Variation of MM^kop||M-\widehat{M}_{k}||_{\text{op}} for different values of ρ\rho. Here k=d^k=\hat{d} for our algorithm and k=log((T))k=\log{(T)}. Furthermore, ||||op||\cdot||_{\text{op}} is the Hankel norm.

Finally, for the case when ρ=0.9,β=40\rho=0.9,\beta=40, we show the model errors for SSREGEST and our algorithm as TT increases. Although asymptotically both algorithms perform the same, it is clear that for small TT our algorithm is more robust to the presence of noise.

T SSREGEST Our Algorithm
500500 6.21±1.356.21\pm 1.35 13.37±3.713.37\pm 3.7
850\approx 850 30.20±7.5530.20\pm 7.55 11.25±2.8911.25\pm 2.89
1200\approx 1200 26.80±8.9426.80\pm 8.94 9.83±2.609.83\pm 2.60
15001500 23.27±10.6523.27\pm 10.65 9.17±2.309.17\pm 2.30
20002000 26.38±12.8826.38\pm 12.88 7.70±1.607.70\pm 1.60

7 Discussion

We propose a new approach to system identification when we observe only finite noisy data. Typically, the order of an LTI system is large and unknown and a priori parametrizations may fail to yield accurate estimates of the underlying system. However, our results suggest that there always exists a lower order approximation of the original LTI system that can be learned with high probability. The central theme of our approach is to recover a good lower order approximation that can be accurately learned. Specifically, we show that identification of such approximations is closely related to the singular values of the system Hankel matrix. In fact, the time required to learn a d^\hat{d}–order approximation scales as T=Ω(β2σd^2)T=\Omega(\frac{\beta^{2}}{\sigma_{\hat{d}}^{2}}) where σd^\sigma_{\hat{d}} is the d^\hat{d}–the singular value of system Hankel matrix. This means that system identification does not explicitly depend on the model order nn, rather depends on nn through σn\sigma_{n}. As a result, in the presence of finite data it is preferable to learn only the “significant” (and perhaps much smaller) part of the system when nn is very large and σn1\sigma_{n}\ll 1. Algorithm 1 and 3 provide a guided mechanism for learning the parameters of such significant approximations with optimal rules for hyperparameter selection given in Algorithm 2.

Future directions for our work include extending the existing low–rank optimization-based identification techniques, such as (Fazel et al., 2013; Grussler et al., 2018), which typically lack statistical guarantees. Since Hankel based operators occur quite naturally in general (not necessarily linear) dynamical systems, exploring if our methods could be extended for identification of such systems appears to be an exciting direction.

References

  • Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
  • Açıkmeşe et al. (2013) Behçet Açıkmeşe, John M Carson, and Lars Blackmore. Lossless convexification of nonconvex control bound and pointing constraints of the soft landing optimal control problem. IEEE Transactions on Control Systems Technology, 21(6):2104–2113, 2013.
  • Agarwal et al. (2018) Anish Agarwal, Muhammad Jehangir Amjad, Devavrat Shah, and Dennis Shen. Time series analysis via matrix estimation. arXiv preprint arXiv:1802.09064, 2018.
  • Allen-Zhu and Li (2016) Zeyuan Allen-Zhu and Yuanzhi Li. Lazysvd: Even faster svd decomposition yet without agonizing pain. In Advances in Neural Information Processing Systems, pages 974–982, 2016.
  • Bauer (2000) Dietmar Bauer. Order estimation for subspace methods. 2000.
  • Boucheron et al. (2013) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
  • Campi and Weyer (2002) Marco C Campi and Erik Weyer. Finite sample properties of system identification methods. IEEE Transactions on Automatic Control, 47(8):1329–1334, 2002.
  • Dempster et al. (1977) Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
  • Faradonbeh et al. (2017) Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite time identification in unstable linear systems. arXiv preprint arXiv:1710.01852, 2017.
  • Fazel et al. (2013) Maryam Fazel, Ting Kei Pong, Defeng Sun, and Paul Tseng. Hankel matrix rank minimization with applications to system identification and realization. SIAM Journal on Matrix Analysis and Applications, 34(3):946–977, 2013.
  • Glover (1984) Keith Glover. All optimal hankel-norm approximations of linear multivariable systems and their ll_{\infty}-error bounds. International journal of control, 39(6):1115–1193, 1984.
  • Glover (1987) Keith Glover. Model reduction: a tutorial on hankel-norm methods and lower bounds on l2 errors. IFAC Proceedings Volumes, 20(5):293–298, 1987.
  • Goldenshluger (1998) Alexander Goldenshluger. Nonparametric estimation of transfer functions: rates of convergence and adaptation. IEEE Transactions on Information Theory, 44(2):644–658, 1998.
  • Grussler et al. (2018) Christian Grussler, Anders Rantzer, and Pontus Giselsson. Low-rank optimization with convex constraints. IEEE Transactions on Automatic Control, 2018.
  • Hardt et al. (2016) Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. arXiv preprint arXiv:1609.05191, 2016.
  • Hazan et al. (2018) Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral filtering for general linear dynamical systems. arXiv preprint arXiv:1802.03981, 2018.
  • Ho and Kalman (1966) BL Ho and Rudolph E Kalman. Effective construction of linear state-variable models from input/output functions. at-Automatisierungstechnik, 14(1-12):545–548, 1966.
  • Kung and Lin (1981) S Kung and D Lin. Optimal hankel-norm model reductions: Multivariable systems. IEEE Transactions on Automatic Control, 26(4):832–852, 1981.
  • Ljung (1987) Lennart Ljung. System identification: theory for the user. Prentice-hall, 1987.
  • Ljung et al. (2015) Lennart Ljung, Rajiv Singh, and Tianshi Chen. Regularization features in the system identification toolbox. IFAC-PapersOnLine, 48(28):745–750, 2015.
  • Meckes et al. (2007) Mark Meckes et al. On the spectral norm of a random toeplitz matrix. Electronic Communications in Probability, 12:315–325, 2007.
  • Oymak and Ozay (2018) Samet Oymak and Necmiye Ozay. Non-asymptotic identification of lti systems from a single trajectory. arXiv preprint arXiv:1806.05722, 2018.
  • Peña et al. (2008) Victor H Peña, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008.
  • Sarkar and Rakhlin (2018) Tuhin Sarkar and Alexander Rakhlin. How fast can linear dynamical systems be learned? arXiv preprint arXiv:1812.0125, 2018.
  • Shah et al. (2012) Parikshit Shah, Badri Narayan Bhaskar, Gongguo Tang, and Benjamin Recht. Linear system identification via atomic norm regularization. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 6265–6270. IEEE, 2012.
  • Shibata (1976) Ritei Shibata. Selection of the order of an autoregressive model by akaike’s information criterion. Biometrika, 63(1):117–126, 1976.
  • Simchowitz et al. (2018) Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. arXiv preprint arXiv:1802.08334, 2018.
  • Tu et al. (2017) Stephen Tu, Ross Boczar, Andrew Packard, and Benjamin Recht. Non-asymptotic analysis of robust control from coarse-grained identification. arXiv preprint arXiv:1707.04791, 2017.
  • Tu et al. (2018a) Stephen Tu, Ross Boczar, and Benjamin Recht. On the approximation of toeplitz operators for nonparametric \mathcal{H}_{\infty}–norm estimation. In 2018 Annual American Control Conference (ACC), pages 1867–1872. IEEE, 2018a.
  • Tu et al. (2018b) Stephen Tu, Ross Boczar, and Benjamin Recht. Minimax lower bounds for \mathcal{H}_{\infty}-norm estimation. arXiv preprint arXiv:1809.10855, 2018b.
  • Tyrtyshnikov (2012) Eugene E Tyrtyshnikov. A brief introduction to numerical analysis. Springer Science & Business Media, 2012.
  • van de Geer and Lederer (2013) Sara van de Geer and Johannes Lederer. The bernstein–orlicz norm and deviation inequalities. Probability theory and related fields, 157(1-2):225–250, 2013.
  • Van Der Vaart and Wellner (1996) Aad W Van Der Vaart and Jon A Wellner. Weak convergence. In Weak convergence and empirical processes, pages 16–28. Springer, 1996.
  • Van Overschee and De Moor (2012) Peter Van Overschee and BL De Moor. Subspace identification for linear systems: Theory—Implementation—Applications. Springer Science & Business Media, 2012.
  • Venkatesh and Dahleh (2001) Saligrama R Venkatesh and Munther A Dahleh. On system identification of complex systems from finite data. IEEE Transactions on Automatic Control, 46(2):235–257, 2001.
  • Vershynin (2010) Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
  • Wedin (1972) Per-Åke Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99–111, 1972.
  • Zhou et al. (1996) K Zhou, JC Doyle, and K Glover. Robust and optimal control, 1996.

8 Preliminaries

Theorem 1 (Theorem 5.39 Vershynin (2010)).

if EE is a T×mdT\times md matrix with independent sub–Gaussian isotropic rows with subGaussian parameter 11 then with probability at least 12ect21-2e^{-ct^{2}} we have

TCmdtσmin(E)T+Cmd+t\sqrt{T}-C\sqrt{md}-t\leq\sigma_{\min}(E)\leq\sqrt{T}+C\sqrt{md}+t
Proposition 2 (Vershynin (2010)).

We have for any ϵ<1\epsilon<1 and any w𝒮d1w\in\mathcal{S}^{d-1} that

(M>z)(1+2/ϵ)d(Mw>z(1ϵ))\mathbb{P}(||M||>z)\leq(1+2/\epsilon)^{d}\mathbb{P}\left(||Mw||>\frac{z}{(1-\epsilon)}\right)
Theorem 3 (Theorem 1 Meckes et al. (2007)).

] Suppose {Xim}i=1\{X_{i}\in\mathbb{R}^{m}\}_{i=1}^{\infty} are independent, 𝔼[Xj]=0\mathbb{E}[X_{j}]=\textbf{0} for all jj, and XijX_{ij} are independent 𝗌𝗎𝖻𝗀(1)\mathsf{subg}(1) random variables. Then (Tdcmdlog(2d)+t)et2/d\mathbb{P}(||T_{d}||\geq cm\sqrt{d\log{2d}}+t)\leq e^{-t^{2}/d} where

Tn=[X0X1Xd1X1X0Xd2Xd1X0]T_{n}=\begin{bmatrix}X_{0}&X_{1}&\ldots&X_{d-1}\\ X_{1}&X_{0}&\ldots&X_{d-2}\\ \vdots&\ddots&\ddots&\vdots\\ X_{d-1}&\ldots&\ldots&X_{0}\end{bmatrix}
Theorem 4 (Hanson–Wright Inequality).

Given a subgaussian vector X=[X1,X2,,Xn]nX=[X_{1},X_{2},\ldots,X_{n}]\in\mathbb{R}^{n} with supiXiψ2K\sup_{i}\left\|X_{i}\right\|_{\psi_{2}}\leq K. Then for any Bn×nB\in\mathbb{R}^{n\times n} and t0t\geq 0

(XBX𝔼[XBX]t)2exp(max(ctK2B,ct2K4BHS2)).\mathbb{P}\left(\|XBX^{\top}-\mathbb{E}[XBX^{\top}]\|\leq t\right)\leq 2\exp\left(\max\left(\frac{-ct}{K^{2}\|B\|},\frac{-ct^{2}}{K^{4}\|B\|^{2}_{HS}}\right)\right).
Proposition 5 (Lecture 2 Tyrtyshnikov (2012)).

Suppose that LL is the lower triangular part of a matrix Ad×dA\in\mathbb{R}^{d\times d}. Then

L2log2(2d)A2.\left\|L\right\|_{2}\leq\log_{2}{(2d)}\left\|A\right\|_{2}.

Let ψ\psi be a nondecreasing, convex function with ψ(0)=0\psi(0)=0 and XX a random variable. Then the Orlicz norm Xψ||X||_{\psi} is defined as

Xψ=inf{α>0:𝔼[ψ(|X|/α)]1}.||X||_{\psi}=\inf\Big{\{}\alpha>0:\mathbb{E}[\psi(|X|/\alpha)]\leq 1\Big{\}}.

Let (B,d)(B,d) be an arbitrary semi–metric space. Denote by N(ϵ,d)N(\epsilon,d) is the minimal number of balls of radius ϵ\epsilon needed to cover BB.

Theorem 6 (Corollary 2.2.5 in Van Der Vaart and Wellner (1996)).

The constant KK can be hosen such that

sups,t|XsXt|ψK0diam(B)ψ1(N(ϵ/2,d))𝑑ϵ||\sup_{s,t}|X_{s}-X_{t}|||_{\psi}\leq K\int_{0}^{\text{diam}(B)}\psi^{-1}(N(\epsilon/2,d))d\epsilon

where diam(B)\text{diam}(B) is the diameter of BB and d(s,t)=XsXtψd(s,t)=||X_{s}-X_{t}||_{\psi}.

Theorem 7 (Theorem 1 in Abbasi-Yadkori et al. (2011)).

Let {𝓕t}t=0\{\bm{\mathcal{F}}_{t}\}_{t=0}^{\infty} be a filtration. Let {ηtm,Xtd}t=1\{\eta_{t}\in\mathbb{R}^{m},X_{t}\in\mathbb{R}^{d}\}_{t=1}^{\infty} be stochastic processes such that ηt,Xt\eta_{t},X_{t} are 𝓕t\bm{\mathcal{F}}_{t} measurable and ηt\eta_{t} is 𝓕t1\bm{\mathcal{F}}_{t-1}-conditionally 𝗌𝗎𝖻𝗀(L2)\mathsf{subg}(L^{2}) for some L>0L>0. For any t0t\geq 0, define Vt=s=1tXsXs,St=s=1tXsηs+1V_{t}=\sum_{s=1}^{t}X_{s}X_{s}^{\prime},S_{t}=\sum_{s=1}^{t}X_{s}\eta_{s+1}^{\top}. Then for any δ>0,V0\delta>0,V\succ 0 and all t0t\geq 0 we have with probability at least 1δ1-\delta

St(V+Vt)1St2L2(log(1δ)+log(det(V+Vt)det(V))+m).S_{t}^{\top}(V+V_{t})^{-1}S_{t}\leq 2L^{2}\left(\log{\frac{1}{\delta}}+\log{\frac{\text{det}(V+V_{t})}{\text{det}(V)}}+m\right).
Proof 8.1.

Define M=(V+Vt)1/2StM=(V+V_{t})^{-1/2}S_{t}. Now we use Proposition 2 and setting ϵ=1/2\epsilon=1/2,

(M2>z)5m(Mw2>2z)\mathbb{P}(||M||_{2}>z)\leq 5^{m}\mathbb{P}(||Mw||_{2}>2z)

for w𝒮m1w\in\mathcal{S}^{m-1}. Then we can use Theorem 1 in Abbasi-Yadkori et al. (2011), and with probability at least 1δ1-\delta we have

Mw222L2(log(1δ)+log(det(V+Vt)det(V))).||Mw||^{2}_{2}\leq 2L^{2}\left(\log{\frac{1}{\delta}}+\log{\frac{\text{det}(V+V_{t})}{\text{det}(V)}}\right).

By δ5mδ\delta\rightarrow 5^{-m}\delta, we have with probability at least 15mδ1-5^{-m}\delta

Mw22L(mlog((5))+log(1δ)+log(det(V+Vt)det(V))).||Mw||_{2}\leq\sqrt{2}L\sqrt{\left(m\log{(5)}+\log{\frac{1}{\delta}}+\log{\frac{\text{det}(V+V_{t})}{\text{det}(V)}}\right)}.

Then with probability at least 1δ1-\delta,

M2log((5))2L(m+log(1δ)+log(det(V+Vt)det(V))).||M||_{2}\leq\sqrt{\frac{\log{(5)}}{2}}L\sqrt{\left(m+\log{\frac{1}{\delta}}+\log{\frac{\text{det}(V+V_{t})}{\text{det}(V)}}\right)}.
Lemma 8.

For any M=(C,A,B)M=(C,A,B), we have that

T×mTv=σ(k=1d𝒯d+k,T𝒯d+k,T)||\mathcal{B}^{v}_{T\times mT}||=\sqrt{\sigma\Big{(}\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T}\Big{)}}

Here T×mTv\mathcal{B}^{v}_{T\times mT} is defined as follows: β=d,d,Tv=[β1,β2,,βT]\beta=\mathcal{H}_{d,d,T}^{\top}v=[\beta_{1}^{\top},\beta_{2}^{\top},\ldots,\beta_{T}^{\top}]^{\top}.

T×mTv=[β100β2β10βTβT1β1]\displaystyle\mathcal{B}^{v}_{T\times mT}=\begin{bmatrix}\beta_{1}^{\top}&0&0&\ldots\\ \beta^{\top}_{2}&\beta^{\top}_{1}&0&\ldots\\ \vdots&\vdots&\ddots&\vdots\\ \beta_{T}^{\top}&\beta_{T-1}^{\top}&\ldots&\beta^{\top}_{1}\end{bmatrix}

and v2=1||v||_{2}=1.

Proof 8.2.

For the matrix v\mathcal{B}^{v} we have

vu\displaystyle\mathcal{B}^{v}u =[β1u1β1u2+β2u1β1u3+β2u2+β3u1β1uT+β2uT1++βTu1]=[v[CAd+1Bu1CAd+2Bu1CA2dBu1]v[CAd+2Bu1+CAd+1Bu2CAd+3Bu1+CAd+2Bu2CA2d+1Bu1+CA2dBu2]v[CAT+dBu1++CAd+1BuTCAT+d+2Bu1++CAd+2BuTCAT+2d1Bu1++CA2dBuT]]\displaystyle=\begin{bmatrix}\beta_{1}^{\top}u_{1}\\ \beta_{1}^{\top}u_{2}+\beta_{2}^{\top}u_{1}\\ \beta_{1}^{\top}u_{3}+\beta_{2}^{\top}u_{2}+\beta_{3}^{\top}u_{1}\\ \vdots\\ \beta_{1}^{\top}u_{T}+\beta_{2}^{\top}u_{T-1}+\ldots+\beta_{T}^{\top}u_{1}\end{bmatrix}=\begin{bmatrix}v^{\top}\begin{bmatrix}CA^{d+1}Bu_{1}\\ CA^{d+2}Bu_{1}\\ \vdots\\ CA^{2d}Bu_{1}\end{bmatrix}\\ v^{\top}\begin{bmatrix}CA^{d+2}Bu_{1}+CA^{d+1}Bu_{2}\\ CA^{d+3}Bu_{1}+CA^{d+2}Bu_{2}\\ \vdots\\ CA^{2d+1}Bu_{1}+CA^{2d}Bu_{2}\end{bmatrix}\\ \vdots\\ v^{\top}\begin{bmatrix}CA^{T+d}Bu_{1}+\ldots+CA^{d+1}Bu_{T}\\ CA^{T+d+2}Bu_{1}+\ldots+CA^{d+2}Bu_{T}\\ \vdots\\ CA^{T+2d-1}Bu_{1}+\ldots+CA^{2d}Bu_{T}\end{bmatrix}\end{bmatrix}
=𝒱[[CAd+1Bu1CAd+2Bu1CA2dBu1][CAd+2Bu1+CAd+1Bu2CAd+3Bu1+CAd+2Bu2CA2d+1Bu1+CA2dBu2][CAT+dBu1++CAd+1BuTCAT+d+2Bu1++CAd+2BuTCAT+2d1Bu1++CA2dBuT]]\displaystyle=\mathcal{V}\begin{bmatrix}\begin{bmatrix}CA^{d+1}Bu_{1}\\ CA^{d+2}Bu_{1}\\ \vdots\\ CA^{2d}Bu_{1}\end{bmatrix}\\ \begin{bmatrix}CA^{d+2}Bu_{1}+CA^{d+1}Bu_{2}\\ CA^{d+3}Bu_{1}+CA^{d+2}Bu_{2}\\ \vdots\\ CA^{2d+1}Bu_{1}+CA^{2d}Bu_{2}\end{bmatrix}\\ \vdots\\ \begin{bmatrix}CA^{T+d}Bu_{1}+\ldots+CA^{d+1}Bu_{T}\\ CA^{T+d+2}Bu_{1}+\ldots+CA^{d+2}Bu_{T}\\ \vdots\\ CA^{T+2d-1}Bu_{1}+\ldots+CA^{2d}Bu_{T}\end{bmatrix}\end{bmatrix}
=𝒱[CAd+1B000CAd+2B000CA2dB000CAd+2BCAd+1B00CAd+3BCAd+2B00CA2d+1BCA2dB00CAT+d1BCAT+dBCAT+d1BCAd+1BCAT+d+2BCAT+d+1BCAT+dBCAd+2BCAT+2d1BCAT+2d1BCAT+2d2BCA2dB]=S[u1u2uT]\displaystyle=\mathcal{V}\underbrace{\begin{bmatrix}CA^{d+1}B&0&0&\ldots&0\\ CA^{d+2}B&0&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{2d}B&0&0&\ldots&0\\ CA^{d+2}B&CA^{d+1}B&0&\ldots&0\\ CA^{d+3}B&CA^{d+2}B&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{2d+1}B&CA^{2d}B&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{T+d-1}B&CA^{T+d}B&CA^{T+d-1}B&\ldots&CA^{d+1}B\\ CA^{T+d+2}B&CA^{T+d+1}B&CA^{T+d}B&\ldots&CA^{d+2}B\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{T+2d-1}B&CA^{T+2d-1}B&CA^{T+2d-2}B&\ldots&CA^{2d}B\\ \end{bmatrix}}_{=S}\begin{bmatrix}u_{1}\\ u_{2}\\ \vdots\\ u_{T}\end{bmatrix}

It is clear that 𝒱2,u2=1||\mathcal{V}||_{2},||u||_{2}=1 and for any matrix SS, S||S|| does not change if we interchange rows of SS. Then we have

S2\displaystyle||S||_{2} =σ([CAd+1B000CAd+2BCAd+1B00CAT+d+1BCAT+dBCAT+d1BCAd+1BCAd+2B000CAd+3BCAd+2B00CAT+d+2BCAT+d+1BCAT+dBCAd+2BCA2dB000CA2d+1BCA2dB00CAT+2d1BCAT+2d1BCAT+2d2BCA2dB])\displaystyle=\sigma\left(\begin{bmatrix}CA^{d+1}B&0&0&\ldots&0\\ CA^{d+2}B&CA^{d+1}B&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{T+d+1}B&CA^{T+d}B&CA^{T+d-1}B&\ldots&CA^{d+1}B\\ CA^{d+2}B&0&0&\ldots&0\\ CA^{d+3}B&CA^{d+2}B&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{T+d+2}B&CA^{T+d+1}B&CA^{T+d}B&\ldots&CA^{d+2}B\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{2d}B&0&0&\ldots&0\\ CA^{2d+1}B&CA^{2d}B&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{T+2d-1}B&CA^{T+2d-1}B&CA^{T+2d-2}B&\ldots&CA^{2d}B\\ \end{bmatrix}\right)
=σ([𝒯d+1,T𝒯d+2,T𝒯2d,T])=σ(k=1d𝒯d+k,T𝒯d+k,T)\displaystyle=\sigma\left(\begin{bmatrix}\mathcal{T}_{d+1,T}\\ \mathcal{T}_{d+2,T}\\ \vdots\\ \mathcal{T}_{2d,T}\end{bmatrix}\right)=\sqrt{\sigma\Big{(}\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T}\Big{)}}
Proposition 9 (Lemma 4.1 Simchowitz et al. (2018)).

Let SS be an invertible matrix and κ(S)\kappa(S) be its condition number. Then for a 14κ\frac{1}{4\kappa}–net of 𝒮d1\mathcal{S}^{d-1} and an arbitrary matrix AA, we have

SA22supv𝒩14κvA2vS12||SA||_{2}\leq 2\sup_{v\in{\mathcal{N}}_{\frac{1}{4\kappa}}}\frac{||v^{\prime}A||_{2}}{||v^{\prime}S^{-1}||_{2}}
Proof 8.3.

For any vector v𝒩14κv\in{\mathcal{N}}_{\frac{1}{4\kappa}} and ww be such that SA2=wA2wS12||SA||_{2}=\frac{||w^{\prime}A||_{2}}{||w^{\prime}S^{-1}||_{2}} we have

SA2vA2vS12\displaystyle||SA||_{2}-\frac{||v^{\prime}A||_{2}}{||v^{\prime}S^{-1}||_{2}} |wA2wS12vA2vS12|\displaystyle\leq\Big{|}\frac{||w^{\prime}A||_{2}}{||w^{\prime}S^{-1}||_{2}}-\frac{||v^{\prime}A||_{2}}{||v^{\prime}S^{-1}||_{2}}\Big{|}
=|wA2wS12vA2wS12+vA2wS12vA2vS12|\displaystyle=\Big{|}\frac{||w^{\prime}A||_{2}}{||w^{\prime}S^{-1}||_{2}}-\frac{||v^{\prime}A||_{2}}{||w^{\prime}S^{-1}||_{2}}+\frac{||v^{\prime}A||_{2}}{||w^{\prime}S^{-1}||_{2}}-\frac{||v^{\prime}A||_{2}}{||v^{\prime}S^{-1}||_{2}}\Big{|}
SA214κS12wS12+SA2|vS12wS121|\displaystyle\leq||SA||_{2}\frac{\frac{1}{4\kappa}||S^{-1}||_{2}}{||w^{\prime}S^{-1}||_{2}}+||SA||_{2}\Big{|}\frac{||v^{\prime}S^{-1}||_{2}}{||w^{\prime}S^{-1}||_{2}}-1\Big{|}
SA22\displaystyle\leq\frac{||SA||_{2}}{2}

9 Control and Systems Theory Preliminaries

9.1 Sylvester Matrix Equation

Define the discrete time Sylvester operator SA,B:n×nn×nS_{A,B}:\mathbb{R}^{n\times n}\rightarrow\mathbb{R}^{n\times n}

A,B(X)=XAXB\mathcal{L}_{A,B}(X)=X-AXB (21)

Then we have the following properties for A,B()\mathcal{L}_{A,B}(\cdot).

Proposition 1.

Let λi,μi\lambda_{i},\mu_{i} be the eigenvalues of A,BA,B then A,B\mathcal{L}_{A,B} is invertible if and only if for all i,ji,j

λiμj1\lambda_{i}\mu_{j}\neq 1

Define the discrete time Lyapunov operator for a matrix AA as A,A()=SA,A1()\mathcal{L}_{A,A^{\prime}}(\cdot)=S^{-1}_{A,A^{\prime}}(\cdot). Clearly it follows from Proposition 1 that whenever λmax(A)<1\lambda_{\max}(A)<1 we have that the SA,A()S_{A,A^{\prime}}(\cdot) is an invertible operator.

Now let Q0Q\succeq 0 then

SA,A(Q)\displaystyle S_{A,A^{\prime}}(Q) =X\displaystyle=X
X\displaystyle\implies X =AXA+Q\displaystyle=AXA^{\prime}+Q
X\displaystyle\implies X =k=0AkQAk\displaystyle=\sum_{k=0}^{\infty}A^{k}QA^{\prime k} (22)

Eq. (22) follows directly by substitution and by Proposition 1 is unique if ρ(A)<1\rho(A)<1. Further, let Q1Q20Q_{1}\succeq Q_{2}\succeq 0 and X1,X2X_{1},X_{2} be the corresponding solutions to the Lyapunov operator then from Eq. (22) that

X1,X2\displaystyle X_{1},X_{2} 0\displaystyle\succeq 0
X1\displaystyle X_{1} X2\displaystyle\succeq X_{2}

9.2 Properties of System Hankel matrix

  • Rank of system Hankel matrix: For M=(C,A,B)nM=(C,A,B)\in\mathcal{M}_{n}, the system Hankel matrix, 0,,(M)\mathcal{H}_{0,\infty,\infty}(M), can be decomposed as follows:

    0,,(M)=[CCACAd]=𝒪[BABAdB]=\mathcal{H}_{0,\infty,\infty}(M)=\underbrace{\begin{bmatrix}C\\ CA\\ \vdots\\ CA^{d}\\ \vdots\end{bmatrix}}_{=\mathcal{O}}\underbrace{\begin{bmatrix}B&AB&\ldots&A^{d}B&\ldots\end{bmatrix}}_{=\mathcal{R}} (23)

    It follows from definition that rank(𝒪),rank()n\text{rank}(\mathcal{O}),\text{rank}(\mathcal{R})\leq n and as a result rank(𝒪)n\text{rank}(\mathcal{O}\mathcal{R})\leq n. The system Hankel matrix rank, or rank(𝒪)\text{rank}(\mathcal{O}\mathcal{R}), which is also the model order(or simply order), captures the complexity of MM. If SVD(0,,)=UΣV\text{SVD}(\mathcal{H}_{0,\infty,\infty})=U\Sigma V^{\top}, then 𝒪=UΣ1/2S,=S1Σ1/2V\mathcal{O}=U\Sigma^{1/2}S,\mathcal{R}=S^{-1}\Sigma^{1/2}V^{\top}. By noting that

    CAlS=CS(S1AS)l,S1AlB=(S1AS)lS1BCA^{l}S=CS(S^{-1}AS)^{l},S^{-1}A^{l}B=(S^{-1}AS)^{l}S^{-1}B

    we have obtained a way of recovering the system parameters (up to similarity transformations). Furthermore, 0,,\mathcal{H}_{0,\infty,\infty} uniquely (up to similarity transformation) recovers (C,A,B)(C,A,B).

  • Mapping Past to Future: 0,,\mathcal{H}_{0,\infty,\infty} can also be viewed as an operator that maps “past” inputs to “future” outputs. In Eq. (1) assume that {ηt,wt}=0\{\eta_{t},w_{t}\}=0. Then consider the following class of inputs UtU_{t} such that Ut=0U_{t}=0 for all tTt\geq T but UtU_{t} may not be zero for t<Tt<T. Here TT is chosen arbitrarily. Then

    [YTYT+1YT+2]Future=0,,[UT1UT2UT3]Past\underbrace{\begin{bmatrix}Y_{T}\\ Y_{T+1}\\ Y_{T+2}\\ \vdots\end{bmatrix}}_{\text{Future}}=\mathcal{H}_{0,\infty,\infty}\underbrace{\begin{bmatrix}U_{T-1}\\ U_{T-2}\\ U_{T-3}\\ \vdots\end{bmatrix}}_{\text{Past}} (24)

9.3 Model Reduction

Given an LTI system M=(C,A,B)M=(C,A,B) of order nn with its doubly infinite system Hankel matrix as 0,,\mathcal{H}_{0,\infty,\infty}. We are interested in finding the best kk order lower dimensional approximation of MM, i.e., for every k<nk<n we would like to find MkM_{k} of model order kk such that MMk||M-M_{k}||_{\infty} is minimized. Systems theory gives us a class of model approximations, known as balanced truncated approximations, that provide strong theoretical guarantees (See Glover (1984) and Section 21.6 in Zhou et al. (1996)). We summarize some of the basics of model reduction below. Assume that MM has distinct Hankel singular values.

Recall that a model M=(C,A,B)M=(C,A,B) is equivalent to M~=(CS,S1AS,S1B)\tilde{M}=(CS,S^{-1}AS,S^{-1}B) with respect to its transfer function. Define

Q\displaystyle Q =AQA+CC\displaystyle=A^{\top}QA+C^{\top}C
P\displaystyle P =APA+BB\displaystyle=APA^{\top}+BB^{\top}

For two positive definite matrices P,QP,Q it is a known fact that there exist a transformation SS such that SQS=S1PS1=ΣS^{\top}QS=S^{-1}PS^{-1\top}=\Sigma where Σ\Sigma is diagonal and the diagonal elements are decreasing. Further, σi\sigma_{i} is the ithi^{th} singular value of 0,,\mathcal{H}_{0,\infty,\infty}. Then let A~=S1AS,C~=CS,B~=S1B\tilde{A}=S^{-1}AS,\tilde{C}=CS,\tilde{B}=S^{-1}B. Clearly M~=(A~,B~,C~)\widetilde{M}=(\tilde{A},\tilde{B},\tilde{C}) is equivalent to MM and we have

Σ\displaystyle\Sigma =A~ΣA~+C~C~\displaystyle=\tilde{A}^{\top}\Sigma\tilde{A}+\tilde{C}^{\top}\tilde{C}
Σ\displaystyle\Sigma =A~ΣA~+B~B~\displaystyle=\tilde{A}\Sigma\tilde{A}^{\top}+\tilde{B}\tilde{B}^{\top} (25)

Here C~,A~,B~\tilde{C},\tilde{A},\tilde{B} is a balanced realization of MM.

Proposition 2.

Let 0,,=UΣV\mathcal{H}_{0,\infty,\infty}=U\Sigma V^{\top}. Here Σ0n×n\Sigma\succeq 0\in\mathbb{R}^{n\times n}. Then

C~\displaystyle\tilde{C} =[UΣ1/2]1:p,:\displaystyle=[U\Sigma^{1/2}]_{1:p,:}
A~\displaystyle\tilde{A} =Σ1/2U[UΣ1/2]p+1:,:\displaystyle=\Sigma^{-1/2}U^{\top}[U\Sigma^{1/2}]_{p+1:,:}
B~\displaystyle\tilde{B} =[Σ1/2V]:,1:m\displaystyle=[\Sigma^{1/2}V^{\top}]_{:,1:m}

The triple (C~,A~,B~)(\tilde{C},\tilde{A},\tilde{B}) is a balanced realization of MM. For any matrix LL, L:,m:nL_{:,m:n} (or Lm:n,:L_{m:n,:}) denotes the submatrix with only columns (or rows) mm through nn.

Proof 9.1.

Let the SVD of 0,,=UΣV\mathcal{H}_{0,\infty,\infty}=U\Sigma V^{\top}. Then MM can constructed as follows: UΣ1/2,Σ1/2VU\Sigma^{1/2},\Sigma^{1/2}V^{\top} are of the form

UΣ1/2=[CSCASCA2S],Σ1/2V=[S1BS1ABS1A2B]\displaystyle U\Sigma^{1/2}=\begin{bmatrix}CS\\ CAS\\ CA^{2}S\\ \vdots\\ \end{bmatrix},\Sigma^{1/2}V^{\top}=\begin{bmatrix}S^{-1}B&S^{-1}AB&S^{-1}A^{2}B\ldots\end{bmatrix}

where SS is the transformation which gives us Eq. (25). This follows because

Σ1/2UUΣ1/2\displaystyle\Sigma^{1/2}U^{\top}U\Sigma^{1/2} =k=0SAkCCAkS\displaystyle=\sum_{k=0}^{\infty}S^{\top}A^{k\top}C^{\top}CA^{k}S
=k=0SAkS1SCCSS1AkS\displaystyle=\sum_{k=0}^{\infty}S^{\top}A^{k\top}S^{-1\top}S^{\top}C^{\top}CSS^{-1}A^{k}S
=k=0A~kC~C~A~k=A~ΣA~+C~C~=Σ\displaystyle=\sum_{k=0}^{\infty}\tilde{A}^{k\top}\tilde{C}^{\top}\tilde{C}\tilde{A}^{k}=\tilde{A}^{\top}\Sigma\tilde{A}+\tilde{C}^{\top}\tilde{C}=\Sigma

Then C~=UΣ1:p,:1/2\tilde{C}=U\Sigma^{1/2}_{1:p,:} and

UΣ1/2A~\displaystyle U\Sigma^{1/2}\tilde{A} =[UΣ1/2]p+1:,:\displaystyle=[U\Sigma^{1/2}]_{p+1:,:}
A~\displaystyle\tilde{A} =Σ1/2U[UΣ1/2]p+1:,:\displaystyle=\Sigma^{-1/2}U^{\top}[U\Sigma^{1/2}]_{p+1:,:}

We do a similar computation for BB.

It should be noted that a balanced realization C~,A~,B~\tilde{C},\tilde{A},\tilde{B} is unique except when there are some Hankel singular values that are equal. To see this, assume that we have

σ1>>σr1>σr=σr+1==σs>σs+1>σn\sigma_{1}>\ldots>\sigma_{r-1}>\sigma_{r}=\sigma_{r+1}=\ldots=\sigma_{s}>\sigma_{s+1}>\ldots\sigma_{n}

where sr>0s-r>0. For any unitary matrix Q(sr+1)×(sr+1)Q\in\mathbb{R}^{(s-r+1)\times(s-r+1)}, define Q0Q_{0}

Q0=[I(r1)×(r1)000Q000I(ns)×(ns)]Q_{0}=\begin{bmatrix}I_{(r-1)\times(r-1)}&0&0\\ 0&Q&0\\ 0&0&I_{(n-s)\times(n-s)}\end{bmatrix} (26)

Then every triple (C~Q0,Q0A~Q0,Q0B~)(\tilde{C}Q_{0},Q_{0}^{\top}\tilde{A}Q_{0},Q_{0}^{\top}\tilde{B}) satisfies Eq. (25) and is a balanced realization. Let Mk=(C~k,A~kk,B~k)M_{k}=(\tilde{C}_{k},\tilde{A}_{kk},\tilde{B}_{k}) where

A~=[A~kkA~0kA~k0A~00],B~=[B~kB~0],C~=[C~kC~0]\displaystyle\tilde{A}=\begin{bmatrix}\tilde{A}_{kk}&\tilde{A}_{0k}\\ \tilde{A}_{k0}&\tilde{A}_{00}\end{bmatrix},\tilde{B}=\begin{bmatrix}\tilde{B}_{k}\\ \tilde{B}_{0}\end{bmatrix},\tilde{C}=\begin{bmatrix}\tilde{C}_{k}&\tilde{C}_{0}\end{bmatrix} (27)

Here A~kk\tilde{A}_{kk} is the k×kk\times k submatrix and corresponding partitions of B~,C~\tilde{B},\tilde{C}. The realization Mk=(C~k,A~kk,B~k)M_{k}=(\tilde{C}_{k},\tilde{A}_{kk},\tilde{B}_{k}) is the kk–order balanced truncated model. Clearly MMnM\equiv M_{n} which gives us C~=C~nn,A~=A~nn,B~=B~nn\tilde{C}=\tilde{C}_{nn},\tilde{A}=\tilde{A}_{nn},\tilde{B}=\tilde{B}_{nn}, i.e., the balanced version of the true model. We will show that for the balanced truncation model we only need to care about the top kk singular vectors and not the entire model.

Proposition 3.

For the kk order balanced truncated model MkM_{k}, we only need top kk singular values and singular vectors of 0,,\mathcal{H}_{0,\infty,\infty}.

Proof 9.2.

From the preceding discussion in Proposition 2 and Eq. (27) it is clear that the first p×kp\times k block submatrix of UΣ1/2U\Sigma^{1/2} (corresponding to the top kk singular vectors) gives us C~k\tilde{C}_{k}. Since

A~=Σ1/2U[UΣ1/2]p+1:,:\tilde{A}=\Sigma^{-1/2}U^{\top}[U\Sigma^{1/2}]_{p+1:,:}

we observe that A~kk\tilde{A}_{kk} depend only on the top kk singular vectors UkU_{k} and corresponding singular values. This can be seen as follows: [UΣ1/2]p+1:,:[U\Sigma^{1/2}]_{p+1:,:} denotes the submatrix of UΣ1/2U\Sigma^{1/2} with top pp rows removed. Now in UΣ1/2U\Sigma^{1/2} each column of UU is scaled by the corresponding singular value. Then the A~kk\tilde{A}_{kk} submatrix depends only on top kk rows of Σ1/2U\Sigma^{-1/2}U^{\top} and the top kk columns of [UΣ1/2]p+1:,:[U\Sigma^{1/2}]_{p+1:,:} which correspond to the top kk singular vectors.

10 Isometry of Input Matrix: Proof of Lemma 1

Theorem 10.1.

Define

U\displaystyle U [UdUd+1UT+d1Ud1UdUT+d2U1U2UT]\displaystyle\coloneqq\begin{bmatrix}U_{d}&U_{d+1}&\ldots&U_{T+d-1}\\ U_{d-1}&U_{d}&\ldots&U_{T+d-2}\\ \vdots&\vdots&\ddots&\vdots\\ U_{1}&U_{2}&\ldots&U_{T}\end{bmatrix}

where each Ui𝗌𝗎𝖻𝗀(1)U_{i}\sim\mathsf{subg}(1) and isotropic. Then there exists an absolute constant cc such that UU satisfies:

(1/2)Tσmin(UU)σmax(UU)(3/2)T(1/2)T\leq\sigma_{\min}(UU^{\top})\leq\sigma_{\max}(UU^{\top})\leq(3/2)T

whenever Tcm2d(log2(d)log2(m2/δ)+log3(2d))T\geq cm^{2}d(\log^{2}{(d)}\log^{2}{(m^{2}/\delta)}+\log^{3}{(2d)}) with probability at least 1δ1-\delta.

Proof 10.2.

Define

Amd×md\displaystyle A_{md\times md} [0000I0000I0000I0],Bmd×m[I00],U^kUd+k\displaystyle\coloneqq\begin{bmatrix}0&0&0&\ldots&0\\ I&0&0&\ldots&0\\ \vdots&\ddots&\ddots&\vdots&\vdots\\ 0&\ldots&I&0&0\\ 0&\ldots&0&I&0\end{bmatrix},B_{md\times m}\coloneqq\begin{bmatrix}I\\ 0\\ \vdots\\ 0\end{bmatrix},\widehat{U}_{k}\coloneqq U_{d+k}

Since

U=[UdUd+1UT+d1Ud1UdUT+d2U1U2UT]U=\begin{bmatrix}U_{d}&U_{d+1}&\ldots&U_{T+d-1}\\ U_{d-1}&U_{d}&\ldots&U_{T+d-2}\\ \vdots&\vdots&\ldots&\vdots\\ U_{1}&U_{2}&\ldots&U_{T}\end{bmatrix}

we can reformulate it so that each column is the output of an LTI system in the following sense:

xk+1\displaystyle x_{k+1} =Axk+BU^(k+1)\displaystyle=Ax_{k}+B\widehat{U}(k+1) (28)

where UU=k=0T1xkxkUU^{\top}=\sum_{k=0}^{T-1}x_{k}x_{k}^{\top} and x0=[UdUd1U1]x_{0}=\begin{bmatrix}U_{d}\\ U_{d-1}\\ \vdots\\ U_{1}\end{bmatrix}. From Theorem 1 we have that

34TIk=0T1U^kU^k54TI\frac{3}{4}TI\preceq\sum_{k=0}^{T-1}\widehat{U}_{k}\widehat{U}_{k}^{\top}\preceq\frac{5}{4}TI

with probability at least 1δ1-\delta whenever Tc(m+log(2δ))T\geq c\Big{(}m+\log{\frac{2}{\delta}}\Big{)}. Define Vt=l=0t1xkxkV_{t}=\sum_{l=0}^{t-1}x_{k}x_{k}^{\top} then,

VT\displaystyle V_{T} =AVT1A+B(k=0T1U^kU^k)B+k=0T2(AxkU^k+1B+BU^k+1xkA)\displaystyle=AV_{T-1}A^{\top}+B\left(\sum_{k=0}^{T-1}\widehat{U}_{k}\widehat{U}_{k}^{\top}\right)B^{\top}+\sum_{k=0}^{T-2}\left(Ax_{k}\widehat{U}^{\top}_{k+1}B^{\top}+B\widehat{U}_{k+1}x^{\top}_{k}A^{\top}\right) (29)

It can be easily checked that xk=[Ud+kUd+k1Uk+1]x_{k}=\begin{bmatrix}U_{d+k}\\ U_{d+k-1}\\ \vdots\\ U_{k+1}\end{bmatrix} and consequently

k=0T2AxkU^k+1B=k=0T2[0000Ud+kUd+k+1000Ud+k1Ud+k+1000Uk+2Ud+k+1000].\displaystyle\sum_{k=0}^{T-2}Ax_{k}\widehat{U}^{\top}_{k+1}B^{\top}=\sum_{k=0}^{T-2}\begin{bmatrix}0&0&\ldots&0&0\\ U_{d+k}U_{d+k+1}^{\top}&0&\ldots&0&0\\ U_{d+k-1}U_{d+k+1}^{\top}&0&\ldots&0&0\\ \vdots&\vdots&\ddots&\vdots&\vdots\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ U_{k+2}U_{d+k+1}^{\top}&0&\ldots&0&0\end{bmatrix}.

Define Ljk=0T2Ud+kj+1Ud+k+1L_{j}\coloneqq\sum_{k=0}^{T-2}U_{d+k-j+1}U_{d+k+1}^{\top} and LjL_{j} is a m×mm\times m block matrix. Then

Td=l=0d1Al(k=0T2AxkU^k+1B)Al=[00000L10000L2L1000Ld100L10].\displaystyle T_{d}=\sum_{l=0}^{d-1}A^{l}\left(\sum_{k=0}^{T-2}Ax_{k}\widehat{U}^{\top}_{k+1}B^{\top}\right)A^{l\top}=\begin{bmatrix}0&0&\ldots&0&0&0\\ L_{1}&0&\ldots&0&0&0\\ L_{2}&L_{1}&\ldots&0&0&0\\ \vdots&\vdots&\ddots&\vdots&\vdots&\vdots\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ L_{d-1}&0&\ldots&0&L_{1}&0\end{bmatrix}.

Use Lemma 1 to show that

TdcmTdlog((d))log((m2/δ))\left\|T_{d}\right\|\leq cm\sqrt{Td}\log{(d)}\log{(m^{2}/\delta)} (30)

with probability at least 1δ1-\delta. Then

VT=l=0d1AlB(k=0T1U^kU^k)BAl+Tdl=0d1AlxT1xT1Al.\displaystyle V_{T}=\sum_{l=0}^{d-1}A^{l}B\left(\sum_{k=0}^{T-1}\widehat{U}_{k}\widehat{U}_{k}^{\top}\right)B^{\top}A^{l\top}+T_{d}-\sum_{l=0}^{d-1}A^{l}x_{T-1}x_{T-1}^{\top}A^{l\top}.

From Theorem 1 we have with probability atleast 1δ1-\delta that

(3/4)TIl=0d1AlB(k=0T1U^kU^k)BAl(5/4)TI(3/4)TI\preceq\sum_{l=0}^{d-1}A^{l}B\Bigg{(}\sum_{k=0}^{T-1}\widehat{U}_{k}\widehat{U}_{k}^{\top}\Bigg{)}B^{\top}A^{l\top}\preceq(5/4)TI (31)

whenever Tc(m+log(2δ))T\geq c\Big{(}m+\log{\frac{2}{\delta}}\Big{)}. Observe that

l=1dAlxT1xT1Al\displaystyle\left\|\sum_{l=1}^{d}A^{l}x_{T-1}x_{T-1}^{\top}A^{l\top}\right\| =σ12([AxT1,A2xT1,,AdxT1])\displaystyle=\sigma_{1}^{2}([Ax_{T-1},A^{2}x_{T-1},\ldots,A^{d}x_{T-1}])

The matrix [AxT1,A2xT1,,AdxT1][Ax_{T-1},A^{2}x_{T-1},\ldots,A^{d}x_{T-1}] is the lower triangular submatrix of a random Toeplitz matrix with i.i.d 𝗌𝗎𝖻𝗀(1)\mathsf{subg}(1) entries as in Theorem 3. Then using Theorem 3 and Proposition 5 we get that with probability at least 1δ1-\delta we have

[AxT1,A2xT1,,AdxT1]cm(dlog((2d))log((2d))+dlog((1/δ))).\left\|[Ax_{T-1},A^{2}x_{T-1},\ldots,A^{d}x_{T-1}]\right\|\leq cm(\sqrt{d\log{(2d)}}\log{(2d)}+\sqrt{d\log{(1/\delta)}}). (32)

Then l=1dAlxT1xT1Alcm2d(log3(2d)+log((1/δ))+log((2d))log((2d))log((1/δ)))\left\|\sum_{l=1}^{d}A^{l}x_{T-1}x_{T-1}^{\top}A^{l\top}\right\|\leq cm^{2}d(\log^{3}{(2d)}+\log{(1/\delta)}+\log{(2d)}\sqrt{\log{(2d)}\log{(1/\delta)}}) with probability at least 1δ1-\delta. By ensuring that Eqs. (30), (31) and (32) hold simultaneously we can ensure that cmTdlog((d))log((m2/δ))T/8cm\sqrt{Td}\log{(d)}\log{(m^{2}/\delta)}\leq T/8 and cm2d(log3(2d)+log((1/δ))+log((2d))log((2d))log((1/δ)))T/8cm^{2}d(\log^{3}{(2d)}+\log{(1/\delta)}+\log{(2d)}\sqrt{\log{(2d)}\log{(1/\delta)}})\leq T/8 for large enough TT and absolute constant cc.

Lemma 1.

Let {Ujm×1}j=1T+d\{U_{j}\in\mathbb{R}^{m\times 1}\}_{j=1}^{T+d} be independent 𝗌𝗎𝖻𝗀(1)\mathsf{subg}(1) random vectors. Define Ljk=0T2Ud+kj+1Ud+k+1L_{j}\coloneqq\sum_{k=0}^{T-2}U_{d+k-j+1}U_{d+k+1}^{\top} for all j1j\geq 1 and

Td[00000L10000L2L1000Ld100L10].\displaystyle T_{d}\coloneqq\begin{bmatrix}0&0&\ldots&0&0&0\\ L_{1}&0&\ldots&0&0&0\\ L_{2}&L_{1}&\ldots&0&0&0\\ \vdots&\vdots&\ddots&\vdots&\vdots&\vdots\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ L_{d-1}&0&\ldots&0&L_{1}&0\end{bmatrix}.

Then with probability at least 1δ1-\delta we have

TdcmTdlog((d))log((m/δ)).\left\|T_{d}\right\|\leq cm\sqrt{Td}\log{(d)}\log{(m/\delta)}.
Proof 10.3.

Since LjL_{j}s are block matrices, the techniques in Meckes et al. (2007) cannot be directly applied. However, by noting that EE can be broken into a sum of mm matrices where the norm of each matrix can be bounded by a Toeplitz matrix we can use the result from Meckes et al. (2007). For instance if m=2m=2 and {ui}i=1\{u_{i}\}_{i=1}^{\infty} are independent 𝗌𝗎𝖻𝗀(1)\mathsf{subg}(1) random variables then we have

Td=[[0000][0000][u1u2u3u4][0000][u5u6u7u8][u1u2u3u4]].\displaystyle T_{d}=\begin{bmatrix}\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}u_{1}&u_{2}\\ u_{3}&u_{4}\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}u_{5}&u_{6}\\ u_{7}&u_{8}\end{bmatrix}&\begin{bmatrix}u_{1}&u_{2}\\ u_{3}&u_{4}\end{bmatrix}&\ldots\\ \vdots&\vdots&\ddots\end{bmatrix}.

Now,

Td=[[0000][0000][u10u30][0000][u50u70][u10u30]]=M1+[[0000][0000][0u20u4][0000][0u60u8][0u20u4]]=M2,\displaystyle T_{d}=\underbrace{\begin{bmatrix}\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}u_{1}&0\\ u_{3}&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}u_{5}&0\\ u_{7}&0\end{bmatrix}&\begin{bmatrix}u_{1}&0\\ u_{3}&0\end{bmatrix}&\ldots\\ \vdots&\vdots&\ddots\end{bmatrix}}_{=M_{1}}+\underbrace{\begin{bmatrix}\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}0&u_{2}\\ 0&u_{4}\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}0&u_{6}\\ 0&u_{8}\end{bmatrix}&\begin{bmatrix}0&u_{2}\\ 0&u_{4}\end{bmatrix}&\ldots\\ \vdots&\vdots&\ddots\end{bmatrix}}_{=M_{2}},

then Tdsup1i2Mi||T_{d}||\leq\sup_{1\leq i\leq 2}||M_{i}||. Furthermore for each MiM_{i} we have

M1=[[0000][0000][u1000][0000][u5000][u1000]]=M11+[[0000][0000][00u30][0000][00u70][00u30]]=M12,\displaystyle M_{1}=\underbrace{\begin{bmatrix}\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}u_{1}&0\\ 0&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}u_{5}&0\\ 0&0\end{bmatrix}&\begin{bmatrix}u_{1}&0\\ 0&0\end{bmatrix}&\ldots\\ \vdots&\vdots&\ddots\end{bmatrix}}_{=M_{11}}+\underbrace{\begin{bmatrix}\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}0&0\\ u_{3}&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}0&0\\ u_{7}&0\end{bmatrix}&\begin{bmatrix}0&0\\ u_{3}&0\end{bmatrix}&\ldots\\ \vdots&\vdots&\ddots\end{bmatrix}}_{=M_{12}},

and M1M11+M12||M_{1}||\leq||M_{11}||+||M_{12}||. The key idea is to show that Mi1M_{i1} are Toeplitz matrices (after removing the zeros in the blocks) and we can use the standard techniques described in proof of Theorem 1 in Meckes et al. (2007). Then we will show that each MijC||M_{ij}||\leq C with high probability and TdmC||T_{d}||\leq mC.

For brevity, we will assume for now that UiU_{i} are scalars and at the end we will scale by mm. By standard techniques described in proof of Theorem 1 in Meckes et al. (2007), we have that the finite Toeplitz matrix Td+TdT_{d}+T_{d}^{\top} is d×dd\times d submatrix of the infinite Laurent matrix

M=[L|jk|1|jk|<d1]j,k.M=[L_{|j-k|}\textbf{1}_{|j-k|<d-1}]_{j,k\in\mathbb{Z}}.

Consider MM as an operator on 2()\ell^{2}(\mathbb{Z}) in the canonical way, and let ψ:2()L2[0,1]\psi:\ell^{2}(\mathbb{Z})\rightarrow L^{2}[0,1] denote the usual linear trigonometric isometry ψ(ej)(x)=e2πijx\psi(e_{j})(x)=e^{2\pi ijx}. Then ψMdψ1:L2L2\psi M_{d}\psi^{-1}:L^{2}\rightarrow L^{2} is the operator correpsonding to

f(x)=j=(d1)d1L|j|e2πijx=L0+2j=1d1cos((2πjx))Ljf(x)=\sum_{j=-(d-1)}^{d-1}L_{|j|}e^{2\pi ijx}=L_{0}+2\sum_{j=1}^{d-1}\cos{(2\pi jx)}L_{j}

Therefore,

Td+TdM=f=sup0x1|Yx|\left\|T_{d}+T_{d}^{\top}\right\|\leq\left\|M\right\|=\left\|f\right\|_{\infty}=\sup_{0\leq x\leq 1}|Y_{x}|

where Yx=2j=1d1cos((2πjx))LjY_{x}=2\sum_{j=1}^{d-1}\cos{(2\pi jx)}L_{j}. Furthermore note that YxY_{x} has the following form

Yx=U[0c1xc2xcd1x0000c1xcd1x000000c1xcd1x000000]=CxU.Y_{x}=U^{\top}\underbrace{\begin{bmatrix}0&c^{x}_{1}&c^{x}_{2}&\ldots&c^{x}_{d-1}&0&\ldots&0\\ 0&0&c^{x}_{1}&\ldots&c^{x}_{d-1}&0&\ldots&0\\ \vdots&\vdots&\ldots&\ddots&\ldots&\ddots&\vdots&\vdots\\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots&\ddots&\vdots\\ 0&0&\ldots&0&0&c^{x}_{1}&\ldots&c^{x}_{d-1}\\ 0&0&\ldots&0&0&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \end{bmatrix}}_{=C_{x}}U. (33)

Here U=[U1U2UT+d]U=\begin{bmatrix}U_{1}\\ U_{2}\\ \vdots\\ U_{T+d}\end{bmatrix} and cjx=2cos((2πjx))c^{x}_{j}=2\cos{(2\pi jx)}. For any xx and assuming Uj𝗌𝗎𝖻𝗀(1)U_{j}\sim\mathsf{subg}(1), we have from Theorem 4

(|Yx/Td|t)2exp(c(tt2))\mathbb{P}\Big{(}\absolutevalue{Y_{x}/\sqrt{Td}}\leq t\Big{)}\leq 2\exp{-c(t\wedge t^{2})} (34)

The tail behavior of Yx/TdY_{x}/\sqrt{Td} is not strictly subgaussian and we need to use Theorem 6. The function ψ\psi can be found as Eq. 1 of van de Geer and Lederer (2013) (equivalent upto universal constants) with L=2L=2 and its inverse being

ψ1(t)=log((1+t))+log((1+t)).\psi^{-1}(t)=\sqrt{\log{(1+t)}}+\log{(1+t)}.

We have that

supt|Yt|ψY0ψ+Td01ψ1(N(ϵ/2,d))𝑑ϵ,\left\|\sup_{t}|Y_{t}|\right\|_{\psi}\leq\left\|Y_{0}\right\|_{\psi}+\sqrt{Td}\int_{0}^{1}\psi^{-1}(N(\epsilon/2,d))d\epsilon,

where d(s,t)=(YsYt)/Tdψd(s,t)=\left\|(Y_{s}-Y_{t})/\sqrt{Td}\right\|_{\psi} and N(ϵ,d)N(\epsilon,d) is the minimal number of balls of radius ϵ\epsilon needed to cover [0,1][0,1] where d(,)d(\cdot,\cdot) is the pseudometric. Since YsY_{s} has distribution as in Eq. (34), it follows that d(s,t)c|st|d(s,t)\leq c|s-t| for some absolute constant cc. Then

01ψ1(N(ϵ/2,d))𝑑ϵc\int_{0}^{1}\psi^{-1}(N(\epsilon/2,d))d\epsilon\leq c

for some universal constant c>0c>0. This ensures that supt|Yt|ψcTd\left\|\sup_{t}|Y_{t}|\right\|_{\psi}\leq c\sqrt{Td}. Since 𝔼[X]Xψ\mathbb{E}[X]\leq\left\|X\right\|_{\psi} we have that 𝔼[sup0x1|Yx|]Td\mathbb{E}[\sup_{0\leq x\leq 1}|Y_{x}|]\leq\sqrt{Td}. This implies 𝔼[Td+Td]Td\mathbb{E}[\left\|T_{d}+T_{d}^{\top}\right\|]\leq\sqrt{Td}, and using Proposition 5 we have 𝔼[Td]cTdlog((d))\mathbb{E}[\left\|T_{d}\right\|]\leq c\sqrt{Td}\log{(d)}. Furthermore, we can make a stronger statement because supt|Yt|ψcTd\left\|\sup_{t}|Y_{t}|\right\|_{\psi}\leq c\sqrt{Td} which implies that

TdcTdlog((d))log((1/δ))\left\|T_{d}\right\|\leq c\sqrt{Td}\log{(d)}\log{(1/\delta)}

with probability at least 1δ1-\delta. Then recalling that in the general case that LjL_{j}s of TdT_{d} were m×mm\times m block matrices we scale by mm and get with probability at least 1δ1-\delta

TdcmTdlog((d))log((m2/δ))\left\|T_{d}\right\|\leq cm\sqrt{Td}\log{(d)}\log{(m^{2}/\delta)}

where the union is over all m2m^{2} elements being less that cTdlog((d))log((m2/δ))c\sqrt{Td}\log{(d)}\log{(m^{2}/\delta)}.

11 Error Analysis for Theorem 5.1

For this section we assume that Ut𝗌𝗎𝖻𝗀(L2)U_{t}\sim\mathsf{subg}(L^{2}).

11.1 Proof of Theorem 2

Recall Eq. (8) and (9), i.e.,

Y~l,d+\displaystyle\tilde{Y}^{+}_{l,d} =0,d,dU~l1,d+𝒯0,dU~l,d++d,d,ld1U~ld1,ld1\displaystyle=\mathcal{H}_{0,d,d}\tilde{U}^{-}_{l-1,d}+\mathcal{T}_{0,d}\tilde{U}^{+}_{l,d}+\mathcal{H}_{d,d,l-d-1}\tilde{U}^{-}_{l-d-1,l-d-1}
+𝒪0,d,dη~l1,d+𝒯𝒪0,dη~l,d++𝒪d,d,ld1η~ld1,ld1+w~l,d+\displaystyle+\mathcal{O}_{0,d,d}\tilde{\eta}^{-}_{l-1,d}+\mathcal{T}\mathcal{O}_{0,d}\tilde{\eta}^{+}_{l,d}+\mathcal{O}_{d,d,l-d-1}\tilde{\eta}^{-}_{l-d-1,l-d-1}+\tilde{w}^{+}_{l,d} (35)

Assume for now that we have T+2dT+2d data points instead of TT. It is clear that

^0,d,d=argminl=0T1Y~l+d+1,d+U~l+d,d22=(l=0T1Y~l+d+1,d+(U~l+d,d))VT+\hat{\mathcal{H}}_{0,d,d}=\arg\min_{\mathcal{H}}\sum_{l=0}^{T-1}||\tilde{Y}^{+}_{l+d+1,d}-\mathcal{H}\tilde{U}^{-}_{l+d,d}||_{2}^{2}=\left(\sum_{l=0}^{T-1}\tilde{Y}^{+}_{l+d+1,d}\left(\tilde{U}^{-}_{l+d,d}\right)^{\top}\right)V_{T}^{+}

where

VT=l=0T1U~l+d,dU~l+d,d,V_{T}=\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l+d,d}, (36)

or

VT=UUV_{T}=UU^{\prime}

where

U\displaystyle U [UdUd+1UT+d1Ud1UdUT+d2U1U2UT].\displaystyle\coloneqq\begin{bmatrix}U_{d}&U_{d+1}&\ldots&U_{T+d-1}\\ U_{d-1}&U_{d}&\ldots&U_{T+d-2}\\ \vdots&\vdots&\ddots&\vdots\\ U_{1}&U_{2}&\ldots&U_{T}\end{bmatrix}.

It is show in Theorem 10.1 that VTV_{T} is invertible with probability at least 1δ1-\delta. So in our analysis we can write this as

(l=0T1U~l+d,dU~l+d,d)+=(l=0T1U~l+d,dU~l+d,d)1\left(\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l+d,d}\right)^{+}=\left(\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l+d,d}\right)^{-1}

From this one can conclude that

^0,d,d2\displaystyle\Big{|}\Big{|}\hat{\mathcal{H}}-\mathcal{H}_{0,d,d}\Big{|}\Big{|}_{2} =||(l=0T1U~l+d,dU~l+d,d)1(l=0T1U~l+d,dU~l+d+1,d+𝒯0,d\displaystyle=\Big{|}\Big{|}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l+d,d}\Big{)}^{-1}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}
+U~l+d,dU~l,ld,d,l+U~l+d,dη~l+d,d𝒪0,d,d\displaystyle+\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l+d,d}\mathcal{O}_{0,d,d}^{\top}
+U~l+d,dη~l+d+1,d+𝒯𝒪0,d+U~l+d,dη~l,l𝒪d,d,l+U~l+d,dw~l+d+1,d+)||2\displaystyle+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\top}_{l+d+1,d}\mathcal{T}\mathcal{O}^{\top}_{0,d}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l,l}\mathcal{O}_{d,d,l}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{w}^{+\top}_{l+d+1,d}\Big{)}\Big{|}\Big{|}_{2} (37)

Here as we can observe U~l,l,η~l,l\tilde{U}^{-\top}_{l,l},\tilde{\eta}^{-\top}_{l,l} grow with TT in dimension. Based on this we divide our error terms in two parts:

E1=(l=0T1U~l+d,dU~l+d,d)1(U~l+d,dU~l,ld,d,l+U~l+d,dη~l,l𝒪d,d,l)\displaystyle E_{1}=\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l+d,d}\Big{)}^{-1}\Bigg{(}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l,l}\mathcal{O}_{d,d,l}^{\top}\Bigg{)} (38)

and

E2=(l=0T1U~l+d,dU~l+d,d)1(U~l+d,dη~l+d+1,d+𝒯𝒪0,d+U~l+d,dU~l+d+1,d+𝒯0,d+\displaystyle E_{2}=\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l+d,d}\Big{)}^{-1}\Bigg{(}\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\top}_{l+d+1,d}\mathcal{T}\mathcal{O}^{\top}_{0,d}+\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}+ (39)
U~l+d,dη~l+d+1,d+𝒯𝒪0,d+U~l+d,dw~l+d+1,d+)\displaystyle\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\top}_{l+d+1,d}\mathcal{T}\mathcal{O}^{\top}_{0,d}+\tilde{U}^{-}_{l+d,d}\tilde{w}^{+\top}_{l+d+1,d}\Bigg{)}

Then the proof of Theorem 5.1 will reduce to Propositions 13. We first analyze

VT1/2(l=0T1U~l+d,dU~l,ld,d,l)2\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}\Big{)}\Big{|}\Big{|}_{2}

The analysis of VT1/2(l=0T1U~l+d,dη~l,lOd,d,l)||V^{-1/2}_{T}(\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l,l}O_{d,d,l}^{\top})|| will be almost identical and will only differ in constants.

Proposition 1.

For 0<δ<10<\delta<1, we have with probability at least 12δ1-2\delta

VT1/2(l=0T1U~l+d,dU~l,ld,d,l)24σlog(1δ)+pd+m\displaystyle\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}\Big{)}\Big{|}\Big{|}_{2}\leq 4\sigma\sqrt{\log{\frac{1}{\delta}}+pd+m}

where σ=σ(k=1d𝒯d+k,T𝒯d+k,T)\sigma=\sqrt{\sigma(\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T})}.

Proof 11.1.

We proved that TI2VT3TI2\frac{TI}{2}\preceq V_{T}\preceq\frac{3TI}{2} with high probability, then

(VT1/2(l=0T1U~l+d,dU~l,ld,d,l)2a,TI2VT3TI2)\displaystyle\mathbb{P}\Big{(}\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}\Big{)}\Big{|}\Big{|}_{2}\geq a,\frac{TI}{2}\preceq V_{T}\preceq\frac{3TI}{2}\Big{)}
(2T(l=0T1U~l+d,dU~l,ld,d,l)2a,TI2VT3TI2)\displaystyle\leq\mathbb{P}\Big{(}\Big{|}\Big{|}\sqrt{\frac{2}{T}}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}\Big{)}\Big{|}\Big{|}_{2}\geq a,\frac{TI}{2}\preceq V_{T}\preceq\frac{3TI}{2}\Big{)}
(2supv𝒩122T(l=0T1U~l+d,dU~l,ld,d,lv)2a)+(TI2VT3TI2)1\displaystyle\leq\mathbb{P}\Big{(}2\sup_{v\in{\mathcal{N}}_{\frac{1}{2}}}\Big{|}\Big{|}\sqrt{\frac{2}{T}}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}v\Big{)}\Big{|}\Big{|}_{2}\geq a\Big{)}+\mathbb{P}\Big{(}\frac{TI}{2}\preceq V_{T}\preceq\frac{3TI}{2}\Big{)}-1
5pd(22T(l=0T1U~l+d,dU~l,ld,d,lv)2a)δ.\displaystyle\leq 5^{pd}\mathbb{P}\Big{(}2\Big{|}\Big{|}\sqrt{\frac{2}{T}}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}v\Big{)}\Big{|}\Big{|}_{2}\geq a\Big{)}-\delta. (40)

Define the following ηl,d=U~l,ld,d,lv,Xl,d=2TU~l+d,d\eta_{l,d}=\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}v,X_{l,d}=\sqrt{\frac{2}{T}}\tilde{U}^{-}_{l+d,d}. Observe that ηl,d,ηl+1,d\eta_{l,d},\eta_{l+1,d} have contributions from Ul1,Ul2U_{l-1},U_{l-2} etc. and do not immediately satisfy the conditions of Theorem 3. Instead we will use the fact that Xi,dX_{i,d} is independent of UjU_{j} for all jij\leq i.

VT1/2(l=0T1U~l+d,dU~l,ld,d,l)2\displaystyle\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}\Big{)}\Big{|}\Big{|}_{2} 2supv𝒩122Tl=0T1U~l+d,dU~l,ld,d,lv\displaystyle\leq 2\sup_{v\in{\mathcal{N}}_{\frac{1}{2}}}{||\sqrt{\frac{2}{T}}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}v||}
2supv𝒩12l=0T1Xl,dηl,d.\displaystyle\leq 2\sup_{v\in{\mathcal{N}}_{\frac{1}{2}}}{||\sum_{l=0}^{T-1}X_{l,d}\eta_{l,d}||}.

Define d,d,lv=[β1,β2,,βl]\mathcal{H}_{d,d,l}^{\top}v=[\beta_{1}^{\top},\beta_{2}^{\top},\ldots,\beta_{l}^{\top}]^{\top}. βi\beta_{i} are m×1m\times 1 vectors when LTI system is MIMO. Then ηl,d=k=0l1Ulkβk+1\eta_{l,d}=\sum_{k=0}^{l-1}U^{\top}_{l-k}\beta_{k+1}. Let αl=Xl,d\alpha_{l}={X_{l,d}}. Then consider the matrix

T×mT=[β100β2β10βTβT1β1].\displaystyle\mathcal{B}_{T\times mT}=\begin{bmatrix}\beta_{1}^{\top}&0&0&\ldots\\ \beta^{\top}_{2}&\beta^{\top}_{1}&0&\ldots\\ \vdots&\vdots&\ddots&\vdots\\ \beta_{T}^{\top}&\beta_{T-1}^{\top}&\ldots&\beta^{\top}_{1}\end{bmatrix}.

Observe that the matrix T×mT2=σ(k=1d𝒯d+k,T𝒯d+k,T)d𝒯d,2<||\mathcal{B}_{T\times mT}||_{2}=\sqrt{\sigma(\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T})}\leq\sqrt{d}||\mathcal{T}_{d,\infty}||_{2}<\infty which follows from Lemma 8. Then

l=0T1Xl,dηl,d\displaystyle\sum_{l=0}^{T-1}X_{l,d}\eta_{l,d} =[α1,,αT][U1U2UT]\displaystyle=[\alpha_{1},\ldots,\alpha_{T}]\mathcal{B}\begin{bmatrix}U_{1}\\ U_{2}\\ \vdots\\ U_{T}\end{bmatrix}
=[k=1Tαkβk,k=2Tαkβk1,,αTβ1][U1U2UT]\displaystyle=[\sum_{k=1}^{T}\alpha_{k}\beta^{\top}_{k},\sum_{k=2}^{T}\alpha_{k}\beta^{\top}_{k-1},\ldots,\alpha_{T}\beta^{\top}_{1}]\begin{bmatrix}U_{1}\\ U_{2}\\ \vdots\\ U_{T}\end{bmatrix}
=j=1T(k=jTαkβkUj).\displaystyle=\sum_{j=1}^{T}\Big{(}\sum_{k=j}^{T}\alpha_{k}\beta^{\top}_{k}U_{j}\Big{)}.

Here αi=Xi,d\alpha_{i}=X_{i,d} and recall that Xi,dX_{i,d} is independent of UjU_{j} for all iji\geq j. Let γ=α\gamma^{\prime}=\alpha^{\prime}\mathcal{B}. Define 𝒢T+dk=σ~({Uk+1,Uk+2,,UT+d})\mathcal{G}_{T+d-k}=\tilde{\sigma}(\{U_{k+1},U_{k+2},\ldots,U_{T+d}\}) where σ~(A)\tilde{\sigma}(A) is the sigma algebra containing the set AA with 𝒢0=ϕ\mathcal{G}_{0}=\phi. Then 𝒢k1𝒢k\mathcal{G}_{k-1}\subset\mathcal{G}_{k}. Furthermore, since γj1,Uj\gamma_{j-1},U_{j} are 𝒢T+d+1j\mathcal{G}_{T+d+1-j} measurable and UjU_{j} is conditionally (on 𝒢T+dj\mathcal{G}_{T+d-j}) subGaussian, we can use Theorem 3 on γU=αU\gamma^{\prime}U=\alpha^{\prime}\mathcal{B}U (where γj=XT+dj,Uj=ηT+dj+1\gamma_{j}=X_{T+d-j},U_{j}=\eta_{T+d-j+1} in the notation of Theorem 3). Then with probability at least 1δ1-\delta we have

(αα+V)1/2γUL(log(1δ)+log(det(αα+V)det(V))).\Big{|}\Big{|}\Big{(}\alpha^{\prime}\mathcal{B}\mathcal{B}^{\prime}\alpha+V\Big{)}^{-1/2}\gamma^{\prime}U\Big{|}\Big{|}\leq L\sqrt{\Big{(}\log{\frac{1}{\delta}}+\log{\frac{\det(\alpha^{\prime}\mathcal{B}\mathcal{B}^{\prime}\alpha+V)}{\det(V)}}\Big{)}}. (41)

For any fixed V>0V>0. With probability at least 1δ1-\delta, we know from Theorem 10.1 that αα3I2αα3σ12()I2\alpha^{\prime}\alpha\preceq\frac{3I}{2}\implies\alpha^{\prime}\mathcal{B}\mathcal{B}^{\prime}\alpha\preceq\frac{3\sigma_{1}^{2}(\mathcal{B})I}{2}. By combining this event and the event in Eq. (41) and setting V=3σ12()I2V=\frac{3\sigma_{1}^{2}(\mathcal{B})I}{2}, we get with probability at least 12δ1-2\delta that

αU2=γU23σ1()L(log(1δ)+pdlog(3)+m).\displaystyle||\alpha^{\prime}\mathcal{B}U||_{2}=||\gamma^{\prime}U||_{2}\leq\sqrt{3}\sigma_{1}(\mathcal{B})L\sqrt{\Big{(}\log{\frac{1}{\delta}}+pd\log{3}+m\Big{)}}. (42)

Replacing δ5pdδ2\delta\rightarrow 5^{-pd}\frac{\delta}{2}, we get from Eq. (40)

VT1/2(l=0T1U~l+d,dU~l,ld,d,l)26log((5))Lσ1()log(1δ)+pd+m\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}\Big{)}\Big{|}\Big{|}_{2}\leq\sqrt{6}\log{(5)}L\sigma_{1}(\mathcal{B})\sqrt{\log{\frac{1}{\delta}}+pd+m}

with probability at least 1δ1-\delta. Since L=1L=1 we get our desired result.

Then similar to Proposition 1, we analyze VT1/2(l=0T1U~l+d,dU~l+d+1,d+𝒯0,d)2\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}\Big{)}\Big{|}\Big{|}_{2}

Proposition 2.

For 0<δ<10<\delta<1 and large enough TT, we have with probability at least 1δ1-\delta

VT1/2(l=0T1U~l+d,dU~l+d+1,d+𝒯0,d)24σlog(1δ)+pd+m\displaystyle\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}\Big{)}\Big{|}\Big{|}_{2}\leq 4\sigma\sqrt{\log{\frac{1}{\delta}}+pd+m}

where

σsupv2=1[vCAdBvCAd1BvCAd2BvCB00000vCAdBvCAd1BvCB]2j=0dCAjB2βd.\sigma\leq\sup_{||v||_{2}=1}\Big{|}\Big{|}\begin{bmatrix}v^{\top}CA^{d}B&v^{\top}CA^{d-1}B&v^{\top}CA^{d-2}B&\ldots&v^{\top}CB&0\\ 0&\ddots&\ddots&\ddots&\ddots&0\\ 0&\ddots&\ddots&\ddots&\ddots&\ddots\\ 0&v^{\top}CA^{d}B&v^{\top}CA^{d-1}B&\ldots&\ldots&v^{\top}CB\end{bmatrix}\Big{|}\Big{|}_{2}\leq\sum_{j=0}^{d}||CA^{j}B||_{2}\leq\beta\sqrt{d}.
Proof 11.2.

Note VT1/2(l=0T1U~l+d,dU~l+d+1,d+𝒯0,d)22T(l=0T1U~l+d,dU~l+d+1,d+𝒯0,d)2\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}\Big{)}\Big{|}\Big{|}_{2}\leq\Big{|}\Big{|}\sqrt{\frac{2}{T}}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}\Big{)}\Big{|}\Big{|}_{2} with probability at least 1δ1-\delta for large enough TT. Here 𝒯0,d\mathcal{T}_{0,d}^{\top} is md×pdmd\times pd matrix. Then define Xl=2TU~l+d,dX_{l}=\sqrt{\frac{2}{T}}\tilde{U}^{-}_{l+d,d} and the vector MlpdM_{l}\in\mathbb{R}^{pd} as Ml=U~l+d+1,d+𝒯0,dM_{l}^{\top}=\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}. Then

(l=0T1XlMl2t)\displaystyle\mathbb{P}(||\sum_{l=0}^{T-1}X_{l}M_{l}^{\top}||_{2}\geq t) 12net5pd(l=0T1XlMlv2t/2)\displaystyle\underbrace{\leq}_{\frac{1}{2}-\text{net}}5^{pd}\mathbb{P}(||\sum_{l=0}^{T-1}X_{l}M_{l}^{\top}v||_{2}\geq t/2)

where MlvM_{l}^{\top}v is a real value. Let β𝒯0,dv\beta\coloneqq\mathcal{T}_{0,d}^{\top}v, then Mlv=U~l+d+1,d+βM_{l}^{\top}v=\tilde{U}^{+\top}_{l+d+1,d}\beta. This allows us to write XlMlvX_{l}M_{l}^{\top}v in a form that will enable us to apply Theorem 3.

l=0T1XlMlv\displaystyle\sum_{l=0}^{T-1}X_{l}M_{l}^{\top}v =[X0,X1,,XT1]=X[β1β2βd00β10000β1βd]=[Ud+1Ud+2UT+2d]=N\displaystyle=\underbrace{[X_{0},X_{1},\ldots,X_{T-1}]}_{=X}\underbrace{\begin{bmatrix}\beta_{1}^{\top}&\beta_{2}^{\top}&\ldots&\beta_{d}^{\top}&\ldots&0\\ 0&\beta_{1}^{\top}&\ddots&\ddots&\ddots&0\\ 0&\ddots&\ddots&\ddots&\ddots&\ddots\\ 0&\ldots&0&\beta_{1}^{\top}&\ldots&\beta_{d}^{\top}\end{bmatrix}}_{=\mathcal{I}}\underbrace{\begin{bmatrix}U_{d+1}\\ U_{d+2}\\ \vdots\\ U_{T+2d}\end{bmatrix}}_{=N} (43)

Here \mathcal{I} is T×(mT+md)\mathbb{R}^{T\times(mT+md)}. It is known from Theorem 10.1 that XX3I2XX^{\top}\preceq\frac{3I}{2} with high probability and consequently XX3σ12()I2X\mathcal{I}\mathcal{I}^{\top}X^{\top}\preceq\frac{3\sigma^{2}_{1}(\mathcal{I})I}{2}. Define 𝓕l=σ~({Ul}j=1d+l)\bm{\mathcal{F}}_{l}=\tilde{\sigma}(\{U_{l}\}_{j=1}^{d+l}) as the sigma field generated by ({Ul}j=1d+l(\{U_{l}\}_{j=1}^{d+l}. Furthermore NlN_{l} is 𝓕l\bm{\mathcal{F}}_{l} measurable, and [X]l[X\mathcal{I}]_{l} is 𝓕l1\bm{\mathcal{F}}_{l-1} measurable and we can apply Theorem 3. Now the proof is similar to Proposition 1. Following the same steps as before we get with probability at least 1δ1-\delta

l=0T1XlMlv2=l=0T1[X]lNl23σ1()Llog(1δ)+pdlog(3)+m\displaystyle||\sum_{l=0}^{T-1}X_{l}M_{l}^{\top}v||_{2}=||\sum_{l=0}^{T-1}[X\mathcal{I}]_{l}N_{l}||_{2}\leq\sqrt{3}\sigma_{1}(\mathcal{I})L\sqrt{\log{\frac{1}{\delta}}+pd\log{3}+m}

and substituting δ5pdδ\delta\rightarrow 5^{-pd}\delta we get

l=0T1XlMl26log((5))σ1()Llog(1δ)+pd+m||\sum_{l=0}^{T-1}X_{l}M_{l}^{\top}||_{2}\leq\sqrt{6}\log{(5)}\sigma_{1}(\mathcal{I})L\sqrt{\log{\frac{1}{\delta}}+pd+m}

and

l=0T1XlMl24σ1()Llog(1δ)+pd+m.||\sum_{l=0}^{T-1}X_{l}M_{l}||_{2}\leq 4\sigma_{1}(\mathcal{I})L\sqrt{\log{\frac{1}{\delta}}+pd+m}. (44)

The proof for noise and covariate cross terms is almost identical to Proposition 2 but easier because of independence. Finally note that σ1()i=1dβi22d=𝒯0,dv22dβd\sigma_{1}(\mathcal{I})\leq\sqrt{\sum_{i=1}^{d}\|\beta_{i}\|^{2}_{2}}\sqrt{d}=\sqrt{\|\mathcal{T}_{0,d}^{\top}v\|_{2}^{2}}\sqrt{d}\leq\beta\sqrt{d}.

Proposition 3.

For 0<δ<10<\delta<1, we have with probability at least 1δ1-\delta

VT1/2(k=0TU~l+d,dη~l+1+d,d+𝒯𝒪0,d)2\displaystyle\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{k=0}^{T}\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\prime}_{l+1+d,d}\mathcal{T}\mathcal{O}^{\prime}_{0,d}\Big{)}\Big{|}\Big{|}_{2} 4σAlog(1δ)+pd+m\displaystyle\leq 4\sigma_{A}\sqrt{\log{\frac{1}{\delta}}+pd+m}
VT1/2(k=0TU~l+d,dη~l,l𝒪d,d,l)2\displaystyle\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{k=0}^{T}\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\prime}_{l,l}\mathcal{O}^{\prime}_{d,d,l}\Big{)}\Big{|}\Big{|}_{2} 4σBlog(1δ)+pd+m\displaystyle\leq 4\sigma_{B}\sqrt{\log{\frac{1}{\delta}}+pd+m}
VT1/2(k=0TU~l+d,dη~l+d,d𝒪0,d,d)2\displaystyle\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{k=0}^{T}\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\prime}_{l+d,d}\mathcal{O}^{\prime}_{0,d,d}\Big{)}\Big{|}\Big{|}_{2} 4σClog(1δ)+pd+m\displaystyle\leq 4\sigma_{C}\sqrt{\log{\frac{1}{\delta}}+pd+m}
VT1/2(k=0TU~l+d,dw~l+1+d,d+)2\displaystyle\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{k=0}^{T}\tilde{U}^{-}_{l+d,d}\tilde{w}^{+\prime}_{l+1+d,d}\Big{)}\Big{|}\Big{|}_{2} 4σDlog(1δ)+pd+m\displaystyle\leq 4\sigma_{D}\sqrt{\log{\frac{1}{\delta}}+pd+m}

Here σ=max(σA,σB,σC,σD)\sigma=\max{(\sigma_{A},\sigma_{B},\sigma_{C},\sigma_{D})} where

σAσCsupv2=1[vCAdvCAd1vCAd200000vCAdvC]2j=0dCAj2βRd\sigma_{A}\vee\sigma_{C}\leq\sup_{||v||_{2}=1}\Big{|}\Big{|}\begin{bmatrix}v^{\top}CA^{d}&v^{\top}CA^{d-1}&v^{\top}CA^{d-2}&\ldots&0\\ 0&\ddots&\ddots&\ddots&0\\ 0&\ddots&\ddots&\ddots&\ddots\\ 0&\ldots&v^{\top}CA^{d}&\ldots&v^{\top}C\end{bmatrix}\Big{|}\Big{|}_{2}\leq\sum_{j=0}^{d}||CA^{j}||_{2}\leq\beta R\sqrt{d}

σB=σ(k=1d𝒯𝒪d+k,T𝒯𝒪d+k,T)βRd,σDc\sigma_{B}=\sqrt{\sigma(\sum_{k=1}^{d}\mathcal{T}\mathcal{O}_{d+k,T}^{\top}\mathcal{T}\mathcal{O}_{d+k,T})}\leq\beta R\sqrt{d},\sigma_{D}\leq c.

By taking the intersection of all the aforementioned events for a fixed δ\delta we then have with probability at least 1δ1-\delta

^0,d,d0,d,d2\displaystyle\Big{|}\Big{|}\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,d,d}\Big{|}\Big{|}_{2} 16σ1Tm+pd+log(dδ)\displaystyle\leq 16\sigma\sqrt{\frac{1}{T}}\sqrt{m+pd+\log{\frac{d}{\delta}}}

12 Subspace Perturbation Results

In this section we present variants of the famous Wedin’s theorem (Section 3 of Wedin (1972)) that depends on the distribution of Hankel singular values. These will be “sign free” generalizations of the gap–Free Wedin Theorem from Allen-Zhu and Li (2016). First we define the Hermitian dilation of a matrix.

(S)=[0SS0]\displaystyle\mathcal{H}(S)=\begin{bmatrix}0&S\\ S^{\prime}&0\end{bmatrix}

The Hermitian dilation has the property that S1S2ϵ(S1)(S2)ϵ||S_{1}-S_{2}||\leq\epsilon\Longleftrightarrow||\mathcal{H}(S_{1})-\mathcal{H}(S_{2})||\leq\epsilon. Hermitian dilations will be useful in applying Wedin’s theorem for general (not symmetric) matrices.

Proposition 1.

Let S,S^S,\hat{S} be symmetric matrices and SS^ϵ||S-\hat{S}||\leq\epsilon. Further, let vj,v^jv_{j},\hat{v}_{j} correspond to the jthj^{th} eigenvector of S,S^S,\hat{S} respectively such that λ1λ2λn\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{n} and λ^1λ^2λ^n\hat{\lambda}_{1}\geq\hat{\lambda}_{2}\geq\ldots\geq\hat{\lambda}_{n}. Then we have

|vj,v^k|ϵ|λjλ^k||\langle v_{j},\hat{v}_{k}\rangle|\leq\frac{\epsilon}{{|\lambda_{j}-\hat{\lambda}_{k}|}} (45)

if either λj\lambda_{j} or λ^k\hat{\lambda}_{k} is not zero.

Proof 12.1.

Let S=λjvjvj+VΛjVS=\lambda_{j}v_{j}v_{j}^{\prime}+V\Lambda_{-j}V^{\prime} and S^=λ^kv^kv^k+V^Λ^kV^\hat{S}=\hat{\lambda}_{k}\hat{v}_{k}\hat{v}_{k}^{\prime}+\hat{V}\hat{\Lambda}_{-k}\hat{V}^{\prime}, wlog assume |λj||λ^k||\lambda_{j}|\leq|\hat{\lambda}_{k}|. Define R=SS^R=S-\hat{S}

S\displaystyle S =S^+R\displaystyle=\hat{S}+R
vjSv^k\displaystyle v_{j}^{\prime}S\hat{v}_{k} =vjS^v^k+vjRv^k\displaystyle=v_{j}^{\prime}\hat{S}\hat{v}_{k}+v_{j}^{\prime}R\hat{v}_{k}

Since vj,v^kv_{j},\hat{v}_{k} are eigenvectors of SS and S^\hat{S} respectively.

λjvjv^k\displaystyle\lambda_{j}v_{j}^{\prime}\hat{v}_{k} =λ^kvjv^k+vjRv^k\displaystyle=\hat{\lambda}_{k}v_{j}^{\prime}\hat{v}_{k}+v_{j}^{\prime}R\hat{v}_{k}
|λjλ^k||vjv^k|\displaystyle|\lambda_{j}-\hat{\lambda}_{k}||v_{j}^{\prime}\hat{v}_{k}| ϵ\displaystyle\leq\epsilon

Proposition 1 gives an eigenvector subjective Wedin’s theorem. Next, we show how to extend these results to arbitrary subsets of eigenvectors.

Proposition 2.

For ϵ>0\epsilon>0, let S,PS,P be two symmetric matrices such that SP2ϵ||S-P||_{2}\leq\epsilon. Let

S=UΣSU,P=VΣPVS=U\Sigma^{S}U^{\top},P=V\Sigma^{P}V^{\top}

Let V+V_{+} correspond to the eigenvectors of singular values β\geq\beta, VV_{-} correspond to the eigenvectors of singular values α\leq\alpha and V¯\bar{V} are the remaining ones. Define a similar partition for SS. Let α<β\alpha<\beta

UV+\displaystyle||U_{-}^{\top}V_{+}|| ϵβα\displaystyle\leq\frac{\epsilon}{\beta-\alpha}
Proof 12.2.

The proof is similar to before. S,PS,P have a spectral decomposition of the form

S\displaystyle S =U+Σ+SU++UΣSU+U¯ΣS0U¯\displaystyle=U_{+}\Sigma^{S}_{+}U_{+}^{\prime}+U_{-}\Sigma^{S}_{-}U_{-}^{\prime}+\bar{U}{\Sigma^{S}}_{0}\bar{U}^{\prime}
P\displaystyle P =V+Σ+PV++VΣPV+V¯ΣP0V¯\displaystyle=V_{+}\Sigma^{P}_{+}V_{+}^{\prime}+V_{-}\Sigma^{P}_{-}V_{-}^{\prime}+\bar{V}{\Sigma^{P}}_{0}\bar{V}^{\prime}

Let R=SPR=S-P and since U+U_{+} is orthogonal to U,U¯U_{-},\bar{U} and similarly for VV

US\displaystyle U_{-}^{\prime}S =ΣSU=UP+UR\displaystyle=\Sigma^{S}_{-}U_{-}^{\prime}=U_{-}^{\prime}P+U_{-}^{\prime}R
ΣSUV+\displaystyle\Sigma^{S}_{-}U_{-}^{\prime}V_{+} =UV+Σ+P+URV+\displaystyle=U_{-}^{\prime}V_{+}\Sigma^{P}_{+}+U_{-}^{\prime}RV_{+}

Diving both sides by ΣP\Sigma^{P}

ΣSUV+(Σ+P)1\displaystyle\Sigma^{S}_{-}U_{-}^{\prime}V_{+}(\Sigma^{P}_{+})^{-1} =UV++URV+(Σ+P)1\displaystyle=U_{-}^{\prime}V_{+}+U_{-}^{\prime}RV_{+}(\Sigma^{P}_{+})^{-1}
ΣSUV+(Σ+P)1\displaystyle||\Sigma^{S}_{-}U_{-}^{\prime}V_{+}(\Sigma^{P}_{+})^{-1}|| UV+URV+(Σ+P)1\displaystyle\geq||U_{-}^{\prime}V_{+}||-||U_{-}^{\prime}RV_{+}(\Sigma^{P}_{+})^{-1}||
αβUV+\displaystyle\frac{\alpha}{\beta}||U_{-}^{\prime}V_{+}|| UV+ϵβ\displaystyle\geq||U_{-}^{\prime}V_{+}||-\frac{\epsilon}{\beta}
UV+\displaystyle||U_{-}^{\prime}V_{+}|| ϵβα\displaystyle\leq\frac{\epsilon}{\beta-\alpha}

Let Sk,PkS_{k},P_{k} be the best rank kk approximations of S,PS,P respectively. We develop a sequence of results to see how SkPk||S_{k}-P_{k}|| varies when SPϵ||S-P||\leq\epsilon as a function of kk.

Proposition 3.

Let S,PS,P be such that

SPϵ||S-P||\leq\epsilon

Furthermore, let ϵ\epsilon be such that

ϵinf{1ir1}{s+1in}(σi(P)σi+1(P)2)\epsilon\leq\inf_{\{1\leq i\leq r-1\}\cup\{s+1\leq i\leq n\}}\Big{(}\frac{\sigma_{i}(P)-\sigma_{i+1}(P)}{2}\Big{)} (46)

and UjS,VjSU_{j}^{S},V^{S}_{j} be the left and right singular vectors of SS corresponding to σj(S)\sigma_{j}(S). There exists a unitary transformation QQ such that

σmax([UrP,,UsP]Q[UrS,,UsS])\displaystyle\sigma_{\max}([U_{r}^{P},\ldots,U_{s}^{P}]Q-[U_{r}^{S},\ldots,U_{s}^{S}]) 2ϵmin(σr1(P)σr(S),σs(S)σs+1(P))\displaystyle\leq\frac{2\epsilon}{\min{\Big{(}\sigma_{r-1}(P)-\sigma_{r}(S),\sigma_{s}(S)-\sigma_{s+1}(P)\Big{)}}}
σmax([VrP,,VsP]Q[VrS,,VsS])\displaystyle\sigma_{\max}([V_{r}^{P},\ldots,V_{s}^{P}]Q-[V_{r}^{S},\ldots,V_{s}^{S}]) 2ϵmin(σr1(P)σr(S),σs(S)σs+1(P)).\displaystyle\leq\frac{2\epsilon}{\min{\Big{(}\sigma_{r-1}(P)-\sigma_{r}(S),\sigma_{s}(S)-\sigma_{s+1}(P)\Big{)}}}.
Proof 12.3.

Let rksr\leq k\leq s. First divide the indices [1,n][1,n] into 3 parts K1=[1,r1],K2=[r,s],K3=[s+1,n]K_{1}=[1,r-1],K_{2}=[r,s],K_{3}=[s+1,n]. Although we focus on only three groups extension to general case will be a straight forward extension of this proof. Define the Hermitian dilation of S,PS,P as (S),(P)\mathcal{H}(S),\mathcal{H}(P) respectively. Then we know that the eigenvalues of (S)\mathcal{H}(S) are

i=1n{σi(S),σi(S)}\cup_{i=1}^{n}\{\sigma_{i}(S),-\sigma_{i}(S)\}

Further the eigenvectors corresponding to these are

i=1n{12[uiSviS],12[uiSviS]}\cup_{i=1}^{n}\Bigg{\{}\frac{1}{\sqrt{2}}\begin{bmatrix}u^{S}_{i}\\ v^{S}_{i}\end{bmatrix},\frac{1}{\sqrt{2}}\begin{bmatrix}u^{S}_{i}\\ -v^{S}_{i}\end{bmatrix}\Bigg{\}}

Similarly define the respective quantities for (P)\mathcal{H}(P). Now clearly, (S)(P)ϵ||\mathcal{H}(S)-\mathcal{H}(P)||\leq\epsilon since SPϵ||S-P||\leq\epsilon. Then by Weyl’s inequality we have that

|σi(S)σi(P)|ϵ|\sigma_{i}(S)-\sigma_{i}(P)|\leq\epsilon

Now we can use Proposition 1. To ease notation, define σi(S)=λi((S))\sigma_{i}(S)=\lambda_{i}(\mathcal{H}(S)) and λi((S))=σi(S)\lambda_{-i}(\mathcal{H}(S))=-\sigma_{i}(S) and let the corresponding eigenvectors be ai,aia_{i},a_{-i} for SS and bi,bib_{i},b_{-i} for PP respectively. Note that we can make the assumption that ai,bi0\langle a_{i},b_{i}\rangle\geq 0 for every ii. This does not change any of our results because ai,bia_{i},b_{i} are just stacking of left and right singular vectors and uiviu_{i}v_{i}^{\top} is identical for ui,viu_{i},v_{i} and ui,vi-u_{i},-v_{i}.

Then using Proposition 1 we get for every (i,j)K2×K2(i,j)\not\in K_{2}\times K_{2} and iji\neq j

|ai,bj|ϵ|σi(S)σj(P)||\langle a_{i},b_{j}\rangle|\leq\frac{\epsilon}{|\sigma_{i}(S)-\sigma_{j}(P)|} (47)

similarly

|ai,bj|ϵ|σi(S)+σj(P)||\langle a_{-i},b_{j}\rangle|\leq\frac{\epsilon}{|\sigma_{i}(S)+\sigma_{j}(P)|} (48)

Since

ai=12[uiSviS],ai=12[uiSviS],bj=12[uiPviP]a_{i}=\frac{1}{\sqrt{2}}\begin{bmatrix}u^{S}_{i}\\ v^{S}_{i}\end{bmatrix},a_{-i}=\frac{1}{\sqrt{2}}\begin{bmatrix}u^{S}_{i}\\ -v^{S}_{i}\end{bmatrix},b_{j}=\frac{1}{\sqrt{2}}\begin{bmatrix}u^{P}_{i}\\ v^{P}_{i}\end{bmatrix}

and σi(S),σi(P)0\sigma_{i}(S),\sigma_{i}(P)\geq 0 we have by adding Eq. (47),(48) that

max(|uiS,ujP|,|viS,vjP|)ϵ|σi(S)σj(P)|\max{\Big{(}|\langle u^{S}_{i},u^{P}_{j}\rangle|,|\langle v^{S}_{i},v^{P}_{j}\rangle|\Big{)}}\leq\frac{\epsilon}{|\sigma_{i}(S)-\sigma_{j}(P)|}

Define UKiSU^{S}_{K_{i}} to be the matrix formed by the orthornormal vectors {aj}jKi\{a_{j}\}_{j\in K_{i}} and UKiSU^{S}_{K_{-i}} to be the matrix formed by the orthonormal vectors {aj}jKi\{a_{j}\}_{j\in-K_{i}}. Define similar quantities for PP. Then

(UK2S)UK2P(UK2P)UK2S=(UK2S)(Ij2UKjP(UKjP))UK2S\displaystyle(U^{S}_{K_{2}})^{\top}U^{P}_{K_{2}}(U^{P}_{K_{2}})^{\top}U^{S}_{K_{2}}=(U^{S}_{K_{2}})^{\top}(I-\sum_{j\neq 2}U^{P}_{K_{j}}(U^{P}_{K_{j}})^{\top})U^{S}_{K_{2}}
=(UK2S)(I|j|2UKjP(UKjP)UK2P(UK2P))UK2S\displaystyle=(U^{S}_{K_{2}})^{\top}(I-\sum_{|j|\neq 2}U^{P}_{K_{j}}(U^{P}_{K_{j}})^{\top}-U^{P}_{K_{-2}}(U^{P}_{K_{-2}})^{\top})U^{S}_{K_{2}}
=I(UK2S)|j|2UKjP(UKjP)UK2S(UK2S)UK2P(UK2P)UK2S\displaystyle=I-(U^{S}_{K_{2}})^{\top}\sum_{|j|\neq 2}U^{P}_{K_{j}}(U^{P}_{K_{j}})^{\top}U^{S}_{K_{2}}-(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}(U^{P}_{K_{-2}})^{\top}U^{S}_{K_{2}} (49)

Now K1,K1K_{1},K_{-1} corresponds to eigenvectors where singular values σr1(P)\geq\sigma_{r-1}(P), K3,K3K_{3},K_{-3} corresponds to eigenvectors where singular values σs+1(P)\leq\sigma_{s+1}(P). We are in a position to use Proposition 2. Using that on Eq. (49) we get the following relation

(UK2P)UK2S(UK2S)UK2P\displaystyle(U^{P}_{K_{2}})^{\top}U^{S}_{K_{2}}(U^{S}_{K_{2}})^{\top}U^{P}_{K_{2}} I(1ϵ2(σr1(P)σs(S))2ϵ2(σs(S)σs+1(P))2)\displaystyle\succeq I\Bigg{(}1-\frac{\epsilon^{2}}{(\sigma_{r-1}(P)-\sigma_{s}(S))^{2}}-\frac{\epsilon^{2}}{(\sigma_{s}(S)-\sigma_{s+1}(P))^{2}}\Bigg{)}
(UK2S)UK2P(UK2P)UK2S\displaystyle-(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}(U^{P}_{K_{-2}})^{\top}U^{S}_{K_{2}} (50)

In the Eq. (50) we need to upper bound (UK2S)UK2P(UK2P)UK2S(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}(U^{P}_{K_{-2}})^{\top}U^{S}_{K_{2}}. To this end we will exploit the fact that all singular values corresponding to UK2SU^{S}_{K_{2}} are the same. Since (S)(P)ϵ||\mathcal{H}(S)-\mathcal{H}(P)||\leq\epsilon, then

(S)\displaystyle\mathcal{H}(S) =UK2SΣK2S(UK2S)+UK2SΣK2S(UK2S)+UK0SΣK0S(UK0S)\displaystyle=U^{S}_{K_{2}}\Sigma^{S}_{K_{2}}(U^{S}_{K_{2}})^{\top}+U^{S}_{K_{-2}}\Sigma^{S}_{K_{-2}}(U^{S}_{K_{-2}})^{\top}+U^{S}_{K_{0}}\Sigma^{S}_{K_{0}}(U^{S}_{K_{0}})^{\top}
(P)\displaystyle\mathcal{H}(P) =UK2PΣK2P(UK2P)+UK2PΣK2P(UK2P)+UK0PΣK0P(UK0P)\displaystyle=U^{P}_{K_{2}}\Sigma^{P}_{K_{2}}(U^{P}_{K_{2}})^{\top}+U^{P}_{K_{-2}}\Sigma^{P}_{K_{-2}}(U^{P}_{K_{-2}})^{\top}+U^{P}_{K_{0}}\Sigma^{P}_{K_{0}}(U^{P}_{K_{0}})^{\top}

Then by pre–multiplying and post–multiplying we get

(UK2S)(S)UK2P\displaystyle(U^{S}_{K_{2}})^{\top}\mathcal{H}(S)U^{P}_{K_{-2}} =ΣK2S(UK2S)UK2P\displaystyle=\Sigma^{S}_{K_{2}}(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}
(UK2S)(P)UK2P\displaystyle(U^{S}_{K_{2}})^{\top}\mathcal{H}(P)U^{P}_{K_{-2}} =(UK2S)UK2PΣK2P\displaystyle=(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}\Sigma^{P}_{K_{-2}}

Let (S)(P)=R\mathcal{H}(S)-\mathcal{H}(P)=R then

(UK2S)((S)(P))UK2P\displaystyle(U^{S}_{K_{2}})^{\top}(\mathcal{H}(S)-\mathcal{H}(P))U^{P}_{K_{-2}} =(UK2S)RUK2P\displaystyle=(U^{S}_{K_{2}})^{\top}RU^{P}_{K_{-2}}
ΣK2S(UK2S)UK2P(UK2S)UK2PΣK2P\displaystyle\Sigma^{S}_{K_{2}}(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}-(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}\Sigma^{P}_{K_{-2}} =(UK2S)RUK2P\displaystyle=(U^{S}_{K_{2}})^{\top}RU^{P}_{K_{-2}}

Since ΣK2S=σs(A)I\Sigma^{S}_{K_{2}}=\sigma_{s}(A)I then

(UK2S)UK2P(σs(S)IΣK2P)\displaystyle||(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}(\sigma_{s}(S)I-\Sigma^{P}_{K_{-2}})|| =(UK2S)RUK2P\displaystyle=||(U^{S}_{K_{2}})^{\top}RU^{P}_{K_{-2}}||
(UK2S)UK2P\displaystyle||(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}|| ϵσs(S)+σs(P)\displaystyle\leq\frac{\epsilon}{\sigma_{s}(S)+\sigma_{s}(P)}

Similarly

(UK2P)UK2Sϵσs(P)+σs(S)||(U^{P}_{K_{2}})^{\top}U^{S}_{K_{-2}}||\leq\frac{\epsilon}{\sigma_{s}(P)+\sigma_{s}(S)}

Since σs(P)+σs(S)σs(S)σs+1(P)\sigma_{s}(P)+\sigma_{s}(S)\geq\sigma_{s}(S)-\sigma_{s+1}(P) combining this with Eq. (50) we get

σmin((UK2S)UK2P)13ϵ2min(σr1(P)σs(S),σs(S)σs+1(P))2\sigma_{\min}((U^{S}_{K_{2}})^{\top}U^{P}_{K_{2}})\geq 1-\frac{3\epsilon^{2}}{\min{\Big{(}\sigma_{r-1}(P)-\sigma_{s}(S),\sigma_{s}(S)-\sigma_{s+1}(P)\Big{)}^{2}}} (51)

Since

ϵinfi(σi(P)σi+1(P)2),\epsilon\leq\inf_{i}\Big{(}\frac{\sigma_{i}(P)-\sigma_{i+1}(P)}{2}\Big{)},

for Eq. (51), we use the inequality 1x21x2\sqrt{1-x^{2}}\geq 1-x^{2} whenever x<1x<1 which is true when Eq. (46) is true. This means that there exists unitary transformation QQ such that

UK2SUK2PQ2ϵmin(σr1(P)σs(S),σs(S)σs+1(P))||U_{K_{2}}^{S}-U_{K_{2}}^{P}Q||\leq\frac{2\epsilon}{\min{\Big{(}\sigma_{r-1}(P)-\sigma_{s}(S),\sigma_{s}(S)-\sigma_{s+1}(P)\Big{)}}}
Remark 12.4.

Note that S,PS,P will be Hermitian dilations of 0,,,^0,d^,d^\mathcal{H}_{0,\infty,\infty},\hat{\mathcal{H}}_{0,\hat{d},\hat{d}} respectively in our case. Since the singular vectors of SS (and PP) are simply stacked version of singular vectors of 0,,\mathcal{H}_{0,\infty,\infty} (and ^0,d^,d^\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}), our results hold directly for the singular vectors of 0,,\mathcal{H}_{0,\infty,\infty} (and ^0,d^,d^\hat{\mathcal{H}}_{0,\hat{d},\hat{d}})

Let rksr\leq k\leq s. First divide the indices [1,n][1,n] into 3 parts K1=[1,r1],K2=[r,s],K3=[s+1,n]K_{1}=[1,r-1],K_{2}=[r,s],K_{3}=[s+1,n].

Proposition 4 (System Reduction).

Let SPϵ||S-P||\leq\epsilon and the singular values of SS be arranged as follows:

σ1(S)>>σr1(S)>σr(S)σr+1(S)σs(S)>σs+1(S)>σn(S)>σn+1(S)=0\sigma_{1}(S)>\ldots>\sigma_{r-1}(S)>\sigma_{r}(S)\geq\sigma_{r+1}(S)\geq\ldots\geq\sigma_{s}(S)>\sigma_{s+1}(S)>\ldots\sigma_{n}(S)>\sigma_{n+1}(S)=0

Furthermore, let ϵ\epsilon be such that

ϵinf{1ir1}{s+1in}(σi(P)σi+1(P)2).\epsilon\leq\inf_{\{1\leq i\leq r-1\}\cup\{s+1\leq i\leq n\}}\Big{(}\frac{\sigma_{i}(P)-\sigma_{i+1}(P)}{2}\Big{)}. (52)

Define K0=K1K2K_{0}=K_{1}\cup K_{2}, then

UK0S(ΣK0S)1/2UK0P(ΣK0P)1/22\displaystyle||U^{S}_{K_{0}}(\Sigma^{S}_{K_{0}})^{1/2}-U^{P}_{K_{0}}(\Sigma^{P}_{K_{0}})^{1/2}||_{2} 2ϵi=1r1σi/ζi2+σr/ζr2+sup1is|σiσ^i|\displaystyle\leq 2\epsilon\sqrt{\sum_{i=1}^{r-1}\sigma_{i}/\zeta_{i}^{2}+\sigma_{r}/\zeta_{r}^{2}}+\sup_{1\leq i\leq s}|\sqrt{\sigma_{i}}-\sqrt{\hat{\sigma}_{i}}|

and σi=σi(S),σ^i=σi(P)\sigma_{i}=\sigma_{i}(S),\hat{\sigma}_{i}=\sigma_{i}(P). Here ζi=min(σiσi+1,σiσi+1)\zeta_{i}=\min{(\sigma_{i}-\sigma_{i+1},\sigma_{i}-\sigma_{i+1})} and ζr=min(σr1σr,σsσs+1)\zeta_{r}=\min{(\sigma_{r-1}-\sigma_{r},\sigma_{s}-\sigma_{s+1})}.

Proof 12.5.

Since UK0S=[UK1SUK2S]U_{K_{0}}^{S}=[U_{K_{1}}^{S}U_{K_{2}}^{S}] and likewise for BB, we can separate the analysis for K1,K2K_{1},K_{2} as follows

UK0S(ΣK0S)1/2UK0P(ΣK0P)1/2\displaystyle||U^{S}_{K_{0}}(\Sigma^{S}_{K_{0}})^{1/2}-U^{P}_{K_{0}}(\Sigma^{P}_{K_{0}})^{1/2}|| (UK0SUK0P)(ΣK0S)1/2+UK0P((ΣK0S)1/2(ΣK0P)1/2)\displaystyle\leq||(U^{S}_{K_{0}}-U^{P}_{K_{0}})(\Sigma^{S}_{K_{0}})^{1/2}||+||U^{P}_{K_{0}}((\Sigma^{S}_{K_{0}})^{1/2}-(\Sigma^{P}_{K_{0}})^{1/2})||
(UK1SUK1P)(ΣK1S)1/222+(UK2SUK2P)(ΣK2S)1/222\displaystyle\leq\sqrt{||(U^{S}_{K_{1}}-U^{P}_{K_{1}})(\Sigma^{S}_{K_{1}})^{1/2}||_{2}^{2}+||(U^{S}_{K_{2}}-U^{P}_{K_{2}})(\Sigma^{S}_{K_{2}})^{1/2}||_{2}^{2}}
+(ΣK0S)1/2(ΣK0P)1/2\displaystyle+||(\Sigma^{S}_{K_{0}})^{1/2}-(\Sigma^{P}_{K_{0}})^{1/2}||

Now (ΣK0S)1/2(ΣK0P)1/2=supl|σl(S)σl(P)|||(\Sigma^{S}_{K_{0}})^{1/2}-(\Sigma^{P}_{K_{0}})^{1/2}||=\sup_{l}|\sqrt{\sigma_{l}(S)}-\sqrt{\sigma_{l}(P)}|. Recall that σr(S)==σk(S)==σs1(S)\sigma_{r}(S)=\ldots=\sigma_{k}(S)=\ldots=\sigma_{s-1}(S) and by conditions on ϵ\epsilon we are guaranteed that ϵσiσj<1/2\frac{\epsilon}{\sigma_{i}-\sigma_{j}}<1/2 for all 1ijr1\leq i\neq j\leq r. We will combine our previous results in Proposition 13 to prove this claim. Specifically from Proposition 3 we have

(UK2SUK2P)(ΣK2S)1/2\displaystyle||(U^{S}_{K_{2}}-U^{P}_{K_{2}})(\Sigma^{S}_{K_{2}})^{1/2}|| 2ϵσr(S)min(σr1(P)σr(S),σr(S)σs+1(P))\displaystyle\leq\frac{2\epsilon\sqrt{\sigma_{r}(S)}}{\min{\Big{(}\sigma_{r-1}(P)-\sigma_{r}(S),\sigma_{r}(S)-\sigma_{s+1}(P)\Big{)}}}

On the remaining term we will use Proposition 3 on each column

(UK1SUK1P)(ΣK1S)1/2\displaystyle||(U^{S}_{K_{1}}-U^{P}_{K_{1}})(\Sigma^{S}_{K_{1}})^{1/2}|| [σ1(S)c1,,σ|K1|(S)c|K1|]j=1r1σj2cj2\displaystyle\leq||[\sqrt{\sigma_{1}(S)}c_{1},\ldots,\sqrt{\sigma_{|K_{1}|}(S)}c_{|K_{1}|}]||\leq\sqrt{\sum_{j=1}^{r-1}\sigma_{j}^{2}||c_{j}||^{2}}
ϵj=1r12σj(S)min(σj1(P)σj(S),σj(S)σj+1(P))2\displaystyle\leq\epsilon\sqrt{\sum_{j=1}^{r-1}\frac{2\sigma_{j}(S)}{\min{\Big{(}\sigma_{j-1}(P)-\sigma_{j}(S),\sigma_{j}(S)-\sigma_{j+1}(P)\Big{)}^{2}}}}

In the context of our system identification, S=0,,S=\mathcal{H}_{0,\infty,\infty} and P=^0,d^,d^P=\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}. PP will be made compatible by padding it with zeros to make it doubly infinite. Then UK0S,UK0PU_{K_{0}}^{S},U_{K_{0}}^{P} (after padding) has infinite rows. Define Z0=UK0S(ΣK0S)1/2(1:,:),Z1=UK0S(ΣK0S)1/2(p+1:,:)Z_{0}=U_{K_{0}}^{S}(\Sigma^{S}_{K_{0}})^{1/2}(1:,:),Z_{1}=U_{K_{0}}^{S}(\Sigma^{S}_{K_{0}})^{1/2}(p+1:,:) (both infinite length) and similarly we will have Z^0,Z^1\hat{Z}_{0},\hat{Z}_{1}. Note that from a computational perspective we do not need to Z0,Z1Z_{0},Z_{1}; we only need to work with Z^0=UK0P(ΣK0P)1/2(1:,:),Z^1=UK0P(ΣK0P)1/2(p+1:,:)\hat{Z}_{0}=U_{K_{0}}^{P}(\Sigma^{P}_{K_{0}})^{1/2}(1:,:),\hat{Z}_{1}=U_{K_{0}}^{P}(\Sigma^{P}_{K_{0}})^{1/2}(p+1:,:) and since most of it is just zero padding we can simply compute on Z^0(1:pd,:),Z^1(1:pd,:)\hat{Z}_{0}(1:pd,:),\hat{Z}_{1}(1:pd,:).

Proposition 5.

Assume Z1=Z0AZ_{1}=Z_{0}A. Furthermore, SP2ϵ||S-P||_{2}\leq\epsilon and let ϵ\epsilon be such that

ϵinf{1ir1}{s+1in}(σi(P)σi+1(P)2)\epsilon\leq\inf_{\{1\leq i\leq r-1\}\cup\{s+1\leq i\leq n\}}\Big{(}\frac{\sigma_{i}(P)-\sigma_{i+1}(P)}{2}\Big{)} (53)

then

(Z0Z0)1Z0Z1(Z^0Z^0)1Z^0Z^1\displaystyle||(Z_{0}^{\prime}Z_{0})^{-1}Z_{0}^{\prime}Z_{1}-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{1}|| 𝒞ϵ(γ+1)σs(σs2((σsσs+1)(σr1σs))2\displaystyle\leq\frac{\mathcal{C}\epsilon(\gamma+1)}{\sigma_{s}}\Bigg{(}\sqrt{\frac{\sigma^{2}_{s}}{((\sigma_{s}-\sigma_{s+1})\wedge(\sigma_{r-1}-\sigma_{s}))^{2}}}
+i=1r1σiσs(σiσi+1)2(σi1σi)2)\displaystyle+\sqrt{\sum_{i=1}^{r-1}\frac{\sigma_{i}\sigma_{s}}{(\sigma_{i}-\sigma_{i+1})^{2}\wedge(\sigma_{i-1}-\sigma_{i})^{2}}}\Bigg{)}

where σ1(A)γ\sigma_{1}(A)\leq\gamma.

Proof 12.6.

Note that Z1=Z0AZ_{1}=Z_{0}A, then

(Z0Z0)1Z0Z1(Z^0Z^0)1Z^0Z^12\displaystyle||(Z_{0}^{\prime}Z_{0})^{-1}Z_{0}^{\prime}Z_{1}-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{1}||_{2}
=\displaystyle= A(Z^0Z^0)1Z^0Z^12=(Z^0Z^0)1Z^0Z^0A(Z^0Z^0)1Z^0Z^12\displaystyle||A-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{1}||_{2}=||(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{0}A-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{1}||_{2}
=\displaystyle= (Z^0Z^0)1Z^0Z^0A(Z^0Z^0)1Z^0Z0A+(Z^0Z^0)1Z^0Z0A(Z^0Z^0)1Z^0Z^12\displaystyle||(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{0}A-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}Z_{0}A+(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}Z_{0}A-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{1}||_{2}
\displaystyle\leq (Z^0Z^0)1Z^0Z^0A(Z^0Z^0)1Z^0Z0A2+(Z^0Z^0)1Z^0Z0A(Z^0Z^0)1Z^0Z^12\displaystyle||(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{0}A-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}Z_{0}A||_{2}+||(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}Z_{0}A-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{1}||_{2}
\displaystyle\leq (Z^0Z^0)1Z^02(Z0AZ^0A2+Z0AShifted version of Z0Z^12)\displaystyle||(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}||_{2}\Big{(}||Z_{0}A-\hat{Z}_{0}A||_{2}+||\underbrace{Z_{0}A}_{\text{Shifted version of }Z_{0}}-\hat{Z}_{1}||_{2}\Big{)}

Now, (Z^0Z^0)1Z^02(σsϵ)1||(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}||_{2}\leq(\sqrt{\sigma_{s}-\epsilon})^{-1}, Z0AZ^12Z0Z^02||Z_{0}A-\hat{Z}_{1}||_{2}\leq||Z_{0}-\hat{Z}_{0}||_{2} since Z1=Z0AZ_{1}=Z_{0}A is a submatrix of Z0Z_{0} and Z^1\hat{Z}_{1} is a submatrix of Z^0\hat{Z}_{0} we have Z0AZ^12Z0Z^02||Z_{0}A-\hat{Z}_{1}||_{2}\leq||Z_{0}-\hat{Z}_{0}||_{2} and Z0AZ^0A2A2Z0Z^02||Z_{0}A-\hat{Z}_{0}A||_{2}\leq||A||_{2}||Z_{0}-\hat{Z}_{0}||_{2}

\displaystyle\leq cϵ(γ+1)σs(σs2((σsσs+1)(σr1σs))2+i=1r1σiσs(σiσi+1)2(σi1σi)2)\displaystyle\frac{c\epsilon(\gamma+1)}{{\sigma_{s}}}\Bigg{(}\sqrt{\frac{\sigma^{2}_{s}}{((\sigma_{s}-\sigma_{s+1})\wedge(\sigma_{r-1}-\sigma_{s}))^{2}}}+\sqrt{\sum_{i=1}^{r-1}\frac{\sigma_{i}\sigma_{s}}{(\sigma_{i}-\sigma_{i+1})^{2}\wedge(\sigma_{i-1}-\sigma_{i})^{2}}}\Bigg{)}

13 Hankel Matrix Estimation Results

In this section we provide the proof for Theorem 6. For any matrix PP, we define its doubly infinite extension P¯\bar{P} as

P¯=[P000]\bar{P}=\begin{bmatrix}P&0&\ldots\\ 0&0&\ldots\\ \vdots&\vdots&\vdots\end{bmatrix} (54)
Proposition 1.

Fix d>0d>0. Then we have

d,,20,,¯0,d,d22d,,22𝒯d,2||\mathcal{H}_{d,\infty,\infty}||_{2}\leq||\mathcal{H}_{0,\infty,\infty}-\bar{\mathcal{H}}_{0,d,d}||_{2}\leq\sqrt{2}||\mathcal{H}_{d,\infty,\infty}||_{2}\leq\sqrt{2}||\mathcal{T}_{d,\infty}||_{2}
Proof 13.1.

Define C~d,B~d\tilde{C}_{d},\tilde{B}_{d} as follows

C~d\displaystyle\tilde{C}_{d} =[0md×nCCA]\displaystyle=\begin{bmatrix}0_{md\times n}\\ C\\ CA\\ \vdots\\ \end{bmatrix}
B~d\displaystyle\tilde{B}_{d} =[0n×pdBAB]\displaystyle=\begin{bmatrix}0_{n\times pd}&B&AB&\ldots\end{bmatrix}

Now pad 0,d,d\mathcal{H}_{0,d,d} with zeros to make it a doubly infinite matrix and call it ¯0,d,d\bar{\mathcal{H}}_{0,d,d} and we get that

¯0,d,d0,,\displaystyle||\bar{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}|| =[0M12M21M22]\displaystyle=\begin{bmatrix}0&M_{12}\\ M_{21}&M_{22}\end{bmatrix}

Note here that M21M_{21} and M0=[M12M22]M_{0}=\begin{bmatrix}M_{12}\\ M_{22}\end{bmatrix} are infinite matrices. Further d,,2=M02M212||\mathcal{H}_{d,\infty,\infty}||_{2}=||M_{0}||_{2}\geq||M_{21}||_{2}. Then

¯0,d,d0,,\displaystyle||\bar{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}|| M1222+M0222d,,2\displaystyle\leq\sqrt{||M_{12}||_{2}^{2}+||M_{0}||_{2}^{2}}\leq\sqrt{2}||\mathcal{H}_{d,\infty,\infty}||_{2}

Further ¯0,d,d0,,M0=d,,2||\bar{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||\geq||M_{0}||=||\mathcal{H}_{d,\infty,\infty}||_{2}.

Proposition 2.

For any d1d2d_{1}\geq d_{2}, we have

0,,¯0,d1,d1220,,¯0,d2,d22||\mathcal{H}_{0,\infty,\infty}-\bar{\mathcal{H}}_{0,d_{1},d_{1}}||_{2}\leq\sqrt{2}||\mathcal{H}_{0,\infty,\infty}-\bar{\mathcal{H}}_{0,d_{2},d_{2}}||_{2}
Proof 13.2.

Since d1,,20,,¯0,d1,d122d1,,2||\mathcal{H}_{d_{1},\infty,\infty}||_{2}\leq||\mathcal{H}_{0,\infty,\infty}-\bar{\mathcal{H}}_{0,d_{1},d_{1}}||_{2}\leq\sqrt{2}||\mathcal{H}_{d_{1},\infty,\infty}||_{2} from Proposition 1. It is clear that d1,,2d2,,2||\mathcal{H}_{d_{1},\infty,\infty}||_{2}\leq||\mathcal{H}_{d_{2},\infty,\infty}||_{2}. Then

120,,¯0,d1,d12d1,,2d2,,20,,¯0,d2,d22\displaystyle\frac{1}{\sqrt{2}}||\mathcal{H}_{0,\infty,\infty}-\bar{\mathcal{H}}_{0,d_{1},d_{1}}||_{2}\leq||\mathcal{H}_{d_{1},\infty,\infty}||_{2}\leq||\mathcal{H}_{d_{2},\infty,\infty}||_{2}\leq||\mathcal{H}_{0,\infty,\infty}-\bar{\mathcal{H}}_{0,d_{2},d_{2}}||_{2}
Proposition 3.

Fix d>0d>0. Then

𝒯d,(M)2M~ρ(A)d1ρ(A)||\mathcal{T}_{d,\infty}(M)||_{2}\leq\frac{\tilde{M}\rho(A)^{d}}{1-\rho(A)}
Proof 13.3.

Recall that

𝒯d,(M)=[0000CAdB000CAd+1BCAdB00]\displaystyle\mathcal{T}_{d,\infty}(M)=\begin{bmatrix}0&0&0&\ldots&0\\ CA^{d}B&0&0&\ldots&0\\ CA^{d+1}B&CA^{d}B&0&\ldots&0\\ \vdots&\ddots&\ddots&\vdots&\vdots\end{bmatrix}

Then 𝒯d,(M)2j=dCAjB2||\mathcal{T}_{d,\infty}(M)||_{2}\leq\sum_{j=d}^{\infty}||CA^{j}B||_{2}. Now from Eq. 4.1 and Lemma 4.1 in Tu et al. (2017) we get that CAjB2M~ρ(A)j||CA^{j}B||_{2}\leq\tilde{M}\rho(A)^{j}. Then

j=dCAjB2M~ρ(A)d1ρ(A)\sum_{j=d}^{\infty}||CA^{j}B||_{2}\leq\frac{\tilde{M}\rho(A)^{d}}{1-\rho(A)}
Remark 13.4.

Proposition 3 is just needed to show exponential decay and is not precise. Please refer to Tu et al. (2017) for explicit rates.

Next we show that T(κ)(δ)T^{(\kappa)}_{*}(\delta) and d(T,δ)d_{*}(T,\delta) defined in Eq. (16) given by

d(T,δ)\displaystyle d_{*}(T,\delta) =inf{d|16βRdm+pd+log(Tδ)T0,d,d0,,2}\displaystyle=\inf\Bigg{\{}d\Bigg{|}16\beta R\sqrt{d}\sqrt{\frac{m+pd+\log{\frac{T}{\delta}}}{T}}\geq||\mathcal{H}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||_{2}\Bigg{\}}
T(κ)(δ)\displaystyle T^{(\kappa)}_{*}(\delta) =inf{T|Tcm2log3(Tm/δ)d(T,δ),d(T,δ)κd(Tκ2,δ)8}\displaystyle=\inf\Big{\{}T\Big{|}\frac{T}{cm^{2}\log^{3}{(Tm/\delta)}}\geq d_{*}(T,\delta),\hskip 5.69054ptd_{*}(T,\delta)\leq\frac{\kappa d_{*}(\frac{T}{\kappa^{2}},\delta)}{8}\Big{\}} (55)

The existence of d(T,δ)d_{*}(T,\delta) is predicated on the finiteness of T(κ)(δ)T^{(\kappa)}_{*}(\delta) which we discuss below.

13.1 Existence of T(κ)(δ)<T^{(\kappa)}_{*}(\delta)<\infty

Construct two sets

T1(δ)\displaystyle T_{1}(\delta) =inf{T|d(T,δ)𝒟(T)}\displaystyle=\inf\Big{\{}T\Big{|}d_{*}(T,\delta)\in\mathcal{D}(T)\Big{\}} (56)
T2(δ)\displaystyle T_{2}(\delta) =inf{T|d(t,δ)κd(tκ2,δ)8,tT}\displaystyle=\inf\Big{\{}T\Big{|}d_{*}(t,\delta)\leq\frac{\kappa d_{*}(\frac{t}{\kappa^{2}},\delta)}{8},\hskip 8.53581pt\forall t\geq T\Big{\}} (57)

Clearly, T(κ)(δ)<T1(δ)T2(δ)T^{(\kappa)}_{*}(\delta)<T_{1}(\delta)\vee T_{2}(\delta). A key assumption in the statement of our results is that T(κ)(δ)<T^{(\kappa)}_{*}(\delta)<\infty. We will show that it is indeed true. Let κ16\kappa\geq 16.

Proposition 4.

For a fixed δ>0\delta>0, T1(δ)<T_{1}(\delta)<\infty with d(T,δ)clog((cT+log(1δ)))log(R)+log((M~/β))log(1ρ)d_{*}(T,\delta)\leq\frac{c\log{(cT+\log{\frac{1}{\delta}})}-\log{R}+\log{(\tilde{M}/\beta)}}{\log{\frac{1}{\rho}}}. Here ρ=ρ(A)\rho=\rho(A).

Proof 13.5.

Note the form for d(T,δ)d_{*}(T,\delta), it is the minimum dd that satisfies

16βRdm+pd+log(Tδ)T0,d,d0,,216\beta R\sqrt{d}\sqrt{\frac{m+pd+\log{\frac{T}{\delta}}}{T}}\geq||\mathcal{H}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||_{2}

Since from Proposition 1 and 3 we have 0,d,d0,,23M~ρd1ρ(A)||\mathcal{H}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq\frac{3\tilde{M}\rho^{d}}{1-\rho(A)}, then d(T,δ)dd_{*}(T,\delta)\leq d that satisfies

16βRdm+pd+log(Tδ)T3M~ρd1ρ(A)16\beta R\sqrt{d}\sqrt{\frac{m+pd+\log{\frac{T}{\delta}}}{T}}\geq\frac{3\tilde{M}\rho^{d}}{1-\rho(A)}

which immediately implies d(T,δ)d=clog((cTlog(R)+log(1δ)))+log((M~/β))log(1ρ)d_{*}(T,\delta)\leq d=\frac{c\log{(cT-\log{R}+\log{\frac{1}{\delta}})}+\log{(\tilde{M}/\beta)}}{\log{\frac{1}{\rho}}}, i.e., d(T,δ)d_{*}(T,\delta) is at most logarithmic in TT. As a result, for a large enough TT

cm2dlog2(d)log2(m2/δ)+cdlog3(2d)clog((cT+log(1δ)))log(R)+log((M~/β))log(1ρ)cm^{2}d\log^{2}{(d)}\log^{2}{(m^{2}/\delta)}+cd\log^{3}{(2d)}\geq\frac{c\log{(cT+\log{\frac{1}{\delta}})}-\log{R}+\log{(\tilde{M}/\beta)}}{\log{\frac{1}{\rho}}}

The intuition behind T2(δ)T_{2}(\delta) is the following: d(T,δ)d_{*}(T,\delta) grows at most logarithmically in TT, as is clear from the previous proof. Then T2(δ)T_{2}(\delta) is the point where d(T,δ)d_{*}(T,\delta) is still growing as T\sqrt{T} (i.e., “mixing” has not happened) but at a slightly reduced rate.

Proposition 5.

For a fixed δ>0\delta>0, T2(δ)<T_{2}(\delta)<\infty.

Proof 13.6.

Recall from the proof of Proposition 1 that d,,0,,0,d,d2d,,||\mathcal{H}_{d,\infty,\infty}||\leq||\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d,d}||\leq\sqrt{2}||\mathcal{H}_{d,\infty,\infty}||. Now d,,\mathcal{H}_{d,\infty,\infty} can be written as

d,,\displaystyle\mathcal{H}_{d,\infty,\infty} =[CCA]=C~Ad[B,AB,]=B~\displaystyle=\underbrace{\begin{bmatrix}C\\ CA\\ \vdots\\ \end{bmatrix}}_{=\tilde{C}}A^{d}\underbrace{[B,AB,\ldots]}_{=\tilde{B}}

Define Pd=AdB~B~(Ad)P_{d}=A^{d}\tilde{B}\tilde{B}^{\top}(A^{d})^{\top}. Let dκd_{\kappa} be such that for every ddκd\geq d_{\kappa} and κ16\kappa\geq 16

Pd14κP0P_{d}\preceq\frac{1}{4\kappa}P_{0} (58)

Clearly such a dκ<d_{\kappa}<\infty would exist because P00P_{0}\neq 0 but limdPd=0\lim_{d\rightarrow\infty}P_{d}=0. Then observe that P2d14κPdP_{2d}\preceq\frac{1}{4\kappa}P_{d}. Then for every ddκd\geq d_{\kappa} we have that

d,,4κ2d,,||\mathcal{H}_{d,\infty,\infty}||\geq 4\kappa||\mathcal{H}_{2d,\infty,\infty}||

Let

T4dκ(16)2β2R2σ02(dκp+log((T/δ)))T\geq\frac{4d_{\kappa}\cdot(16)^{2}\cdot\beta^{2}R^{2}}{\sigma_{0}^{2}}(d_{\kappa}p+\log{(T/\delta)}) (59)

where σ0=dκ,,\sigma_{0}=||\mathcal{H}_{d_{\kappa},\infty,\infty}||. Assume that σ0>0\sigma_{0}>0 (if not then are condition is trivially true). Then simple computation shows that

0,dκ,dκ0,,\displaystyle||\mathcal{H}_{0,d_{\kappa},d_{\kappa}}-\mathcal{H}_{0,\infty,\infty}|| dκ,,16βRdκm+pdκ+log(Tδ)T<σ02\displaystyle\geq||\mathcal{H}_{d_{\kappa},\infty,\infty}||\geq\underbrace{16\beta R\sqrt{d_{\kappa}}\sqrt{\frac{m+pd_{\kappa}+\log{\frac{T}{\delta}}}{T}}}_{<\frac{\sigma_{0}}{2}}

This implies that d=d(T,δ)dκd_{*}=d_{*}(T,\delta)\geq d_{\kappa} for TT prescribed as above (ensured by Proposition 2). But from our discussion above we also have

0,d,d0,,d,,4κ2d,,2κ0,2d,2d0,,\displaystyle||\mathcal{H}_{0,d_{*},d_{*}}-\mathcal{H}_{0,\infty,\infty}||\geq||\mathcal{H}_{d_{*},\infty,\infty}||\geq 4\kappa||\mathcal{H}_{2d_{*},\infty,\infty}||\geq 2\kappa||\mathcal{H}_{0,2d_{*},2d_{*}}-\mathcal{H}_{0,\infty,\infty}||

This means that if

0,d,d0,,\displaystyle||\mathcal{H}_{0,d_{*},d_{*}}-\mathcal{H}_{0,\infty,\infty}|| 16βRdm+pd+log(Tδ)T\displaystyle\leq 16\beta R\sqrt{d_{*}}\sqrt{\frac{m+pd_{*}+\log{\frac{T}{\delta}}}{T}}

then

0,2d,2d0,,\displaystyle||\mathcal{H}_{0,2d_{*},2d_{*}}-\mathcal{H}_{0,\infty,\infty}|| 162κβRdm+pd+log(Tδ)T16βR2dm+2pd+log(κ2Tδ)κ2T\displaystyle\leq\frac{16}{2\kappa}\beta R\sqrt{d_{*}}\sqrt{\frac{m+pd_{*}+\log{\frac{T}{\delta}}}{T}}\leq 16\beta R\sqrt{2d_{*}}\sqrt{\frac{m+2pd_{*}+\log{\frac{\kappa^{2}T}{\delta}}}{\kappa^{2}T}}

which implies that d(κ2T,δ)2d(T,δ)d_{*}(\kappa^{2}T,\delta)\leq 2d_{*}(T,\delta). The inequality follows from the definition of d(κ2T,δ)d_{*}(\kappa^{2}T,\delta). Furthermore, if κ16\kappa\geq 16, 2d(T,δ)κ8d(T,δ)2d_{*}(T,\delta)\leq\frac{\kappa}{8}d_{*}(T,\delta) whenever TT is greater than a certain finite threshold of Eq. (59).

Eq. (58) happens when σ(Ad)214κdκ=𝒪(log(κ)log(1ρ))\sigma(A^{d})^{2}\leq\frac{1}{4\kappa}\implies d_{\kappa}=\mathcal{O}\Big{(}\frac{\log{\kappa}}{\log{\frac{1}{\rho}}}\Big{)} where ρ=ρ(A)\rho=\rho(A) and T2(δ)cT1(δ)T_{2}(\delta)\leq cT_{1}(\delta). It should be noted that the dependence of Ti(δ)T_{i}(\delta) on log(1ρ)\log{\frac{1}{\rho}} is worst case, i.e., there exists some “bad” LTI system that gives this dependence and it is quite likely Ti(δ)T_{i}(\delta) is much smaller. The condition TT1(δ)T2(δ)T\geq T_{1}(\delta)\vee T_{2}(\delta) simply requires that we capture some reasonable portion of the dynamics and not necessarily the entire dynamics.

13.2 Proof of Theorem 6

Proposition 6.

Let TT(κ)(δ)T\geq T^{(\kappa)}_{*}(\delta) and d=d(T,δ)d_{*}=d_{*}(T,\delta) then

0,,^0,d,d2cβRdTm+pd+log(Tδ)||\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,d_{*},d_{*}}||\leq 2c\beta R\sqrt{\frac{d_{*}}{T}}\sqrt{m+pd_{*}+\log{\frac{T}{\delta}}}
Proof 13.7.

Consider the following error

0,,^0,d,d2\displaystyle||\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,d_{*},d_{*}}||_{2} 0,d,d^0,d,d2+0,,0,d,d2\displaystyle\leq||\mathcal{H}_{0,d_{*},d_{*}}-\hat{\mathcal{H}}_{0,d_{*},d_{*}}||_{2}+||\mathcal{H}_{0,\infty,\infty}-{\mathcal{H}}_{0,d_{*},d_{*}}||_{2}

From Proposition 1 and Eq. (55) we get that

0,,0,d,d216βRdTm+pd+log(Tδ)||\mathcal{H}_{0,\infty,\infty}-{\mathcal{H}}_{0,d_{*},d_{*}}||_{2}\leq 16\beta R\sqrt{\frac{d_{*}}{T}}\sqrt{m+pd_{*}+\log{\frac{T}{\delta}}}

Since from Theorem 2

0,d,d^0,d,d2\displaystyle||\mathcal{H}_{0,d_{*},d_{*}}-\hat{\mathcal{H}}_{0,d_{*},d_{*}}||_{2} 16βRdTm+pd+log(Tδ)\displaystyle\leq 16\beta R\sqrt{\frac{d_{*}}{T}}\sqrt{m+pd_{*}+\log{\frac{T}{\delta}}}
0,,^0,d,d2\displaystyle||\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,d_{*},d_{*}}||_{2} 32βRdTm+pd+log(Tδ)\displaystyle\leq 32\beta R\sqrt{\frac{d_{*}}{T}}\sqrt{m+pd_{*}+\log{\frac{T}{\delta}}} (60)

Recall the adaptive rule to choose dd in Algorithm 1. From Theorem 2 we know that for every d𝒟(T)d\in\mathcal{D}(T) we have with probability at least 1δ1-\delta.

0,d,d^0,d,d216βRd(m+dpT+log(Tδ)T)||\mathcal{H}_{0,d,d}-\hat{\mathcal{H}}_{0,d,d}||_{2}\leq 16\beta R\sqrt{d}\left(\sqrt{m+\frac{dp}{T}+\frac{\log{\frac{T}{\delta}}}{T}}\right)

Let α(l)=l(lpT+log(Tδ)T)\alpha(l)=\sqrt{l}\Big{(}\sqrt{\frac{lp}{T}+\frac{\log{\frac{T}{\delta}}}{T}}\Big{)}. Then consider the following adaptive rule

d0(T,δ)\displaystyle d_{0}(T,\delta) =inf{l|^0,l,l^0,h,h216βR(2α(l)+α(h))h𝒟(T),hl}\displaystyle=\inf\Big{\{}l\Big{|}||\hat{\mathcal{H}}_{0,l,l}-\hat{\mathcal{H}}_{0,h,h}||_{2}\leq 16\beta R(2\alpha(l)+\alpha(h))\hskip 5.69054pt\forall h\in\mathcal{D}(T),h\geq l\Big{\}} (61)
d^=d^(T,δ)\displaystyle\hat{d}=\hat{d}(T,\delta) =d0(T,δ)log((Tδ))\displaystyle=d_{0}(T,\delta)\vee\log{\left(\frac{T}{\delta}\right)} (62)

for the same universal constant cc as Theorem 2. Let d(T,δ)d_{*}(T,\delta) be as Eq. (55). Recall that d=d(T,δ)d_{*}=d_{*}(T,\delta) is the point where estimation error dominates the finite truncation error. Unfortunately, we do not have apriori knowledge of d(T,δ)d_{*}(T,\delta) to use in the algorithm. Therefore, we will simply use Eq. (62) as our proxy. The goal will be to bound ^0,d^,d^0,,2||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\infty,\infty}||_{2}

Proposition 7.

Let TT(κ)(δ)T\geq T^{(\kappa)}_{*}(\delta), d(T,δ)d_{*}(T,\delta) be as in Eq. (55) and d^\hat{d} be as in Eq. (62). Then with probability at least 1δ1-\delta we have

d^d(T,δ)log((Tδ))\hat{d}\leq d_{*}(T,\delta)\vee\log{\Big{(}\frac{T}{\delta}\Big{)}}
Proof 13.8.

Let d=d(T,δ)d_{*}=d_{*}(T,\delta). First for all h𝒟(T)dh\in\mathcal{D}(T)\geq d_{*}, we note

^0,d,d^0,h,h2\displaystyle||\hat{\mathcal{H}}_{0,d_{*},d_{*}}-\hat{\mathcal{H}}_{0,h,h}||_{2} ^0,d,d0,d,d+0,h,h^0,h,h2+0,h,h0,d,d2\displaystyle\leq||\hat{\mathcal{H}}_{0,d_{*},d_{*}}-\mathcal{H}_{0,d_{*},d_{*}}||+||\mathcal{H}_{0,h,h}-\hat{\mathcal{H}}_{0,h,h}||_{2}+||\mathcal{H}_{0,h,h}-{\mathcal{H}}_{0,d_{*},d_{*}}||_{2}
>hd^0,d,d0,d,d2+0,h,h^0,h,h2+0,,0,d,d2.\displaystyle\underbrace{\leq}_{\infty>h\geq d_{*}}||\hat{\mathcal{H}}_{0,d_{*},d_{*}}-\mathcal{H}_{0,d_{*},d_{*}}||_{2}+||\mathcal{H}_{0,h,h}-\hat{\mathcal{H}}_{0,h,h}||_{2}+||\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d_{*},d_{*}}||_{2}.

We use the property that 0,,0,d,d20,h,h0,d,d2||\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d_{*},d_{*}}||_{2}\geq||\mathcal{H}_{0,h,h}-\mathcal{H}_{0,d_{*},d_{*}}||_{2}. Furthermore, because of the properties of dd_{*} we have

0,,0,d,d216βRα(d)||\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d_{*},d_{*}}||_{2}\leq 16\beta R\alpha(d_{*})

and

^0,d,d0,d,d2\displaystyle||\hat{\mathcal{H}}_{0,d_{*},d_{*}}-\mathcal{H}_{0,d_{*},d_{*}}||_{2} 16βRα(d),0,h,h^0,h,h216βRα(h).\displaystyle\leq 16\beta R\alpha(d_{*}),\quad{}||\mathcal{H}_{0,h,h}-\hat{\mathcal{H}}_{0,h,h}||_{2}\leq 16\beta R\alpha(h). (63)

and

^0,d,d^0,h,h216βR(2α(d)+α(h)).||\hat{\mathcal{H}}_{0,d_{*},d_{*}}-\hat{\mathcal{H}}_{0,h,h}||_{2}\leq 16\beta R(2\alpha(d_{*})+\alpha(h)).

This implies that d0(T,δ)dd_{0}(T,\delta)\leq d_{*} and the assertion follows.

We have the following key lemma about the behavior of ^0,d^,d^\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}.

Lemma 8.

For a fixed κ20\kappa\geq 20, whenever TT(κ)(δ)T\geq T^{(\kappa)}_{*}(\delta) we have with probability at least 1δ1-\delta

0,,^0,d^,d^23cβRα(max(d(T,δ),log((Tδ))))||\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}||_{2}\leq 3c\beta R\alpha\left(\max{\left(d_{*}(T,\delta),\log{\left(\frac{T}{\delta}\right)}\right)}\right) (64)

Furthermore, d^=O(log(Tδ))\hat{d}=O(\log{\frac{T}{\delta}}).

Proof 13.9.

Let d>d^d_{*}>\hat{d} then

0,,^0,d^,d^2\displaystyle||\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}||_{2} 0,,0,d,d2+^0,d^,d^0,d^,d^2+^0,d,d0,d,d2\displaystyle\leq||\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d_{*},d_{*}}||_{2}+||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\hat{d},\hat{d}}||_{2}+||\hat{\mathcal{H}}_{0,d_{*},d_{*}}-\mathcal{H}_{0,d_{*},d_{*}}||_{2}
3cβRα(d)\displaystyle\leq 3c\beta R\alpha(d_{*})

If d^>d\hat{d}>d_{*} then

0,,^0,d^,d^2\displaystyle||\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}||_{2} 0,,0,d^,d^2+^0,d^,d^0,d^,d^2=2^0,d^,d^0,d^,d^2\displaystyle\leq||\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,\hat{d},\hat{d}}||_{2}+||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\hat{d},\hat{d}}||_{2}=2||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\hat{d},\hat{d}}||_{2}
2cβRα(d^)=2cβRα(log((Tδ)))\displaystyle\leq 2c\beta R\alpha(\hat{d})=2c\beta R\alpha\left(\log{\Big{(}\frac{T}{\delta}\Big{)}}\right)

where the equality follows from Proposition 7. The fact that d^=O(log(Tδ))\hat{d}=O(\log{\frac{T}{\delta}}) follows from Proposition 13.1.

In the following we will use l=0,l,l\mathcal{H}_{l}=\mathcal{H}_{0,l,l} for shorthand.

Proposition 9.

Fix κ16\kappa\geq 16, and TT(κ)(δ)T\geq T_{*}^{(\kappa)}(\delta). Then

^0,d^(T,δ),d^(T,δ)0,,212cβRd^(T,δ)m+pd^(T,δ)+log(Tδ)T||\hat{\mathcal{H}}_{0,\hat{d}(T,\delta),\hat{d}(T,\delta)}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq 12c\beta R\sqrt{\hat{d}(T,\delta)}\sqrt{\frac{m+p\hat{d}(T,\delta)+\log{\frac{T}{\delta}}}{T}}

with probability at least 1δ1-\delta.

Proof 13.10.

Assume that log((Tδ))d(T,δ)\log{\Big{(}\frac{T}{\delta}\Big{)}}\leq d_{*}(T,\delta). Recall the following functions

d(T,δ)\displaystyle d_{*}(T,\delta) =inf{d|cβRdm+pd+log(Tδ)Td2}\displaystyle=\inf{\Big{\{}d\Big{|}c\beta R\sqrt{d}\sqrt{\frac{m+pd+\log{\frac{T}{\delta}}}{T}}\geq||\mathcal{H}_{d}-\mathcal{H}_{\infty}||_{2}\Big{\}}}
d0(T,δ)\displaystyle d_{0}(T,\delta) =inf{l|^l^h2cβR(α(h)+2α(l))hl,h𝒟(T)}\displaystyle=\inf{\Big{\{}l\Big{|}||\hat{\mathcal{H}}_{l}-\hat{\mathcal{H}}_{h}||_{2}\leq c\beta R(\alpha(h)+2\alpha(l))\hskip 5.69054pt\forall h\geq l,\hskip 5.69054pth\in\mathcal{D}(T)\Big{\}}}
d^(T,δ)\displaystyle\hat{d}(T,\delta) =d0(T,δ)log((Tδ))\displaystyle=d_{0}(T,\delta)\vee\log{\Big{(}\frac{T}{\delta}\Big{)}}

It is clear that d(κ2T,δ)(1+12p)κd(T,δ)d_{*}(\kappa^{2}T,\delta)\leq(1+\frac{1}{2p})\kappa d_{*}(T,\delta) for any κ16\kappa\geq 16. Assume the following

  • d(T,δ)κ8d(κ2T,δ)d_{*}(T,\delta)\leq\frac{\kappa}{8}d_{*}(\kappa^{-2}T,\delta) (This relation is true whenever TT(κ)(δ)T\geq T^{(\kappa)}_{*}(\delta)),

  • d^(T,δ)26cβRd^(T,δ)m+pd^(T,δ)+log(Tδ)T||\mathcal{H}_{\hat{d}(T,\delta)}-\mathcal{H}_{\infty}||_{2}\geq 6c\beta R\sqrt{\hat{d}(T,\delta)}\sqrt{\frac{m+p\hat{d}(T,\delta)+\log{\frac{T}{\delta}}}{T}},

  • d^(T,δ)<d(κ2T,δ)1\hat{d}(T,\delta)<d_{*}(\kappa^{-2}T,\delta)-1.

The key will be to show that with high probability that all three assumptions can not hold with high probability. For shorthand we define d(1)=d(T,δ),d(κ2)=d(κ2T,δ),d^(1)=d^(T,δ),d^(κ2)=d^(κ2T,δ)d_{*}^{(1)}=d_{*}(T,\delta),d_{*}^{(\kappa^{2})}=d_{*}(\kappa^{-2}T,\delta),\hat{d}^{(1)}=\hat{d}(T,\delta),\hat{d}^{(\kappa^{2})}=\hat{d}(\kappa^{-2}T,\delta) and l=0,l,l,^l=^0,l,l\mathcal{H}_{l}=\mathcal{H}_{0,l,l},\hat{\mathcal{H}}_{l}=\hat{\mathcal{H}}_{0,l,l}. Let T~=κ2T\tilde{T}=\kappa^{-2}T. Then this implies that

cβR(d(1)+2d^(1))κm+pd(1)+log(κ2T~δ)T~\displaystyle\frac{c\beta R(\sqrt{d_{*}^{(1)}}+2\sqrt{\hat{d}^{(1)}})}{\kappa}\sqrt{\frac{m+pd_{*}^{(1)}+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\tilde{T}}} ^d^(1)^d(1)2\displaystyle\geq||\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\hat{\mathcal{H}}_{d_{*}^{(1)}}||_{2}
^d^(1)^d(1)2\displaystyle||\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\hat{\mathcal{H}}_{d_{*}^{(1)}}||_{2} ^d^(1)2^d(1)2\displaystyle\geq||\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}||_{2}-||\hat{\mathcal{H}}_{d_{*}^{(1)}}-\mathcal{H}_{\infty}||_{2}
^d(1)2+^d^(1)^d(1)2\displaystyle||\hat{\mathcal{H}}_{d_{*}^{(1)}}-\mathcal{H}_{\infty}||_{2}+||\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\hat{\mathcal{H}}_{d_{*}^{(1)}}||_{2} ^d^(1)2\displaystyle\geq||\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}||_{2}
^d(1)d(1)2+d(1)2+^d^(1)^d(1)2\displaystyle||\hat{\mathcal{H}}_{d_{*}^{(1)}}-\mathcal{H}_{d_{*}^{(1)}}||_{2}+||\mathcal{H}_{d_{*}^{(1)}}-\mathcal{H}_{\infty}||_{2}+||\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\hat{\mathcal{H}}_{d_{*}^{(1)}}||_{2} ^d^(1)2\displaystyle\geq||\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}||_{2}

Since by definition of d(,)d_{*}(\cdot,\cdot) we have

^d(1)d(1)2+d(1)22cβRκd(1)m+pd(1)+log(κ2T~δ)T~||\hat{\mathcal{H}}_{d_{*}^{(1)}}-\mathcal{H}_{d_{*}^{(1)}}||_{2}+||\mathcal{H}_{d_{*}^{(1)}}-\mathcal{H}_{\infty}||_{2}\leq\frac{2c\beta R}{\kappa}\sqrt{d_{*}^{(1)}}\sqrt{\frac{m+pd_{*}^{(1)}+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\tilde{T}}}

and by assumptions d(1)κ8d(κ2),d^(1)d(κ2)d_{*}^{(1)}\leq\frac{\kappa}{8}d_{*}^{(\kappa^{2})},\hat{d}^{(1)}\leq d_{*}^{(\kappa^{2})} then as a result (d(1)+2d^(1))d(1)(2κ8+1)d(κ2)(\sqrt{d_{*}^{(1)}}+2\sqrt{\hat{d}^{(1)}})\sqrt{d_{*}^{(1)}}\leq(\frac{2\kappa}{8}+1)d_{*}^{(\kappa^{2})}

^d^(1)2^d(1)d(1)2+d(1)2+^d^(1)^d(1)2\displaystyle||\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}||_{2}\leq||\hat{\mathcal{H}}_{d_{*}^{(1)}}-\mathcal{H}_{d_{*}^{(1)}}||_{2}+||\mathcal{H}_{d_{*}^{(1)}}-\mathcal{H}_{\infty}||_{2}+\underbrace{||\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\hat{\mathcal{H}}_{d_{*}^{(1)}}||_{2}}_{\Downarrow}
2cβRd(1)κm+pd(1)+log(κ2T~δ)T~Prop.6+cβR(d(1)+2d^(1))κm+pd(1)+log(κ2T~δ)T~Definition of d^(1)\displaystyle\leq\underbrace{\frac{2c\beta R\sqrt{d_{*}^{(1)}}}{\kappa}\sqrt{\frac{m+pd_{*}^{(1)}+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\tilde{T}}}}_{\text{Prop.}\leavevmode\nobreak\ \ref{ds_error}}+\underbrace{\frac{c\beta R(\sqrt{d_{*}^{(1)}}+2\sqrt{\hat{d}^{(1)}})}{\kappa}\sqrt{\frac{m+pd_{*}^{(1)}+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\tilde{T}}}}_{\text{Definition of }\hat{d}^{(1)}}
^d^(1)2(12+1κ)cβRd(κ2)m+pd(κ2)+log(T~δ)T~\displaystyle||\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}||_{2}\leq\Big{(}\frac{1}{2}+\frac{1}{\kappa}\Big{)}c\beta R\sqrt{d_{*}^{(\kappa^{2})}}\sqrt{\frac{m+pd_{*}^{(\kappa^{2})}+\log{\frac{\tilde{T}}{\delta}}}{\tilde{T}}}

where the last inequality follows from (d(1)+2d^(1))d(1)(2κ8+1)d(κ2)(\sqrt{d_{*}^{(1)}}+2\sqrt{\hat{d}^{(1)}})\sqrt{d_{*}^{(1)}}\leq(\frac{2\kappa}{8}+1)d_{*}^{(\kappa^{2})}. Now by assumption

d^(1)26cβRd^(1)m+pd^(1)+log(κ2T~δ)κ2T~||\mathcal{H}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}||_{2}\geq 6c\beta R\sqrt{\hat{d}^{(1)}}\sqrt{\frac{m+p\hat{d}^{(1)}+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\kappa^{2}\tilde{T}}}

it is clear that

^d^(1)256d^(1)2||\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}||_{2}\geq\frac{5}{6}||\mathcal{H}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}||_{2}

and we can conclude that, since 65(12+1κ)<12\frac{6}{5}\Big{(}\frac{1}{2}+\frac{1}{\kappa}\Big{)}<\frac{1}{\sqrt{2}},

d^(1)2<cβRd(κ2)2m+pd(κ2)+log(T~δ)T~||\mathcal{H}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}||_{2}<c\beta R\sqrt{\frac{d_{*}^{(\kappa^{2})}}{2}}\sqrt{\frac{m+pd_{*}^{(\kappa^{2})}+\log{\frac{\tilde{T}}{\delta}}}{\tilde{T}}}

which implies that d^(1)d(κ2)1\hat{d}^{(1)}\geq d_{*}^{(\kappa^{2})}-1. This is because by definition of d(κ2)d_{*}^{(\kappa^{2})} we know that d(κ2)d_{*}^{(\kappa^{2})} is the minimum such that

d(κ2)2cβRd(κ2)2m+pd(κ2)+log(T~δ)T~||\mathcal{H}_{d_{*}^{(\kappa^{2})}}-\mathcal{H}_{\infty}||_{2}\leq c\beta R\sqrt{\frac{d_{*}^{(\kappa^{2})}}{2}}\sqrt{\frac{m+pd_{*}^{(\kappa^{2})}+\log{\frac{\tilde{T}}{\delta}}}{\tilde{T}}}

and furthermore from Proposition 2 we have for any d1d2d_{1}\leq d_{2}

0,,0,d1,d1120,,0,d2,d2.||\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d_{1},d_{1}}||\geq\frac{1}{\sqrt{2}}||\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d_{2},d_{2}}||.

This contradicts Assumption 3. So, this means that one of three assumptions do not hold. Clearly if assumption 33 is invalid then we have a suitable lower bound on the chosen d^(,)\hat{d}(\cdot,\cdot), i.e., since d(κ2T,δ)d(T,δ)κ8d(κ2T,δ)d_{*}(\kappa^{-2}T,\delta)\leq d_{*}(T,\delta)\leq\frac{\kappa}{8}d_{*}(\kappa^{-2}T,\delta) we get

κ8d^(κ2T~,δ)κ8d(T~,δ)κ8d(κ2T~,δ)κ8d^(κ2T~,δ)κ8d(T~,δ)κ8\frac{\kappa}{8}\hat{d}(\kappa^{2}\tilde{T},\delta)\geq\frac{\kappa}{8}d_{*}(\tilde{T},\delta)-\frac{\kappa}{8}\geq d_{*}(\kappa^{2}\tilde{T},\delta)-\frac{\kappa}{8}\geq\hat{d}(\kappa^{2}\tilde{T},\delta)-\frac{\kappa}{8}\geq d_{*}(\tilde{T},\delta)-\frac{\kappa}{8}

which implies from Lemma 8 that (since we pick κ=16\kappa=16, for large enough TT d(T~,δ)4d_{*}(\tilde{T},\delta)\geq 4) and we have

^d^(κ2T~,δ)2\displaystyle||\hat{\mathcal{H}}_{\hat{d}(\kappa^{2}\tilde{T},\delta)}-\mathcal{H}_{\infty}||_{2} 3cβRd(κ2T~,δ)pd(κ2T~,δ)+log(κ2T~δ)κ2T~\displaystyle\leq 3c\beta R\sqrt{d_{*}(\kappa^{2}\tilde{T},\delta)}\sqrt{\frac{pd_{*}(\kappa^{2}\tilde{T},\delta)+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\kappa^{2}\tilde{T}}}
3κ8cβRd^(κ2T~,δ)pd^(κ2T~,δ)+log(κ2T~δ)κ2T~\displaystyle\leq\frac{3\kappa}{8}c\beta R\sqrt{\hat{d}(\kappa^{2}\tilde{T},\delta)}\sqrt{\frac{p\hat{d}(\kappa^{2}\tilde{T},\delta)+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\kappa^{2}\tilde{T}}}

Similarly, if assumption 22 is invalid then we get that

d^(κ2T~,δ)2<6cβRd^(κ2T~,δ)pd^(κ2T~,δ)+log(κ2T~δ)κ2T~||\mathcal{H}_{\hat{d}(\kappa^{2}\tilde{T},\delta)}-\mathcal{H}_{\infty}||_{2}<6c\beta R\sqrt{\hat{d}(\kappa^{2}\tilde{T},\delta)}\sqrt{\frac{p\hat{d}(\kappa^{2}\tilde{T},\delta)+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\kappa^{2}\tilde{T}}}

and because d^(κ2T~,δ)d(κ2T~,δ)\hat{d}(\kappa^{2}\tilde{T},\delta)\leq d_{*}(\kappa^{2}\tilde{T},\delta) and ^d^(κ2T~,δ)2d^(κ2T~,δ)2+^d^(κ2T~,δ)2||\hat{\mathcal{H}}_{\hat{d}(\kappa^{2}\tilde{T},\delta)}-\mathcal{H}_{\infty}||_{2}\leq||\mathcal{H}_{\hat{d}(\kappa^{2}\tilde{T},\delta)}-\mathcal{H}_{\infty}||_{2}+||\hat{\mathcal{H}}_{\hat{d}(\kappa^{2}\tilde{T},\delta)}-\mathcal{H}_{\infty}||_{2} we get in a similar fashion to Proposition 6

^d^(κ2T~,δ)212cβRd^(κ2T~,δ)pd^(κ2T~,δ)+log(κ2T~δ)κ2T~||\hat{\mathcal{H}}_{\hat{d}(\kappa^{2}\tilde{T},\delta)}-\mathcal{H}_{\infty}||_{2}\leq 12c\beta R\sqrt{\hat{d}(\kappa^{2}\tilde{T},\delta)}\sqrt{\frac{p\hat{d}(\kappa^{2}\tilde{T},\delta)+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\kappa^{2}\tilde{T}}}

Replacing κ2T~=T\kappa^{2}\tilde{T}=T it is clear that for any κ16\kappa\geq 16

^d^(T,δ)212cβRd^(T,δ)pd^(T,δ)+log(Tδ)T||\hat{\mathcal{H}}_{\hat{d}(T,\delta)}-\mathcal{H}_{\infty}||_{2}\leq 12c\beta R\sqrt{\hat{d}(T,\delta)}\sqrt{\frac{p\hat{d}(T,\delta)+\log{\frac{T}{\delta}}}{T}} (65)

If d(T,δ)log((Tδ))d_{*}(T,\delta)\leq\log{\left(\frac{T}{\delta}\right)} then we can simply apply Lemma 8 and our assertion holds.

14 Model Selection Results

Proposition 1.

Let 0,,=UΣV,^0,d^,d^=U^Σ^V^\mathcal{H}_{0,\infty,\infty}=U\Sigma V^{\top},\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}=\hat{U}\hat{\Sigma}\hat{V}^{\top} and

0,,^0,d^,d^ϵ.||\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}||\leq\epsilon.

Let Σ^\hat{\Sigma} be arranged into blocks of singular values such that in each block ii we have

supjσ^jiσ^j+1iχϵ\sup_{j}\hat{\sigma}^{i}_{j}-\hat{\sigma}^{i}_{j+1}\leq\chi\epsilon

for some χ2\chi\geq 2, i.e.,

Σ^=[Λ1000Λ20000Λl]\hat{\Sigma}=\begin{bmatrix}\Lambda_{1}&0&\ldots&0\\ 0&\Lambda_{2}&\ldots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\ldots&\Lambda_{l}\\ \end{bmatrix}

where Λi\Lambda_{i} are diagonal matrices and σ^ji\hat{\sigma}^{i}_{j} is the jthj^{th} singular value in the block Λi\Lambda_{i}. Then there exists an orthogonal transformation, QQ, such that

U^Σ^1/2QUΣ1/22\displaystyle||\hat{U}\hat{\Sigma}^{1/2}Q-U\Sigma^{1/2}||_{2} 2ϵσ^1/ζn12+σ^n1+1/ζn22++σ^i=1l1ni+1/ζnl2\displaystyle\leq 2\epsilon\sqrt{\hat{\sigma}_{1}/\zeta_{n_{1}}^{2}+\hat{\sigma}_{n_{1}+1}/\zeta_{n_{2}}^{2}+\ldots+\hat{\sigma}_{\sum_{i=1}^{l-1}n_{i}+1}/\zeta_{n_{l}}^{2}}
+2sup1ilσ^maxiσ^mini+ϵσ^d^ϵ.\displaystyle+2\sup_{1\leq i\leq l}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}+\frac{\epsilon}{\sqrt{\hat{\sigma}_{\hat{d}}}}\wedge\sqrt{\epsilon}.

Here sup1ilσ^maxiσ^miniχσ^maxiϵd^χd^ϵ\sup_{1\leq i\leq l}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}\leq\frac{\chi}{\sqrt{\hat{\sigma}^{i}_{\max}}}\epsilon\hat{d}\wedge\sqrt{\chi\hat{d}\epsilon} and

ζni=min(σ^minni1σ^maxni,σ^minniσ^maxni+1)\zeta_{n_{i}}=\min{({\hat{\sigma}}^{n_{i-1}}_{\min}-{\hat{\sigma}}^{n_{i}}_{\max},{\hat{\sigma}}^{n_{i}}_{\min}-{\hat{\sigma}}^{n_{i+1}}_{\max})}

for 1<i<l1<i<l, ζn1=σ^minn1σ^maxn2\zeta_{n_{1}}={\hat{\sigma}}^{n_{1}}_{\min}-{\hat{\sigma}}^{n_{2}}_{\max} and ζnl=min(σ^minnl1σ^maxnl,σ^minnl)\zeta_{n_{l}}=\min{({\hat{\sigma}}^{n_{l-1}}_{\min}-{\hat{\sigma}}^{n_{l}}_{\max},{\hat{\sigma}}^{n_{l}}_{\min})}.

Proof 14.1.

Let U^Σ^V^=SVD(^0,d^,d^)\hat{U}\hat{\Sigma}\hat{V}^{\top}=\text{SVD}(\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}) and UΣV=SVD(0,,){U}{\Sigma}{V}^{\top}=\text{SVD}(\mathcal{H}_{0,\infty,\infty}) where ^0,d^,d^0,,2ϵ||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq\epsilon. Σ^\hat{\Sigma} is arranged into blocks of singular values such that in each block ii we have σ^jiσ^j+1iχϵ\hat{\sigma}^{i}_{j}-\hat{\sigma}^{i}_{j+1}\leq\chi\epsilon, i.e.,

Σ^=[Λ1000Λ20000Λl]\hat{\Sigma}=\begin{bmatrix}\Lambda_{1}&0&\ldots&0\\ 0&\Lambda_{2}&\ldots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\ldots&\Lambda_{l}\\ \end{bmatrix}

where Λi\Lambda_{i} are diagonal matrices and σ^ji\hat{\sigma}^{i}_{j} is the jthj^{th} singular value in the block Λi\Lambda_{i}. Furthermore, σ^mini1σ^maxi>χϵ\hat{\sigma}^{i-1}_{\min}-\hat{\sigma}^{i}_{\max}>\chi\epsilon. From Σ^\hat{\Sigma} define Σ^¯\bar{\hat{\Sigma}} as follows:

Σ^¯=[σ^¯1In1×n1000σ^¯2In2×n20000σ^¯lInl×nl]\bar{\hat{\Sigma}}=\begin{bmatrix}\bar{\hat{\sigma}}_{1}I_{n_{1}\times n_{1}}&0&\ldots&0\\ 0&\bar{\hat{\sigma}}_{2}I_{n_{2}\times n_{2}}&\ldots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\ldots&\bar{\hat{\sigma}}_{l}I_{n_{l}\times n_{l}}\\ \end{bmatrix} (66)

where Λi\Lambda_{i} is a ni×nin_{i}\times n_{i} matrix and σ^¯i=1nijσ^ji\bar{\hat{\sigma}}_{i}=\frac{1}{n_{i}}\sum_{j}\hat{\sigma}^{i}_{j}. The key idea of the proof is the following: (A,B,C)(QAQ,QB,CQ)(A,B,C)\equiv(QAQ^{\top},QB,CQ^{\top}) where QQ is a orthogonal transformation and we will show that there exists a block diagonal unitary matrix QQ of the form

Q=[Qn1×n1000Qn2×n20000Qnl×nl]Q={\begin{bmatrix}Q_{n_{1}\times n_{1}}&0&\ldots&0\\ 0&Q_{n_{2}\times n_{2}}&\ldots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\ldots&Q_{n_{l}\times n_{l}}\\ \end{bmatrix}} (67)

such that each block Qni×niQ_{n_{i}\times n_{i}} corresponds to a orthogonal matrix of dimensions ni×nin_{i}\times n_{i} and that U^Σ^1/2QUΣ1/22||\hat{U}\hat{\Sigma}^{1/2}Q-U\Sigma^{1/2}||_{2} is small if ^0,d^,d^0,,2||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\infty,\infty}||_{2} is small. Each of the blocks correspond to the set of singular values where the inter-singular value distance is “small”. To start off, note that from Propositon 4 there must exist a QQ that is block diagonal with orthogonal entries such that

U^QΣ^1/2UΣ1/22\displaystyle||\hat{U}Q\hat{\Sigma}^{1/2}-U\Sigma^{1/2}||_{2} cϵσ^1/ζn12+σ^n1+1/ζn22++σ^i=1l1ni+1/ζnl2+sup1id^|σiσ^i|\displaystyle\leq c\epsilon\sqrt{\hat{\sigma}_{1}/\zeta_{n_{1}}^{2}+\hat{\sigma}_{n_{1}+1}/\zeta_{n_{2}}^{2}+\ldots+\hat{\sigma}_{\sum_{i=1}^{l-1}n_{i}+1}/\zeta_{n_{l}}^{2}}+\sup_{1\leq i\leq\hat{d}}|\sqrt{\sigma_{i}}-\sqrt{\hat{\sigma}_{i}}| (68)

Here

ζni=min(σ^minni1σ^maxni,σ^minniσ^maxni+1)\zeta_{n_{i}}=\min{({\hat{\sigma}}^{n_{i-1}}_{\min}-{\hat{\sigma}}^{n_{i}}_{\max},{\hat{\sigma}}^{n_{i}}_{\min}-{\hat{\sigma}}^{n_{i+1}}_{\max})}

for 1<i<l1<i<l, ζn1=σ^minn1σ^maxn2\zeta_{n_{1}}={\hat{\sigma}}^{n_{1}}_{\min}-{\hat{\sigma}}^{n_{2}}_{\max} and ζnl=min(σ^minnl1σ^maxnl,σ^minnl)\zeta_{n_{l}}=\min{({\hat{\sigma}}^{n_{l-1}}_{\min}-{\hat{\sigma}}^{n_{l}}_{\max},{\hat{\sigma}}^{n_{l}}_{\min})}. Informally, the ζi\zeta_{i} measure the singular value gaps between each blocks.

Furthermore, it can be shown that for any QQ of the form in Eq. (67)

U^QΣ^1/2U^Σ^1/2Q2\displaystyle||\hat{U}Q\hat{\Sigma}^{1/2}-\hat{U}\hat{\Sigma}^{1/2}Q||_{2} U^QΣ^¯1/2U^QΣ^1/22+U^Σ^1/2QU^Σ^¯1/2Q22Σ^1/2Σ^¯1/22\displaystyle\leq||\hat{U}Q{\bar{\hat{\Sigma}}}^{1/2}-\hat{U}Q\hat{\Sigma}^{1/2}||_{2}+||\hat{U}\hat{\Sigma}^{1/2}Q-\hat{U}{\bar{\hat{\Sigma}}}^{1/2}Q||_{2}\leq 2||\hat{\Sigma}^{1/2}-\bar{\hat{\Sigma}}^{1/2}||_{2}

because U^QΣ^¯1/2=U^Σ^¯1/2Q\hat{U}Q{\bar{\hat{\Sigma}}}^{1/2}=\hat{U}{\bar{\hat{\Sigma}}}^{1/2}Q. Note that Σ^1/2Σ^¯1/22sup1ilσ^maxiσ^mini||\hat{\Sigma}^{1/2}-\bar{\hat{\Sigma}}^{1/2}||_{2}\leq\sup_{1\leq i\leq l}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}. Now, when σ^maxiχniϵ\hat{\sigma}^{i}_{\max}\geq\chi n_{i}\epsilon, then σ^maxiσ^miniχϵσ^maxi\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}\leq\frac{\chi\epsilon}{\sqrt{\hat{\sigma}^{i}_{\max}}}; on the other hand when σ^maxi<χniϵ\hat{\sigma}^{i}_{\max}<\chi n_{i}\epsilon then σ^maxiσ^¯iχniϵ\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\bar{\hat{\sigma}}^{i}}\leq\sqrt{\chi n_{i}\epsilon} and this implies that

sup1ilσ^maxiσ^miniχniσ^maxiϵχniϵ.\sup_{1\leq i\leq l}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}\leq\frac{\chi n_{i}}{\sqrt{\hat{\sigma}^{i}_{\max}}}\epsilon\wedge\sqrt{\chi n_{i}\epsilon}.

Finally,

U^Σ^1/2QUΣ1/22\displaystyle||\hat{U}\hat{\Sigma}^{1/2}Q-U\Sigma^{1/2}||_{2} U^QΣ^1/2UΣ1/22+U^QΣ^1/2U^Σ^1/2Q2\displaystyle\leq||\hat{U}Q\hat{\Sigma}^{1/2}-U\Sigma^{1/2}||_{2}+||\hat{U}Q\hat{\Sigma}^{1/2}-\hat{U}\hat{\Sigma}^{1/2}Q||_{2}
=2ϵσ^1/ζn12+σ^n1+1/ζn22++σ^i=1l1ni+1/ζnl2+sup1id^|σiσ^i|\displaystyle=2\epsilon\sqrt{\hat{\sigma}_{1}/\zeta_{n_{1}}^{2}+\hat{\sigma}_{n_{1}+1}/\zeta_{n_{2}}^{2}+\ldots+\hat{\sigma}_{\sum_{i=1}^{l-1}n_{i}+1}/\zeta_{n_{l}}^{2}}+\sup_{1\leq i\leq\hat{d}}|\sqrt{\sigma_{i}}-\sqrt{\hat{\sigma}_{i}}|
+χϵσ^maxiχϵ.\displaystyle+\frac{\chi\epsilon}{\sqrt{\hat{\sigma}^{i}_{\max}}}\wedge\sqrt{\chi\epsilon}.

Our assertion follows since sup1id^|σiσ^i|ϵσ^d^ϵ\sup_{1\leq i\leq\hat{d}}|\sqrt{\sigma_{i}}-\sqrt{\hat{\sigma}_{i}}|\leq\frac{\epsilon}{\sqrt{\hat{\sigma}_{\hat{d}}}}\wedge\sqrt{\epsilon}.

Proposition 2.

Let 0,,=UΣV,^0,d^,d^=U^Σ^V^\mathcal{H}_{0,\infty,\infty}=U\Sigma V^{\top},\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}=\hat{U}\hat{\Sigma}\hat{V}^{\top} and

0,,^0,d^,d^ϵ.||\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}||\leq\epsilon.

Let Σ^\hat{\Sigma} be arranged into blocks of singular values such that in each block ii we have

supjσ^jiσ^j+1iχϵ\sup_{j}\hat{\sigma}^{i}_{j}-\hat{\sigma}^{i}_{j+1}\leq\chi\epsilon

for some χ2\chi\geq 2, i.e.,

Σ^=[Λ1000Λ20000Λl]\hat{\Sigma}=\begin{bmatrix}\Lambda_{1}&0&\ldots&0\\ 0&\Lambda_{2}&\ldots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\ldots&\Lambda_{l}\\ \end{bmatrix}

where Λi\Lambda_{i} are diagonal matrices and σ^ji\hat{\sigma}^{i}_{j} is the jthj^{th} singular value in the block Λi\Lambda_{i}. Then there exists an orthogonal transformation, QQ, such that

max(C^C2,B^B2)\displaystyle\max{\left(||\hat{C}-C||_{2},||\hat{B}-B||_{2}\right)} 2ϵσ^1/ζn12+σ^n1+1/ζn22++σ^i=1l1ni+1/ζnl2\displaystyle\leq 2\epsilon\sqrt{\hat{\sigma}_{1}/\zeta_{n_{1}}^{2}+\hat{\sigma}_{n_{1}+1}/\zeta_{n_{2}}^{2}+\ldots+\hat{\sigma}_{\sum_{i=1}^{l-1}n_{i}+1}/\zeta_{n_{l}}^{2}}
+2sup1ilσ^maxiσ^mini+ϵσ^d^ϵ=ζ,\displaystyle+2\sup_{1\leq i\leq l}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}+\frac{\epsilon}{\sqrt{\hat{\sigma}_{\hat{d}}}}\wedge\sqrt{\epsilon}=\zeta,
AA^2\displaystyle||A-\hat{A}||_{2} 4γζ/σ^d^.\displaystyle\leq 4\gamma\cdot\zeta/\sqrt{\hat{\sigma}_{\hat{d}}}.

Here sup1ilσ^maxiσ^miniχσ^maxiϵd^χd^ϵ\sup_{1\leq i\leq l}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}\leq\frac{\chi}{\sqrt{\hat{\sigma}^{i}_{\max}}}\epsilon\hat{d}\wedge\sqrt{\chi\hat{d}\epsilon} and

ζni=min(σ^minni1σ^maxni,σ^minniσ^maxni+1)\zeta_{n_{i}}=\min{({\hat{\sigma}}^{n_{i-1}}_{\min}-{\hat{\sigma}}^{n_{i}}_{\max},{\hat{\sigma}}^{n_{i}}_{\min}-{\hat{\sigma}}^{n_{i+1}}_{\max})}

for 1<i<l1<i<l, ζn1=σ^minn1σ^maxn2\zeta_{n_{1}}={\hat{\sigma}}^{n_{1}}_{\min}-{\hat{\sigma}}^{n_{2}}_{\max} and ζnl=min(σ^minnl1σ^maxnl,σ^minnl)\zeta_{n_{l}}=\min{({\hat{\sigma}}^{n_{l-1}}_{\min}-{\hat{\sigma}}^{n_{l}}_{\max},{\hat{\sigma}}^{n_{l}}_{\min})}.

Proof 14.2.

The proof follows because all parameters are equivalent up to a orthogonal transform (See discussion preceding Proposition 2). Following that we use Propositions 4 and 5.

15 Order Estimation Lower Bound

Lemma 15.1 (Theorem 4.21 in Boucheron et al. (2013)).

Let {i}i=0N\{\mathbb{P}_{i}\}_{i=0}^{N} be probability laws over (Σ,𝒜)(\Sigma,\mathcal{A}) and let {Ai𝒜}i=0N\{A_{i}\in\mathcal{A}\}_{i=0}^{N} be disjoint events. If a=mini=0,,Ni(Ai)1/(N+1)a=\min_{i=0,\ldots,N}\mathbb{P}_{i}(A_{i})\geq 1/(N+1),

aalog((Na1a))+(1a)log((1a11aN))1Ni=1NKL(Pi||P0)\displaystyle a\leq a\log{\Big{(}\frac{Na}{1-a}\Big{)}}+(1-a)\log{\Big{(}\frac{1-a}{1-\frac{1-a}{N}}\Big{)}}\leq\frac{1}{N}\sum_{i=1}^{N}KL(P_{i}||P_{0}) (69)
Lemma 15.2 (Le Cam’s Method).

Let P0,P1P_{0},P_{1} be two probability laws then

supθ{0,1}θ[MM^]121212KL(P0||P1)\displaystyle\sup_{\theta\in\{0,1\}}\mathbb{P}_{\theta}[M\neq\hat{M}]\geq\frac{1}{2}-\frac{1}{2}\sqrt{\frac{1}{2}KL(P_{0}||P_{1})}
Proposition 1.

Let 𝒩0,𝒩1{\mathcal{N}}_{0},{\mathcal{N}}_{1} be two multivariate Gaussians with mean μ0T,μ1T\mu_{0}\in\mathbb{R}^{T},\mu_{1}\in\mathbb{R}^{T} and covariance matrix Σ0T×T,Σ1T×T\Sigma_{0}\in\mathbb{R}^{T\times T},\Sigma_{1}\in\mathbb{R}^{T\times T} respectively. Then the KL(𝒩0,𝒩1)=12(tr(Σ11Σ0)T+log(det(Σ1)det(Σ0))+𝔼μ1,μ0[(μ1μ0)Σ11(μ1μ0)])\text{KL}({\mathcal{N}}_{0},{\mathcal{N}}_{1})=\frac{1}{2}\Big{(}\text{tr}(\Sigma_{1}^{-1}\Sigma_{0})-T+\log{\frac{\text{det}(\Sigma_{1})}{\text{det}(\Sigma_{0})}}+\mathbb{E}_{\mu_{1},\mu_{0}}[(\mu_{1}-\mu_{0})^{\top}\Sigma_{1}^{-1}(\mu_{1}-\mu_{0})]\Big{)}.

In this section we will prove a lower bound on the finite time error for model approximation. In systems theory subspace based methods are useful in estimating the true system parameters. Intuitively, it should be harder to correctly estimate the subspace that corresponds to lower Hankel singular values, or “energy” due to the presence of noise. However, due to strong structural constraints on Hankel matrix finding a minimax lower bound is a much harder proposition for LTI systems. Specifically, it is not clear if standard subspace identification lower bounds can provide reasonable estimates for a structured and non i.i.d. setting such as our case. To alleviate some of the technical difficulties that arise in obtaining the lower bounds, we will focus on a small set of LTI systems which are simply parametrized by a number ζ\zeta. Consider the following canonical form order 11 and 22 LTI systems respectively with m=p=1m=p=1 and let RR be the noise-to-signal ratio bound.

A0\displaystyle A_{0} =[010000ζ00],A1=A0,B0=[00β/R],B1=[0β/Rβ/R],C0=[00βR],C1=C0\displaystyle=\begin{bmatrix}0&1&0\\ 0&0&0\\ \zeta&0&0\end{bmatrix},A_{1}=A_{0},B_{0}=\begin{bmatrix}0\\ 0\\ \sqrt{\beta}/R\end{bmatrix},B_{1}=\begin{bmatrix}0\\ \sqrt{\beta}/R\\ \sqrt{\beta}/R\end{bmatrix},C_{0}=\begin{bmatrix}0&0&\sqrt{\beta}R\end{bmatrix},C_{1}=C_{0} (70)

A0,A1A_{0},A_{1} are Schur stable whenever |ζ|<1|\zeta|<1.

ζ,0\displaystyle\mathcal{H}_{\zeta,0} =β[1000000000000000000000000]\displaystyle=\beta\begin{bmatrix}1&0&0&0&0&\ldots\\ 0&0&0&0&0&\ldots\\ 0&0&0&0&0&\ldots\\ 0&0&0&0&0&\ldots\\ 0&0&0&0&0&\ldots\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\end{bmatrix}
ζ,1\displaystyle\mathcal{H}_{\zeta,1} =β[10ζ000ζ000ζ00000000000000]\displaystyle=\beta\begin{bmatrix}1&0&\zeta&0&0&\ldots\\ 0&\zeta&0&0&0&\ldots\\ \zeta&0&0&0&0&\ldots\\ 0&0&0&0&0&\ldots\\ 0&0&0&0&0&\ldots\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\end{bmatrix} (71)

Here ζ,0,ζ,1\mathcal{H}_{\zeta,0},\mathcal{H}_{\zeta,1} are the Hankel matrices generated by (C0,A0,B0),(C1,A1,B1)(C_{0},A_{0},B_{0}),(C_{1},A_{1},B_{1}) respectively. It is easy to check that for ζ,1\mathcal{H}_{\zeta,1} we have 1ζσ1σ21+ζζ\frac{1}{\zeta}\leq\frac{\sigma_{1}}{\sigma_{2}}\leq\frac{1+\zeta}{\zeta} where σi\sigma_{i} are Hankel singular values. Further the rank of ζ,0\mathcal{H}_{\zeta,0} is 11 and that of ζ,1\mathcal{H}_{\zeta,1} is at least 22. Also, 𝒯𝒪0,((Ci,Ai,Bi))2𝒯0,((Ci,Ai,Bi))2R\frac{||\mathcal{T}\mathcal{O}_{0,\infty}((C_{i},A_{i},B_{i}))||_{2}}{||\mathcal{T}_{0,\infty}((C_{i},A_{i},B_{i}))||_{2}}\leq R.

This construction will be key to show that identification of a particular rank realization depends on the condition number of the Hankel matrix. An alternate representation of the input–output behavior is

[yTyT1y1]\displaystyle\begin{bmatrix}y_{T}\\ y_{T-1}\\ \vdots\\ y_{1}\end{bmatrix} =[CBCAiBCAiT1B0CBCAiT2B00CB]Πi[uT+1uTu2]U\displaystyle=\underbrace{\begin{bmatrix}CB&CA_{i}B&\ldots&CA_{i}^{T-1}B\\ 0&CB&\ldots&CA_{i}^{T-2}B\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&CB\end{bmatrix}}_{\Pi_{i}}\underbrace{\begin{bmatrix}u_{T+1}\\ u_{T}\\ \vdots\\ u_{2}\end{bmatrix}}_{U}
+[CCAiCAiT10CCAiT200C]Oi[ηT+1ηTη2]+[wTwT1w1]\displaystyle+\underbrace{\begin{bmatrix}C&CA_{i}&\ldots&CA_{i}^{T-1}\\ 0&C&\ldots&CA_{i}^{T-2}\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&C\end{bmatrix}}_{O_{i}}\begin{bmatrix}\eta_{T+1}\\ \eta_{T}\\ \vdots\\ \eta_{2}\end{bmatrix}+\begin{bmatrix}w_{T}\\ w_{T-1}\\ \vdots\\ w_{1}\end{bmatrix} (72)

where Ai{A0,A1}A_{i}\in\{A_{0},A_{1}\}. We will prove this result for a general class of inputs, i.e., active inputs. Then we will follow the same steps as in proof of Theorem 2 in Tu et al. (2018b).

KL(P0||P1)\displaystyle\text{KL}(P_{0}||P_{1}) =𝔼P0[log(t=1Tγt(ut|{ul,yl}l=1t1)P0(yt|{ul}l=1t1)γt(ut|{ul,yl}l=1t1)P1(yt|{ul}l=1t1))]\displaystyle=\mathbb{E}_{P_{0}}\Bigg{[}\log{\prod_{t=1}^{T}\frac{\gamma_{t}(u_{t}|\{u_{l},y_{l}\}_{l=1}^{t-1})P_{0}(y_{t}|\{u_{l}\}_{l=1}^{t-1})}{\gamma_{t}(u_{t}|\{u_{l},y_{l}\}_{l=1}^{t-1})P_{1}(y_{t}|\{u_{l}\}_{l=1}^{t-1})}}\Bigg{]}
=𝔼P0[log(t=1TP0(yt|{ul}l=1t1)P1(yt|{ul}l=1t1))]\displaystyle=\mathbb{E}_{P_{0}}\Bigg{[}\log{\prod_{t=1}^{T}\frac{P_{0}(y_{t}|\{u_{l}\}_{l=1}^{t-1})}{P_{1}(y_{t}|\{u_{l}\}_{l=1}^{t-1})}}\Bigg{]}

Here γt(|)\gamma_{t}(\cdot|\cdot) is the active rule for choosing utu_{t} from past data. From Eq. (15) it is clear that conditional on {ul}l=1T\{u_{l}\}_{l=1}^{T}, {yl}l=1T\{y_{l}\}_{l=1}^{T} is Gaussian with mean given by ΠiU\Pi_{i}U. Then we use Birge’s inequality (Lemma 15.1). In our case Σ0=O0O0+I,Σ1=O1O1+I\Sigma_{0}=O_{0}O_{0}^{\top}+I,\Sigma_{1}=O_{1}O_{1}^{\top}+I where OiO_{i} is given in Eq. (15). We will apply a combination of Lemma 15.1, Proposition 1 and assume ηi\eta_{i} are i.i.d Gaussian to obtain our desired result. Note that O1=O0O_{1}=O_{0} but Π1Π0\Pi_{1}\neq\Pi_{0}. Therefore, from Proposition 1 KL(𝒩0,𝒩1)=𝔼μ1,μ0[(μ1μ0)Σ11(μ1μ0)]Tζ2R2KL({\mathcal{N}}_{0},{\mathcal{N}}_{1})=\mathbb{E}_{\mu_{1},\mu_{0}}[(\mu_{1}-\mu_{0})^{\top}\Sigma_{1}^{-1}(\mu_{1}-\mu_{0})]\leq T\frac{\zeta^{2}}{R^{2}} where μi=ΠiU\mu_{i}=\Pi_{i}U. For any δ(0,1/4)\delta\in(0,1/4), set a=1δa=1-\delta in Proposition 15.1, then we get whenever

δlog((δ1δ))+(1δ)log((1δδ))Tζ2R2\displaystyle\delta\log{\Big{(}\frac{\delta}{1-\delta}\Big{)}}+(1-\delta)\log{\Big{(}\frac{1-\delta}{\delta}\Big{)}}\geq\frac{T\zeta^{2}}{R^{2}} (73)

we have supijAi(Aj)δ\sup_{i\neq j}\mathbb{P}_{A_{i}}(A_{j})\geq\delta. For δ[1/4,1)\delta\in[1/4,1) we use Le Cam’s method in Lemma 15.2 and show that if 8δ2Tζ2R28\delta^{2}\geq\frac{T\zeta^{2}}{R^{2}} then supijAi(Aj)δ\sup_{i\neq j}\mathbb{P}_{A_{i}}(A_{j})\geq\delta. Since δ2clog(1δ)\delta^{2}\geq c\log{\frac{1}{\delta}} when δ[1/4,1)\delta\in[1/4,1) for an absolute constant, our assertion holds.