Nonparametric Finite Time LTI System Identification

1 Introduction

Finite-time system identification—the problem of estimating the system parameters given a finite single time series of its output—is an important problem in the context of control theory, time series analysis, robotics, and economics, among many others. In this work, we focus on parameter estimation and model approximation of linear time invariant (LTI) systems or linear dynamical system (LDS), which are described by

	$\displaystyle X_{t+1}$	$\displaystyle=AX_{t}+BU_{t}+\eta_{t+1}$
	$\displaystyle Y_{t}$	$\displaystyle=CX_{t}+w_{t}.$		(1)

Here $C\in\mathbb{R}^{p\times n},A\in\mathbb{R}^{n\times n},B\in\mathbb{R}^{n\times m}$ ; $\{\eta_{t},w_{t}\}_{t=1}^{\infty}$ are process and output noise, $U_{t}$ is an external control input, $X_{t}$ is the latent state variable and $Y_{t}$ is the observed output. The goal here is parameter estimation, i.e., learning $(C,A,B)$ from a single finite time series of $\{Y_{t},U_{t}\}_{t=1}^{T}$ when the order, $n$ , is unknown. Since typically $p,m<n$ , it becomes challenging to find suitable parametrizations of LTI systems for provably efficient learning. When $\{X_{j}\}_{j=1}^{\infty}$ are observed (or, $C$ is known to be the identity matrix), identification of $(C,A,B)$ in Eq. (1) is significantly easier, and ordinary least squares (OLS) is a statistically optimal estimator. It is, in general, unclear how (or if) OLS can be employed in the case when $X_{t}$ ’s are not observed.

To motivate the study of a lower-order approximation of a high-order system, consider the following example:

Example 1.1.

Consider $M_{1}=(A_{1},B_{1},C_{1})$ with

\displaystyle A_{1}

\displaystyle=\begin{bmatrix}0&1&0&0&\ldots&0\\ 0&0&1&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&0&\ldots&1\\ -a&0&0&0&\ldots&0\end{bmatrix}_{n\times n}B_{1}=\begin{bmatrix}0\\ 0\\ \vdots\\ 0\\ 1\end{bmatrix}_{n\times 1}C_{1}=B_{1}^{\top}

(2)

where $na\ll 1$ and $n>20$ . Here the order of $M_{1}$ is $n$ . However, it can be approximated well by $M_{2}$ which is of a much lower order and given by

\displaystyle A_{2}

\displaystyle=\begin{bmatrix}0&0\\ 1&0\end{bmatrix}\hskip 5.69054ptB_{2}=\begin{bmatrix}0\\ 1\end{bmatrix}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ C_{2}=B_{2}^{\top}.

(3)

For the same input $U_{t}$ , if $Y^{(1)}_{t},Y^{(2)}_{t}$ be the output generated by $M_{1}$ and $M_{2}$ respectively then a simple computation shows that

\sup_{U}\sum_{t=1}^{\infty}\frac{(Y^{(1)}_{t}-Y^{(2)}_{t})^{2}}{U_{t}^{2}}\leq 4n^{2}a^{2}\ll 1

This suggests that the actual value of $n$ is not important; rather there exists an effective order, $r$ (which is $2$ in this case). This lower order model captures “most” of the LTI system.

Since the true model order is not known in many cases, we emphasize a nonparametric approach to identification: one which adaptively selects the best model order for the given data and approximates the underlying LTI system better as $T$ (length of data) grows. The key to this approach will be designing an estimator $\hat{M}$ from which we obtain a realization $(\hat{C},\hat{A},\hat{B})$ of the selected order.

1.1 Related Work

Linear time invariant systems are an extensively studied class of models in control and systems theory. These models are used in feedback control systems (for example in planetary soft landing systems for rockets (Açıkmeşe et al., 2013)) and as linear approximations to many non–linear systems that nevertheless work well in practice. In the absence of process and output noise, subspace-based system identification methods are known to learn $(C,A,B)$ (up to similarity transformation)(Ljung, 1987; Van Overschee and De Moor, 2012). These typically involve constructing a Hankel matrix from the input–output pairs and then obtaining system parameters by a singular value decomposition. Such methods are inspired by the celebrated Ho-Kalman realization algorithm (Ho and Kalman, 1966). The correctness of these methods is predicated on the knowledge of $n$ or presence of infinite data. Other approaches include rank minimization-based methods for system identification (Fazel et al., 2013; Grussler et al., 2018), further relaxing the rank constraint to a suitable convex formulation. However, there is a lack of statistical guarantees for these algorithms, and it is unclear how much data is required to obtain accurate estimates of system parameters from finite noisy data. Empirical methods such as the EM algorithm (Dempster et al., 1977) are also used in practice; however, these suffer from non-convexity in problem formulation and can get trapped in local minima. Learning simpler approximations to complex models in the presence of finite noisy data was studied in Venkatesh and Dahleh (2001) where identification error is decomposed into error due to approximation and error due to noise; however the analysis assumes the knowledge of a “good” parametrization and does not provide statistical guarantees for learning the system parameters of such an approximation.

More recently, there has been a resurgence in the study of statistical identification of LTI systems from a single time series in the machine learning community. In cases when $C=I$ , i.e., $X_{t}$ is observed directly, sharp finite time error bounds for identification of $A,B$ from a single time series are provided in Faradonbeh et al. (2017); Simchowitz et al. (2018); Sarkar and Rakhlin (2018). The approach to finding $A,B$ is based on a standard ordinary least squares (OLS) given by

(\hat{A},\hat{B})=\arg\min_{A,B}\sum_{t=1}^{T}||X_{t+1}-[A,B][X_{t}^{\top},U_{t}^{\top}]^{\top}||_{2}^{2}.

Another closely related area is that of online prediction in time series Hazan et al. (2018); Agarwal et al. (2018). Finite time regret guarantees for prediction in linear time series are provided in Hazan et al. (2018). The approach there circumvents the need for system identification and instead uses a filtering technique that convolves the time series with eigenvectors of a specific Hankel matrix.

Closest to our work is that of Oymak and Ozay (2018). Their algorithm, which takes inspiration from the Kalman–Ho algorithm, assumes the knowledge of model order $n$ . This limits the applicability of the algorithm in two ways: first, it is unclear how the techniques can be extended to the case when $n$ is unknown—as is usually the case—and, second, in many cases $n$ is very large and a much lower order LTI system can be a very good approximation of the original system. In such cases, constructing the order $n$ estimate might be unnecessarily conservative (See Example 1.1). Consequently, the error bounds do not reflect accurate dependence on the system parameters.

When $n$ is unknown, it is unclear when a singular value decomposition should be performed to obtain the parameter estimates via Ho-Kalman algorithm. This leads to the question of model order selection from data. For subspace based methods, such problems have been addressed in Shibata (1976) and Bauer (2000). These papers address the question of estimating order in the context of subspace methods. Specifically, order estimation is achieved by analyzing the information contained in the estimated singular values and/or estimated innovation variance. Furthermore, they provide guarantees for asymptotic consistency of the methods described. Another line of literature studied in Ljung et al. (2015) for example, approaches the identification of systems with unknown order by first learning the largest possible model that fits the data and then performing model reduction to obtain the final system. Although one can show that asymptotically this method outputs the true model, we show that such a two step procedure may underperform in a finite time setting. A possible explanation for this could be that learning the largest possible model with finite data over-fits on the exogenous noise and therefore gives poor model estimates.

Other related work on identifying finite impulse response approximations include Goldenshluger (1998); Tu et al. (2017); but they do not discuss parameter estimation or reduced order modeling. Several authors Campi and Weyer (2002); Shah et al. (2012); Hardt et al. (2016) and references therein have studied the problem of system identification in different contexts. However, they fail to capture the correct dependence of system parameters on error rates. More importantly, they suffer from the same limitation as Oymak and Ozay (2018) that they require the knowledge of $n$ .

2 Mathematical Preliminaries

Throughout the paper, we will refer to an LTI system with dynamics as Eq. (1) by $M=(C,A,B)$ . For a matrix $A$ , let $\sigma_{i}(A)$ be the $i^{\text{th}}$ singular value of $A$ with $\sigma_{i}(A)\geq\sigma_{i+1}(A)$ . Further, $\sigma_{\max}(A)=\sigma_{1}(A)=\sigma(A)$ . Similarly, we define $\rho_{i}(A)=|\lambda_{i}(A)|$ , where $\lambda_{i}(A)$ is an eigenvalue of $A$ with $\rho_{i}(A)\geq\rho_{i+1}(A)$ . Again, $\rho_{\max}(A)=\rho_{1}(A)=\rho(A)$ .

Definition 2.1.

A matrix $A$ is Schur stable if $\rho_{\max}(A)<1$ .

We will only be interested in the class of LTI systems that are Schur stable. Fix $\gamma>0$ (and possibly much greater than $1$ ). The model class $\mathcal{M}_{r}$ of LTI systems parametrized by $r\in\mathbb{Z}_{+}$ is defined as

\mathcal{M}_{r}=\{(C,A,B)\hskip 2.84526pt|\hskip 2.84526ptC\in\mathbb{R}^{p\times r},A\in\mathbb{R}^{r\times r},B\in\mathbb{R}^{r\times m},\rho(A)<1,\sigma(A)\leq\gamma\}.

(4)

Definition 2.2.

The $(k,p,q)$ –dimensional Hankel matrix for $M=(C,A,B)$ as

\displaystyle\mathcal{H}_{k,p,q}(M)=\begin{bmatrix}CA^{k}B&CA^{k+1}B&\ldots&CA^{q+k-1}B\\ CA^{k+1}B&CA^{k+2}B&\ldots&CA^{q+k}B\\ \vdots&\vdots&\ddots&\vdots\\ CA^{p+k-1}B&\ldots&\ldots&CA^{p+q+k-2}B\end{bmatrix}

and its associated Toeplitz matrix as

\displaystyle\mathcal{T}_{k,d}(M)=\begin{bmatrix}0&0&\ldots&0&0\\ CA^{k}B&0&\ldots&0&0\\ \vdots&\ddots&\ddots&\vdots&0\\ CA^{d+k-3}B&\ldots&CA^{k}B&0&0\\ CA^{d+k-2}B&CA^{d+k-3}B&\ldots&CA^{k}B&0\end{bmatrix}.

We will slightly abuse notation by referring to $\mathcal{H}_{k,p,q}(M)=\mathcal{H}_{k,p,q}$ . Similarly for the Toeplitz matrices $\mathcal{T}_{k,d}(M)=\mathcal{T}_{k,d}$ . The matrix $\mathcal{H}_{0,\infty,\infty}(M)$ is known as the system Hankel matrix corresponding to $M$ , and its rank is known as the model order (or simply order) of $M$ . The system Hankel matrix has two well-known properties that make it useful for system identification. First, the rank of $\mathcal{H}_{0,\infty,\infty}$ has an upper bound $n$ . Second, it maps the “past” inputs to “future” outputs. These properties are discussed in detail in appendix as Section 9.2. For infinite matrices $\mathcal{H}_{0,\infty,\infty}$ , $||\mathcal{H}_{0,\infty,\infty}||_{2}\triangleq||\mathcal{H}_{0,\infty,\infty}||_{\text{op}}$ , i.e., the operator norm.

Definition 2.3.

The transfer function of $M=(C,A,B)$ is given by $G(z)=C(zI-A)^{-1}B$ where $z\in\mathbb{C}$ .

The transfer function plays a critical role in control theory as it relates the input to the output. Succinctly, the transfer function of an LTI system is the Z–transform of the output in response to a unit impulse input. Since for any invertible $S$ the LTI systems $M_{1}=(CS^{-1},SAS^{-1},SB),M_{2}=(C,A,B)$ have identical transfer functions, identification may not be unique, but equivalent up to a transformation $S$ , i.e., $(C,A,B)\equiv(CS,S^{-1}AS,S^{-1}B)$ . Next, we define a system norm that will be important from the perspective of model identification and approximation.

Definition 2.4.

The $\mathcal{H}_{\infty}$ –system norm of a Schur stable LTI system $M$ is given by

||M||_{\infty}=\sup_{\omega\in\mathbb{R}}\sigma_{\max}(G(e^{j\omega})).

Here, $G(\cdot)$ is the transfer function of $M$ . The $r$ –truncation of the transfer function is defined as

G_{r}\coloneqq[CB,CAB,\ldots,CA^{r-1}B].

(5)

For a stable LTI system $M$ we have

Proposition 1 (Lemma 2.2 Glover (1987)).

Let $M$ be a LTI system then

||M||_{H}=\sigma_{1}\leq||M||_{\infty}\leq 2(\sigma_{1}+\ldots+\sigma_{n})

where $\sigma_{i}$ are the singular values of $\mathcal{H}_{0,\infty,\infty}(M)$ .

For any matrix $Z$ , define $Z_{m:n,p:q}$ as the submatrix including row $m$ to $n$ and column $p$ to $q$ . Further, $Z_{m:n,:}$ is the submatrix including row $m$ to $n$ and all columns and a similar notion exists for $Z_{:,p:q}$ . Finally, we define balanced truncated models which will play an important role in our algorithm.

Definition 2.5 (Kung and Lin (1981)).

Let $\mathcal{H}_{0,\infty,\infty}(M)=U\Sigma V^{\top}$ where $\Sigma\in\mathbb{R}^{n\times n}$ ( $n$ is the model order). Then for any $r\leq n$ , the $r$ –order balanced truncated model parameters are given by

C_{r}=[U\Sigma^{1/2}]_{1:p,1:r},A_{r}=\Sigma_{1:r,1:r}^{-1/2}U_{:,1:r}^{\top}[U\Sigma^{1/2}]_{p+1:,1:r},B_{r}=[\Sigma^{1/2}V^{\top}]_{1:r,1:m}.

For $r>n$ , the $r$ –order balanced truncated model parameters are the $n$ –order truncated model parameters.

Definition 2.6.

We say a random vector $v\in\mathbb{R}^{d}$ is subgaussian with variance proxy $\tau^{2}$ if

\sup_{||\theta||_{2}=1}\sup_{p\geq 1}\left\{p^{-1/2}\left(\mathbb{E}[\left|\langle v,\theta\rangle\right|^{p}]\right)^{1/p}\right\}=\tau

and $\mathbb{E}[v]=\textbf{0}$ . We denote this by $v\sim\mathsf{subg}(\tau^{2})$ .

A fundamental result in model reduction from systems theory is the following

Theorem 2 (Theorem 21.26 Zhou et al. (1996)).

Let $M=(C,A,B)$ be the true model of order $n$ and $M_{r}=(C_{r},A_{r},B_{r})$ be its balance truncated model of order $r<n$ . Assume that $\sigma_{r}\neq\sigma_{r+1}$ . Then

||M-M_{r}||_{\infty}\leq 2(\sigma_{r+1}+\sigma_{r+2}+\ldots+\sigma_{n})

where $\sigma_{i}$ are the Hankel singular values of $M$ .

Critical to obtaining refined error rates, will be a result from the theory of self–normalized martingales, an application of the pseudo-maximization technique in (Peña et al., 2008, Theorem 14.7):

Theorem 3.

Let $\{\bm{\mathcal{F}}_{t}\}_{t=0}^{\infty}$ be a filtration. Let $\{\eta_{t}\in\mathbb{R}^{m},X_{t}\in\mathbb{R}^{d}\}_{t=1}^{\infty}$ be stochastic processes such that $\eta_{t},X_{t}$ are $\bm{\mathcal{F}}_{t}$ measurable and $\eta_{t}$ is $\bm{\mathcal{F}}_{t-1}$ -conditionally $\mathsf{subg}(L^{2})$ for some $L>0$ . For any $t\geq 0$ , define $V_{t}=\sum_{s=1}^{t}X_{s}X_{s}^{\prime},S_{t}=\sum_{s=1}^{t}\eta_{s+1}X_{s}$ . Then for any $\delta>0,V\succ 0$ and all $t\geq 0$ we have with probability at least $1-\delta$

S_{t}^{\top}(V+V_{t})^{-1}S_{t}\leq 4L^{2}\left(\log{\frac{1}{\delta}}+\log{\frac{\text{det}(V+V_{t})}{\text{det}(V)}}+m\right).

The proof of this result can be found as Theorem 7.

We denote by $c$ universal constants which can change from line to line. For numbers $a,b$ , we define $a\wedge b\triangleq\min{(a,b)}$ and $a\vee b\triangleq\max{(a,b)}$ .

Finally, for two matrices $M_{1}\in\mathbb{R}^{l_{1}\times l_{1}},M_{2}\in\mathbb{R}^{l_{2}\times l_{2}}$ with $l_{1}<l_{2}$ , $M_{1}-M_{2}\triangleq\tilde{M}_{1}-M_{2}$ where $\tilde{M}_{1}=\begin{bmatrix}M_{1}&0_{l_{1}\times l_{2}-l_{1}}\\ 0_{l_{2}-l_{1}\times l_{1}}&0_{l_{2}-l_{1}\times l_{2}-l_{1}}\end{bmatrix}$ .

Proposition 4 (System Reduction).

Let $||S-P||\leq\epsilon$ and the singular values of $S$ be arranged as follows:

\sigma_{1}(S)>\ldots>\sigma_{r-1}(S)>\sigma_{r}(S)\geq\sigma_{r+1}(S)\geq\ldots\geq\sigma_{s}(S)>\sigma_{s+1}(S)>\ldots\sigma_{n}(S)>\sigma_{n+1}(S)=0

Furthermore, let $\epsilon$ be such that

\epsilon\leq\inf_{\{1\leq i\leq r-1\}\cup\{s+1\leq i\leq n\}}\Big{(}\frac{\sigma_{i}(P)-\sigma_{i+1}(P)}{2}\Big{)}.

(6)

Define $K_{0}=[1,2,\ldots,r-1]\cup[s+1,s+2,\ldots,n]$ , then

	$\displaystyle\|\|U^{S}_{K_{0}}(\Sigma^{S}_{K_{0}})^{1/2}-U^{P}_{K_{0}}(\Sigma^{P}_{K_{0}})^{1/2}\|\|_{2}$	$\displaystyle\leq 2\sqrt{\sum_{i=1}^{r-1}\frac{\sigma_{i}\epsilon^{2}}{(\sigma_{i}-\sigma_{i+1})^{2}\wedge(\sigma_{i-1}-\sigma_{i})^{2}}}$
		$\displaystyle+2\sqrt{\frac{\sigma_{s}\epsilon^{2}}{((\sigma_{r-1}-\sigma_{s})\wedge(\sigma_{r}-\sigma_{s+1}))^{2}}}+\sup_{1\leq i\leq s}\|\sqrt{\sigma_{i}}-\sqrt{\hat{\sigma}_{i}}\|$

and $\sigma_{i}=\sigma_{i}(S),\hat{\sigma}_{i}=\sigma_{i}(P)$ .

The proof is provided in Proposition 4 in the appendix. This is an extension of Wedin’s result that allows us to scale the recovery error of the $r^{th}$ singular vector by only condition number of that singular vector. This is useful to represent the error of identifying a $r$ -order approximation as a function of the $r^{th}$ -singular value only.

We briefly summarize our contributions below.

3 Contributions

In this paper we provide a purely data-driven approach to system identification from a single time–series of finite noisy data. Drawing from tools in systems theory and the theory of self–normalized martingales, we offer a nearly optimal OLS-based algorithm to learn the system parameters. We summarize our contributions below:

•

The central theme of our approach is to estimate the infinite system Hankel matrix (to be defined below) with increasing accuracy as the length $T$ of data grows. By utilizing a specific reformulation of the input–output relation in Eq. (1) we reduce the problem of Hankel matrix identification to that of regression between appropriately transformed versions of output and input. The OLS solution is a matrix $\hat{\mathcal{H}}$ of size $\hat{d}$ . More precisely, we show that with probability at least $1-\delta$ ,

\displaystyle\Big{|}\Big{|}\hat{\mathcal{H}}-\mathcal{H}_{0,\hat{d},\hat{d}}\Big{|}\Big{|}_{2}

\displaystyle\lesssim\sqrt{\frac{\beta^{2}\hat{d}}{T}}\sqrt{p\hat{d}+\log{\frac{T}{\delta}}}

for $T$ above a certain threshold, where $\mathcal{H}_{0,\hat{d},\hat{d}}$ is the $p\hat{d}\times m\hat{d}$ principal submatrix of the system Hankel. Here $\beta$ is the $\mathcal{H}_{\infty}$ –system norm.

•

We show that by growing $\hat{d}$ with $T$ in a specific fashion, $\hat{\mathcal{H}}$ becomes the minimax optimal estimator of the system Hankel matrix. The choice of $\hat{d}$ for a fixed $T$ is purely data-dependent and does not depend on spectral radius of $A$ or $n$ .

•

It is well known in systems theory that SVD of the doubly infinite system Hankel matrix gives us $A,B,C$ . However, the presence of finite noisy data prevents learning these parameters accurately. We show that it is always possible to learn the parameters of a lower-order approximation of the underlying system. This is achieved by selecting the top $k$ singular vectors of $\hat{\mathcal{H}}$ . The estimation guarantee corresponds to model selection in Statistics. More precisely, for every $k\leq\hat{d}$ if $(A_{k},B_{k},C_{k})$ are the parameters of a $k$ -order balanced approximation of the original LTI system and $(\hat{A}_{k},\hat{B}_{k},\hat{C}_{k})$ are the estimates of our algorithm then for $T$ above a certain threshold we have

\displaystyle||C_{k}-\hat{C}_{k}||_{2}+||A_{k}-\hat{A}_{k}||_{2}+||B_{k}-\hat{B}_{k}||_{2}

\displaystyle\lesssim\sqrt{\frac{\beta^{2}\hat{d}}{\hat{\sigma}_{k}^{2}T}}\sqrt{p\hat{d}+\log{\frac{T}{\delta}}}

with probability at least $1-\delta$ where $\hat{\sigma}_{i}$ is the $i^{\text{th}}$ largest singular value of $\hat{\mathcal{H}}$ .

4 Problem Formulation and Discussion

4.1 Data Generation

Assume there exists an unknown $M=(C,A,B)\in\mathcal{M}_{n}$ for some unknown $n$ . Let the transfer function of $M$ be $G(z)$ . Suppose we observe the noisy output time series $\{Y_{t}\in\mathbb{R}^{p\times 1}\}_{t=1}^{T}$ in response to user chosen input series, $\{U_{t}\in\mathbb{R}^{m\times 1}\}_{t=1}^{T}$ . We refer to this data generated by $M$ as $Z_{T}=\{(U_{t},Y_{t})\}_{t=1}^{T}$ . We enforce the following assumptions on $M$ .

Assumption 1

The noise process $\{\eta_{t},w_{t}\}_{t=1}^{\infty}$ in the dynamics of $M$ given by Eq. (1) are i.i.d. and $\eta_{t},w_{t}$ are isotropic with subGaussian parameter $1$ . Furthermore, $X_{0}=0$ almost surely. We will only select inputs, $\{U_{t}\}_{t=1}^{T}$ , that are isotropic subGaussian with subGaussian parameter $1$ .

The input–output map of Eq. (1) can be represented in multiple alternate ways. One commonly used reformulation of the input–output map in systems and control theory is the following

\begin{bmatrix}Y_{1}\\ Y_{2}\\ \vdots\\ Y_{T}\end{bmatrix}=\mathcal{T}_{0,T}\begin{bmatrix}U_{1}\\ U_{2}\\ \vdots\\ U_{T}\end{bmatrix}+\mathcal{T}\mathcal{O}_{0,T}\begin{bmatrix}\eta_{1}\\ \eta_{2}\\ \vdots\\ \eta_{T}\end{bmatrix}+\begin{bmatrix}w_{1}\\ w_{2}\\ \vdots\\ w_{T}\end{bmatrix}

where $\mathcal{T}\mathcal{O}_{k,d}$ is defined as the Toeplitz matrix corresponding to process noise $\eta_{t}$ (similar to Definition 2.2):

\displaystyle\mathcal{T}\mathcal{O}_{k,d}=\begin{bmatrix}0&0&\ldots&0&0\\ CA^{k}&0&\ldots&0&0\\ \vdots&\ddots&\ddots&\vdots&0\\ CA^{d+k-3}&\ldots&CA^{k}&0&0\\ CA^{d+k-2}&CA^{d+k-3}&\ldots&CA^{k}&0\end{bmatrix}.

$||\mathcal{T}_{0,T}||_{2},||\mathcal{T}\mathcal{O}_{0,T}||_{2}$ denote observed amplifications of the control input and process noise respectively. Note that stability of $A$ ensures $||\mathcal{T}_{0,\infty}||_{2},||\mathcal{T}\mathcal{O}_{0,\infty}||_{2}<\infty$ . Suppose both $\eta_{t},w_{t}=0$ in Eq. (1). Then it is a well-known fact that

||M||_{\infty}=\sup_{U_{t}}\sqrt{\frac{\sum_{t=0}^{\infty}Y_{t}^{\top}Y_{t}}{\sum_{t=0}^{\infty}U_{t}^{\top}U_{t}}}\implies||M||_{\infty}=||\mathcal{T}_{0,\infty}||_{2}\geq||\mathcal{H}_{0,\infty,\infty}||_{2}.

(7)

Assumption 2

There exist universal constants $\beta,R\geq 1$ such that $||\mathcal{T}_{0,\infty}||_{2}\leq\beta,\leavevmode\nobreak\ \leavevmode\nobreak\ \frac{||\mathcal{T}\mathcal{O}_{0,\infty}||_{2}}{||\mathcal{T}_{0,\infty}||_{2}}\leq R.$

Remark 4.1 ( $\mathcal{H}_{\infty}$ -norm estimation).

Assumption 2 implies that an upper bound to the $\mathcal{H}_{\infty}$ –norm of the system. It is possible to estimate $||M||_{\infty}$ from data (See Tu et al. (2018a) and references therein). It is reasonable to expect that error rates for identification of the parameters $(C,A,B)$ depend on the noise-to-signal ratio $\frac{||\mathcal{T}\mathcal{O}_{0,\infty}||_{2}}{||\mathcal{T}_{0,\infty}||_{2}}$ , i.e., identification is much harder when the ratio is large.

Remark 4.2 ( $R$ estimation).

The noise to signal ratio hyperparameter can also be estimated from data, by allowing the system to run with $U_{t}=0$ and taking the average $\ell_{2}$ norm of the output $Y_{t}$ , i.e., $(1/T)\sum_{t=1}^{T}\|Y_{t}\|^{2}_{2}$ . For the purpose of the results of the paper we simply assume an upper bound on $R$ . If $U_{t}$ was $\mathsf{subg}(L)$ instead of $\mathsf{subg}(1)$ , the noise-to-signal ratio is modified to $R/L$ instead.

$m$ : Input dimension, $p$ : Output dimension
$\gamma$ : Known upper bound on $\|\|A\|\|_{2}$
$\delta$ : Error probability
$c,\mathcal{C}$ : Known absolute constants
$R$ : Known noise to signal ratio, or, $\frac{\|\|\mathcal{T}\mathcal{O}_{0,\infty}\|\|_{2}}{\|\|\mathcal{T}_{0,\infty}\|\|_{2}}$
$\beta$ : Known upper bound on $\mathcal{H}_{\infty}$ -norm of LTI system
$\mathcal{D}(T)=\{d\|T\geq cm^{2}d\log^{2}{(d)}\log^{2}{(m^{2}/\delta)}+cd\log^{3}{(2d)}\}$
$\sigma_{A}=\sum_{l=1}^{d}\|\|CA^{l}B\|\|_{2}$ , $\sigma_{B}=\sum_{l=1}^{d}\|\|CA^{l}\|\|_{2}$
$\sigma_{C}=\sqrt{\sigma\left(\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T}\right)}$ , $\sigma_{D}=\sqrt{\sigma\left(\sum_{k=1}^{d}\mathcal{T}\mathcal{O}_{d+k,T}^{\top}\mathcal{T}\mathcal{O}_{d+k,T}\right)}$
$\alpha(l)=\sqrt{l}\left(\sqrt{\frac{lp+\log{(T/\delta)}+m}{T}}\right)$

Table 1: Summary of constants

5 Algorithmic Details

We will now represent the input–output relationship in terms of the Hankel and Toeplitz matrices defined before. Fix a $d$ , then for any $l$ we have

	$\displaystyle\begin{bmatrix}Y_{l}\\ Y_{l+1}\\ \vdots\\ Y_{l+d-1}\end{bmatrix}$	$\displaystyle=\mathcal{H}_{0,d,d}\begin{bmatrix}U_{l-1}\\ U_{l-2}\\ \vdots\\ U_{l-d}\end{bmatrix}+\mathcal{T}_{0,d}\begin{bmatrix}U_{l}\\ U_{l+1}\\ \vdots\\ U_{l+d-1}\end{bmatrix}+\mathcal{O}_{0,d,d}\begin{bmatrix}\eta_{l-1}\\ \eta_{l-2}\\ \vdots\\ \eta_{l-d+1}\end{bmatrix}+\mathcal{T}\mathcal{O}_{0,d}\begin{bmatrix}\eta_{l}\\ \eta_{l+1}\\ \vdots\\ \eta_{l+d-1}\end{bmatrix}$
		$\displaystyle+\mathcal{H}_{d,d,l-d-1}\begin{bmatrix}U_{l-d-1}\\ U_{l-d-1}\\ \vdots\\ U_{1}\end{bmatrix}+\mathcal{O}_{d,d,l-d-1}\begin{bmatrix}\eta_{l-d-1}\\ \eta_{l-d-1}\\ \vdots\\ \eta_{1}\end{bmatrix}+\begin{bmatrix}w_{l}\\ w_{l+1}\\ \vdots\\ w_{l+d-1}\end{bmatrix}$		(8)

or, succinctly,

	$\displaystyle\tilde{Y}^{+}_{l,d}$	$\displaystyle=\mathcal{H}_{0,d,d}\tilde{U}^{-}_{l-1,d}+\mathcal{T}_{0,d}\tilde{U}^{+}_{l,d}+\mathcal{H}_{d,d,l-d-1}\tilde{U}^{-}_{l-d-1,l-d-1}$
		$\displaystyle+\mathcal{O}_{0,d,d}\tilde{\eta}^{-}_{l-1,d}+\mathcal{T}\mathcal{O}_{0,d}\tilde{\eta}^{+}_{l,d}+\mathcal{O}_{d,d,l-d-1}\tilde{\eta}^{-}_{l-d-1,l-d-1}+\tilde{w}^{+}_{l,d}$		(9)

Here

\displaystyle\mathcal{O}_{k,p,q}

\displaystyle=\begin{bmatrix}CA^{k}&CA^{k+1}&\ldots&CA^{q+k-1}\\ CA^{k+1}&CA^{k+2}&\ldots&CA^{d+k}\\ \vdots&\vdots&\ddots&\vdots\\ CA^{p+k-1}&\ldots&\ldots&CA^{p+q+k-2}\end{bmatrix},\tilde{Y}^{-}_{l,d}=\begin{bmatrix}Y_{l}\\ Y_{l-1}\\ \vdots\\ Y_{l-d+1}\end{bmatrix},\tilde{Y}^{+}_{l,d}=\begin{bmatrix}Y_{l}\\ Y_{l+1}\\ \vdots\\ Y_{l+d-1}\end{bmatrix}.

Furthermore, $\tilde{U}^{-}_{l,d},\tilde{\eta}^{-}_{l,d}$ are defined similar to $\tilde{Y}^{-}_{l,d}$ and $\tilde{U}^{+}_{l,d},\tilde{\eta}^{+}_{l,d},\tilde{w}^{+}_{l,d}$ are similar to $\tilde{Y}^{+}_{l,d}$ . The $+$ and $-$ signs indicate moving forward and backward in time respectively. This representation will be at the center of our analysis.

There are three key steps in our algorithm which we describe in the following sections:

(a)

Hankel submatrix estimation: Estimating $\mathcal{H}_{0,l,l}$ for every $1\leq l\leq T$ . We refer to the estimators as $\{\hat{\mathcal{H}}_{0,l,l}\}_{l=1}^{T}$ .
(b)

Model Selection: From the estimators $\{\hat{\mathcal{H}}_{0,l,l}\}_{l=1}^{T}$ select $\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}$ in a data dependent way such that it “best” estimates $\mathcal{H}_{0,\infty,\infty}$ .
(c)

Parameter Recovery: For every $k\leq\hat{d}$ , we do a singular value decomposition of $\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}$ to obtain parameter estimates for a “good” $k$ -order approximation of the true model.

5.1 Hankel Submatrix Estimation

The goal of our systems identification is to estimate either $\mathcal{H}_{0,n,n}$ or $\mathcal{H}_{0,\infty,\infty}$ . Since we only have finite data and no apriori knowledge of $n$ it is not possible to directly estimate the unknown matrices. The first step then is to estimate all possible Hankel submatrices that are “allowed” by data, i.e., $\mathcal{H}_{0,d,d}$ for $d\leq T$ . For a fixed $d$ , Algorithm 1 estimates the $d\times d$ principal submatrix $\mathcal{H}_{0,d,d}$ .

Algorithm 1 LearnSystem(

T,d,m,p

)

Input $T=\text{Horizon for learning}$
$d=\text{Hankel Size}$
$m=\text{Input dimension}$
$p=\text{Output dimension}$
Output System Parameters: $\hat{\mathcal{H}}_{0,d,d}$

1: Generate

2T

i.i.d. inputs

\{U_{j}\sim{\mathcal{N}}(0,I_{m\times m})\}_{j=1}^{2T}

.

2: Collect

2T

input–output pairs

\{U_{j},Y_{j}\}_{j=1}^{2T}

.

3:

\hat{\mathcal{H}}_{0,d,d}=\arg\min_{\mathcal{H}}\sum_{l=0}^{T-1}||\tilde{Y}^{+}_{l+d+1,d}-\mathcal{H}\tilde{U}^{-}_{l+d,d}||_{2}^{2}

4: return

\hat{\mathcal{H}}_{0,d,d}

It can be shown that

\hat{\mathcal{H}}_{0,d,d}=\Big{(}\sum_{l=0}^{T-1}\tilde{Y}^{+}_{l+d+1,d}(\tilde{U}^{-}_{l+d,d})^{\top}\Big{)}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}(\tilde{U}^{-}_{l+d,d})^{\top}\Big{)}^{+}

(10)

and by running the algorithm $T$ times, we obtain $\{\hat{\mathcal{H}}_{0,d,,d}\}_{d=1}^{T}$ . A key step in showing that $\hat{\mathcal{H}}_{0,d,d}$ is a good estimator for $\mathcal{H}_{0,d,d}$ is to prove the finite time isometry of $V_{T}=\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}(\tilde{U}^{-}_{l+d,d})^{\top}$ , i.e., the sample covariance matrix.

Lemma 1.

Define

\displaystyle T_{0}(\delta,d)=cm^{2}d\log^{2}{(d)}\log^{2}{(m^{2}/\delta)}+cd\log^{3}{(2d)}

where $c$ is some universal constant. Define the sample covariance matrix $V_{T}\coloneqq\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}(\tilde{U}^{-}_{l+d,d})^{\top}$ . We have with probability $1-\delta$ and for $T>T_{0}(\delta,d)$

\displaystyle\frac{1}{2}TI\preceq V_{T}\preceq\frac{3}{2}TI

(11)

Lemma 1 allows us to write Eq. (10) as $\hat{\mathcal{H}}_{0,d,d}=\Big{(}\sum_{l=0}^{T-1}\tilde{Y}^{+}_{l+d+1,d}(\tilde{U}^{-}_{l+d,d})^{\top}\Big{)}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}(\tilde{U}^{-}_{l+d,d})^{\top}\Big{)}^{-1}$ with high probability and upper bound estimation error for $d\times d$ principal submatrix.

Theorem 2.

Fix $d$ and let $\hat{\mathcal{H}}_{0,d,d}$ be the output of Algorithm 1. Then for any $0<\delta<1$ and $T\geq T_{0}(\delta,d)$ , we have with probability at least $1-\delta$

\displaystyle\Big{|}\Big{|}\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,d,d}\Big{|}\Big{|}_{2}

\displaystyle\leq 4\sigma\sqrt{\frac{1}{T}}\sqrt{pd+\log{\frac{1}{\delta}}+m}.

Here $T_{0}(\delta,d)=cm^{2}d\log^{2}{(d)}\log^{2}{(m^{2}/\delta)}+cd\log^{3}{(2d)}$ , $c$ is a universal constant and $\sigma=\max{(\sigma_{A},\sigma_{B},\sigma_{C},\sigma_{D})}$ from Table 1.

Proof 5.1.

We outline the proof here. Recall Eq. (8), (9). Then for a fixed $d$

\hat{\mathcal{H}}_{0,d,d}=\Big{(}\sum_{l=0}^{T-1}\tilde{Y}^{+}_{l+d+1,d}(\tilde{U}^{-}_{l+d,d})^{\top}\Big{)}V_{T}^{+}.

Then the identification error is

$\displaystyle\Big{\|}\Big{\|}\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,d,d}\Big{\|}\Big{\|}_{2}$	$\displaystyle=\Big{\|}\Big{\|}V_{T}^{+}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{w}^{+\top}_{l+d+1,d}$
	$\displaystyle+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l+d,d}\mathcal{O}_{0,d,d}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\top}_{l+d+1,d}\mathcal{T}\mathcal{O}^{\top}_{0,d}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l,l}\mathcal{O}_{d,d,l}^{\top}\Big{)}\Big{\|}\Big{\|}_{2}$
	$\displaystyle=\|\|V_{T}^{+}E\|\|_{2}$	(12)

with

	$\displaystyle E$	$\displaystyle=\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{w}^{+\top}_{l+d+1,d}$
		$\displaystyle+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l+d,d}\mathcal{O}_{0,d,d}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\top}_{l+d+1,d}\mathcal{T}\mathcal{O}^{\top}_{0,d}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l,l}\mathcal{O}_{d,d,l}^{\top}.$

By Lemma 1 we have, whenever $T\geq T_{0}(\delta,d)$ , with probability at least $1-\delta$

\frac{TI}{2}\preceq V_{T}\preceq\frac{3TI}{2}.

(13)

This ensures that, with high probability, that $V_{T}^{-1}$ exists and decays as $O(T^{-1})$ . The next step involves showing that $||E||_{2}$ grows at most as $\sqrt{T}$ with high probability. This is reminiscent of Theorem 3 and the theory of self–normalized martingales. However, unlike that cases the conditional sub-Gaussianity requirements do not hold here. For example, let $\bm{\mathcal{F}}_{l}=\sigma(\eta_{1},\ldots,\eta_{l})$ then $\mathbb{E}[v^{\top}\tilde{\eta}^{-}_{l+1,l+1}|\bm{\mathcal{F}}_{l}]\neq 0$ for all $v$ since $\{\tilde{\eta}^{-}_{l+1,l+1}\}_{l=0}^{T-1}$ is not an independent sequence. As a result it is not immediately obvious on how to apply Theorem 3 to our case. Under the event when Eq. (13) holds (which happens with high probability), a careful analysis of the normalized cross terms, i.e., $V_{T}^{-1/2}E$ shows that $||V_{T}^{-1/2}E||_{2}=O(1)$ with high probability. This is summarized in Propositions 1-3. The idea is to decompose $E$ into a linear combination of independent subgaussians and reduce it to a form where we can apply Theorem 3. This comes at the cost of additional scaling in the form of system dependent constants – such as the $\mathcal{H}_{\infty}$ –norm. Then we can conclude with high probability that $||\hat{\mathcal{H}}-\mathcal{H}_{0,d,d}||_{2}\leq||V_{T}^{-1/2}||_{2}||V_{T}^{-1/2}E||_{2}\leq T^{-1/2}O(1)$ . The full proof has been deferred to Section 11.1 in Appendix 11.

Remark 5.2.

Recall $\mathcal{D}(T)$ from Table 1. Since

d\in\mathcal{D}(T)\implies T\geq T_{0}(\delta,d)

we can restate Theorem 2 as follows: for a fixed $T$ , we have with probability at least $1-\delta$ that

\Big{|}\Big{|}\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,d,d}\Big{|}\Big{|}_{2}\leq 4\sigma\sqrt{\frac{1}{T}}\sqrt{pd+\log{\frac{1}{\delta}}+m}

when $d\in\mathcal{D}(T)$ .

We next present bounds on $\sigma$ in Theorem 2. From the perspective of model selection in later sections, we require that $\sigma$ be known. In the next proposition we present two bounds on $\sigma$ , the first one depends on unknown parameters and recovers the precise dependence on $d$ . The second bound is an apriori known upper bound and incurs an additional factor of $\sqrt{d}$ .

Proposition 3.

$\sigma$ upper bound independent of $d$ :

\sigma\leq\frac{c_{n}}{(1-\rho(A))^{2}}

where $c_{n}$ depends only on $n$ .

$\sigma$ upper bound dependent on $d$ :

\sigma\leq\beta R\sqrt{d}.

where $R$ is the noise-to-signal ratio as in Table 1

Proof 5.3.

By Gelfand’s formula, since $||A^{d}||_{2}\leq c(n)\rho_{\max}(A)^{d}$ where $\rho_{\max}(A)<1$ and $c(n)$ is a constant that only depends on $n$ , it implies that

\displaystyle\sigma_{A}

\displaystyle=\sum_{l=0}^{d}||CA^{l}B||_{2}\leq\sum_{l=0}^{\infty}||CA^{l}B||_{2}\leq\sum_{l=0}^{\infty}c(n)\rho(A)^{l}=\frac{c(n)}{1-\rho(A)},

and

\displaystyle||\mathcal{T}_{d+k,T}||_{2}\leq\sum_{l=0}^{T-1}||CA^{d+k+l}B||_{2}\leq\frac{c(n)\rho(A)^{d+k}}{1-\rho(A)}.

Then

\sigma_{C}=\sqrt{\sigma\left(\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T}\right)}\leq\frac{c(n)\rho(A)^{d}}{(1-\rho(A))^{2}}\leq\frac{c(n)}{(1-\rho(A))^{2}}.

Similarly, there exists a finite upper bound on $\sigma_{B},\sigma_{D}$ by replacing $CA^{l}B$ and $\mathcal{T}_{d+k,T}$ with $CA^{l}$ and $\mathcal{T}\mathcal{O}_{d+k,T}$ respectively. For the $d$ independent upper bound, we have

\sigma_{A}=\sum_{l=0}^{d}||CA^{l}B||_{2}\leq\sqrt{d}\sqrt{\sum_{l=0}^{d}||CA^{l}B||^{2}_{2}}\leq\sqrt{d}||M||_{H}\leq\sqrt{d}\beta.

Since $\sigma\left(\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T}\right)\leq\beta$ , then

\sigma_{C}=\sqrt{\sigma\left(\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T}\right)}\leq\beta\sqrt{d}.

For the $\sigma_{B},\sigma_{D}$ we get an extra $R$ because $\mathcal{T}\mathcal{O}_{0,\infty}\leq\beta R$ .

The key feature of the data dependent upper bound is that it only depends on $\beta$ and $R$ which are known apriori.

Recall that $G_{d}=[CB,CAB,\ldots,CA^{d-1}B]$ , i.e., the $d$ -order FIR truncation of $G(z)$ . Since the $p$ rows of the $\mathcal{H}_{0,d,d}$ matrix corresponds to $G_{d}$ we can obtain estimators for any $d$ -order FIR.

Corollary 4.

Let $\widehat{G}_{d}=\hat{\mathcal{H}}_{0,d,d}[1:p,:]$ denote the first $p$ -rows of $\hat{\mathcal{H}}_{0,d,d}$ . Then for any $0<\delta<1$ and $T\geq T_{0}(\delta,d)$ , we have with probability at least $1-\delta$ ,

||\widehat{G}_{d}-G_{d}||_{2}\leq 4\sigma\sqrt{\frac{1}{T}}\sqrt{pd+\log{\frac{1}{\delta}}+m}.

Proof 5.4.

Proof follows because ${G}_{d}=\mathcal{H}_{0,d,d}[1:p,:]$ and Theorem 2.

Next, we show that the error in Theorem 2 is minimax optimal (up to logarithmic factors) and cannot be improved by any estimation method.

Proposition 5.

Let $\sqrt{T}\geq c$ where $c$ is an absolute constant. Then for any estimator $\hat{\mathcal{H}}$ of $\mathcal{H}_{0,\infty,\infty}$ we have

\sup_{\hat{\mathcal{H}}}\mathbb{E}[||\hat{\mathcal{H}}-\mathcal{H}_{0,\infty,\infty}||_{2}]\geq c_{n}\cdot{\sqrt{\frac{\log{T}}{T}}}

where $c_{n}>0$ is a constant that is independent of $T$ but can depend on system level parameters.

Proof 5.5.

Assume the contrary that

\sup_{\hat{\mathcal{H}}}\mathbb{E}[||\hat{\mathcal{H}}-\mathcal{H}_{0,\infty,\infty}||_{2}]=o\left(\sqrt{\frac{\log{T}}{T}}\right).

Then recall that $[\mathcal{H}_{0,\infty,\infty}]_{1:p,:}=[CB,CAB,\ldots,]$ and $G(z)=z^{-1}CB+z^{-2}CAB+\ldots$ . Similarly we have $\hat{G}(z)$ . Define

||G-\hat{G}||_{2}=\sqrt{\sum_{k=0}^{\infty}||CA^{k}B-\hat{C}\hat{A}^{k}\hat{B}||_{2}^{2}}.

If $\sup_{\hat{\mathcal{H}}}\mathbb{E}[||\hat{\mathcal{H}}-\mathcal{H}_{0,\infty,\infty}||_{2}]=o\Big{(}\sqrt{\frac{\log{T}}{T}}\Big{)}$ , then since $||\hat{\mathcal{H}}-\mathcal{H}_{0,\infty,\infty}||_{2}\geq||G-\hat{G}||_{2}$ we can conclude that

\mathbb{E}[||G-\hat{G}||_{2}]=o\left(\sqrt{\frac{\log{T}}{T}}\right)

which contradicts Theorem 5 in (Goldenshluger, 1998). Thus, $\sup_{\hat{\mathcal{H}}}\mathbb{E}[||\hat{\mathcal{H}}-\mathcal{H}_{0,\infty,\infty}||_{2}]\geq c_{n}\cdot{\sqrt{\frac{\log{T}}{T}}}$ .

5.2 Model Selection

At a high level, we want to choose $\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}$ from $\{\hat{\mathcal{H}}_{0,d,d}\}_{d=1}^{T}$ such that $\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}$ is a good estimator of $\mathcal{H}_{0,\infty,\infty}$ . Our idea of model selection is motivated by (Goldenshluger, 1998). For any $\hat{\mathcal{H}}_{0,d,d}$ , the error from $\mathcal{H}_{0,\infty,\infty}$ can be broken as:

||\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq\underbrace{||\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,d,d}||_{2}}_{=\text{Estimation Error}}+\underbrace{||\mathcal{H}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||_{2}}_{=\text{Truncation Error}}.

We would like to select a $d=\hat{d}$ such that it balances the truncation and estimation error in the following way:

c_{2}\cdot\text{Data dependent upper bound }\geq c_{1}\cdot\text{Estimation Error}\geq\text{Truncation Error}

where $c_{i}$ are absolute constants. Such a balancing ensures that

||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq c_{2}\cdot(1/c_{1}+1)\cdot\text{Data dependent upper bound }.

(14)

Note that such a balancing is possible because the estimation error increases as $d$ grows and truncation error decreases with $d$ . Furthermore, a data dependent upper bound for estimation error can be obtained from Theorem 2. Unfortunately $(C,A,B)$ are unknown and it is not immediately clear on how to obtain such a bound for truncation error.

To achieve this, we first define a truncation error proxy, i.e., how much do we truncate if a specific $\hat{\mathcal{H}}_{0,d,d}$ is used. For a given $d$ , we look at $||\hat{\mathcal{H}}_{0,d,d}-\hat{\mathcal{H}}_{0,l,l}||_{2}$ for $l\in\mathcal{D}(T)\geq d$ . This measures the additional error incurred if we choose $\hat{\mathcal{H}}_{0,d,d}$ as an estimator for $\mathcal{H}_{0,\infty,\infty}$ instead of $\hat{\mathcal{H}}_{0,l,l}$ for $l>d$ . Then we pick $\hat{d}$ as follows:

\hat{d}\coloneqq\inf\Bigg{\{}d\Bigg{|}||\hat{\mathcal{H}}_{0,d,d}-\hat{\mathcal{H}}_{0,l,l}||_{2}\leq 16\beta R\cdot\alpha(l)\quad{}\forall l\in\mathcal{D}(T)\geq d\Bigg{\}}.

(15)

Recall that $\alpha(l)=\sqrt{\frac{l\log{(l/\delta)}+pl^{2}+ml}{T}}$ , where $\sqrt{\frac{\log{(l/\delta)}+pl+m}{T}}$ denotes how much estimation error is incurred in learning $l\times l$ Hankel submatrix, the extra $\beta\sqrt{l}$ is incurred because we need a data dependent, albeit coarse, upper bound on the estimation error.

A key step will be to show that for any $l\geq d$ , whenever

||\hat{\mathcal{H}}_{0,d,d}-\hat{\mathcal{H}}_{0,l,l}||_{2}\leq c\beta R\cdot\alpha(l)

ensures that

||\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq c\beta R\cdot\alpha(l)\quad{}\text{and}\quad{}||\hat{\mathcal{H}}_{0,l,l}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq c\beta R\cdot\alpha(l)

and there is no gain in choosing a larger Hankel submatrix estimate. By picking the smallest $d$ for which such a property holds for all larger Hankel submatrices, we ensure that a regularized model is estimated that “agrees” with the data.

Algorithm 2 Choice of

d

Output $\hat{\mathcal{H}}_{0,\hat{d},\hat{d}},\quad{}\hat{d}$

1:

\mathcal{D}(T)=\Big{\{}d\Big{|}d\leq\frac{T}{cm^{2}\log^{3}{(Tm/\delta)}}\Big{\}},\alpha(h)=\sqrt{h}\Big{(}\sqrt{\frac{m+hp+\log{\left(\frac{T}{\delta}\right)}}{T}}\Big{)}

.

2:

d_{0}(T,\delta)=\inf\Big{\{}l\Big{|}||\hat{\mathcal{H}}_{0,l,l}-\hat{\mathcal{H}}_{0,h,h}||_{2}\leq 16\beta R(\alpha(h)+2\alpha(l))\hskip 5.69054pt\forall h\in\mathcal{D}(T),h\geq l\Big{\}}

.

3:

\hat{d}=\max{\left(d_{0}(T,\delta),\log{\left(\frac{T}{\delta}\right)}\right)}

4: return

\hat{\mathcal{H}}_{0,\hat{d},\hat{d}},\quad{}\hat{d}

We now state the main estimation result for $\mathcal{H}_{0,\infty,\infty}$ for $d=\hat{d}$ as chosen in Algorithm 2. Define

\displaystyle T_{*}(\delta)

\displaystyle=\inf\Big{\{}T\Big{|}d_{*}(T,\delta)\in\mathcal{D}(T),\hskip 5.69054ptd_{*}(T,\delta)\leq 2d_{*}\left(\frac{T}{256},\delta\right)\Big{\}}

(16)

where

d_{*}(T,\delta)=\inf\Bigg{\{}d\Bigg{|}16\beta R\alpha(d)\geq||\mathcal{H}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||_{2}\Bigg{\}}.

(17)

A close look at Eq. (17) reveals that picking $d=d_{*}(T,\delta)$ ensures the balancing of Eq. (14). However, $d_{*}(T,\delta)$ depends on unknown quantities and is unknown. In such a case, $\hat{d}$ in Eq. (15) becomes a proxy for $d_{*}(T,\delta)$ . From an algorithmic stand point, we no longer need any unknown information; the unknown parameter only appear in $T_{*}(\delta)$ , which is only required to make the theoretical guarantee of Theorem 6 below.

Theorem 6.

Whenever we have $T\geq T_{*}(\delta)$ we have with probability at least $1-\delta$ that

||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq 12c\beta R\left(\sqrt{\frac{m\hat{d}+p\hat{d}^{2}+\hat{d}\log{\frac{T}{\delta}}}{T}}\right).

The proof of Theorem 6 can be found as Proposition 9 in Appendix 13. We see that the error between $\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}$ and $\mathcal{H}_{0,\infty,\infty}$ can be upper bounded by a purely data dependent quantity. The next proposition shows that $\hat{d}$ does not grow more that logarithmically in $T$ .

Proposition 7.

Let $T\geq T_{*}(\delta)$ , $d_{*}(T,\delta)$ be as in Eq. (17). Then with probability at least $1-\delta$ we have

\hat{d}\leq d_{*}(T,\delta)\vee\log{\Big{(}\frac{T}{\delta}\Big{)}}.

Furthermore,

d_{*}(T,\delta)\leq\frac{c\log{(cT+\log{\frac{1}{\delta}})}-\log{R}+\log{\beta}}{\log{\frac{1}{\rho(A)}}}.

The effect of unknown quantities, such as the spectral radius, are subsumed in the finite time condition $T\geq T_{*}(\delta)$ and appear in an upper bound for $\hat{d}$ ; however this information is not needed from an algorithmic perspective as the selection of $\hat{d}$ is agnostic to the knowledge of $\rho(A)$ . The proof of proposition can be found as Propositions 7 and 4.

5.3 Parameter Recovery

Next we discuss finding the system parameters. To obtain system parameters we use a balanced truncation algorithm on $\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}$ where $\hat{d}$ is the output of Algorithm 2. The details are summarized in Algorithm 3 where $\mathcal{H}=\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}$ .

Algorithm 3 Hankel2Sys(

T,\hat{d},k,m,p

)

Input $T=\text{Horizon for Learning}$
$\hat{d}=\text{Hankel Size}$
$m=\text{Input dimension}$
$p=\text{Output dimension}$
Output System Parameters: $(\hat{C}_{\hat{d}},\hat{A}_{\hat{d}},\hat{B}_{\hat{d}})$

1:

\mathcal{H}=\mathcal{H}_{0,\hat{d},\hat{d}}

2: Pad

\mathcal{H}

with zeros to make of dimension

4p\hat{d}\times 4m\hat{d}

3:

U,\Sigma,V\leftarrow

SVD of

\mathcal{H}

4:

U_{\hat{d}},V_{\hat{d}}\leftarrow

top

\hat{d}

singular vectors

5:

\hat{C}_{\hat{d}}\leftarrow

first

p

rows of

U_{\hat{d}}\Sigma_{\hat{d}}^{1/2}

6:

\hat{B}_{\hat{d}}\leftarrow

first

m

columns of

\Sigma_{\hat{d}}^{1/2}V^{\top}_{\hat{d}}

7:

Z_{0}=[U_{\hat{d}}\Sigma_{\hat{d}}^{1/2}]_{1:4p\hat{d}-p,:},Z_{1}=[U_{\hat{d}}\Sigma_{\hat{d}}^{1/2}]_{p+1:,:}

8:

\hat{A}_{\hat{d}}\leftarrow(Z_{0}^{\top}Z_{0})^{-1}Z_{0}^{\top}Z_{1}

.

9: return

(\hat{C}_{\hat{d}},\hat{A}_{\hat{d}},\hat{B}_{\hat{d}})

To state the main result we define a quantity that measures the singular value weighted subspace gap of a matrix $S$ :

\Gamma(S,\epsilon)=\sqrt{{\sigma}^{1}_{\max}/\zeta_{1}^{2}+{\sigma}^{2}_{\max}/\zeta_{2}^{2}+\ldots+{\sigma}^{l}_{\max}/\zeta_{l}^{2}},

where $S=U\Sigma V^{\top}$ and ${\Sigma}$ is arranged into blocks of singular values such that in each block $i$ we have $\sup_{j}\sigma^{i}_{j}-\sigma^{i}_{j+1}\leq\epsilon$ , i.e.,

{\Sigma}=\begin{bmatrix}\Lambda_{1}&0&\ldots&0\\ 0&\Lambda_{2}&\ldots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\ldots&\Lambda_{l}\\ \end{bmatrix}

where $\Lambda_{i}$ are diagonal matrices, $\sigma^{i}_{j}$ is the $j^{th}$ singular value in the block $\Lambda_{i}$ and $\sigma^{i}_{\min},\sigma^{i}_{\max}$ are the minimum and maximum singular values of block $i$ respectively. Furthermore,

\zeta_{i}=\min{({\sigma}^{i-1}_{\min}-{\sigma}^{i}_{\max},{\sigma}^{i}_{\min}-{\sigma}^{i+1}_{\max})}

for $1<i<l$ , $\zeta_{1}={\sigma}^{1}_{\min}-{\sigma}^{2}_{\max}$ and $\zeta_{l}=\min{({\sigma}^{l-1}_{\min}-{\sigma}^{l}_{\max},{\sigma}^{l}_{\min})}$ . Informally, the $\zeta_{i}$ measure the singular value gaps between each blocks. It should be noted that $l$ , the number of separated blocks, is a function of $\epsilon$ itself. For example: if $\epsilon=0$ then the number of blocks correspond to the number of distinct singular values. On the other hand, if $\epsilon$ is very large then $l=1$ .

Theorem 8.

Let $M$ be the true unknown model and

\epsilon=12c\beta R\left(\sqrt{\frac{m\hat{d}+p\hat{d}^{2}+\hat{d}\log{\frac{T}{\delta}}}{T}}\right).

Then whenever $T\geq T_{*}(\delta)$ , we have with probability at least $1-\delta$ :

\begin{rcases*}||C_{\hat{d}}-\hat{C}_{\hat{d}}||_{2}\\ ||B_{\hat{d}}-\hat{B}_{\hat{d}}||_{2}\\ ||A_{\hat{d}}-\hat{A}_{\hat{d}}||_{2}\end{rcases*}\leq\bar{\gamma}\epsilon\Gamma(\hat{\mathcal{H}}_{0,\hat{d},\hat{d}},2\epsilon)+\bar{\gamma}\sup_{1\leq i\leq\hat{d}}\left(\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}\right)+\bar{\gamma}\cdot\frac{\epsilon\wedge\sqrt{\hat{\sigma}_{\hat{d}}\epsilon}}{\sqrt{\hat{\sigma}_{\hat{d}}}}

where $\sup_{1\leq i\leq\hat{d}}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}\leq\frac{2}{\sqrt{\hat{\sigma}_{\hat{d}}}}\epsilon\hat{d}\wedge\sqrt{2\hat{d}\epsilon}$ and $\bar{\gamma}=\max{(4\gamma,8)}$ .

The proof of Theorem 8 follows directly from Theorem 9 where we show

||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq\epsilon

and Proposition 2. Theorem 8 provides an error bound between parameters (of model order $\hat{d}$ ) when true order is unknown. The subspace gap measure, $\Gamma(\hat{\mathcal{H}}_{0,\hat{d},\hat{d}},2\epsilon)$ , is bounded even when $\epsilon=0$ . To see this, note that when $\epsilon=0$ , $\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}$ corresponds exactly to $\mathcal{H}_{0,\hat{d},\hat{d}}$ . In that case, the number of blocks correspond to the number of distinct singular values of $\mathcal{H}_{0,\hat{d},\hat{d}}$ , and $\zeta_{n_{i}}$ then corresponds to singular value gap between the unequal singular values. As a result $\Gamma(\hat{\mathcal{H}}_{0,\hat{d},\hat{d}},2\epsilon)=\Delta<\infty$ . Then the bound decays as $\epsilon=O\left(\sqrt{\hat{d}^{2}/T}\right)$ for singular values $\hat{\sigma}_{\hat{d}}>\hat{d}\epsilon$ , but for much smaller singular values the bound decays as $\sqrt{\epsilon}=O\left(\left(d^{2}/T\right)^{1/4}\right)$ .

To shed more light on the behavior of our bounds, we consider the special case of known order. If $n$ is the model order, then we can set $\hat{d}=n$ . If $\sigma_{i}=\sigma_{i}(\mathcal{H}_{0,\infty,\infty})$ , then for large enough $T$ one can ensure that

\min_{\sigma_{i}\neq\sigma_{i+1}}(\sigma_{i}-\sigma_{i+1})/2>\epsilon,

i.e., $\epsilon$ is less than the singular value gap and small enough that the spectrum of $\hat{\mathcal{H}}_{0,n,n}$ is very close to that of $\mathcal{H}_{0,\infty,\infty}$ . Consequently $\hat{\sigma}_{n}\geq\sigma_{n}/2$ and we have that

\begin{rcases*}||C_{n}-\hat{C}_{n}||_{2}\\ ||B_{n}-\hat{B}_{n}||_{2}\\ ||A_{n}-\hat{A}_{n}||_{2}\end{rcases*}\leq\bar{\gamma}\epsilon\Delta+\bar{\gamma}\epsilon/\sqrt{\sigma_{n}}=c\beta\bar{\gamma}R\left(\sqrt{\frac{pn^{2}+n\log{\frac{T}{\delta}}}{\sigma_{n}T}}\right).

(18)

This upper bound is (nearly) identical to the bounds obtained in Oymak and Ozay (2018) for the known order case. The major advantage of our result is that we do not require any information/assumption on the LTI system besides $\beta$ . Nonparametric approaches to estimating $\beta$ have been studied in Tu et al. (2017).

5.4 Order Estimation Lower Bound

In Theorem 8 it is shown that whenever $T=\Omega\left(\frac{1}{\sigma_{\hat{d}}^{2}}\right)$ we can find an accurate $\hat{d}$ –order approximation. Now we show that if $T=O\left(\frac{1}{\sigma_{\hat{d}}^{2}}\right)$ then there is always some non–zero probability with which we can not recover the singular vector corresponding to the $\sigma_{\hat{d}+1}$ . We prove the following lower bound for model order estimation when inputs $\{U_{t}\}_{t=1}^{T}$ are active and bounded which we define below

Definition 5.6.

An input sequence $\{U_{t}\}_{t=1}^{T}$ is said to be active if $U_{t}$ is allowed to depend on past history $\{U_{l},Y_{l}\}_{l=1}^{t-1}$ . The input sequence is bounded if $\mathbb{E}[U_{t}^{\top}U_{t}]\leq 1$ for all $t$ .

Active inputs allow for the case when input selection can be adaptive due to feedback.

Theorem 9.

Fix $\delta>0,\zeta\in(0,1/2)$ . Let $M_{1},M_{2}$ be two LTI systems and $\sigma_{i}^{(1)},\sigma_{i}^{(2)}$ be the $i^{th}$ -Hankel singular values respectively. Let $\frac{\sigma^{(1)}_{1}}{\sigma^{(1)}_{2}}\leq\frac{2}{\zeta}$ and $\sigma^{(2)}_{2}=0$ . Then whenever $T\leq\frac{\mathcal{C}R^{2}}{\zeta^{2}}\log{\frac{2}{\delta}}$ we have

\sup_{M\in\{M_{1},M_{2}\}}\mathbb{P}_{Z_{T}\sim M}(\text{order}(\hat{M}(Z_{T}))\neq\text{order}(M))\geq\delta

Here $Z_{T}=\{U_{t},Y_{t}\}_{t=1}^{T}\sim M$ means $M$ generates $T$ data points $\{Y_{t}\}_{t=1}^{T}$ in response to active and bounded inputs $\{U_{t}\}_{t=1}^{T}$ and $\hat{M}(Z_{T})$ is any estimator.

Proof 5.7.

The proof can be found in appendix in Section 15 and involves using Fano’s (or Birge’s) inequality to compute the minimax risk between the probability density functions generated by two different LTI systems:

\displaystyle A_{0}

\displaystyle=\begin{bmatrix}0&1&0\\ 0&0&0\\ \zeta&0&0\end{bmatrix},A_{1}=A_{0},B_{0}=\begin{bmatrix}0\\ 0\\ \sqrt{\beta}/R\end{bmatrix},B_{1}=\begin{bmatrix}0\\ \sqrt{\beta}/R\\ \sqrt{\beta}/R\end{bmatrix},C_{0}=\begin{bmatrix}0&0&\sqrt{\beta}R\end{bmatrix},C_{1}=C_{0}.

(19)

$A_{0},A_{1}$ are Schur stable whenever $|\zeta|<1$ .

Theorem 9 shows that when the time required to recover higher order models depends inversely on the condition number. Specifically, to correctly distinguish between an order $1$ and order $2$ model $T\geq\Omega(2/\zeta^{2})$ where $\zeta$ is the condition number of the $2$ -order model. We compare this to our upper bound in Theorem 8 and Eq. (18), assume $\Gamma(\hat{\mathcal{H}}_{0,\hat{d},\hat{d}},2\epsilon)\leq\Delta$ for all $\epsilon\in[0,1]$ and $\hat{d}\epsilon\leq\hat{\sigma}_{\hat{d}}$ , then since parameter error, $\mathcal{E}$ , is upper bounded as

\mathcal{E}\leq c\beta\Delta R\left(\sqrt{\frac{m\hat{d}+p\hat{d}^{2}+\hat{d}\log{\frac{T}{\delta}}}{\sigma_{\hat{d}}T}}\right),

we need

\frac{T}{\log{\frac{T}{\delta}}}\geq\Omega\left(\frac{\beta^{2}\Delta^{2}R^{2}\hat{d}^{2}}{\sigma_{\hat{d}}^{2}}\right)

to correctly identify $\hat{d}$ -order model. The ratio $(\beta/\sigma_{\hat{d}})$ is equal to the condition number of the Hankel matrix. In this sense, the model selection followed by singular value thresholding is not too conservative in terms of $R$ (the signal-to-noise ratio) and conditioning of the Hankel matrix.

6 Experiments

The experiments in this paper are for the single trajectory case. A detailed analysis for system identification from multiple trajectories can be found in Tu et al. (2017). Suppose that the LTI system generating data, $M$ , has transfer function given by

G(z)=\alpha_{0}+\sum_{l=1}^{149}\alpha_{l}\rho^{l}z^{-l},\hskip 5.69054pt\rho<1

(20)

where $\alpha_{i}\sim{\mathcal{N}}(0,1)$ . $M$ is a finite dimensional LTI system or order $150$ with parameters as $M=(C\in\mathbb{R}^{1\times 150},A\in\mathbb{R}^{150\times 150},B\in\mathbb{R}^{150\times 1})$ . For these illustrations, we assume a balanced system and choose $R=1,\delta=0.05$ . We estimate $\beta_{0.6}=15,\beta_{0.9}=40,\beta_{0.99}=140$ , pick $U_{t}\sim{\mathcal{N}}(0,1)$ and $\{w_{t},\eta_{t}\}\sim\{{\mathcal{N}}(0,1),{\mathcal{N}}(0,I)\}$ respectively.

Refer to caption — Figure 1: Variation of Hankel size $=\hat{d}$ with $T$ for different values of $\rho$

Fig. 1 shows how $d=\hat{d}$ change with the number of data points for different values of $\rho$ . When $\rho=0.6$ , i.e., small, $\hat{d}$ does not grow too big with $T$ even when the number of data points is increased. This shows that a small model order is sufficient to specify system dynamics. On the other hand, when $\rho=0.99$ , i.e., closer to instability the $\hat{d}$ required is much larger, indicating the need for a higher order. Although $\hat{d}$ implicitly captures the effect of spectral radius, the knowledge of $\rho$ is not required for $\hat{d}$ selection.

In principle, our algorithm increases the Hankel size to the “appropriate” size as the data increases. We compare this to a deterministic growth policy $d=\log{(T)}$ and the SSREGEST algorithm Ljung et al. (2015). The SSREGEST algorithm first learns a large model from data and then performs model reduction to obtain a final model. In contrast, we go to reduced model directly by picking a small $\hat{d}$ . This reduces the sensitivity to noise.

In Fig. 2 shows the model errors for a deterministic growth policy $d=\log{(T)}$ and our algorithm. Although the difference is negligible when $\rho=0.6$ (small), we see that our algorithm does better $\rho=0.99$ due to its adaptive nature, i.e., $\hat{d}$ responds faster for our algorithm.

Finally, for the case when $\rho=0.9,\beta=40$ , we show the model errors for SSREGEST and our algorithm as $T$ increases. Although asymptotically both algorithms perform the same, it is clear that for small $T$ our algorithm is more robust to the presence of noise.

T	SSREGEST	Our Algorithm
$500$	$6.21\pm 1.35$	$13.37\pm 3.7$
$\approx 850$	$30.20\pm 7.55$	$11.25\pm 2.89$
$\approx 1200$	$26.80\pm 8.94$	$9.83\pm 2.60$
$1500$	$23.27\pm 10.65$	$9.17\pm 2.30$
$2000$	$26.38\pm 12.88$	$7.70\pm 1.60$

7 Discussion

We propose a new approach to system identification when we observe only finite noisy data. Typically, the order of an LTI system is large and unknown and a priori parametrizations may fail to yield accurate estimates of the underlying system. However, our results suggest that there always exists a lower order approximation of the original LTI system that can be learned with high probability. The central theme of our approach is to recover a good lower order approximation that can be accurately learned. Specifically, we show that identification of such approximations is closely related to the singular values of the system Hankel matrix. In fact, the time required to learn a $\hat{d}$ –order approximation scales as $T=\Omega(\frac{\beta^{2}}{\sigma_{\hat{d}}^{2}})$ where $\sigma_{\hat{d}}$ is the $\hat{d}$ –the singular value of system Hankel matrix. This means that system identification does not explicitly depend on the model order $n$ , rather depends on $n$ through $\sigma_{n}$ . As a result, in the presence of finite data it is preferable to learn only the “significant” (and perhaps much smaller) part of the system when $n$ is very large and $\sigma_{n}\ll 1$ . Algorithm 1 and 3 provide a guided mechanism for learning the parameters of such significant approximations with optimal rules for hyperparameter selection given in Algorithm 2.

Future directions for our work include extending the existing low–rank optimization-based identification techniques, such as (Fazel et al., 2013; Grussler et al., 2018), which typically lack statistical guarantees. Since Hankel based operators occur quite naturally in general (not necessarily linear) dynamical systems, exploring if our methods could be extended for identification of such systems appears to be an exciting direction.

8 Preliminaries

Theorem 1 (Theorem 5.39 Vershynin (2010)).

if $E$ is a $T\times md$ matrix with independent sub–Gaussian isotropic rows with subGaussian parameter $1$ then with probability at least $1-2e^{-ct^{2}}$ we have

\sqrt{T}-C\sqrt{md}-t\leq\sigma_{\min}(E)\leq\sqrt{T}+C\sqrt{md}+t

Proposition 2 (Vershynin (2010)).

We have for any $\epsilon<1$ and any $w\in\mathcal{S}^{d-1}$ that

\mathbb{P}(||M||>z)\leq(1+2/\epsilon)^{d}\mathbb{P}\left(||Mw||>\frac{z}{(1-\epsilon)}\right)

Theorem 3 (Theorem 1 Meckes et al. (2007)).

] Suppose $\{X_{i}\in\mathbb{R}^{m}\}_{i=1}^{\infty}$ are independent, $\mathbb{E}[X_{j}]=\textbf{0}$ for all $j$ , and $X_{ij}$ are independent $\mathsf{subg}(1)$ random variables. Then $\mathbb{P}(||T_{d}||\geq cm\sqrt{d\log{2d}}+t)\leq e^{-t^{2}/d}$ where

T_{n}=\begin{bmatrix}X_{0}&X_{1}&\ldots&X_{d-1}\\ X_{1}&X_{0}&\ldots&X_{d-2}\\ \vdots&\ddots&\ddots&\vdots\\ X_{d-1}&\ldots&\ldots&X_{0}\end{bmatrix}

Theorem 4 (Hanson–Wright Inequality).

Given a subgaussian vector $X=[X_{1},X_{2},\ldots,X_{n}]\in\mathbb{R}^{n}$ with $\sup_{i}\left\|X_{i}\right\|_{\psi_{2}}\leq K$ . Then for any $B\in\mathbb{R}^{n\times n}$ and $t\geq 0$

\mathbb{P}\left(\|XBX^{\top}-\mathbb{E}[XBX^{\top}]\|\leq t\right)\leq 2\exp\left(\max\left(\frac{-ct}{K^{2}\|B\|},\frac{-ct^{2}}{K^{4}\|B\|^{2}_{HS}}\right)\right).

Proposition 5 (Lecture 2 Tyrtyshnikov (2012)).

Suppose that $L$ is the lower triangular part of a matrix $A\in\mathbb{R}^{d\times d}$ . Then

\left\|L\right\|_{2}\leq\log_{2}{(2d)}\left\|A\right\|_{2}.

Let $\psi$ be a nondecreasing, convex function with $\psi(0)=0$ and $X$ a random variable. Then the Orlicz norm $||X||_{\psi}$ is defined as

||X||_{\psi}=\inf\Big{\{}\alpha>0:\mathbb{E}[\psi(|X|/\alpha)]\leq 1\Big{\}}.

Let $(B,d)$ be an arbitrary semi–metric space. Denote by $N(\epsilon,d)$ is the minimal number of balls of radius $\epsilon$ needed to cover $B$ .

Theorem 6 (Corollary 2.2.5 in Van Der Vaart and Wellner (1996)).

The constant $K$ can be hosen such that

||\sup_{s,t}|X_{s}-X_{t}|||_{\psi}\leq K\int_{0}^{\text{diam}(B)}\psi^{-1}(N(\epsilon/2,d))d\epsilon

where $\text{diam}(B)$ is the diameter of $B$ and $d(s,t)=||X_{s}-X_{t}||_{\psi}$ .

Theorem 7 (Theorem 1 in Abbasi-Yadkori et al. (2011)).

Let $\{\bm{\mathcal{F}}_{t}\}_{t=0}^{\infty}$ be a filtration. Let $\{\eta_{t}\in\mathbb{R}^{m},X_{t}\in\mathbb{R}^{d}\}_{t=1}^{\infty}$ be stochastic processes such that $\eta_{t},X_{t}$ are $\bm{\mathcal{F}}_{t}$ measurable and $\eta_{t}$ is $\bm{\mathcal{F}}_{t-1}$ -conditionally $\mathsf{subg}(L^{2})$ for some $L>0$ . For any $t\geq 0$ , define $V_{t}=\sum_{s=1}^{t}X_{s}X_{s}^{\prime},S_{t}=\sum_{s=1}^{t}X_{s}\eta_{s+1}^{\top}$ . Then for any $\delta>0,V\succ 0$ and all $t\geq 0$ we have with probability at least $1-\delta$

S_{t}^{\top}(V+V_{t})^{-1}S_{t}\leq 2L^{2}\left(\log{\frac{1}{\delta}}+\log{\frac{\text{det}(V+V_{t})}{\text{det}(V)}}+m\right).

Proof 8.1.

Define $M=(V+V_{t})^{-1/2}S_{t}$ . Now we use Proposition 2 and setting $\epsilon=1/2$ ,

\mathbb{P}(||M||_{2}>z)\leq 5^{m}\mathbb{P}(||Mw||_{2}>2z)

for $w\in\mathcal{S}^{m-1}$ . Then we can use Theorem 1 in Abbasi-Yadkori et al. (2011), and with probability at least $1-\delta$ we have

||Mw||^{2}_{2}\leq 2L^{2}\left(\log{\frac{1}{\delta}}+\log{\frac{\text{det}(V+V_{t})}{\text{det}(V)}}\right).

By $\delta\rightarrow 5^{-m}\delta$ , we have with probability at least $1-5^{-m}\delta$

||Mw||_{2}\leq\sqrt{2}L\sqrt{\left(m\log{(5)}+\log{\frac{1}{\delta}}+\log{\frac{\text{det}(V+V_{t})}{\text{det}(V)}}\right)}.

Then with probability at least $1-\delta$ ,

||M||_{2}\leq\sqrt{\frac{\log{(5)}}{2}}L\sqrt{\left(m+\log{\frac{1}{\delta}}+\log{\frac{\text{det}(V+V_{t})}{\text{det}(V)}}\right)}.

Lemma 8.

For any $M=(C,A,B)$ , we have that

||\mathcal{B}^{v}_{T\times mT}||=\sqrt{\sigma\Big{(}\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T}\Big{)}}

Here $\mathcal{B}^{v}_{T\times mT}$ is defined as follows: $\beta=\mathcal{H}_{d,d,T}^{\top}v=[\beta_{1}^{\top},\beta_{2}^{\top},\ldots,\beta_{T}^{\top}]^{\top}$ .

\displaystyle\mathcal{B}^{v}_{T\times mT}=\begin{bmatrix}\beta_{1}^{\top}&0&0&\ldots\\ \beta^{\top}_{2}&\beta^{\top}_{1}&0&\ldots\\ \vdots&\vdots&\ddots&\vdots\\ \beta_{T}^{\top}&\beta_{T-1}^{\top}&\ldots&\beta^{\top}_{1}\end{bmatrix}

and $||v||_{2}=1$ .

Proof 8.2.

For the matrix $\mathcal{B}^{v}$ we have

	$\displaystyle\mathcal{B}^{v}u$	$\displaystyle=\begin{bmatrix}\beta_{1}^{\top}u_{1}\\ \beta_{1}^{\top}u_{2}+\beta_{2}^{\top}u_{1}\\ \beta_{1}^{\top}u_{3}+\beta_{2}^{\top}u_{2}+\beta_{3}^{\top}u_{1}\\ \vdots\\ \beta_{1}^{\top}u_{T}+\beta_{2}^{\top}u_{T-1}+\ldots+\beta_{T}^{\top}u_{1}\end{bmatrix}=\begin{bmatrix}v^{\top}\begin{bmatrix}CA^{d+1}Bu_{1}\\ CA^{d+2}Bu_{1}\\ \vdots\\ CA^{2d}Bu_{1}\end{bmatrix}\\ v^{\top}\begin{bmatrix}CA^{d+2}Bu_{1}+CA^{d+1}Bu_{2}\\ CA^{d+3}Bu_{1}+CA^{d+2}Bu_{2}\\ \vdots\\ CA^{2d+1}Bu_{1}+CA^{2d}Bu_{2}\end{bmatrix}\\ \vdots\\ v^{\top}\begin{bmatrix}CA^{T+d}Bu_{1}+\ldots+CA^{d+1}Bu_{T}\\ CA^{T+d+2}Bu_{1}+\ldots+CA^{d+2}Bu_{T}\\ \vdots\\ CA^{T+2d-1}Bu_{1}+\ldots+CA^{2d}Bu_{T}\end{bmatrix}\end{bmatrix}$
		$\displaystyle=\mathcal{V}\begin{bmatrix}\begin{bmatrix}CA^{d+1}Bu_{1}\\ CA^{d+2}Bu_{1}\\ \vdots\\ CA^{2d}Bu_{1}\end{bmatrix}\\ \begin{bmatrix}CA^{d+2}Bu_{1}+CA^{d+1}Bu_{2}\\ CA^{d+3}Bu_{1}+CA^{d+2}Bu_{2}\\ \vdots\\ CA^{2d+1}Bu_{1}+CA^{2d}Bu_{2}\end{bmatrix}\\ \vdots\\ \begin{bmatrix}CA^{T+d}Bu_{1}+\ldots+CA^{d+1}Bu_{T}\\ CA^{T+d+2}Bu_{1}+\ldots+CA^{d+2}Bu_{T}\\ \vdots\\ CA^{T+2d-1}Bu_{1}+\ldots+CA^{2d}Bu_{T}\end{bmatrix}\end{bmatrix}$
		$\displaystyle=\mathcal{V}\underbrace{\begin{bmatrix}CA^{d+1}B&0&0&\ldots&0\\ CA^{d+2}B&0&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{2d}B&0&0&\ldots&0\\ CA^{d+2}B&CA^{d+1}B&0&\ldots&0\\ CA^{d+3}B&CA^{d+2}B&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{2d+1}B&CA^{2d}B&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{T+d-1}B&CA^{T+d}B&CA^{T+d-1}B&\ldots&CA^{d+1}B\\ CA^{T+d+2}B&CA^{T+d+1}B&CA^{T+d}B&\ldots&CA^{d+2}B\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{T+2d-1}B&CA^{T+2d-1}B&CA^{T+2d-2}B&\ldots&CA^{2d}B\\ \end{bmatrix}}_{=S}\begin{bmatrix}u_{1}\\ u_{2}\\ \vdots\\ u_{T}\end{bmatrix}$

It is clear that $||\mathcal{V}||_{2},||u||_{2}=1$ and for any matrix $S$ , $||S||$ does not change if we interchange rows of $S$ . Then we have

	$\displaystyle\|\|S\|\|_{2}$	$\displaystyle=\sigma\left(\begin{bmatrix}CA^{d+1}B&0&0&\ldots&0\\ CA^{d+2}B&CA^{d+1}B&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{T+d+1}B&CA^{T+d}B&CA^{T+d-1}B&\ldots&CA^{d+1}B\\ CA^{d+2}B&0&0&\ldots&0\\ CA^{d+3}B&CA^{d+2}B&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{T+d+2}B&CA^{T+d+1}B&CA^{T+d}B&\ldots&CA^{d+2}B\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{2d}B&0&0&\ldots&0\\ CA^{2d+1}B&CA^{2d}B&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ CA^{T+2d-1}B&CA^{T+2d-1}B&CA^{T+2d-2}B&\ldots&CA^{2d}B\\ \end{bmatrix}\right)$
		$\displaystyle=\sigma\left(\begin{bmatrix}\mathcal{T}_{d+1,T}\\ \mathcal{T}_{d+2,T}\\ \vdots\\ \mathcal{T}_{2d,T}\end{bmatrix}\right)=\sqrt{\sigma\Big{(}\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T}\Big{)}}$

Proposition 9 (Lemma 4.1 Simchowitz et al. (2018)).

Let $S$ be an invertible matrix and $\kappa(S)$ be its condition number. Then for a $\frac{1}{4\kappa}$ –net of $\mathcal{S}^{d-1}$ and an arbitrary matrix $A$ , we have

||SA||_{2}\leq 2\sup_{v\in{\mathcal{N}}_{\frac{1}{4\kappa}}}\frac{||v^{\prime}A||_{2}}{||v^{\prime}S^{-1}||_{2}}

Proof 8.3.

For any vector $v\in{\mathcal{N}}_{\frac{1}{4\kappa}}$ and $w$ be such that $||SA||_{2}=\frac{||w^{\prime}A||_{2}}{||w^{\prime}S^{-1}||_{2}}$ we have

	$\displaystyle\|\|SA\|\|_{2}-\frac{\|\|v^{\prime}A\|\|_{2}}{\|\|v^{\prime}S^{-1}\|\|_{2}}$	$\displaystyle\leq\Big{\|}\frac{\|\|w^{\prime}A\|\|_{2}}{\|\|w^{\prime}S^{-1}\|\|_{2}}-\frac{\|\|v^{\prime}A\|\|_{2}}{\|\|v^{\prime}S^{-1}\|\|_{2}}\Big{\|}$
		$\displaystyle=\Big{\|}\frac{\|\|w^{\prime}A\|\|_{2}}{\|\|w^{\prime}S^{-1}\|\|_{2}}-\frac{\|\|v^{\prime}A\|\|_{2}}{\|\|w^{\prime}S^{-1}\|\|_{2}}+\frac{\|\|v^{\prime}A\|\|_{2}}{\|\|w^{\prime}S^{-1}\|\|_{2}}-\frac{\|\|v^{\prime}A\|\|_{2}}{\|\|v^{\prime}S^{-1}\|\|_{2}}\Big{\|}$
		$\displaystyle\leq\|\|SA\|\|_{2}\frac{\frac{1}{4\kappa}\|\|S^{-1}\|\|_{2}}{\|\|w^{\prime}S^{-1}\|\|_{2}}+\|\|SA\|\|_{2}\Big{\|}\frac{\|\|v^{\prime}S^{-1}\|\|_{2}}{\|\|w^{\prime}S^{-1}\|\|_{2}}-1\Big{\|}$
		$\displaystyle\leq\frac{\|\|SA\|\|_{2}}{2}$

9 Control and Systems Theory Preliminaries

9.1 Sylvester Matrix Equation

Define the discrete time Sylvester operator $S_{A,B}:\mathbb{R}^{n\times n}\rightarrow\mathbb{R}^{n\times n}$

\mathcal{L}_{A,B}(X)=X-AXB

(21)

Then we have the following properties for $\mathcal{L}_{A,B}(\cdot)$ .

Proposition 1.

Let $\lambda_{i},\mu_{i}$ be the eigenvalues of $A,B$ then $\mathcal{L}_{A,B}$ is invertible if and only if for all $i,j$

\lambda_{i}\mu_{j}\neq 1

Define the discrete time Lyapunov operator for a matrix $A$ as $\mathcal{L}_{A,A^{\prime}}(\cdot)=S^{-1}_{A,A^{\prime}}(\cdot)$ . Clearly it follows from Proposition 1 that whenever $\lambda_{\max}(A)<1$ we have that the $S_{A,A^{\prime}}(\cdot)$ is an invertible operator.

Now let $Q\succeq 0$ then

$\displaystyle S_{A,A^{\prime}}(Q)$	$\displaystyle=X$
$\displaystyle\implies X$	$\displaystyle=AXA^{\prime}+Q$
$\displaystyle\implies X$	$\displaystyle=\sum_{k=0}^{\infty}A^{k}QA^{\prime k}$	(22)

Eq. (22) follows directly by substitution and by Proposition 1 is unique if $\rho(A)<1$ . Further, let $Q_{1}\succeq Q_{2}\succeq 0$ and $X_{1},X_{2}$ be the corresponding solutions to the Lyapunov operator then from Eq. (22) that

	$\displaystyle X_{1},X_{2}$	$\displaystyle\succeq 0$
	$\displaystyle X_{1}$	$\displaystyle\succeq X_{2}$

9.2 Properties of System Hankel matrix

•

Rank of system Hankel matrix: For $M=(C,A,B)\in\mathcal{M}_{n}$ , the system Hankel matrix, $\mathcal{H}_{0,\infty,\infty}(M)$ , can be decomposed as follows:

\mathcal{H}_{0,\infty,\infty}(M)=\underbrace{\begin{bmatrix}C\\ CA\\ \vdots\\ CA^{d}\\ \vdots\end{bmatrix}}_{=\mathcal{O}}\underbrace{\begin{bmatrix}B&AB&\ldots&A^{d}B&\ldots\end{bmatrix}}_{=\mathcal{R}}

(23)

It follows from definition that $\text{rank}(\mathcal{O}),\text{rank}(\mathcal{R})\leq n$ and as a result $\text{rank}(\mathcal{O}\mathcal{R})\leq n$ . The system Hankel matrix rank, or $\text{rank}(\mathcal{O}\mathcal{R})$ , which is also the model order(or simply order), captures the complexity of $M$ . If $\text{SVD}(\mathcal{H}_{0,\infty,\infty})=U\Sigma V^{\top}$ , then $\mathcal{O}=U\Sigma^{1/2}S,\mathcal{R}=S^{-1}\Sigma^{1/2}V^{\top}$ . By noting that

CA^{l}S=CS(S^{-1}AS)^{l},S^{-1}A^{l}B=(S^{-1}AS)^{l}S^{-1}B

we have obtained a way of recovering the system parameters (up to similarity transformations). Furthermore, $\mathcal{H}_{0,\infty,\infty}$ uniquely (up to similarity transformation) recovers $(C,A,B)$ .

•

Mapping Past to Future: $\mathcal{H}_{0,\infty,\infty}$ can also be viewed as an operator that maps “past” inputs to “future” outputs. In Eq. (1) assume that $\{\eta_{t},w_{t}\}=0$ . Then consider the following class of inputs $U_{t}$ such that $U_{t}=0$ for all $t\geq T$ but $U_{t}$ may not be zero for $t<T$ . Here $T$ is chosen arbitrarily. Then

\underbrace{\begin{bmatrix}Y_{T}\\ Y_{T+1}\\ Y_{T+2}\\ \vdots\end{bmatrix}}_{\text{Future}}=\mathcal{H}_{0,\infty,\infty}\underbrace{\begin{bmatrix}U_{T-1}\\ U_{T-2}\\ U_{T-3}\\ \vdots\end{bmatrix}}_{\text{Past}}

(24)

9.3 Model Reduction

Given an LTI system $M=(C,A,B)$ of order $n$ with its doubly infinite system Hankel matrix as $\mathcal{H}_{0,\infty,\infty}$ . We are interested in finding the best $k$ order lower dimensional approximation of $M$ , i.e., for every $k<n$ we would like to find $M_{k}$ of model order $k$ such that $||M-M_{k}||_{\infty}$ is minimized. Systems theory gives us a class of model approximations, known as balanced truncated approximations, that provide strong theoretical guarantees (See Glover (1984) and Section 21.6 in Zhou et al. (1996)). We summarize some of the basics of model reduction below. Assume that $M$ has distinct Hankel singular values.

Recall that a model $M=(C,A,B)$ is equivalent to $\tilde{M}=(CS,S^{-1}AS,S^{-1}B)$ with respect to its transfer function. Define

	$\displaystyle Q$	$\displaystyle=A^{\top}QA+C^{\top}C$
	$\displaystyle P$	$\displaystyle=APA^{\top}+BB^{\top}$

For two positive definite matrices $P,Q$ it is a known fact that there exist a transformation $S$ such that $S^{\top}QS=S^{-1}PS^{-1\top}=\Sigma$ where $\Sigma$ is diagonal and the diagonal elements are decreasing. Further, $\sigma_{i}$ is the $i^{th}$ singular value of $\mathcal{H}_{0,\infty,\infty}$ . Then let $\tilde{A}=S^{-1}AS,\tilde{C}=CS,\tilde{B}=S^{-1}B$ . Clearly $\widetilde{M}=(\tilde{A},\tilde{B},\tilde{C})$ is equivalent to $M$ and we have

	$\displaystyle\Sigma$	$\displaystyle=\tilde{A}^{\top}\Sigma\tilde{A}+\tilde{C}^{\top}\tilde{C}$
	$\displaystyle\Sigma$	$\displaystyle=\tilde{A}\Sigma\tilde{A}^{\top}+\tilde{B}\tilde{B}^{\top}$		(25)

Here $\tilde{C},\tilde{A},\tilde{B}$ is a balanced realization of $M$ .

Proposition 2.

Let $\mathcal{H}_{0,\infty,\infty}=U\Sigma V^{\top}$ . Here $\Sigma\succeq 0\in\mathbb{R}^{n\times n}$ . Then

	$\displaystyle\tilde{C}$	$\displaystyle=[U\Sigma^{1/2}]_{1:p,:}$
	$\displaystyle\tilde{A}$	$\displaystyle=\Sigma^{-1/2}U^{\top}[U\Sigma^{1/2}]_{p+1:,:}$
	$\displaystyle\tilde{B}$	$\displaystyle=[\Sigma^{1/2}V^{\top}]_{:,1:m}$

The triple $(\tilde{C},\tilde{A},\tilde{B})$ is a balanced realization of $M$ . For any matrix $L$ , $L_{:,m:n}$ (or $L_{m:n,:}$ ) denotes the submatrix with only columns (or rows) $m$ through $n$ .

Proof 9.1.

Let the SVD of $\mathcal{H}_{0,\infty,\infty}=U\Sigma V^{\top}$ . Then $M$ can constructed as follows: $U\Sigma^{1/2},\Sigma^{1/2}V^{\top}$ are of the form

\displaystyle U\Sigma^{1/2}=\begin{bmatrix}CS\\ CAS\\ CA^{2}S\\ \vdots\\ \end{bmatrix},\Sigma^{1/2}V^{\top}=\begin{bmatrix}S^{-1}B&S^{-1}AB&S^{-1}A^{2}B\ldots\end{bmatrix}

where $S$ is the transformation which gives us Eq. (25). This follows because

	$\displaystyle\Sigma^{1/2}U^{\top}U\Sigma^{1/2}$	$\displaystyle=\sum_{k=0}^{\infty}S^{\top}A^{k\top}C^{\top}CA^{k}S$
		$\displaystyle=\sum_{k=0}^{\infty}S^{\top}A^{k\top}S^{-1\top}S^{\top}C^{\top}CSS^{-1}A^{k}S$
		$\displaystyle=\sum_{k=0}^{\infty}\tilde{A}^{k\top}\tilde{C}^{\top}\tilde{C}\tilde{A}^{k}=\tilde{A}^{\top}\Sigma\tilde{A}+\tilde{C}^{\top}\tilde{C}=\Sigma$

Then $\tilde{C}=U\Sigma^{1/2}_{1:p,:}$ and

	$\displaystyle U\Sigma^{1/2}\tilde{A}$	$\displaystyle=[U\Sigma^{1/2}]_{p+1:,:}$
	$\displaystyle\tilde{A}$	$\displaystyle=\Sigma^{-1/2}U^{\top}[U\Sigma^{1/2}]_{p+1:,:}$

We do a similar computation for $B$ .

It should be noted that a balanced realization $\tilde{C},\tilde{A},\tilde{B}$ is unique except when there are some Hankel singular values that are equal. To see this, assume that we have

\sigma_{1}>\ldots>\sigma_{r-1}>\sigma_{r}=\sigma_{r+1}=\ldots=\sigma_{s}>\sigma_{s+1}>\ldots\sigma_{n}

where $s-r>0$ . For any unitary matrix $Q\in\mathbb{R}^{(s-r+1)\times(s-r+1)}$ , define $Q_{0}$

Q_{0}=\begin{bmatrix}I_{(r-1)\times(r-1)}&0&0\\ 0&Q&0\\ 0&0&I_{(n-s)\times(n-s)}\end{bmatrix}

(26)

Then every triple $(\tilde{C}Q_{0},Q_{0}^{\top}\tilde{A}Q_{0},Q_{0}^{\top}\tilde{B})$ satisfies Eq. (25) and is a balanced realization. Let $M_{k}=(\tilde{C}_{k},\tilde{A}_{kk},\tilde{B}_{k})$ where

\displaystyle\tilde{A}=\begin{bmatrix}\tilde{A}_{kk}&\tilde{A}_{0k}\\ \tilde{A}_{k0}&\tilde{A}_{00}\end{bmatrix},\tilde{B}=\begin{bmatrix}\tilde{B}_{k}\\ \tilde{B}_{0}\end{bmatrix},\tilde{C}=\begin{bmatrix}\tilde{C}_{k}&\tilde{C}_{0}\end{bmatrix}

(27)

Here $\tilde{A}_{kk}$ is the $k\times k$ submatrix and corresponding partitions of $\tilde{B},\tilde{C}$ . The realization $M_{k}=(\tilde{C}_{k},\tilde{A}_{kk},\tilde{B}_{k})$ is the $k$ –order balanced truncated model. Clearly $M\equiv M_{n}$ which gives us $\tilde{C}=\tilde{C}_{nn},\tilde{A}=\tilde{A}_{nn},\tilde{B}=\tilde{B}_{nn}$ , i.e., the balanced version of the true model. We will show that for the balanced truncation model we only need to care about the top $k$ singular vectors and not the entire model.

Proposition 3.

For the $k$ order balanced truncated model $M_{k}$ , we only need top $k$ singular values and singular vectors of $\mathcal{H}_{0,\infty,\infty}$ .

Proof 9.2.

From the preceding discussion in Proposition 2 and Eq. (27) it is clear that the first $p\times k$ block submatrix of $U\Sigma^{1/2}$ (corresponding to the top $k$ singular vectors) gives us $\tilde{C}_{k}$ . Since

\tilde{A}=\Sigma^{-1/2}U^{\top}[U\Sigma^{1/2}]_{p+1:,:}

we observe that $\tilde{A}_{kk}$ depend only on the top $k$ singular vectors $U_{k}$ and corresponding singular values. This can be seen as follows: $[U\Sigma^{1/2}]_{p+1:,:}$ denotes the submatrix of $U\Sigma^{1/2}$ with top $p$ rows removed. Now in $U\Sigma^{1/2}$ each column of $U$ is scaled by the corresponding singular value. Then the $\tilde{A}_{kk}$ submatrix depends only on top $k$ rows of $\Sigma^{-1/2}U^{\top}$ and the top $k$ columns of $[U\Sigma^{1/2}]_{p+1:,:}$ which correspond to the top $k$ singular vectors.

10 Isometry of Input Matrix: Proof of Lemma 1

Theorem 10.1.

Define

\displaystyle U

\displaystyle\coloneqq\begin{bmatrix}U_{d}&U_{d+1}&\ldots&U_{T+d-1}\\ U_{d-1}&U_{d}&\ldots&U_{T+d-2}\\ \vdots&\vdots&\ddots&\vdots\\ U_{1}&U_{2}&\ldots&U_{T}\end{bmatrix}

where each $U_{i}\sim\mathsf{subg}(1)$ and isotropic. Then there exists an absolute constant $c$ such that $U$ satisfies:

(1/2)T\leq\sigma_{\min}(UU^{\top})\leq\sigma_{\max}(UU^{\top})\leq(3/2)T

whenever $T\geq cm^{2}d(\log^{2}{(d)}\log^{2}{(m^{2}/\delta)}+\log^{3}{(2d)})$ with probability at least $1-\delta$ .

Proof 10.2.

Define

\displaystyle A_{md\times md}

\displaystyle\coloneqq\begin{bmatrix}0&0&0&\ldots&0\\ I&0&0&\ldots&0\\ \vdots&\ddots&\ddots&\vdots&\vdots\\ 0&\ldots&I&0&0\\ 0&\ldots&0&I&0\end{bmatrix},B_{md\times m}\coloneqq\begin{bmatrix}I\\ 0\\ \vdots\\ 0\end{bmatrix},\widehat{U}_{k}\coloneqq U_{d+k}

Since

U=\begin{bmatrix}U_{d}&U_{d+1}&\ldots&U_{T+d-1}\\ U_{d-1}&U_{d}&\ldots&U_{T+d-2}\\ \vdots&\vdots&\ldots&\vdots\\ U_{1}&U_{2}&\ldots&U_{T}\end{bmatrix}

we can reformulate it so that each column is the output of an LTI system in the following sense:

\displaystyle x_{k+1}

\displaystyle=Ax_{k}+B\widehat{U}(k+1)

(28)

where $UU^{\top}=\sum_{k=0}^{T-1}x_{k}x_{k}^{\top}$ and $x_{0}=\begin{bmatrix}U_{d}\\ U_{d-1}\\ \vdots\\ U_{1}\end{bmatrix}$ . From Theorem 1 we have that

\frac{3}{4}TI\preceq\sum_{k=0}^{T-1}\widehat{U}_{k}\widehat{U}_{k}^{\top}\preceq\frac{5}{4}TI

with probability at least $1-\delta$ whenever $T\geq c\Big{(}m+\log{\frac{2}{\delta}}\Big{)}$ . Define $V_{t}=\sum_{l=0}^{t-1}x_{k}x_{k}^{\top}$ then,

\displaystyle V_{T}

\displaystyle=AV_{T-1}A^{\top}+B\left(\sum_{k=0}^{T-1}\widehat{U}_{k}\widehat{U}_{k}^{\top}\right)B^{\top}+\sum_{k=0}^{T-2}\left(Ax_{k}\widehat{U}^{\top}_{k+1}B^{\top}+B\widehat{U}_{k+1}x^{\top}_{k}A^{\top}\right)

(29)

It can be easily checked that $x_{k}=\begin{bmatrix}U_{d+k}\\ U_{d+k-1}\\ \vdots\\ U_{k+1}\end{bmatrix}$ and consequently

\displaystyle\sum_{k=0}^{T-2}Ax_{k}\widehat{U}^{\top}_{k+1}B^{\top}=\sum_{k=0}^{T-2}\begin{bmatrix}0&0&\ldots&0&0\\ U_{d+k}U_{d+k+1}^{\top}&0&\ldots&0&0\\ U_{d+k-1}U_{d+k+1}^{\top}&0&\ldots&0&0\\ \vdots&\vdots&\ddots&\vdots&\vdots\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ U_{k+2}U_{d+k+1}^{\top}&0&\ldots&0&0\end{bmatrix}.

Define $L_{j}\coloneqq\sum_{k=0}^{T-2}U_{d+k-j+1}U_{d+k+1}^{\top}$ and $L_{j}$ is a $m\times m$ block matrix. Then

\displaystyle T_{d}=\sum_{l=0}^{d-1}A^{l}\left(\sum_{k=0}^{T-2}Ax_{k}\widehat{U}^{\top}_{k+1}B^{\top}\right)A^{l\top}=\begin{bmatrix}0&0&\ldots&0&0&0\\ L_{1}&0&\ldots&0&0&0\\ L_{2}&L_{1}&\ldots&0&0&0\\ \vdots&\vdots&\ddots&\vdots&\vdots&\vdots\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ L_{d-1}&0&\ldots&0&L_{1}&0\end{bmatrix}.

Use Lemma 1 to show that

\left\|T_{d}\right\|\leq cm\sqrt{Td}\log{(d)}\log{(m^{2}/\delta)}

(30)

with probability at least $1-\delta$ . Then

\displaystyle V_{T}=\sum_{l=0}^{d-1}A^{l}B\left(\sum_{k=0}^{T-1}\widehat{U}_{k}\widehat{U}_{k}^{\top}\right)B^{\top}A^{l\top}+T_{d}-\sum_{l=0}^{d-1}A^{l}x_{T-1}x_{T-1}^{\top}A^{l\top}.

From Theorem 1 we have with probability atleast $1-\delta$ that

(3/4)TI\preceq\sum_{l=0}^{d-1}A^{l}B\Bigg{(}\sum_{k=0}^{T-1}\widehat{U}_{k}\widehat{U}_{k}^{\top}\Bigg{)}B^{\top}A^{l\top}\preceq(5/4)TI

(31)

whenever $T\geq c\Big{(}m+\log{\frac{2}{\delta}}\Big{)}$ . Observe that

\displaystyle\left\|\sum_{l=1}^{d}A^{l}x_{T-1}x_{T-1}^{\top}A^{l\top}\right\|

\displaystyle=\sigma_{1}^{2}([Ax_{T-1},A^{2}x_{T-1},\ldots,A^{d}x_{T-1}])

The matrix $[Ax_{T-1},A^{2}x_{T-1},\ldots,A^{d}x_{T-1}]$ is the lower triangular submatrix of a random Toeplitz matrix with i.i.d $\mathsf{subg}(1)$ entries as in Theorem 3. Then using Theorem 3 and Proposition 5 we get that with probability at least $1-\delta$ we have

\left\|[Ax_{T-1},A^{2}x_{T-1},\ldots,A^{d}x_{T-1}]\right\|\leq cm(\sqrt{d\log{(2d)}}\log{(2d)}+\sqrt{d\log{(1/\delta)}}).

(32)

Then $\left\|\sum_{l=1}^{d}A^{l}x_{T-1}x_{T-1}^{\top}A^{l\top}\right\|\leq cm^{2}d(\log^{3}{(2d)}+\log{(1/\delta)}+\log{(2d)}\sqrt{\log{(2d)}\log{(1/\delta)}})$ with probability at least $1-\delta$ . By ensuring that Eqs. (30), (31) and (32) hold simultaneously we can ensure that $cm\sqrt{Td}\log{(d)}\log{(m^{2}/\delta)}\leq T/8$ and $cm^{2}d(\log^{3}{(2d)}+\log{(1/\delta)}+\log{(2d)}\sqrt{\log{(2d)}\log{(1/\delta)}})\leq T/8$ for large enough $T$ and absolute constant $c$ .

Lemma 1.

Let $\{U_{j}\in\mathbb{R}^{m\times 1}\}_{j=1}^{T+d}$ be independent $\mathsf{subg}(1)$ random vectors. Define $L_{j}\coloneqq\sum_{k=0}^{T-2}U_{d+k-j+1}U_{d+k+1}^{\top}$ for all $j\geq 1$ and

\displaystyle T_{d}\coloneqq\begin{bmatrix}0&0&\ldots&0&0&0\\ L_{1}&0&\ldots&0&0&0\\ L_{2}&L_{1}&\ldots&0&0&0\\ \vdots&\vdots&\ddots&\vdots&\vdots&\vdots\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ L_{d-1}&0&\ldots&0&L_{1}&0\end{bmatrix}.

Then with probability at least $1-\delta$ we have

\left\|T_{d}\right\|\leq cm\sqrt{Td}\log{(d)}\log{(m/\delta)}.

Proof 10.3.

Since $L_{j}$ s are block matrices, the techniques in Meckes et al. (2007) cannot be directly applied. However, by noting that $E$ can be broken into a sum of $m$ matrices where the norm of each matrix can be bounded by a Toeplitz matrix we can use the result from Meckes et al. (2007). For instance if $m=2$ and $\{u_{i}\}_{i=1}^{\infty}$ are independent $\mathsf{subg}(1)$ random variables then we have

\displaystyle T_{d}=\begin{bmatrix}\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}u_{1}&u_{2}\\ u_{3}&u_{4}\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}u_{5}&u_{6}\\ u_{7}&u_{8}\end{bmatrix}&\begin{bmatrix}u_{1}&u_{2}\\ u_{3}&u_{4}\end{bmatrix}&\ldots\\ \vdots&\vdots&\ddots\end{bmatrix}.

Now,

\displaystyle T_{d}=\underbrace{\begin{bmatrix}\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}u_{1}&0\\ u_{3}&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}u_{5}&0\\ u_{7}&0\end{bmatrix}&\begin{bmatrix}u_{1}&0\\ u_{3}&0\end{bmatrix}&\ldots\\ \vdots&\vdots&\ddots\end{bmatrix}}_{=M_{1}}+\underbrace{\begin{bmatrix}\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}0&u_{2}\\ 0&u_{4}\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}0&u_{6}\\ 0&u_{8}\end{bmatrix}&\begin{bmatrix}0&u_{2}\\ 0&u_{4}\end{bmatrix}&\ldots\\ \vdots&\vdots&\ddots\end{bmatrix}}_{=M_{2}},

then $||T_{d}||\leq\sup_{1\leq i\leq 2}||M_{i}||$ . Furthermore for each $M_{i}$ we have

\displaystyle M_{1}=\underbrace{\begin{bmatrix}\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}u_{1}&0\\ 0&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}u_{5}&0\\ 0&0\end{bmatrix}&\begin{bmatrix}u_{1}&0\\ 0&0\end{bmatrix}&\ldots\\ \vdots&\vdots&\ddots\end{bmatrix}}_{=M_{11}}+\underbrace{\begin{bmatrix}\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}0&0\\ u_{3}&0\end{bmatrix}&\begin{bmatrix}0&0\\ 0&0\end{bmatrix}&\ldots\\ \begin{bmatrix}0&0\\ u_{7}&0\end{bmatrix}&\begin{bmatrix}0&0\\ u_{3}&0\end{bmatrix}&\ldots\\ \vdots&\vdots&\ddots\end{bmatrix}}_{=M_{12}},

and $||M_{1}||\leq||M_{11}||+||M_{12}||$ . The key idea is to show that $M_{i1}$ are Toeplitz matrices (after removing the zeros in the blocks) and we can use the standard techniques described in proof of Theorem 1 in Meckes et al. (2007). Then we will show that each $||M_{ij}||\leq C$ with high probability and $||T_{d}||\leq mC$ .

For brevity, we will assume for now that $U_{i}$ are scalars and at the end we will scale by $m$ . By standard techniques described in proof of Theorem 1 in Meckes et al. (2007), we have that the finite Toeplitz matrix $T_{d}+T_{d}^{\top}$ is $d\times d$ submatrix of the infinite Laurent matrix

M=[L_{|j-k|}\textbf{1}_{|j-k|<d-1}]_{j,k\in\mathbb{Z}}.

Consider $M$ as an operator on $\ell^{2}(\mathbb{Z})$ in the canonical way, and let $\psi:\ell^{2}(\mathbb{Z})\rightarrow L^{2}[0,1]$ denote the usual linear trigonometric isometry $\psi(e_{j})(x)=e^{2\pi ijx}$ . Then $\psi M_{d}\psi^{-1}:L^{2}\rightarrow L^{2}$ is the operator correpsonding to

f(x)=\sum_{j=-(d-1)}^{d-1}L_{|j|}e^{2\pi ijx}=L_{0}+2\sum_{j=1}^{d-1}\cos{(2\pi jx)}L_{j}

Therefore,

\left\|T_{d}+T_{d}^{\top}\right\|\leq\left\|M\right\|=\left\|f\right\|_{\infty}=\sup_{0\leq x\leq 1}|Y_{x}|

where $Y_{x}=2\sum_{j=1}^{d-1}\cos{(2\pi jx)}L_{j}$ . Furthermore note that $Y_{x}$ has the following form

Y_{x}=U^{\top}\underbrace{\begin{bmatrix}0&c^{x}_{1}&c^{x}_{2}&\ldots&c^{x}_{d-1}&0&\ldots&0\\ 0&0&c^{x}_{1}&\ldots&c^{x}_{d-1}&0&\ldots&0\\ \vdots&\vdots&\ldots&\ddots&\ldots&\ddots&\vdots&\vdots\\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots&\ddots&\vdots\\ 0&0&\ldots&0&0&c^{x}_{1}&\ldots&c^{x}_{d-1}\\ 0&0&\ldots&0&0&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \end{bmatrix}}_{=C_{x}}U.

(33)

Here $U=\begin{bmatrix}U_{1}\\ U_{2}\\ \vdots\\ U_{T+d}\end{bmatrix}$ and $c^{x}_{j}=2\cos{(2\pi jx)}$ . For any $x$ and assuming $U_{j}\sim\mathsf{subg}(1)$ , we have from Theorem 4

\mathbb{P}\Big{(}\absolutevalue{Y_{x}/\sqrt{Td}}\leq t\Big{)}\leq 2\exp{-c(t\wedge t^{2})}

(34)

The tail behavior of $Y_{x}/\sqrt{Td}$ is not strictly subgaussian and we need to use Theorem 6. The function $\psi$ can be found as Eq. 1 of van de Geer and Lederer (2013) (equivalent upto universal constants) with $L=2$ and its inverse being

\psi^{-1}(t)=\sqrt{\log{(1+t)}}+\log{(1+t)}.

We have that

\left\|\sup_{t}|Y_{t}|\right\|_{\psi}\leq\left\|Y_{0}\right\|_{\psi}+\sqrt{Td}\int_{0}^{1}\psi^{-1}(N(\epsilon/2,d))d\epsilon,

where $d(s,t)=\left\|(Y_{s}-Y_{t})/\sqrt{Td}\right\|_{\psi}$ and $N(\epsilon,d)$ is the minimal number of balls of radius $\epsilon$ needed to cover $[0,1]$ where $d(\cdot,\cdot)$ is the pseudometric. Since $Y_{s}$ has distribution as in Eq. (34), it follows that $d(s,t)\leq c|s-t|$ for some absolute constant $c$ . Then

\int_{0}^{1}\psi^{-1}(N(\epsilon/2,d))d\epsilon\leq c

for some universal constant $c>0$ . This ensures that $\left\|\sup_{t}|Y_{t}|\right\|_{\psi}\leq c\sqrt{Td}$ . Since $\mathbb{E}[X]\leq\left\|X\right\|_{\psi}$ we have that $\mathbb{E}[\sup_{0\leq x\leq 1}|Y_{x}|]\leq\sqrt{Td}$ . This implies $\mathbb{E}[\left\|T_{d}+T_{d}^{\top}\right\|]\leq\sqrt{Td}$ , and using Proposition 5 we have $\mathbb{E}[\left\|T_{d}\right\|]\leq c\sqrt{Td}\log{(d)}$ . Furthermore, we can make a stronger statement because $\left\|\sup_{t}|Y_{t}|\right\|_{\psi}\leq c\sqrt{Td}$ which implies that

\left\|T_{d}\right\|\leq c\sqrt{Td}\log{(d)}\log{(1/\delta)}

with probability at least $1-\delta$ . Then recalling that in the general case that $L_{j}$ s of $T_{d}$ were $m\times m$ block matrices we scale by $m$ and get with probability at least $1-\delta$

\left\|T_{d}\right\|\leq cm\sqrt{Td}\log{(d)}\log{(m^{2}/\delta)}

where the union is over all $m^{2}$ elements being less that $c\sqrt{Td}\log{(d)}\log{(m^{2}/\delta)}$ .

11 Error Analysis for Theorem 5.1

For this section we assume that $U_{t}\sim\mathsf{subg}(L^{2})$ .

11.1 Proof of Theorem 2

Recall Eq. (8) and (9), i.e.,

	$\displaystyle\tilde{Y}^{+}_{l,d}$	$\displaystyle=\mathcal{H}_{0,d,d}\tilde{U}^{-}_{l-1,d}+\mathcal{T}_{0,d}\tilde{U}^{+}_{l,d}+\mathcal{H}_{d,d,l-d-1}\tilde{U}^{-}_{l-d-1,l-d-1}$
		$\displaystyle+\mathcal{O}_{0,d,d}\tilde{\eta}^{-}_{l-1,d}+\mathcal{T}\mathcal{O}_{0,d}\tilde{\eta}^{+}_{l,d}+\mathcal{O}_{d,d,l-d-1}\tilde{\eta}^{-}_{l-d-1,l-d-1}+\tilde{w}^{+}_{l,d}$		(35)

Assume for now that we have $T+2d$ data points instead of $T$ . It is clear that

\hat{\mathcal{H}}_{0,d,d}=\arg\min_{\mathcal{H}}\sum_{l=0}^{T-1}||\tilde{Y}^{+}_{l+d+1,d}-\mathcal{H}\tilde{U}^{-}_{l+d,d}||_{2}^{2}=\left(\sum_{l=0}^{T-1}\tilde{Y}^{+}_{l+d+1,d}\left(\tilde{U}^{-}_{l+d,d}\right)^{\top}\right)V_{T}^{+}

where

V_{T}=\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l+d,d},

(36)

or

V_{T}=UU^{\prime}

where

\displaystyle U

\displaystyle\coloneqq\begin{bmatrix}U_{d}&U_{d+1}&\ldots&U_{T+d-1}\\ U_{d-1}&U_{d}&\ldots&U_{T+d-2}\\ \vdots&\vdots&\ddots&\vdots\\ U_{1}&U_{2}&\ldots&U_{T}\end{bmatrix}.

It is show in Theorem 10.1 that $V_{T}$ is invertible with probability at least $1-\delta$ . So in our analysis we can write this as

\left(\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l+d,d}\right)^{+}=\left(\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l+d,d}\right)^{-1}

From this one can conclude that

$\displaystyle\Big{\|}\Big{\|}\hat{\mathcal{H}}-\mathcal{H}_{0,d,d}\Big{\|}\Big{\|}_{2}$	$\displaystyle=\Big{\|}\Big{\|}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l+d,d}\Big{)}^{-1}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}$
	$\displaystyle+\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l+d,d}\mathcal{O}_{0,d,d}^{\top}$
	$\displaystyle+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\top}_{l+d+1,d}\mathcal{T}\mathcal{O}^{\top}_{0,d}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l,l}\mathcal{O}_{d,d,l}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{w}^{+\top}_{l+d+1,d}\Big{)}\Big{\|}\Big{\|}_{2}$	(37)

Here as we can observe $\tilde{U}^{-\top}_{l,l},\tilde{\eta}^{-\top}_{l,l}$ grow with $T$ in dimension. Based on this we divide our error terms in two parts:

\displaystyle E_{1}=\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l+d,d}\Big{)}^{-1}\Bigg{(}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l,l}\mathcal{O}_{d,d,l}^{\top}\Bigg{)}

(38)

and

	$\displaystyle E_{2}=\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l+d,d}\Big{)}^{-1}\Bigg{(}\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\top}_{l+d+1,d}\mathcal{T}\mathcal{O}^{\top}_{0,d}+\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}+$		(39)
	$\displaystyle\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\top}_{l+d+1,d}\mathcal{T}\mathcal{O}^{\top}_{0,d}+\tilde{U}^{-}_{l+d,d}\tilde{w}^{+\top}_{l+d+1,d}\Bigg{)}$

Then the proof of Theorem 5.1 will reduce to Propositions 1–3. We first analyze

\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}\Big{)}\Big{|}\Big{|}_{2}

The analysis of $||V^{-1/2}_{T}(\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l,l}O_{d,d,l}^{\top})||$ will be almost identical and will only differ in constants.

Proposition 1.

For $0<\delta<1$ , we have with probability at least $1-2\delta$

\displaystyle\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}\Big{)}\Big{|}\Big{|}_{2}\leq 4\sigma\sqrt{\log{\frac{1}{\delta}}+pd+m}

where $\sigma=\sqrt{\sigma(\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T})}$ .

Proof 11.1.

We proved that $\frac{TI}{2}\preceq V_{T}\preceq\frac{3TI}{2}$ with high probability, then

	$\displaystyle\mathbb{P}\Big{(}\Big{\|}\Big{\|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}\Big{)}\Big{\|}\Big{\|}_{2}\geq a,\frac{TI}{2}\preceq V_{T}\preceq\frac{3TI}{2}\Big{)}$
	$\displaystyle\leq\mathbb{P}\Big{(}\Big{\|}\Big{\|}\sqrt{\frac{2}{T}}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}\Big{)}\Big{\|}\Big{\|}_{2}\geq a,\frac{TI}{2}\preceq V_{T}\preceq\frac{3TI}{2}\Big{)}$
	$\displaystyle\leq\mathbb{P}\Big{(}2\sup_{v\in{\mathcal{N}}_{\frac{1}{2}}}\Big{\|}\Big{\|}\sqrt{\frac{2}{T}}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}v\Big{)}\Big{\|}\Big{\|}_{2}\geq a\Big{)}+\mathbb{P}\Big{(}\frac{TI}{2}\preceq V_{T}\preceq\frac{3TI}{2}\Big{)}-1$
	$\displaystyle\leq 5^{pd}\mathbb{P}\Big{(}2\Big{\|}\Big{\|}\sqrt{\frac{2}{T}}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}v\Big{)}\Big{\|}\Big{\|}_{2}\geq a\Big{)}-\delta.$		(40)

Define the following $\eta_{l,d}=\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}v,X_{l,d}=\sqrt{\frac{2}{T}}\tilde{U}^{-}_{l+d,d}$ . Observe that $\eta_{l,d},\eta_{l+1,d}$ have contributions from $U_{l-1},U_{l-2}$ etc. and do not immediately satisfy the conditions of Theorem 3. Instead we will use the fact that $X_{i,d}$ is independent of $U_{j}$ for all $j\leq i$ .

	$\displaystyle\Big{\|}\Big{\|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}\Big{)}\Big{\|}\Big{\|}_{2}$	$\displaystyle\leq 2\sup_{v\in{\mathcal{N}}_{\frac{1}{2}}}{\|\|\sqrt{\frac{2}{T}}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}v\|\|}$
		$\displaystyle\leq 2\sup_{v\in{\mathcal{N}}_{\frac{1}{2}}}{\|\|\sum_{l=0}^{T-1}X_{l,d}\eta_{l,d}\|\|}.$

Define $\mathcal{H}_{d,d,l}^{\top}v=[\beta_{1}^{\top},\beta_{2}^{\top},\ldots,\beta_{l}^{\top}]^{\top}$ . $\beta_{i}$ are $m\times 1$ vectors when LTI system is MIMO. Then $\eta_{l,d}=\sum_{k=0}^{l-1}U^{\top}_{l-k}\beta_{k+1}$ . Let $\alpha_{l}={X_{l,d}}$ . Then consider the matrix

\displaystyle\mathcal{B}_{T\times mT}=\begin{bmatrix}\beta_{1}^{\top}&0&0&\ldots\\ \beta^{\top}_{2}&\beta^{\top}_{1}&0&\ldots\\ \vdots&\vdots&\ddots&\vdots\\ \beta_{T}^{\top}&\beta_{T-1}^{\top}&\ldots&\beta^{\top}_{1}\end{bmatrix}.

Observe that the matrix $||\mathcal{B}_{T\times mT}||_{2}=\sqrt{\sigma(\sum_{k=1}^{d}\mathcal{T}_{d+k,T}^{\top}\mathcal{T}_{d+k,T})}\leq\sqrt{d}||\mathcal{T}_{d,\infty}||_{2}<\infty$ which follows from Lemma 8. Then

	$\displaystyle\sum_{l=0}^{T-1}X_{l,d}\eta_{l,d}$	$\displaystyle=[\alpha_{1},\ldots,\alpha_{T}]\mathcal{B}\begin{bmatrix}U_{1}\\ U_{2}\\ \vdots\\ U_{T}\end{bmatrix}$
		$\displaystyle=[\sum_{k=1}^{T}\alpha_{k}\beta^{\top}_{k},\sum_{k=2}^{T}\alpha_{k}\beta^{\top}_{k-1},\ldots,\alpha_{T}\beta^{\top}_{1}]\begin{bmatrix}U_{1}\\ U_{2}\\ \vdots\\ U_{T}\end{bmatrix}$
		$\displaystyle=\sum_{j=1}^{T}\Big{(}\sum_{k=j}^{T}\alpha_{k}\beta^{\top}_{k}U_{j}\Big{)}.$

Here $\alpha_{i}=X_{i,d}$ and recall that $X_{i,d}$ is independent of $U_{j}$ for all $i\geq j$ . Let $\gamma^{\prime}=\alpha^{\prime}\mathcal{B}$ . Define $\mathcal{G}_{T+d-k}=\tilde{\sigma}(\{U_{k+1},U_{k+2},\ldots,U_{T+d}\})$ where $\tilde{\sigma}(A)$ is the sigma algebra containing the set $A$ with $\mathcal{G}_{0}=\phi$ . Then $\mathcal{G}_{k-1}\subset\mathcal{G}_{k}$ . Furthermore, since $\gamma_{j-1},U_{j}$ are $\mathcal{G}_{T+d+1-j}$ measurable and $U_{j}$ is conditionally (on $\mathcal{G}_{T+d-j}$ ) subGaussian, we can use Theorem 3 on $\gamma^{\prime}U=\alpha^{\prime}\mathcal{B}U$ (where $\gamma_{j}=X_{T+d-j},U_{j}=\eta_{T+d-j+1}$ in the notation of Theorem 3). Then with probability at least $1-\delta$ we have

\Big{|}\Big{|}\Big{(}\alpha^{\prime}\mathcal{B}\mathcal{B}^{\prime}\alpha+V\Big{)}^{-1/2}\gamma^{\prime}U\Big{|}\Big{|}\leq L\sqrt{\Big{(}\log{\frac{1}{\delta}}+\log{\frac{\det(\alpha^{\prime}\mathcal{B}\mathcal{B}^{\prime}\alpha+V)}{\det(V)}}\Big{)}}.

(41)

For any fixed $V>0$ . With probability at least $1-\delta$ , we know from Theorem 10.1 that $\alpha^{\prime}\alpha\preceq\frac{3I}{2}\implies\alpha^{\prime}\mathcal{B}\mathcal{B}^{\prime}\alpha\preceq\frac{3\sigma_{1}^{2}(\mathcal{B})I}{2}$ . By combining this event and the event in Eq. (41) and setting $V=\frac{3\sigma_{1}^{2}(\mathcal{B})I}{2}$ , we get with probability at least $1-2\delta$ that

\displaystyle||\alpha^{\prime}\mathcal{B}U||_{2}=||\gamma^{\prime}U||_{2}\leq\sqrt{3}\sigma_{1}(\mathcal{B})L\sqrt{\Big{(}\log{\frac{1}{\delta}}+pd\log{3}+m\Big{)}}.

(42)

Replacing $\delta\rightarrow 5^{-pd}\frac{\delta}{2}$ , we get from Eq. (40)

\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}\Big{)}\Big{|}\Big{|}_{2}\leq\sqrt{6}\log{(5)}L\sigma_{1}(\mathcal{B})\sqrt{\log{\frac{1}{\delta}}+pd+m}

with probability at least $1-\delta$ . Since $L=1$ we get our desired result.

Then similar to Proposition 1, we analyze $\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}\Big{)}\Big{|}\Big{|}_{2}$

Proposition 2.

For $0<\delta<1$ and large enough $T$ , we have with probability at least $1-\delta$

\displaystyle\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}\Big{)}\Big{|}\Big{|}_{2}\leq 4\sigma\sqrt{\log{\frac{1}{\delta}}+pd+m}

where

\sigma\leq\sup_{||v||_{2}=1}\Big{|}\Big{|}\begin{bmatrix}v^{\top}CA^{d}B&v^{\top}CA^{d-1}B&v^{\top}CA^{d-2}B&\ldots&v^{\top}CB&0\\ 0&\ddots&\ddots&\ddots&\ddots&0\\ 0&\ddots&\ddots&\ddots&\ddots&\ddots\\ 0&v^{\top}CA^{d}B&v^{\top}CA^{d-1}B&\ldots&\ldots&v^{\top}CB\end{bmatrix}\Big{|}\Big{|}_{2}\leq\sum_{j=0}^{d}||CA^{j}B||_{2}\leq\beta\sqrt{d}.

Proof 11.2.

Note $\Big{|}\Big{|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}\Big{)}\Big{|}\Big{|}_{2}\leq\Big{|}\Big{|}\sqrt{\frac{2}{T}}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}\Big{)}\Big{|}\Big{|}_{2}$ with probability at least $1-\delta$ for large enough $T$ . Here $\mathcal{T}_{0,d}^{\top}$ is $md\times pd$ matrix. Then define $X_{l}=\sqrt{\frac{2}{T}}\tilde{U}^{-}_{l+d,d}$ and the vector $M_{l}\in\mathbb{R}^{pd}$ as $M_{l}^{\top}=\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}$ . Then

\displaystyle\mathbb{P}(||\sum_{l=0}^{T-1}X_{l}M_{l}^{\top}||_{2}\geq t)

\displaystyle\underbrace{\leq}_{\frac{1}{2}-\text{net}}5^{pd}\mathbb{P}(||\sum_{l=0}^{T-1}X_{l}M_{l}^{\top}v||_{2}\geq t/2)

where $M_{l}^{\top}v$ is a real value. Let $\beta\coloneqq\mathcal{T}_{0,d}^{\top}v$ , then $M_{l}^{\top}v=\tilde{U}^{+\top}_{l+d+1,d}\beta$ . This allows us to write $X_{l}M_{l}^{\top}v$ in a form that will enable us to apply Theorem 3.

\displaystyle\sum_{l=0}^{T-1}X_{l}M_{l}^{\top}v

\displaystyle=\underbrace{[X_{0},X_{1},\ldots,X_{T-1}]}_{=X}\underbrace{\begin{bmatrix}\beta_{1}^{\top}&\beta_{2}^{\top}&\ldots&\beta_{d}^{\top}&\ldots&0\\ 0&\beta_{1}^{\top}&\ddots&\ddots&\ddots&0\\ 0&\ddots&\ddots&\ddots&\ddots&\ddots\\ 0&\ldots&0&\beta_{1}^{\top}&\ldots&\beta_{d}^{\top}\end{bmatrix}}_{=\mathcal{I}}\underbrace{\begin{bmatrix}U_{d+1}\\ U_{d+2}\\ \vdots\\ U_{T+2d}\end{bmatrix}}_{=N}

(43)

Here $\mathcal{I}$ is $\mathbb{R}^{T\times(mT+md)}$ . It is known from Theorem 10.1 that $XX^{\top}\preceq\frac{3I}{2}$ with high probability and consequently $X\mathcal{I}\mathcal{I}^{\top}X^{\top}\preceq\frac{3\sigma^{2}_{1}(\mathcal{I})I}{2}$ . Define $\bm{\mathcal{F}}_{l}=\tilde{\sigma}(\{U_{l}\}_{j=1}^{d+l})$ as the sigma field generated by $(\{U_{l}\}_{j=1}^{d+l}$ . Furthermore $N_{l}$ is $\bm{\mathcal{F}}_{l}$ measurable, and $[X\mathcal{I}]_{l}$ is $\bm{\mathcal{F}}_{l-1}$ measurable and we can apply Theorem 3. Now the proof is similar to Proposition 1. Following the same steps as before we get with probability at least $1-\delta$

\displaystyle||\sum_{l=0}^{T-1}X_{l}M_{l}^{\top}v||_{2}=||\sum_{l=0}^{T-1}[X\mathcal{I}]_{l}N_{l}||_{2}\leq\sqrt{3}\sigma_{1}(\mathcal{I})L\sqrt{\log{\frac{1}{\delta}}+pd\log{3}+m}

and substituting $\delta\rightarrow 5^{-pd}\delta$ we get

||\sum_{l=0}^{T-1}X_{l}M_{l}^{\top}||_{2}\leq\sqrt{6}\log{(5)}\sigma_{1}(\mathcal{I})L\sqrt{\log{\frac{1}{\delta}}+pd+m}

and

||\sum_{l=0}^{T-1}X_{l}M_{l}||_{2}\leq 4\sigma_{1}(\mathcal{I})L\sqrt{\log{\frac{1}{\delta}}+pd+m}.

(44)

The proof for noise and covariate cross terms is almost identical to Proposition 2 but easier because of independence. Finally note that $\sigma_{1}(\mathcal{I})\leq\sqrt{\sum_{i=1}^{d}\|\beta_{i}\|^{2}_{2}}\sqrt{d}=\sqrt{\|\mathcal{T}_{0,d}^{\top}v\|_{2}^{2}}\sqrt{d}\leq\beta\sqrt{d}$ .

Proposition 3.

For $0<\delta<1$ , we have with probability at least $1-\delta$

	$\displaystyle\Big{\|}\Big{\|}V^{-1/2}_{T}\Big{(}\sum_{k=0}^{T}\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\prime}_{l+1+d,d}\mathcal{T}\mathcal{O}^{\prime}_{0,d}\Big{)}\Big{\|}\Big{\|}_{2}$	$\displaystyle\leq 4\sigma_{A}\sqrt{\log{\frac{1}{\delta}}+pd+m}$
	$\displaystyle\Big{\|}\Big{\|}V^{-1/2}_{T}\Big{(}\sum_{k=0}^{T}\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\prime}_{l,l}\mathcal{O}^{\prime}_{d,d,l}\Big{)}\Big{\|}\Big{\|}_{2}$	$\displaystyle\leq 4\sigma_{B}\sqrt{\log{\frac{1}{\delta}}+pd+m}$
	$\displaystyle\Big{\|}\Big{\|}V^{-1/2}_{T}\Big{(}\sum_{k=0}^{T}\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\prime}_{l+d,d}\mathcal{O}^{\prime}_{0,d,d}\Big{)}\Big{\|}\Big{\|}_{2}$	$\displaystyle\leq 4\sigma_{C}\sqrt{\log{\frac{1}{\delta}}+pd+m}$
	$\displaystyle\Big{\|}\Big{\|}V^{-1/2}_{T}\Big{(}\sum_{k=0}^{T}\tilde{U}^{-}_{l+d,d}\tilde{w}^{+\prime}_{l+1+d,d}\Big{)}\Big{\|}\Big{\|}_{2}$	$\displaystyle\leq 4\sigma_{D}\sqrt{\log{\frac{1}{\delta}}+pd+m}$

Here $\sigma=\max{(\sigma_{A},\sigma_{B},\sigma_{C},\sigma_{D})}$ where

\sigma_{A}\vee\sigma_{C}\leq\sup_{||v||_{2}=1}\Big{|}\Big{|}\begin{bmatrix}v^{\top}CA^{d}&v^{\top}CA^{d-1}&v^{\top}CA^{d-2}&\ldots&0\\ 0&\ddots&\ddots&\ddots&0\\ 0&\ddots&\ddots&\ddots&\ddots\\ 0&\ldots&v^{\top}CA^{d}&\ldots&v^{\top}C\end{bmatrix}\Big{|}\Big{|}_{2}\leq\sum_{j=0}^{d}||CA^{j}||_{2}\leq\beta R\sqrt{d}

$\sigma_{B}=\sqrt{\sigma(\sum_{k=1}^{d}\mathcal{T}\mathcal{O}_{d+k,T}^{\top}\mathcal{T}\mathcal{O}_{d+k,T})}\leq\beta R\sqrt{d},\sigma_{D}\leq c$ .

By taking the intersection of all the aforementioned events for a fixed $\delta$ we then have with probability at least $1-\delta$

\displaystyle\Big{|}\Big{|}\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,d,d}\Big{|}\Big{|}_{2}

\displaystyle\leq 16\sigma\sqrt{\frac{1}{T}}\sqrt{m+pd+\log{\frac{d}{\delta}}}

12 Subspace Perturbation Results

In this section we present variants of the famous Wedin’s theorem (Section 3 of Wedin (1972)) that depends on the distribution of Hankel singular values. These will be “sign free” generalizations of the gap–Free Wedin Theorem from Allen-Zhu and Li (2016). First we define the Hermitian dilation of a matrix.

\displaystyle\mathcal{H}(S)=\begin{bmatrix}0&S\\ S^{\prime}&0\end{bmatrix}

The Hermitian dilation has the property that $||S_{1}-S_{2}||\leq\epsilon\Longleftrightarrow||\mathcal{H}(S_{1})-\mathcal{H}(S_{2})||\leq\epsilon$ . Hermitian dilations will be useful in applying Wedin’s theorem for general (not symmetric) matrices.

Proposition 1.

Let $S,\hat{S}$ be symmetric matrices and $||S-\hat{S}||\leq\epsilon$ . Further, let $v_{j},\hat{v}_{j}$ correspond to the $j^{th}$ eigenvector of $S,\hat{S}$ respectively such that $\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{n}$ and $\hat{\lambda}_{1}\geq\hat{\lambda}_{2}\geq\ldots\geq\hat{\lambda}_{n}$ . Then we have

|\langle v_{j},\hat{v}_{k}\rangle|\leq\frac{\epsilon}{{|\lambda_{j}-\hat{\lambda}_{k}|}}

(45)

if either $\lambda_{j}$ or $\hat{\lambda}_{k}$ is not zero.

Proof 12.1.

Let $S=\lambda_{j}v_{j}v_{j}^{\prime}+V\Lambda_{-j}V^{\prime}$ and $\hat{S}=\hat{\lambda}_{k}\hat{v}_{k}\hat{v}_{k}^{\prime}+\hat{V}\hat{\Lambda}_{-k}\hat{V}^{\prime}$ , wlog assume $|\lambda_{j}|\leq|\hat{\lambda}_{k}|$ . Define $R=S-\hat{S}$

	$\displaystyle S$	$\displaystyle=\hat{S}+R$
	$\displaystyle v_{j}^{\prime}S\hat{v}_{k}$	$\displaystyle=v_{j}^{\prime}\hat{S}\hat{v}_{k}+v_{j}^{\prime}R\hat{v}_{k}$

Since $v_{j},\hat{v}_{k}$ are eigenvectors of $S$ and $\hat{S}$ respectively.

	$\displaystyle\lambda_{j}v_{j}^{\prime}\hat{v}_{k}$	$\displaystyle=\hat{\lambda}_{k}v_{j}^{\prime}\hat{v}_{k}+v_{j}^{\prime}R\hat{v}_{k}$
	$\displaystyle\|\lambda_{j}-\hat{\lambda}_{k}\|\|v_{j}^{\prime}\hat{v}_{k}\|$	$\displaystyle\leq\epsilon$

Proposition 1 gives an eigenvector subjective Wedin’s theorem. Next, we show how to extend these results to arbitrary subsets of eigenvectors.

Proposition 2.

For $\epsilon>0$ , let $S,P$ be two symmetric matrices such that $||S-P||_{2}\leq\epsilon$ . Let

S=U\Sigma^{S}U^{\top},P=V\Sigma^{P}V^{\top}

Let $V_{+}$ correspond to the eigenvectors of singular values $\geq\beta$ , $V_{-}$ correspond to the eigenvectors of singular values $\leq\alpha$ and $\bar{V}$ are the remaining ones. Define a similar partition for $S$ . Let $\alpha<\beta$

\displaystyle||U_{-}^{\top}V_{+}||

\displaystyle\leq\frac{\epsilon}{\beta-\alpha}

Proof 12.2.

The proof is similar to before. $S,P$ have a spectral decomposition of the form

	$\displaystyle S$	$\displaystyle=U_{+}\Sigma^{S}_{+}U_{+}^{\prime}+U_{-}\Sigma^{S}_{-}U_{-}^{\prime}+\bar{U}{\Sigma^{S}}_{0}\bar{U}^{\prime}$
	$\displaystyle P$	$\displaystyle=V_{+}\Sigma^{P}_{+}V_{+}^{\prime}+V_{-}\Sigma^{P}_{-}V_{-}^{\prime}+\bar{V}{\Sigma^{P}}_{0}\bar{V}^{\prime}$

Let $R=S-P$ and since $U_{+}$ is orthogonal to $U_{-},\bar{U}$ and similarly for $V$

	$\displaystyle U_{-}^{\prime}S$	$\displaystyle=\Sigma^{S}_{-}U_{-}^{\prime}=U_{-}^{\prime}P+U_{-}^{\prime}R$
	$\displaystyle\Sigma^{S}_{-}U_{-}^{\prime}V_{+}$	$\displaystyle=U_{-}^{\prime}V_{+}\Sigma^{P}_{+}+U_{-}^{\prime}RV_{+}$

Diving both sides by $\Sigma^{P}$

	$\displaystyle\Sigma^{S}_{-}U_{-}^{\prime}V_{+}(\Sigma^{P}_{+})^{-1}$	$\displaystyle=U_{-}^{\prime}V_{+}+U_{-}^{\prime}RV_{+}(\Sigma^{P}_{+})^{-1}$
	$\displaystyle\|\|\Sigma^{S}_{-}U_{-}^{\prime}V_{+}(\Sigma^{P}_{+})^{-1}\|\|$	$\displaystyle\geq\|\|U_{-}^{\prime}V_{+}\|\|-\|\|U_{-}^{\prime}RV_{+}(\Sigma^{P}_{+})^{-1}\|\|$
	$\displaystyle\frac{\alpha}{\beta}\|\|U_{-}^{\prime}V_{+}\|\|$	$\displaystyle\geq\|\|U_{-}^{\prime}V_{+}\|\|-\frac{\epsilon}{\beta}$
	$\displaystyle\|\|U_{-}^{\prime}V_{+}\|\|$	$\displaystyle\leq\frac{\epsilon}{\beta-\alpha}$

Let $S_{k},P_{k}$ be the best rank $k$ approximations of $S,P$ respectively. We develop a sequence of results to see how $||S_{k}-P_{k}||$ varies when $||S-P||\leq\epsilon$ as a function of $k$ .

Proposition 3.

Let $S,P$ be such that

||S-P||\leq\epsilon

Furthermore, let $\epsilon$ be such that

\epsilon\leq\inf_{\{1\leq i\leq r-1\}\cup\{s+1\leq i\leq n\}}\Big{(}\frac{\sigma_{i}(P)-\sigma_{i+1}(P)}{2}\Big{)}

(46)

and $U_{j}^{S},V^{S}_{j}$ be the left and right singular vectors of $S$ corresponding to $\sigma_{j}(S)$ . There exists a unitary transformation $Q$ such that

	$\displaystyle\sigma_{\max}([U_{r}^{P},\ldots,U_{s}^{P}]Q-[U_{r}^{S},\ldots,U_{s}^{S}])$	$\displaystyle\leq\frac{2\epsilon}{\min{\Big{(}\sigma_{r-1}(P)-\sigma_{r}(S),\sigma_{s}(S)-\sigma_{s+1}(P)\Big{)}}}$
	$\displaystyle\sigma_{\max}([V_{r}^{P},\ldots,V_{s}^{P}]Q-[V_{r}^{S},\ldots,V_{s}^{S}])$	$\displaystyle\leq\frac{2\epsilon}{\min{\Big{(}\sigma_{r-1}(P)-\sigma_{r}(S),\sigma_{s}(S)-\sigma_{s+1}(P)\Big{)}}}.$

Proof 12.3.

Let $r\leq k\leq s$ . First divide the indices $[1,n]$ into 3 parts $K_{1}=[1,r-1],K_{2}=[r,s],K_{3}=[s+1,n]$ . Although we focus on only three groups extension to general case will be a straight forward extension of this proof. Define the Hermitian dilation of $S,P$ as $\mathcal{H}(S),\mathcal{H}(P)$ respectively. Then we know that the eigenvalues of $\mathcal{H}(S)$ are

\cup_{i=1}^{n}\{\sigma_{i}(S),-\sigma_{i}(S)\}

Further the eigenvectors corresponding to these are

\cup_{i=1}^{n}\Bigg{\{}\frac{1}{\sqrt{2}}\begin{bmatrix}u^{S}_{i}\\ v^{S}_{i}\end{bmatrix},\frac{1}{\sqrt{2}}\begin{bmatrix}u^{S}_{i}\\ -v^{S}_{i}\end{bmatrix}\Bigg{\}}

Similarly define the respective quantities for $\mathcal{H}(P)$ . Now clearly, $||\mathcal{H}(S)-\mathcal{H}(P)||\leq\epsilon$ since $||S-P||\leq\epsilon$ . Then by Weyl’s inequality we have that

|\sigma_{i}(S)-\sigma_{i}(P)|\leq\epsilon

Now we can use Proposition 1. To ease notation, define $\sigma_{i}(S)=\lambda_{i}(\mathcal{H}(S))$ and $\lambda_{-i}(\mathcal{H}(S))=-\sigma_{i}(S)$ and let the corresponding eigenvectors be $a_{i},a_{-i}$ for $S$ and $b_{i},b_{-i}$ for $P$ respectively. Note that we can make the assumption that $\langle a_{i},b_{i}\rangle\geq 0$ for every $i$ . This does not change any of our results because $a_{i},b_{i}$ are just stacking of left and right singular vectors and $u_{i}v_{i}^{\top}$ is identical for $u_{i},v_{i}$ and $-u_{i},-v_{i}$ .

Then using Proposition 1 we get for every $(i,j)\not\in K_{2}\times K_{2}$ and $i\neq j$

|\langle a_{i},b_{j}\rangle|\leq\frac{\epsilon}{|\sigma_{i}(S)-\sigma_{j}(P)|}

(47)

similarly

|\langle a_{-i},b_{j}\rangle|\leq\frac{\epsilon}{|\sigma_{i}(S)+\sigma_{j}(P)|}

(48)

Since

a_{i}=\frac{1}{\sqrt{2}}\begin{bmatrix}u^{S}_{i}\\ v^{S}_{i}\end{bmatrix},a_{-i}=\frac{1}{\sqrt{2}}\begin{bmatrix}u^{S}_{i}\\ -v^{S}_{i}\end{bmatrix},b_{j}=\frac{1}{\sqrt{2}}\begin{bmatrix}u^{P}_{i}\\ v^{P}_{i}\end{bmatrix}

and $\sigma_{i}(S),\sigma_{i}(P)\geq 0$ we have by adding Eq. (47),(48) that

\max{\Big{(}|\langle u^{S}_{i},u^{P}_{j}\rangle|,|\langle v^{S}_{i},v^{P}_{j}\rangle|\Big{)}}\leq\frac{\epsilon}{|\sigma_{i}(S)-\sigma_{j}(P)|}

Define $U^{S}_{K_{i}}$ to be the matrix formed by the orthornormal vectors $\{a_{j}\}_{j\in K_{i}}$ and $U^{S}_{K_{-i}}$ to be the matrix formed by the orthonormal vectors $\{a_{j}\}_{j\in-K_{i}}$ . Define similar quantities for $P$ . Then

	$\displaystyle(U^{S}_{K_{2}})^{\top}U^{P}_{K_{2}}(U^{P}_{K_{2}})^{\top}U^{S}_{K_{2}}=(U^{S}_{K_{2}})^{\top}(I-\sum_{j\neq 2}U^{P}_{K_{j}}(U^{P}_{K_{j}})^{\top})U^{S}_{K_{2}}$
	$\displaystyle=(U^{S}_{K_{2}})^{\top}(I-\sum_{\|j\|\neq 2}U^{P}_{K_{j}}(U^{P}_{K_{j}})^{\top}-U^{P}_{K_{-2}}(U^{P}_{K_{-2}})^{\top})U^{S}_{K_{2}}$
	$\displaystyle=I-(U^{S}_{K_{2}})^{\top}\sum_{\|j\|\neq 2}U^{P}_{K_{j}}(U^{P}_{K_{j}})^{\top}U^{S}_{K_{2}}-(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}(U^{P}_{K_{-2}})^{\top}U^{S}_{K_{2}}$		(49)

Now $K_{1},K_{-1}$ corresponds to eigenvectors where singular values $\geq\sigma_{r-1}(P)$ , $K_{3},K_{-3}$ corresponds to eigenvectors where singular values $\leq\sigma_{s+1}(P)$ . We are in a position to use Proposition 2. Using that on Eq. (49) we get the following relation

	$\displaystyle(U^{P}_{K_{2}})^{\top}U^{S}_{K_{2}}(U^{S}_{K_{2}})^{\top}U^{P}_{K_{2}}$	$\displaystyle\succeq I\Bigg{(}1-\frac{\epsilon^{2}}{(\sigma_{r-1}(P)-\sigma_{s}(S))^{2}}-\frac{\epsilon^{2}}{(\sigma_{s}(S)-\sigma_{s+1}(P))^{2}}\Bigg{)}$
		$\displaystyle-(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}(U^{P}_{K_{-2}})^{\top}U^{S}_{K_{2}}$		(50)

In the Eq. (50) we need to upper bound $(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}(U^{P}_{K_{-2}})^{\top}U^{S}_{K_{2}}$ . To this end we will exploit the fact that all singular values corresponding to $U^{S}_{K_{2}}$ are the same. Since $||\mathcal{H}(S)-\mathcal{H}(P)||\leq\epsilon$ , then

	$\displaystyle\mathcal{H}(S)$	$\displaystyle=U^{S}_{K_{2}}\Sigma^{S}_{K_{2}}(U^{S}_{K_{2}})^{\top}+U^{S}_{K_{-2}}\Sigma^{S}_{K_{-2}}(U^{S}_{K_{-2}})^{\top}+U^{S}_{K_{0}}\Sigma^{S}_{K_{0}}(U^{S}_{K_{0}})^{\top}$
	$\displaystyle\mathcal{H}(P)$	$\displaystyle=U^{P}_{K_{2}}\Sigma^{P}_{K_{2}}(U^{P}_{K_{2}})^{\top}+U^{P}_{K_{-2}}\Sigma^{P}_{K_{-2}}(U^{P}_{K_{-2}})^{\top}+U^{P}_{K_{0}}\Sigma^{P}_{K_{0}}(U^{P}_{K_{0}})^{\top}$

Then by pre–multiplying and post–multiplying we get

	$\displaystyle(U^{S}_{K_{2}})^{\top}\mathcal{H}(S)U^{P}_{K_{-2}}$	$\displaystyle=\Sigma^{S}_{K_{2}}(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}$
	$\displaystyle(U^{S}_{K_{2}})^{\top}\mathcal{H}(P)U^{P}_{K_{-2}}$	$\displaystyle=(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}\Sigma^{P}_{K_{-2}}$

Let $\mathcal{H}(S)-\mathcal{H}(P)=R$ then

	$\displaystyle(U^{S}_{K_{2}})^{\top}(\mathcal{H}(S)-\mathcal{H}(P))U^{P}_{K_{-2}}$	$\displaystyle=(U^{S}_{K_{2}})^{\top}RU^{P}_{K_{-2}}$
	$\displaystyle\Sigma^{S}_{K_{2}}(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}-(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}\Sigma^{P}_{K_{-2}}$	$\displaystyle=(U^{S}_{K_{2}})^{\top}RU^{P}_{K_{-2}}$

Since $\Sigma^{S}_{K_{2}}=\sigma_{s}(A)I$ then

	$\displaystyle\|\|(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}(\sigma_{s}(S)I-\Sigma^{P}_{K_{-2}})\|\|$	$\displaystyle=\|\|(U^{S}_{K_{2}})^{\top}RU^{P}_{K_{-2}}\|\|$
	$\displaystyle\|\|(U^{S}_{K_{2}})^{\top}U^{P}_{K_{-2}}\|\|$	$\displaystyle\leq\frac{\epsilon}{\sigma_{s}(S)+\sigma_{s}(P)}$

Similarly

||(U^{P}_{K_{2}})^{\top}U^{S}_{K_{-2}}||\leq\frac{\epsilon}{\sigma_{s}(P)+\sigma_{s}(S)}

Since $\sigma_{s}(P)+\sigma_{s}(S)\geq\sigma_{s}(S)-\sigma_{s+1}(P)$ combining this with Eq. (50) we get

\sigma_{\min}((U^{S}_{K_{2}})^{\top}U^{P}_{K_{2}})\geq 1-\frac{3\epsilon^{2}}{\min{\Big{(}\sigma_{r-1}(P)-\sigma_{s}(S),\sigma_{s}(S)-\sigma_{s+1}(P)\Big{)}^{2}}}

(51)

Since

\epsilon\leq\inf_{i}\Big{(}\frac{\sigma_{i}(P)-\sigma_{i+1}(P)}{2}\Big{)},

for Eq. (51), we use the inequality $\sqrt{1-x^{2}}\geq 1-x^{2}$ whenever $x<1$ which is true when Eq. (46) is true. This means that there exists unitary transformation $Q$ such that

||U_{K_{2}}^{S}-U_{K_{2}}^{P}Q||\leq\frac{2\epsilon}{\min{\Big{(}\sigma_{r-1}(P)-\sigma_{s}(S),\sigma_{s}(S)-\sigma_{s+1}(P)\Big{)}}}

Remark 12.4.

Note that $S,P$ will be Hermitian dilations of $\mathcal{H}_{0,\infty,\infty},\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}$ respectively in our case. Since the singular vectors of $S$ (and $P$ ) are simply stacked version of singular vectors of $\mathcal{H}_{0,\infty,\infty}$ (and $\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}$ ), our results hold directly for the singular vectors of $\mathcal{H}_{0,\infty,\infty}$ (and $\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}$ )

Let $r\leq k\leq s$ . First divide the indices $[1,n]$ into 3 parts $K_{1}=[1,r-1],K_{2}=[r,s],K_{3}=[s+1,n]$ .

Proposition 4 (System Reduction).

Let $||S-P||\leq\epsilon$ and the singular values of $S$ be arranged as follows:

\sigma_{1}(S)>\ldots>\sigma_{r-1}(S)>\sigma_{r}(S)\geq\sigma_{r+1}(S)\geq\ldots\geq\sigma_{s}(S)>\sigma_{s+1}(S)>\ldots\sigma_{n}(S)>\sigma_{n+1}(S)=0

Furthermore, let $\epsilon$ be such that

\epsilon\leq\inf_{\{1\leq i\leq r-1\}\cup\{s+1\leq i\leq n\}}\Big{(}\frac{\sigma_{i}(P)-\sigma_{i+1}(P)}{2}\Big{)}.

(52)

Define $K_{0}=K_{1}\cup K_{2}$ , then

\displaystyle||U^{S}_{K_{0}}(\Sigma^{S}_{K_{0}})^{1/2}-U^{P}_{K_{0}}(\Sigma^{P}_{K_{0}})^{1/2}||_{2}

\displaystyle\leq 2\epsilon\sqrt{\sum_{i=1}^{r-1}\sigma_{i}/\zeta_{i}^{2}+\sigma_{r}/\zeta_{r}^{2}}+\sup_{1\leq i\leq s}|\sqrt{\sigma_{i}}-\sqrt{\hat{\sigma}_{i}}|

and $\sigma_{i}=\sigma_{i}(S),\hat{\sigma}_{i}=\sigma_{i}(P)$ . Here $\zeta_{i}=\min{(\sigma_{i}-\sigma_{i+1},\sigma_{i}-\sigma_{i+1})}$ and $\zeta_{r}=\min{(\sigma_{r-1}-\sigma_{r},\sigma_{s}-\sigma_{s+1})}$ .

Proof 12.5.

Since $U_{K_{0}}^{S}=[U_{K_{1}}^{S}U_{K_{2}}^{S}]$ and likewise for $B$ , we can separate the analysis for $K_{1},K_{2}$ as follows

	$\displaystyle\|\|U^{S}_{K_{0}}(\Sigma^{S}_{K_{0}})^{1/2}-U^{P}_{K_{0}}(\Sigma^{P}_{K_{0}})^{1/2}\|\|$	$\displaystyle\leq\|\|(U^{S}_{K_{0}}-U^{P}_{K_{0}})(\Sigma^{S}_{K_{0}})^{1/2}\|\|+\|\|U^{P}_{K_{0}}((\Sigma^{S}_{K_{0}})^{1/2}-(\Sigma^{P}_{K_{0}})^{1/2})\|\|$
		$\displaystyle\leq\sqrt{\|\|(U^{S}_{K_{1}}-U^{P}_{K_{1}})(\Sigma^{S}_{K_{1}})^{1/2}\|\|_{2}^{2}+\|\|(U^{S}_{K_{2}}-U^{P}_{K_{2}})(\Sigma^{S}_{K_{2}})^{1/2}\|\|_{2}^{2}}$
		$\displaystyle+\|\|(\Sigma^{S}_{K_{0}})^{1/2}-(\Sigma^{P}_{K_{0}})^{1/2}\|\|$

Now $||(\Sigma^{S}_{K_{0}})^{1/2}-(\Sigma^{P}_{K_{0}})^{1/2}||=\sup_{l}|\sqrt{\sigma_{l}(S)}-\sqrt{\sigma_{l}(P)}|$ . Recall that $\sigma_{r}(S)=\ldots=\sigma_{k}(S)=\ldots=\sigma_{s-1}(S)$ and by conditions on $\epsilon$ we are guaranteed that $\frac{\epsilon}{\sigma_{i}-\sigma_{j}}<1/2$ for all $1\leq i\neq j\leq r$ . We will combine our previous results in Proposition 1–3 to prove this claim. Specifically from Proposition 3 we have

\displaystyle||(U^{S}_{K_{2}}-U^{P}_{K_{2}})(\Sigma^{S}_{K_{2}})^{1/2}||

\displaystyle\leq\frac{2\epsilon\sqrt{\sigma_{r}(S)}}{\min{\Big{(}\sigma_{r-1}(P)-\sigma_{r}(S),\sigma_{r}(S)-\sigma_{s+1}(P)\Big{)}}}

On the remaining term we will use Proposition 3 on each column

	$\displaystyle\|\|(U^{S}_{K_{1}}-U^{P}_{K_{1}})(\Sigma^{S}_{K_{1}})^{1/2}\|\|$	$\displaystyle\leq\|\|[\sqrt{\sigma_{1}(S)}c_{1},\ldots,\sqrt{\sigma_{\|K_{1}\|}(S)}c_{\|K_{1}\|}]\|\|\leq\sqrt{\sum_{j=1}^{r-1}\sigma_{j}^{2}\|\|c_{j}\|\|^{2}}$
		$\displaystyle\leq\epsilon\sqrt{\sum_{j=1}^{r-1}\frac{2\sigma_{j}(S)}{\min{\Big{(}\sigma_{j-1}(P)-\sigma_{j}(S),\sigma_{j}(S)-\sigma_{j+1}(P)\Big{)}^{2}}}}$

In the context of our system identification, $S=\mathcal{H}_{0,\infty,\infty}$ and $P=\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}$ . $P$ will be made compatible by padding it with zeros to make it doubly infinite. Then $U_{K_{0}}^{S},U_{K_{0}}^{P}$ (after padding) has infinite rows. Define $Z_{0}=U_{K_{0}}^{S}(\Sigma^{S}_{K_{0}})^{1/2}(1:,:),Z_{1}=U_{K_{0}}^{S}(\Sigma^{S}_{K_{0}})^{1/2}(p+1:,:)$ (both infinite length) and similarly we will have $\hat{Z}_{0},\hat{Z}_{1}$ . Note that from a computational perspective we do not need to $Z_{0},Z_{1}$ ; we only need to work with $\hat{Z}_{0}=U_{K_{0}}^{P}(\Sigma^{P}_{K_{0}})^{1/2}(1:,:),\hat{Z}_{1}=U_{K_{0}}^{P}(\Sigma^{P}_{K_{0}})^{1/2}(p+1:,:)$ and since most of it is just zero padding we can simply compute on $\hat{Z}_{0}(1:pd,:),\hat{Z}_{1}(1:pd,:)$ .

Proposition 5.

Assume $Z_{1}=Z_{0}A$ . Furthermore, $||S-P||_{2}\leq\epsilon$ and let $\epsilon$ be such that

\epsilon\leq\inf_{\{1\leq i\leq r-1\}\cup\{s+1\leq i\leq n\}}\Big{(}\frac{\sigma_{i}(P)-\sigma_{i+1}(P)}{2}\Big{)}

(53)

then

	$\displaystyle\|\|(Z_{0}^{\prime}Z_{0})^{-1}Z_{0}^{\prime}Z_{1}-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{1}\|\|$	$\displaystyle\leq\frac{\mathcal{C}\epsilon(\gamma+1)}{\sigma_{s}}\Bigg{(}\sqrt{\frac{\sigma^{2}_{s}}{((\sigma_{s}-\sigma_{s+1})\wedge(\sigma_{r-1}-\sigma_{s}))^{2}}}$
		$\displaystyle+\sqrt{\sum_{i=1}^{r-1}\frac{\sigma_{i}\sigma_{s}}{(\sigma_{i}-\sigma_{i+1})^{2}\wedge(\sigma_{i-1}-\sigma_{i})^{2}}}\Bigg{)}$

where $\sigma_{1}(A)\leq\gamma$ .

Proof 12.6.

Note that $Z_{1}=Z_{0}A$ , then

		$\displaystyle\|\|(Z_{0}^{\prime}Z_{0})^{-1}Z_{0}^{\prime}Z_{1}-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{1}\|\|_{2}$
	$\displaystyle=$	$\displaystyle\|\|A-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{1}\|\|_{2}=\|\|(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{0}A-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{1}\|\|_{2}$
	$\displaystyle=$	$\displaystyle\|\|(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{0}A-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}Z_{0}A+(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}Z_{0}A-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{1}\|\|_{2}$
	$\displaystyle\leq$	$\displaystyle\|\|(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{0}A-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}Z_{0}A\|\|_{2}+\|\|(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}Z_{0}A-(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\hat{Z}_{1}\|\|_{2}$
	$\displaystyle\leq$	$\displaystyle\|\|(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}\|\|_{2}\Big{(}\|\|Z_{0}A-\hat{Z}_{0}A\|\|_{2}+\|\|\underbrace{Z_{0}A}_{\text{Shifted version of }Z_{0}}-\hat{Z}_{1}\|\|_{2}\Big{)}$

Now, $||(\hat{Z}_{0}^{\prime}\hat{Z}_{0})^{-1}\hat{Z}_{0}^{\prime}||_{2}\leq(\sqrt{\sigma_{s}-\epsilon})^{-1}$ , $||Z_{0}A-\hat{Z}_{1}||_{2}\leq||Z_{0}-\hat{Z}_{0}||_{2}$ since $Z_{1}=Z_{0}A$ is a submatrix of $Z_{0}$ and $\hat{Z}_{1}$ is a submatrix of $\hat{Z}_{0}$ we have $||Z_{0}A-\hat{Z}_{1}||_{2}\leq||Z_{0}-\hat{Z}_{0}||_{2}$ and $||Z_{0}A-\hat{Z}_{0}A||_{2}\leq||A||_{2}||Z_{0}-\hat{Z}_{0}||_{2}$

\displaystyle\leq

\displaystyle\frac{c\epsilon(\gamma+1)}{{\sigma_{s}}}\Bigg{(}\sqrt{\frac{\sigma^{2}_{s}}{((\sigma_{s}-\sigma_{s+1})\wedge(\sigma_{r-1}-\sigma_{s}))^{2}}}+\sqrt{\sum_{i=1}^{r-1}\frac{\sigma_{i}\sigma_{s}}{(\sigma_{i}-\sigma_{i+1})^{2}\wedge(\sigma_{i-1}-\sigma_{i})^{2}}}\Bigg{)}

13 Hankel Matrix Estimation Results

In this section we provide the proof for Theorem 6. For any matrix $P$ , we define its doubly infinite extension $\bar{P}$ as

\bar{P}=\begin{bmatrix}P&0&\ldots\\ 0&0&\ldots\\ \vdots&\vdots&\vdots\end{bmatrix}

(54)

Proposition 1.

Fix $d>0$ . Then we have

||\mathcal{H}_{d,\infty,\infty}||_{2}\leq||\mathcal{H}_{0,\infty,\infty}-\bar{\mathcal{H}}_{0,d,d}||_{2}\leq\sqrt{2}||\mathcal{H}_{d,\infty,\infty}||_{2}\leq\sqrt{2}||\mathcal{T}_{d,\infty}||_{2}

Proof 13.1.

Define $\tilde{C}_{d},\tilde{B}_{d}$ as follows

	$\displaystyle\tilde{C}_{d}$	$\displaystyle=\begin{bmatrix}0_{md\times n}\\ C\\ CA\\ \vdots\\ \end{bmatrix}$
	$\displaystyle\tilde{B}_{d}$	$\displaystyle=\begin{bmatrix}0_{n\times pd}&B&AB&\ldots\end{bmatrix}$

Now pad $\mathcal{H}_{0,d,d}$ with zeros to make it a doubly infinite matrix and call it $\bar{\mathcal{H}}_{0,d,d}$ and we get that

\displaystyle||\bar{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||

\displaystyle=\begin{bmatrix}0&M_{12}\\ M_{21}&M_{22}\end{bmatrix}

Note here that $M_{21}$ and $M_{0}=\begin{bmatrix}M_{12}\\ M_{22}\end{bmatrix}$ are infinite matrices. Further $||\mathcal{H}_{d,\infty,\infty}||_{2}=||M_{0}||_{2}\geq||M_{21}||_{2}$ . Then

\displaystyle||\bar{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||

\displaystyle\leq\sqrt{||M_{12}||_{2}^{2}+||M_{0}||_{2}^{2}}\leq\sqrt{2}||\mathcal{H}_{d,\infty,\infty}||_{2}

Further $||\bar{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||\geq||M_{0}||=||\mathcal{H}_{d,\infty,\infty}||_{2}$ .

Proposition 2.

For any $d_{1}\geq d_{2}$ , we have

||\mathcal{H}_{0,\infty,\infty}-\bar{\mathcal{H}}_{0,d_{1},d_{1}}||_{2}\leq\sqrt{2}||\mathcal{H}_{0,\infty,\infty}-\bar{\mathcal{H}}_{0,d_{2},d_{2}}||_{2}

Proof 13.2.

Since $||\mathcal{H}_{d_{1},\infty,\infty}||_{2}\leq||\mathcal{H}_{0,\infty,\infty}-\bar{\mathcal{H}}_{0,d_{1},d_{1}}||_{2}\leq\sqrt{2}||\mathcal{H}_{d_{1},\infty,\infty}||_{2}$ from Proposition 1. It is clear that $||\mathcal{H}_{d_{1},\infty,\infty}||_{2}\leq||\mathcal{H}_{d_{2},\infty,\infty}||_{2}$ . Then

\displaystyle\frac{1}{\sqrt{2}}||\mathcal{H}_{0,\infty,\infty}-\bar{\mathcal{H}}_{0,d_{1},d_{1}}||_{2}\leq||\mathcal{H}_{d_{1},\infty,\infty}||_{2}\leq||\mathcal{H}_{d_{2},\infty,\infty}||_{2}\leq||\mathcal{H}_{0,\infty,\infty}-\bar{\mathcal{H}}_{0,d_{2},d_{2}}||_{2}

Proposition 3.

Fix $d>0$ . Then

||\mathcal{T}_{d,\infty}(M)||_{2}\leq\frac{\tilde{M}\rho(A)^{d}}{1-\rho(A)}

Proof 13.3.

Recall that

\displaystyle\mathcal{T}_{d,\infty}(M)=\begin{bmatrix}0&0&0&\ldots&0\\ CA^{d}B&0&0&\ldots&0\\ CA^{d+1}B&CA^{d}B&0&\ldots&0\\ \vdots&\ddots&\ddots&\vdots&\vdots\end{bmatrix}

Then $||\mathcal{T}_{d,\infty}(M)||_{2}\leq\sum_{j=d}^{\infty}||CA^{j}B||_{2}$ . Now from Eq. 4.1 and Lemma 4.1 in Tu et al. (2017) we get that $||CA^{j}B||_{2}\leq\tilde{M}\rho(A)^{j}$ . Then

\sum_{j=d}^{\infty}||CA^{j}B||_{2}\leq\frac{\tilde{M}\rho(A)^{d}}{1-\rho(A)}

Remark 13.4.

Proposition 3 is just needed to show exponential decay and is not precise. Please refer to Tu et al. (2017) for explicit rates.

Next we show that $T^{(\kappa)}_{*}(\delta)$ and $d_{*}(T,\delta)$ defined in Eq. (16) given by

	$\displaystyle d_{*}(T,\delta)$	$\displaystyle=\inf\Bigg{\{}d\Bigg{\|}16\beta R\sqrt{d}\sqrt{\frac{m+pd+\log{\frac{T}{\delta}}}{T}}\geq\|\|\mathcal{H}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}\|\|_{2}\Bigg{\}}$
	$\displaystyle T^{(\kappa)}_{*}(\delta)$	$\displaystyle=\inf\Big{\{}T\Big{\|}\frac{T}{cm^{2}\log^{3}{(Tm/\delta)}}\geq d_{}(T,\delta),\hskip 5.69054ptd_{}(T,\delta)\leq\frac{\kappa d_{*}(\frac{T}{\kappa^{2}},\delta)}{8}\Big{\}}$		(55)

The existence of $d_{*}(T,\delta)$ is predicated on the finiteness of $T^{(\kappa)}_{*}(\delta)$ which we discuss below.

13.1 Existence of $T^{(\kappa)}_{*}(\delta)<\infty$

Construct two sets

	$\displaystyle T_{1}(\delta)$	$\displaystyle=\inf\Big{\{}T\Big{\|}d_{*}(T,\delta)\in\mathcal{D}(T)\Big{\}}$		(56)
	$\displaystyle T_{2}(\delta)$	$\displaystyle=\inf\Big{\{}T\Big{\|}d_{}(t,\delta)\leq\frac{\kappa d_{}(\frac{t}{\kappa^{2}},\delta)}{8},\hskip 8.53581pt\forall t\geq T\Big{\}}$		(57)

Clearly, $T^{(\kappa)}_{*}(\delta)<T_{1}(\delta)\vee T_{2}(\delta)$ . A key assumption in the statement of our results is that $T^{(\kappa)}_{*}(\delta)<\infty$ . We will show that it is indeed true. Let $\kappa\geq 16$ .

Proposition 4.

For a fixed $\delta>0$ , $T_{1}(\delta)<\infty$ with $d_{*}(T,\delta)\leq\frac{c\log{(cT+\log{\frac{1}{\delta}})}-\log{R}+\log{(\tilde{M}/\beta)}}{\log{\frac{1}{\rho}}}$ . Here $\rho=\rho(A)$ .

Proof 13.5.

Note the form for $d_{*}(T,\delta)$ , it is the minimum $d$ that satisfies

16\beta R\sqrt{d}\sqrt{\frac{m+pd+\log{\frac{T}{\delta}}}{T}}\geq||\mathcal{H}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||_{2}

Since from Proposition 1 and 3 we have $||\mathcal{H}_{0,d,d}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq\frac{3\tilde{M}\rho^{d}}{1-\rho(A)}$ , then $d_{*}(T,\delta)\leq d$ that satisfies

16\beta R\sqrt{d}\sqrt{\frac{m+pd+\log{\frac{T}{\delta}}}{T}}\geq\frac{3\tilde{M}\rho^{d}}{1-\rho(A)}

which immediately implies $d_{*}(T,\delta)\leq d=\frac{c\log{(cT-\log{R}+\log{\frac{1}{\delta}})}+\log{(\tilde{M}/\beta)}}{\log{\frac{1}{\rho}}}$ , i.e., $d_{*}(T,\delta)$ is at most logarithmic in $T$ . As a result, for a large enough $T$

cm^{2}d\log^{2}{(d)}\log^{2}{(m^{2}/\delta)}+cd\log^{3}{(2d)}\geq\frac{c\log{(cT+\log{\frac{1}{\delta}})}-\log{R}+\log{(\tilde{M}/\beta)}}{\log{\frac{1}{\rho}}}

The intuition behind $T_{2}(\delta)$ is the following: $d_{*}(T,\delta)$ grows at most logarithmically in $T$ , as is clear from the previous proof. Then $T_{2}(\delta)$ is the point where $d_{*}(T,\delta)$ is still growing as $\sqrt{T}$ (i.e., “mixing” has not happened) but at a slightly reduced rate.

Proposition 5.

For a fixed $\delta>0$ , $T_{2}(\delta)<\infty$ .

Proof 13.6.

Recall from the proof of Proposition 1 that $||\mathcal{H}_{d,\infty,\infty}||\leq||\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d,d}||\leq\sqrt{2}||\mathcal{H}_{d,\infty,\infty}||$ . Now $\mathcal{H}_{d,\infty,\infty}$ can be written as

\displaystyle\mathcal{H}_{d,\infty,\infty}

\displaystyle=\underbrace{\begin{bmatrix}C\\ CA\\ \vdots\\ \end{bmatrix}}_{=\tilde{C}}A^{d}\underbrace{[B,AB,\ldots]}_{=\tilde{B}}

Define $P_{d}=A^{d}\tilde{B}\tilde{B}^{\top}(A^{d})^{\top}$ . Let $d_{\kappa}$ be such that for every $d\geq d_{\kappa}$ and $\kappa\geq 16$

P_{d}\preceq\frac{1}{4\kappa}P_{0}

(58)

Clearly such a $d_{\kappa}<\infty$ would exist because $P_{0}\neq 0$ but $\lim_{d\rightarrow\infty}P_{d}=0$ . Then observe that $P_{2d}\preceq\frac{1}{4\kappa}P_{d}$ . Then for every $d\geq d_{\kappa}$ we have that

||\mathcal{H}_{d,\infty,\infty}||\geq 4\kappa||\mathcal{H}_{2d,\infty,\infty}||

Let

T\geq\frac{4d_{\kappa}\cdot(16)^{2}\cdot\beta^{2}R^{2}}{\sigma_{0}^{2}}(d_{\kappa}p+\log{(T/\delta)})

(59)

where $\sigma_{0}=||\mathcal{H}_{d_{\kappa},\infty,\infty}||$ . Assume that $\sigma_{0}>0$ (if not then are condition is trivially true). Then simple computation shows that

\displaystyle||\mathcal{H}_{0,d_{\kappa},d_{\kappa}}-\mathcal{H}_{0,\infty,\infty}||

\displaystyle\geq||\mathcal{H}_{d_{\kappa},\infty,\infty}||\geq\underbrace{16\beta R\sqrt{d_{\kappa}}\sqrt{\frac{m+pd_{\kappa}+\log{\frac{T}{\delta}}}{T}}}_{<\frac{\sigma_{0}}{2}}

This implies that $d_{*}=d_{*}(T,\delta)\geq d_{\kappa}$ for $T$ prescribed as above (ensured by Proposition 2). But from our discussion above we also have

\displaystyle||\mathcal{H}_{0,d_{*},d_{*}}-\mathcal{H}_{0,\infty,\infty}||\geq||\mathcal{H}_{d_{*},\infty,\infty}||\geq 4\kappa||\mathcal{H}_{2d_{*},\infty,\infty}||\geq 2\kappa||\mathcal{H}_{0,2d_{*},2d_{*}}-\mathcal{H}_{0,\infty,\infty}||

This means that if

\displaystyle||\mathcal{H}_{0,d_{*},d_{*}}-\mathcal{H}_{0,\infty,\infty}||

\displaystyle\leq 16\beta R\sqrt{d_{*}}\sqrt{\frac{m+pd_{*}+\log{\frac{T}{\delta}}}{T}}

then

\displaystyle||\mathcal{H}_{0,2d_{*},2d_{*}}-\mathcal{H}_{0,\infty,\infty}||

\displaystyle\leq\frac{16}{2\kappa}\beta R\sqrt{d_{*}}\sqrt{\frac{m+pd_{*}+\log{\frac{T}{\delta}}}{T}}\leq 16\beta R\sqrt{2d_{*}}\sqrt{\frac{m+2pd_{*}+\log{\frac{\kappa^{2}T}{\delta}}}{\kappa^{2}T}}

which implies that $d_{*}(\kappa^{2}T,\delta)\leq 2d_{*}(T,\delta)$ . The inequality follows from the definition of $d_{*}(\kappa^{2}T,\delta)$ . Furthermore, if $\kappa\geq 16$ , $2d_{*}(T,\delta)\leq\frac{\kappa}{8}d_{*}(T,\delta)$ whenever $T$ is greater than a certain finite threshold of Eq. (59).

Eq. (58) happens when $\sigma(A^{d})^{2}\leq\frac{1}{4\kappa}\implies d_{\kappa}=\mathcal{O}\Big{(}\frac{\log{\kappa}}{\log{\frac{1}{\rho}}}\Big{)}$ where $\rho=\rho(A)$ and $T_{2}(\delta)\leq cT_{1}(\delta)$ . It should be noted that the dependence of $T_{i}(\delta)$ on $\log{\frac{1}{\rho}}$ is worst case, i.e., there exists some “bad” LTI system that gives this dependence and it is quite likely $T_{i}(\delta)$ is much smaller. The condition $T\geq T_{1}(\delta)\vee T_{2}(\delta)$ simply requires that we capture some reasonable portion of the dynamics and not necessarily the entire dynamics.

13.2 Proof of Theorem 6

Proposition 6.

Let $T\geq T^{(\kappa)}_{*}(\delta)$ and $d_{*}=d_{*}(T,\delta)$ then

||\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,d_{*},d_{*}}||\leq 2c\beta R\sqrt{\frac{d_{*}}{T}}\sqrt{m+pd_{*}+\log{\frac{T}{\delta}}}

Proof 13.7.

Consider the following error

\displaystyle||\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,d_{*},d_{*}}||_{2}

\displaystyle\leq||\mathcal{H}_{0,d_{*},d_{*}}-\hat{\mathcal{H}}_{0,d_{*},d_{*}}||_{2}+||\mathcal{H}_{0,\infty,\infty}-{\mathcal{H}}_{0,d_{*},d_{*}}||_{2}

From Proposition 1 and Eq. (55) we get that

||\mathcal{H}_{0,\infty,\infty}-{\mathcal{H}}_{0,d_{*},d_{*}}||_{2}\leq 16\beta R\sqrt{\frac{d_{*}}{T}}\sqrt{m+pd_{*}+\log{\frac{T}{\delta}}}

Since from Theorem 2

	$\displaystyle\|\|\mathcal{H}_{0,d_{},d_{}}-\hat{\mathcal{H}}_{0,d_{},d_{}}\|\|_{2}$	$\displaystyle\leq 16\beta R\sqrt{\frac{d_{}}{T}}\sqrt{m+pd_{}+\log{\frac{T}{\delta}}}$
	$\displaystyle\|\|\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,d_{},d_{}}\|\|_{2}$	$\displaystyle\leq 32\beta R\sqrt{\frac{d_{}}{T}}\sqrt{m+pd_{}+\log{\frac{T}{\delta}}}$		(60)

Recall the adaptive rule to choose $d$ in Algorithm 1. From Theorem 2 we know that for every $d\in\mathcal{D}(T)$ we have with probability at least $1-\delta$ .

||\mathcal{H}_{0,d,d}-\hat{\mathcal{H}}_{0,d,d}||_{2}\leq 16\beta R\sqrt{d}\left(\sqrt{m+\frac{dp}{T}+\frac{\log{\frac{T}{\delta}}}{T}}\right)

Let $\alpha(l)=\sqrt{l}\Big{(}\sqrt{\frac{lp}{T}+\frac{\log{\frac{T}{\delta}}}{T}}\Big{)}$ . Then consider the following adaptive rule

	$\displaystyle d_{0}(T,\delta)$	$\displaystyle=\inf\Big{\{}l\Big{\|}\|\|\hat{\mathcal{H}}_{0,l,l}-\hat{\mathcal{H}}_{0,h,h}\|\|_{2}\leq 16\beta R(2\alpha(l)+\alpha(h))\hskip 5.69054pt\forall h\in\mathcal{D}(T),h\geq l\Big{\}}$		(61)
	$\displaystyle\hat{d}=\hat{d}(T,\delta)$	$\displaystyle=d_{0}(T,\delta)\vee\log{\left(\frac{T}{\delta}\right)}$		(62)

for the same universal constant $c$ as Theorem 2. Let $d_{*}(T,\delta)$ be as Eq. (55). Recall that $d_{*}=d_{*}(T,\delta)$ is the point where estimation error dominates the finite truncation error. Unfortunately, we do not have apriori knowledge of $d_{*}(T,\delta)$ to use in the algorithm. Therefore, we will simply use Eq. (62) as our proxy. The goal will be to bound $||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\infty,\infty}||_{2}$

Proposition 7.

Let $T\geq T^{(\kappa)}_{*}(\delta)$ , $d_{*}(T,\delta)$ be as in Eq. (55) and $\hat{d}$ be as in Eq. (62). Then with probability at least $1-\delta$ we have

\hat{d}\leq d_{*}(T,\delta)\vee\log{\Big{(}\frac{T}{\delta}\Big{)}}

Proof 13.8.

Let $d_{*}=d_{*}(T,\delta)$ . First for all $h\in\mathcal{D}(T)\geq d_{*}$ , we note

	$\displaystyle\|\|\hat{\mathcal{H}}_{0,d_{},d_{}}-\hat{\mathcal{H}}_{0,h,h}\|\|_{2}$	$\displaystyle\leq\|\|\hat{\mathcal{H}}_{0,d_{},d_{}}-\mathcal{H}_{0,d_{},d_{}}\|\|+\|\|\mathcal{H}_{0,h,h}-\hat{\mathcal{H}}_{0,h,h}\|\|_{2}+\|\|\mathcal{H}_{0,h,h}-{\mathcal{H}}_{0,d_{},d_{}}\|\|_{2}$
		$\displaystyle\underbrace{\leq}_{\infty>h\geq d_{}}\|\|\hat{\mathcal{H}}_{0,d_{},d_{}}-\mathcal{H}_{0,d_{},d_{}}\|\|_{2}+\|\|\mathcal{H}_{0,h,h}-\hat{\mathcal{H}}_{0,h,h}\|\|_{2}+\|\|\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d_{},d_{*}}\|\|_{2}.$

We use the property that $||\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d_{*},d_{*}}||_{2}\geq||\mathcal{H}_{0,h,h}-\mathcal{H}_{0,d_{*},d_{*}}||_{2}$ . Furthermore, because of the properties of $d_{*}$ we have

||\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d_{*},d_{*}}||_{2}\leq 16\beta R\alpha(d_{*})

and

\displaystyle||\hat{\mathcal{H}}_{0,d_{*},d_{*}}-\mathcal{H}_{0,d_{*},d_{*}}||_{2}

\displaystyle\leq 16\beta R\alpha(d_{*}),\quad{}||\mathcal{H}_{0,h,h}-\hat{\mathcal{H}}_{0,h,h}||_{2}\leq 16\beta R\alpha(h).

(63)

and

||\hat{\mathcal{H}}_{0,d_{*},d_{*}}-\hat{\mathcal{H}}_{0,h,h}||_{2}\leq 16\beta R(2\alpha(d_{*})+\alpha(h)).

This implies that $d_{0}(T,\delta)\leq d_{*}$ and the assertion follows.

We have the following key lemma about the behavior of $\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}$ .

Lemma 8.

For a fixed $\kappa\geq 20$ , whenever $T\geq T^{(\kappa)}_{*}(\delta)$ we have with probability at least $1-\delta$

||\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}||_{2}\leq 3c\beta R\alpha\left(\max{\left(d_{*}(T,\delta),\log{\left(\frac{T}{\delta}\right)}\right)}\right)

(64)

Furthermore, $\hat{d}=O(\log{\frac{T}{\delta}})$ .

Proof 13.9.

Let $d_{*}>\hat{d}$ then

	$\displaystyle\|\|\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}\|\|_{2}$	$\displaystyle\leq\|\|\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d_{},d_{}}\|\|_{2}+\|\|\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\hat{d},\hat{d}}\|\|_{2}+\|\|\hat{\mathcal{H}}_{0,d_{},d_{}}-\mathcal{H}_{0,d_{},d_{}}\|\|_{2}$
		$\displaystyle\leq 3c\beta R\alpha(d_{*})$

If $\hat{d}>d_{*}$ then

	$\displaystyle\|\|\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}\|\|_{2}$	$\displaystyle\leq\|\|\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,\hat{d},\hat{d}}\|\|_{2}+\|\|\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\hat{d},\hat{d}}\|\|_{2}=2\|\|\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\hat{d},\hat{d}}\|\|_{2}$
		$\displaystyle\leq 2c\beta R\alpha(\hat{d})=2c\beta R\alpha\left(\log{\Big{(}\frac{T}{\delta}\Big{)}}\right)$

where the equality follows from Proposition 7. The fact that $\hat{d}=O(\log{\frac{T}{\delta}})$ follows from Proposition 13.1.

In the following we will use $\mathcal{H}_{l}=\mathcal{H}_{0,l,l}$ for shorthand.

Proposition 9.

Fix $\kappa\geq 16$ , and $T\geq T_{*}^{(\kappa)}(\delta)$ . Then

||\hat{\mathcal{H}}_{0,\hat{d}(T,\delta),\hat{d}(T,\delta)}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq 12c\beta R\sqrt{\hat{d}(T,\delta)}\sqrt{\frac{m+p\hat{d}(T,\delta)+\log{\frac{T}{\delta}}}{T}}

with probability at least $1-\delta$ .

Proof 13.10.

Assume that $\log{\Big{(}\frac{T}{\delta}\Big{)}}\leq d_{*}(T,\delta)$ . Recall the following functions

	$\displaystyle d_{*}(T,\delta)$	$\displaystyle=\inf{\Big{\{}d\Big{\|}c\beta R\sqrt{d}\sqrt{\frac{m+pd+\log{\frac{T}{\delta}}}{T}}\geq\|\|\mathcal{H}_{d}-\mathcal{H}_{\infty}\|\|_{2}\Big{\}}}$
	$\displaystyle d_{0}(T,\delta)$	$\displaystyle=\inf{\Big{\{}l\Big{\|}\|\|\hat{\mathcal{H}}_{l}-\hat{\mathcal{H}}_{h}\|\|_{2}\leq c\beta R(\alpha(h)+2\alpha(l))\hskip 5.69054pt\forall h\geq l,\hskip 5.69054pth\in\mathcal{D}(T)\Big{\}}}$
	$\displaystyle\hat{d}(T,\delta)$	$\displaystyle=d_{0}(T,\delta)\vee\log{\Big{(}\frac{T}{\delta}\Big{)}}$

It is clear that $d_{*}(\kappa^{2}T,\delta)\leq(1+\frac{1}{2p})\kappa d_{*}(T,\delta)$ for any $\kappa\geq 16$ . Assume the following

•

$d_{*}(T,\delta)\leq\frac{\kappa}{8}d_{*}(\kappa^{-2}T,\delta)$ (This relation is true whenever $T\geq T^{(\kappa)}_{*}(\delta)$ ),
•

$||\mathcal{H}_{\hat{d}(T,\delta)}-\mathcal{H}_{\infty}||_{2}\geq 6c\beta R\sqrt{\hat{d}(T,\delta)}\sqrt{\frac{m+p\hat{d}(T,\delta)+\log{\frac{T}{\delta}}}{T}}$ ,
•

$\hat{d}(T,\delta)<d_{*}(\kappa^{-2}T,\delta)-1$ .

The key will be to show that with high probability that all three assumptions can not hold with high probability. For shorthand we define $d_{*}^{(1)}=d_{*}(T,\delta),d_{*}^{(\kappa^{2})}=d_{*}(\kappa^{-2}T,\delta),\hat{d}^{(1)}=\hat{d}(T,\delta),\hat{d}^{(\kappa^{2})}=\hat{d}(\kappa^{-2}T,\delta)$ and $\mathcal{H}_{l}=\mathcal{H}_{0,l,l},\hat{\mathcal{H}}_{l}=\hat{\mathcal{H}}_{0,l,l}$ . Let $\tilde{T}=\kappa^{-2}T$ . Then this implies that

	$\displaystyle\frac{c\beta R(\sqrt{d_{}^{(1)}}+2\sqrt{\hat{d}^{(1)}})}{\kappa}\sqrt{\frac{m+pd_{}^{(1)}+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\tilde{T}}}$	$\displaystyle\geq\|\|\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\hat{\mathcal{H}}_{d_{*}^{(1)}}\|\|_{2}$
	$\displaystyle\|\|\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\hat{\mathcal{H}}_{d_{*}^{(1)}}\|\|_{2}$	$\displaystyle\geq\|\|\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}\|\|_{2}-\|\|\hat{\mathcal{H}}_{d_{*}^{(1)}}-\mathcal{H}_{\infty}\|\|_{2}$
	$\displaystyle\|\|\hat{\mathcal{H}}_{d_{}^{(1)}}-\mathcal{H}_{\infty}\|\|_{2}+\|\|\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\hat{\mathcal{H}}_{d_{}^{(1)}}\|\|_{2}$	$\displaystyle\geq\|\|\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}\|\|_{2}$
	$\displaystyle\|\|\hat{\mathcal{H}}_{d_{}^{(1)}}-\mathcal{H}_{d_{}^{(1)}}\|\|_{2}+\|\|\mathcal{H}_{d_{}^{(1)}}-\mathcal{H}_{\infty}\|\|_{2}+\|\|\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\hat{\mathcal{H}}_{d_{}^{(1)}}\|\|_{2}$	$\displaystyle\geq\|\|\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}\|\|_{2}$

Since by definition of $d_{*}(\cdot,\cdot)$ we have

||\hat{\mathcal{H}}_{d_{*}^{(1)}}-\mathcal{H}_{d_{*}^{(1)}}||_{2}+||\mathcal{H}_{d_{*}^{(1)}}-\mathcal{H}_{\infty}||_{2}\leq\frac{2c\beta R}{\kappa}\sqrt{d_{*}^{(1)}}\sqrt{\frac{m+pd_{*}^{(1)}+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\tilde{T}}}

and by assumptions $d_{*}^{(1)}\leq\frac{\kappa}{8}d_{*}^{(\kappa^{2})},\hat{d}^{(1)}\leq d_{*}^{(\kappa^{2})}$ then as a result $(\sqrt{d_{*}^{(1)}}+2\sqrt{\hat{d}^{(1)}})\sqrt{d_{*}^{(1)}}\leq(\frac{2\kappa}{8}+1)d_{*}^{(\kappa^{2})}$

	$\displaystyle\|\|\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}\|\|_{2}\leq\|\|\hat{\mathcal{H}}_{d_{}^{(1)}}-\mathcal{H}_{d_{}^{(1)}}\|\|_{2}+\|\|\mathcal{H}_{d_{}^{(1)}}-\mathcal{H}_{\infty}\|\|_{2}+\underbrace{\|\|\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\hat{\mathcal{H}}_{d_{}^{(1)}}\|\|_{2}}_{\Downarrow}$
	$\displaystyle\leq\underbrace{\frac{2c\beta R\sqrt{d_{}^{(1)}}}{\kappa}\sqrt{\frac{m+pd_{}^{(1)}+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\tilde{T}}}}_{\text{Prop.}\leavevmode\nobreak\ \ref{ds_error}}+\underbrace{\frac{c\beta R(\sqrt{d_{}^{(1)}}+2\sqrt{\hat{d}^{(1)}})}{\kappa}\sqrt{\frac{m+pd_{}^{(1)}+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\tilde{T}}}}_{\text{Definition of }\hat{d}^{(1)}}$
	$\displaystyle\|\|\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}\|\|_{2}\leq\Big{(}\frac{1}{2}+\frac{1}{\kappa}\Big{)}c\beta R\sqrt{d_{}^{(\kappa^{2})}}\sqrt{\frac{m+pd_{}^{(\kappa^{2})}+\log{\frac{\tilde{T}}{\delta}}}{\tilde{T}}}$

where the last inequality follows from $(\sqrt{d_{*}^{(1)}}+2\sqrt{\hat{d}^{(1)}})\sqrt{d_{*}^{(1)}}\leq(\frac{2\kappa}{8}+1)d_{*}^{(\kappa^{2})}$ . Now by assumption

||\mathcal{H}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}||_{2}\geq 6c\beta R\sqrt{\hat{d}^{(1)}}\sqrt{\frac{m+p\hat{d}^{(1)}+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\kappa^{2}\tilde{T}}}

it is clear that

||\hat{\mathcal{H}}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}||_{2}\geq\frac{5}{6}||\mathcal{H}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}||_{2}

and we can conclude that, since $\frac{6}{5}\Big{(}\frac{1}{2}+\frac{1}{\kappa}\Big{)}<\frac{1}{\sqrt{2}}$ ,

||\mathcal{H}_{\hat{d}^{(1)}}-\mathcal{H}_{\infty}||_{2}<c\beta R\sqrt{\frac{d_{*}^{(\kappa^{2})}}{2}}\sqrt{\frac{m+pd_{*}^{(\kappa^{2})}+\log{\frac{\tilde{T}}{\delta}}}{\tilde{T}}}

which implies that $\hat{d}^{(1)}\geq d_{*}^{(\kappa^{2})}-1$ . This is because by definition of $d_{*}^{(\kappa^{2})}$ we know that $d_{*}^{(\kappa^{2})}$ is the minimum such that

||\mathcal{H}_{d_{*}^{(\kappa^{2})}}-\mathcal{H}_{\infty}||_{2}\leq c\beta R\sqrt{\frac{d_{*}^{(\kappa^{2})}}{2}}\sqrt{\frac{m+pd_{*}^{(\kappa^{2})}+\log{\frac{\tilde{T}}{\delta}}}{\tilde{T}}}

and furthermore from Proposition 2 we have for any $d_{1}\leq d_{2}$

||\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d_{1},d_{1}}||\geq\frac{1}{\sqrt{2}}||\mathcal{H}_{0,\infty,\infty}-\mathcal{H}_{0,d_{2},d_{2}}||.

This contradicts Assumption 3. So, this means that one of three assumptions do not hold. Clearly if assumption $3$ is invalid then we have a suitable lower bound on the chosen $\hat{d}(\cdot,\cdot)$ , i.e., since $d_{*}(\kappa^{-2}T,\delta)\leq d_{*}(T,\delta)\leq\frac{\kappa}{8}d_{*}(\kappa^{-2}T,\delta)$ we get

\frac{\kappa}{8}\hat{d}(\kappa^{2}\tilde{T},\delta)\geq\frac{\kappa}{8}d_{*}(\tilde{T},\delta)-\frac{\kappa}{8}\geq d_{*}(\kappa^{2}\tilde{T},\delta)-\frac{\kappa}{8}\geq\hat{d}(\kappa^{2}\tilde{T},\delta)-\frac{\kappa}{8}\geq d_{*}(\tilde{T},\delta)-\frac{\kappa}{8}

which implies from Lemma 8 that (since we pick $\kappa=16$ , for large enough $T$ $d_{*}(\tilde{T},\delta)\geq 4$ ) and we have

	$\displaystyle\|\|\hat{\mathcal{H}}_{\hat{d}(\kappa^{2}\tilde{T},\delta)}-\mathcal{H}_{\infty}\|\|_{2}$	$\displaystyle\leq 3c\beta R\sqrt{d_{}(\kappa^{2}\tilde{T},\delta)}\sqrt{\frac{pd_{}(\kappa^{2}\tilde{T},\delta)+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\kappa^{2}\tilde{T}}}$
		$\displaystyle\leq\frac{3\kappa}{8}c\beta R\sqrt{\hat{d}(\kappa^{2}\tilde{T},\delta)}\sqrt{\frac{p\hat{d}(\kappa^{2}\tilde{T},\delta)+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\kappa^{2}\tilde{T}}}$

Similarly, if assumption $2$ is invalid then we get that

||\mathcal{H}_{\hat{d}(\kappa^{2}\tilde{T},\delta)}-\mathcal{H}_{\infty}||_{2}<6c\beta R\sqrt{\hat{d}(\kappa^{2}\tilde{T},\delta)}\sqrt{\frac{p\hat{d}(\kappa^{2}\tilde{T},\delta)+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\kappa^{2}\tilde{T}}}

and because $\hat{d}(\kappa^{2}\tilde{T},\delta)\leq d_{*}(\kappa^{2}\tilde{T},\delta)$ and $||\hat{\mathcal{H}}_{\hat{d}(\kappa^{2}\tilde{T},\delta)}-\mathcal{H}_{\infty}||_{2}\leq||\mathcal{H}_{\hat{d}(\kappa^{2}\tilde{T},\delta)}-\mathcal{H}_{\infty}||_{2}+||\hat{\mathcal{H}}_{\hat{d}(\kappa^{2}\tilde{T},\delta)}-\mathcal{H}_{\infty}||_{2}$ we get in a similar fashion to Proposition 6

||\hat{\mathcal{H}}_{\hat{d}(\kappa^{2}\tilde{T},\delta)}-\mathcal{H}_{\infty}||_{2}\leq 12c\beta R\sqrt{\hat{d}(\kappa^{2}\tilde{T},\delta)}\sqrt{\frac{p\hat{d}(\kappa^{2}\tilde{T},\delta)+\log{\frac{\kappa^{2}\tilde{T}}{\delta}}}{\kappa^{2}\tilde{T}}}

Replacing $\kappa^{2}\tilde{T}=T$ it is clear that for any $\kappa\geq 16$

||\hat{\mathcal{H}}_{\hat{d}(T,\delta)}-\mathcal{H}_{\infty}||_{2}\leq 12c\beta R\sqrt{\hat{d}(T,\delta)}\sqrt{\frac{p\hat{d}(T,\delta)+\log{\frac{T}{\delta}}}{T}}

(65)

If $d_{*}(T,\delta)\leq\log{\left(\frac{T}{\delta}\right)}$ then we can simply apply Lemma 8 and our assertion holds.

14 Model Selection Results

Proposition 1.

Let $\mathcal{H}_{0,\infty,\infty}=U\Sigma V^{\top},\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}=\hat{U}\hat{\Sigma}\hat{V}^{\top}$ and

||\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}||\leq\epsilon.

Let $\hat{\Sigma}$ be arranged into blocks of singular values such that in each block $i$ we have

\sup_{j}\hat{\sigma}^{i}_{j}-\hat{\sigma}^{i}_{j+1}\leq\chi\epsilon

for some $\chi\geq 2$ , i.e.,

\hat{\Sigma}=\begin{bmatrix}\Lambda_{1}&0&\ldots&0\\ 0&\Lambda_{2}&\ldots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\ldots&\Lambda_{l}\\ \end{bmatrix}

where $\Lambda_{i}$ are diagonal matrices and $\hat{\sigma}^{i}_{j}$ is the $j^{th}$ singular value in the block $\Lambda_{i}$ . Then there exists an orthogonal transformation, $Q$ , such that

	$\displaystyle\|\|\hat{U}\hat{\Sigma}^{1/2}Q-U\Sigma^{1/2}\|\|_{2}$	$\displaystyle\leq 2\epsilon\sqrt{\hat{\sigma}_{1}/\zeta_{n_{1}}^{2}+\hat{\sigma}_{n_{1}+1}/\zeta_{n_{2}}^{2}+\ldots+\hat{\sigma}_{\sum_{i=1}^{l-1}n_{i}+1}/\zeta_{n_{l}}^{2}}$
		$\displaystyle+2\sup_{1\leq i\leq l}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}+\frac{\epsilon}{\sqrt{\hat{\sigma}_{\hat{d}}}}\wedge\sqrt{\epsilon}.$

Here $\sup_{1\leq i\leq l}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}\leq\frac{\chi}{\sqrt{\hat{\sigma}^{i}_{\max}}}\epsilon\hat{d}\wedge\sqrt{\chi\hat{d}\epsilon}$ and

\zeta_{n_{i}}=\min{({\hat{\sigma}}^{n_{i-1}}_{\min}-{\hat{\sigma}}^{n_{i}}_{\max},{\hat{\sigma}}^{n_{i}}_{\min}-{\hat{\sigma}}^{n_{i+1}}_{\max})}

for $1<i<l$ , $\zeta_{n_{1}}={\hat{\sigma}}^{n_{1}}_{\min}-{\hat{\sigma}}^{n_{2}}_{\max}$ and $\zeta_{n_{l}}=\min{({\hat{\sigma}}^{n_{l-1}}_{\min}-{\hat{\sigma}}^{n_{l}}_{\max},{\hat{\sigma}}^{n_{l}}_{\min})}$ .

Proof 14.1.

Let $\hat{U}\hat{\Sigma}\hat{V}^{\top}=\text{SVD}(\hat{\mathcal{H}}_{0,\hat{d},\hat{d}})$ and ${U}{\Sigma}{V}^{\top}=\text{SVD}(\mathcal{H}_{0,\infty,\infty})$ where $||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\infty,\infty}||_{2}\leq\epsilon$ . $\hat{\Sigma}$ is arranged into blocks of singular values such that in each block $i$ we have $\hat{\sigma}^{i}_{j}-\hat{\sigma}^{i}_{j+1}\leq\chi\epsilon$ , i.e.,

\hat{\Sigma}=\begin{bmatrix}\Lambda_{1}&0&\ldots&0\\ 0&\Lambda_{2}&\ldots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\ldots&\Lambda_{l}\\ \end{bmatrix}

where $\Lambda_{i}$ are diagonal matrices and $\hat{\sigma}^{i}_{j}$ is the $j^{th}$ singular value in the block $\Lambda_{i}$ . Furthermore, $\hat{\sigma}^{i-1}_{\min}-\hat{\sigma}^{i}_{\max}>\chi\epsilon$ . From $\hat{\Sigma}$ define $\bar{\hat{\Sigma}}$ as follows:

\bar{\hat{\Sigma}}=\begin{bmatrix}\bar{\hat{\sigma}}_{1}I_{n_{1}\times n_{1}}&0&\ldots&0\\ 0&\bar{\hat{\sigma}}_{2}I_{n_{2}\times n_{2}}&\ldots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\ldots&\bar{\hat{\sigma}}_{l}I_{n_{l}\times n_{l}}\\ \end{bmatrix}

(66)

where $\Lambda_{i}$ is a $n_{i}\times n_{i}$ matrix and $\bar{\hat{\sigma}}_{i}=\frac{1}{n_{i}}\sum_{j}\hat{\sigma}^{i}_{j}$ . The key idea of the proof is the following: $(A,B,C)\equiv(QAQ^{\top},QB,CQ^{\top})$ where $Q$ is a orthogonal transformation and we will show that there exists a block diagonal unitary matrix $Q$ of the form

Q={\begin{bmatrix}Q_{n_{1}\times n_{1}}&0&\ldots&0\\ 0&Q_{n_{2}\times n_{2}}&\ldots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\ldots&Q_{n_{l}\times n_{l}}\\ \end{bmatrix}}

(67)

such that each block $Q_{n_{i}\times n_{i}}$ corresponds to a orthogonal matrix of dimensions $n_{i}\times n_{i}$ and that $||\hat{U}\hat{\Sigma}^{1/2}Q-U\Sigma^{1/2}||_{2}$ is small if $||\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}-\mathcal{H}_{0,\infty,\infty}||_{2}$ is small. Each of the blocks correspond to the set of singular values where the inter-singular value distance is “small”. To start off, note that from Propositon 4 there must exist a $Q$ that is block diagonal with orthogonal entries such that

\displaystyle||\hat{U}Q\hat{\Sigma}^{1/2}-U\Sigma^{1/2}||_{2}

\displaystyle\leq c\epsilon\sqrt{\hat{\sigma}_{1}/\zeta_{n_{1}}^{2}+\hat{\sigma}_{n_{1}+1}/\zeta_{n_{2}}^{2}+\ldots+\hat{\sigma}_{\sum_{i=1}^{l-1}n_{i}+1}/\zeta_{n_{l}}^{2}}+\sup_{1\leq i\leq\hat{d}}|\sqrt{\sigma_{i}}-\sqrt{\hat{\sigma}_{i}}|

(68)

Here

\zeta_{n_{i}}=\min{({\hat{\sigma}}^{n_{i-1}}_{\min}-{\hat{\sigma}}^{n_{i}}_{\max},{\hat{\sigma}}^{n_{i}}_{\min}-{\hat{\sigma}}^{n_{i+1}}_{\max})}

for $1<i<l$ , $\zeta_{n_{1}}={\hat{\sigma}}^{n_{1}}_{\min}-{\hat{\sigma}}^{n_{2}}_{\max}$ and $\zeta_{n_{l}}=\min{({\hat{\sigma}}^{n_{l-1}}_{\min}-{\hat{\sigma}}^{n_{l}}_{\max},{\hat{\sigma}}^{n_{l}}_{\min})}$ . Informally, the $\zeta_{i}$ measure the singular value gaps between each blocks.

Furthermore, it can be shown that for any $Q$ of the form in Eq. (67)

\displaystyle||\hat{U}Q\hat{\Sigma}^{1/2}-\hat{U}\hat{\Sigma}^{1/2}Q||_{2}

\displaystyle\leq||\hat{U}Q{\bar{\hat{\Sigma}}}^{1/2}-\hat{U}Q\hat{\Sigma}^{1/2}||_{2}+||\hat{U}\hat{\Sigma}^{1/2}Q-\hat{U}{\bar{\hat{\Sigma}}}^{1/2}Q||_{2}\leq 2||\hat{\Sigma}^{1/2}-\bar{\hat{\Sigma}}^{1/2}||_{2}

because $\hat{U}Q{\bar{\hat{\Sigma}}}^{1/2}=\hat{U}{\bar{\hat{\Sigma}}}^{1/2}Q$ . Note that $||\hat{\Sigma}^{1/2}-\bar{\hat{\Sigma}}^{1/2}||_{2}\leq\sup_{1\leq i\leq l}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}$ . Now, when $\hat{\sigma}^{i}_{\max}\geq\chi n_{i}\epsilon$ , then $\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}\leq\frac{\chi\epsilon}{\sqrt{\hat{\sigma}^{i}_{\max}}}$ ; on the other hand when $\hat{\sigma}^{i}_{\max}<\chi n_{i}\epsilon$ then $\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\bar{\hat{\sigma}}^{i}}\leq\sqrt{\chi n_{i}\epsilon}$ and this implies that

\sup_{1\leq i\leq l}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}\leq\frac{\chi n_{i}}{\sqrt{\hat{\sigma}^{i}_{\max}}}\epsilon\wedge\sqrt{\chi n_{i}\epsilon}.

Finally,

	$\displaystyle\|\|\hat{U}\hat{\Sigma}^{1/2}Q-U\Sigma^{1/2}\|\|_{2}$	$\displaystyle\leq\|\|\hat{U}Q\hat{\Sigma}^{1/2}-U\Sigma^{1/2}\|\|_{2}+\|\|\hat{U}Q\hat{\Sigma}^{1/2}-\hat{U}\hat{\Sigma}^{1/2}Q\|\|_{2}$
		$\displaystyle=2\epsilon\sqrt{\hat{\sigma}_{1}/\zeta_{n_{1}}^{2}+\hat{\sigma}_{n_{1}+1}/\zeta_{n_{2}}^{2}+\ldots+\hat{\sigma}_{\sum_{i=1}^{l-1}n_{i}+1}/\zeta_{n_{l}}^{2}}+\sup_{1\leq i\leq\hat{d}}\|\sqrt{\sigma_{i}}-\sqrt{\hat{\sigma}_{i}}\|$
		$\displaystyle+\frac{\chi\epsilon}{\sqrt{\hat{\sigma}^{i}_{\max}}}\wedge\sqrt{\chi\epsilon}.$

Our assertion follows since $\sup_{1\leq i\leq\hat{d}}|\sqrt{\sigma_{i}}-\sqrt{\hat{\sigma}_{i}}|\leq\frac{\epsilon}{\sqrt{\hat{\sigma}_{\hat{d}}}}\wedge\sqrt{\epsilon}$ .

Proposition 2.

Let $\mathcal{H}_{0,\infty,\infty}=U\Sigma V^{\top},\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}=\hat{U}\hat{\Sigma}\hat{V}^{\top}$ and

||\mathcal{H}_{0,\infty,\infty}-\hat{\mathcal{H}}_{0,\hat{d},\hat{d}}||\leq\epsilon.

Let $\hat{\Sigma}$ be arranged into blocks of singular values such that in each block $i$ we have

\sup_{j}\hat{\sigma}^{i}_{j}-\hat{\sigma}^{i}_{j+1}\leq\chi\epsilon

for some $\chi\geq 2$ , i.e.,

\hat{\Sigma}=\begin{bmatrix}\Lambda_{1}&0&\ldots&0\\ 0&\Lambda_{2}&\ldots&0\\ \vdots&\vdots&\ddots&0\\ 0&0&\ldots&\Lambda_{l}\\ \end{bmatrix}

where $\Lambda_{i}$ are diagonal matrices and $\hat{\sigma}^{i}_{j}$ is the $j^{th}$ singular value in the block $\Lambda_{i}$ . Then there exists an orthogonal transformation, $Q$ , such that

	$\displaystyle\max{\left(\|\|\hat{C}-C\|\|_{2},\|\|\hat{B}-B\|\|_{2}\right)}$	$\displaystyle\leq 2\epsilon\sqrt{\hat{\sigma}_{1}/\zeta_{n_{1}}^{2}+\hat{\sigma}_{n_{1}+1}/\zeta_{n_{2}}^{2}+\ldots+\hat{\sigma}_{\sum_{i=1}^{l-1}n_{i}+1}/\zeta_{n_{l}}^{2}}$
		$\displaystyle+2\sup_{1\leq i\leq l}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}+\frac{\epsilon}{\sqrt{\hat{\sigma}_{\hat{d}}}}\wedge\sqrt{\epsilon}=\zeta,$
	$\displaystyle\|\|A-\hat{A}\|\|_{2}$	$\displaystyle\leq 4\gamma\cdot\zeta/\sqrt{\hat{\sigma}_{\hat{d}}}.$

Here $\sup_{1\leq i\leq l}\sqrt{\hat{\sigma}^{i}_{\max}}-\sqrt{\hat{\sigma}^{i}_{\min}}\leq\frac{\chi}{\sqrt{\hat{\sigma}^{i}_{\max}}}\epsilon\hat{d}\wedge\sqrt{\chi\hat{d}\epsilon}$ and

\zeta_{n_{i}}=\min{({\hat{\sigma}}^{n_{i-1}}_{\min}-{\hat{\sigma}}^{n_{i}}_{\max},{\hat{\sigma}}^{n_{i}}_{\min}-{\hat{\sigma}}^{n_{i+1}}_{\max})}

for $1<i<l$ , $\zeta_{n_{1}}={\hat{\sigma}}^{n_{1}}_{\min}-{\hat{\sigma}}^{n_{2}}_{\max}$ and $\zeta_{n_{l}}=\min{({\hat{\sigma}}^{n_{l-1}}_{\min}-{\hat{\sigma}}^{n_{l}}_{\max},{\hat{\sigma}}^{n_{l}}_{\min})}$ .

Proof 14.2.

The proof follows because all parameters are equivalent up to a orthogonal transform (See discussion preceding Proposition 2). Following that we use Propositions 4 and 5.

15 Order Estimation Lower Bound

Lemma 15.1 (Theorem 4.21 in Boucheron et al. (2013)).

Let $\{\mathbb{P}_{i}\}_{i=0}^{N}$ be probability laws over $(\Sigma,\mathcal{A})$ and let $\{A_{i}\in\mathcal{A}\}_{i=0}^{N}$ be disjoint events. If $a=\min_{i=0,\ldots,N}\mathbb{P}_{i}(A_{i})\geq 1/(N+1)$ ,

\displaystyle a\leq a\log{\Big{(}\frac{Na}{1-a}\Big{)}}+(1-a)\log{\Big{(}\frac{1-a}{1-\frac{1-a}{N}}\Big{)}}\leq\frac{1}{N}\sum_{i=1}^{N}KL(P_{i}||P_{0})

(69)

Lemma 15.2 (Le Cam’s Method).

Let $P_{0},P_{1}$ be two probability laws then

\displaystyle\sup_{\theta\in\{0,1\}}\mathbb{P}_{\theta}[M\neq\hat{M}]\geq\frac{1}{2}-\frac{1}{2}\sqrt{\frac{1}{2}KL(P_{0}||P_{1})}

Proposition 1.

Let ${\mathcal{N}}_{0},{\mathcal{N}}_{1}$ be two multivariate Gaussians with mean $\mu_{0}\in\mathbb{R}^{T},\mu_{1}\in\mathbb{R}^{T}$ and covariance matrix $\Sigma_{0}\in\mathbb{R}^{T\times T},\Sigma_{1}\in\mathbb{R}^{T\times T}$ respectively. Then the $\text{KL}({\mathcal{N}}_{0},{\mathcal{N}}_{1})=\frac{1}{2}\Big{(}\text{tr}(\Sigma_{1}^{-1}\Sigma_{0})-T+\log{\frac{\text{det}(\Sigma_{1})}{\text{det}(\Sigma_{0})}}+\mathbb{E}_{\mu_{1},\mu_{0}}[(\mu_{1}-\mu_{0})^{\top}\Sigma_{1}^{-1}(\mu_{1}-\mu_{0})]\Big{)}$ .

In this section we will prove a lower bound on the finite time error for model approximation. In systems theory subspace based methods are useful in estimating the true system parameters. Intuitively, it should be harder to correctly estimate the subspace that corresponds to lower Hankel singular values, or “energy” due to the presence of noise. However, due to strong structural constraints on Hankel matrix finding a minimax lower bound is a much harder proposition for LTI systems. Specifically, it is not clear if standard subspace identification lower bounds can provide reasonable estimates for a structured and non i.i.d. setting such as our case. To alleviate some of the technical difficulties that arise in obtaining the lower bounds, we will focus on a small set of LTI systems which are simply parametrized by a number $\zeta$ . Consider the following canonical form order $1$ and $2$ LTI systems respectively with $m=p=1$ and let $R$ be the noise-to-signal ratio bound.

\displaystyle A_{0}

\displaystyle=\begin{bmatrix}0&1&0\\ 0&0&0\\ \zeta&0&0\end{bmatrix},A_{1}=A_{0},B_{0}=\begin{bmatrix}0\\ 0\\ \sqrt{\beta}/R\end{bmatrix},B_{1}=\begin{bmatrix}0\\ \sqrt{\beta}/R\\ \sqrt{\beta}/R\end{bmatrix},C_{0}=\begin{bmatrix}0&0&\sqrt{\beta}R\end{bmatrix},C_{1}=C_{0}

(70)

$A_{0},A_{1}$ are Schur stable whenever $|\zeta|<1$ .

	$\displaystyle\mathcal{H}_{\zeta,0}$	$\displaystyle=\beta\begin{bmatrix}1&0&0&0&0&\ldots\\ 0&0&0&0&0&\ldots\\ 0&0&0&0&0&\ldots\\ 0&0&0&0&0&\ldots\\ 0&0&0&0&0&\ldots\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\end{bmatrix}$
	$\displaystyle\mathcal{H}_{\zeta,1}$	$\displaystyle=\beta\begin{bmatrix}1&0&\zeta&0&0&\ldots\\ 0&\zeta&0&0&0&\ldots\\ \zeta&0&0&0&0&\ldots\\ 0&0&0&0&0&\ldots\\ 0&0&0&0&0&\ldots\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\end{bmatrix}$		(71)

Here $\mathcal{H}_{\zeta,0},\mathcal{H}_{\zeta,1}$ are the Hankel matrices generated by $(C_{0},A_{0},B_{0}),(C_{1},A_{1},B_{1})$ respectively. It is easy to check that for $\mathcal{H}_{\zeta,1}$ we have $\frac{1}{\zeta}\leq\frac{\sigma_{1}}{\sigma_{2}}\leq\frac{1+\zeta}{\zeta}$ where $\sigma_{i}$ are Hankel singular values. Further the rank of $\mathcal{H}_{\zeta,0}$ is $1$ and that of $\mathcal{H}_{\zeta,1}$ is at least $2$ . Also, $\frac{||\mathcal{T}\mathcal{O}_{0,\infty}((C_{i},A_{i},B_{i}))||_{2}}{||\mathcal{T}_{0,\infty}((C_{i},A_{i},B_{i}))||_{2}}\leq R$ .

This construction will be key to show that identification of a particular rank realization depends on the condition number of the Hankel matrix. An alternate representation of the input–output behavior is

	$\displaystyle\begin{bmatrix}y_{T}\\ y_{T-1}\\ \vdots\\ y_{1}\end{bmatrix}$	$\displaystyle=\underbrace{\begin{bmatrix}CB&CA_{i}B&\ldots&CA_{i}^{T-1}B\\ 0&CB&\ldots&CA_{i}^{T-2}B\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&CB\end{bmatrix}}_{\Pi_{i}}\underbrace{\begin{bmatrix}u_{T+1}\\ u_{T}\\ \vdots\\ u_{2}\end{bmatrix}}_{U}$
		$\displaystyle+\underbrace{\begin{bmatrix}C&CA_{i}&\ldots&CA_{i}^{T-1}\\ 0&C&\ldots&CA_{i}^{T-2}\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&C\end{bmatrix}}_{O_{i}}\begin{bmatrix}\eta_{T+1}\\ \eta_{T}\\ \vdots\\ \eta_{2}\end{bmatrix}+\begin{bmatrix}w_{T}\\ w_{T-1}\\ \vdots\\ w_{1}\end{bmatrix}$		(72)

where $A_{i}\in\{A_{0},A_{1}\}$ . We will prove this result for a general class of inputs, i.e., active inputs. Then we will follow the same steps as in proof of Theorem 2 in Tu et al. (2018b).

	$\displaystyle\text{KL}(P_{0}\|\|P_{1})$	$\displaystyle=\mathbb{E}_{P_{0}}\Bigg{[}\log{\prod_{t=1}^{T}\frac{\gamma_{t}(u_{t}\|\{u_{l},y_{l}\}_{l=1}^{t-1})P_{0}(y_{t}\|\{u_{l}\}_{l=1}^{t-1})}{\gamma_{t}(u_{t}\|\{u_{l},y_{l}\}_{l=1}^{t-1})P_{1}(y_{t}\|\{u_{l}\}_{l=1}^{t-1})}}\Bigg{]}$
		$\displaystyle=\mathbb{E}_{P_{0}}\Bigg{[}\log{\prod_{t=1}^{T}\frac{P_{0}(y_{t}\|\{u_{l}\}_{l=1}^{t-1})}{P_{1}(y_{t}\|\{u_{l}\}_{l=1}^{t-1})}}\Bigg{]}$

Here $\gamma_{t}(\cdot|\cdot)$ is the active rule for choosing $u_{t}$ from past data. From Eq. (15) it is clear that conditional on $\{u_{l}\}_{l=1}^{T}$ , $\{y_{l}\}_{l=1}^{T}$ is Gaussian with mean given by $\Pi_{i}U$ . Then we use Birge’s inequality (Lemma 15.1). In our case $\Sigma_{0}=O_{0}O_{0}^{\top}+I,\Sigma_{1}=O_{1}O_{1}^{\top}+I$ where $O_{i}$ is given in Eq. (15). We will apply a combination of Lemma 15.1, Proposition 1 and assume $\eta_{i}$ are i.i.d Gaussian to obtain our desired result. Note that $O_{1}=O_{0}$ but $\Pi_{1}\neq\Pi_{0}$ . Therefore, from Proposition 1 $KL({\mathcal{N}}_{0},{\mathcal{N}}_{1})=\mathbb{E}_{\mu_{1},\mu_{0}}[(\mu_{1}-\mu_{0})^{\top}\Sigma_{1}^{-1}(\mu_{1}-\mu_{0})]\leq T\frac{\zeta^{2}}{R^{2}}$ where $\mu_{i}=\Pi_{i}U$ . For any $\delta\in(0,1/4)$ , set $a=1-\delta$ in Proposition 15.1, then we get whenever

\displaystyle\delta\log{\Big{(}\frac{\delta}{1-\delta}\Big{)}}+(1-\delta)\log{\Big{(}\frac{1-\delta}{\delta}\Big{)}}\geq\frac{T\zeta^{2}}{R^{2}}

(73)

we have $\sup_{i\neq j}\mathbb{P}_{A_{i}}(A_{j})\geq\delta$ . For $\delta\in[1/4,1)$ we use Le Cam’s method in Lemma 15.2 and show that if $8\delta^{2}\geq\frac{T\zeta^{2}}{R^{2}}$ then $\sup_{i\neq j}\mathbb{P}_{A_{i}}(A_{j})\geq\delta$ . Since $\delta^{2}\geq c\log{\frac{1}{\delta}}$ when $\delta\in[1/4,1)$ for an absolute constant, our assertion holds.

$\displaystyle\Big{\|}\Big{\|}\hat{\mathcal{H}}_{0,d,d}-\mathcal{H}_{0,d,d}\Big{\|}\Big{\|}_{2}$	$\displaystyle=\Big{\|}\Big{\|}V_{T}^{+}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{w}^{+\top}_{l+d+1,d}$
	$\displaystyle+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l+d,d}\mathcal{O}_{0,d,d}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\top}_{l+d+1,d}\mathcal{T}\mathcal{O}^{\top}_{0,d}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l,l}\mathcal{O}_{d,d,l}^{\top}\Big{)}\Big{\|}\Big{\|}_{2}$
	$\displaystyle=\|\|V_{T}^{+}E\|\|_{2}$	(12)

	$\displaystyle\|\|SA\|\|_{2}-\frac{\|\|v^{\prime}A\|\|_{2}}{\|\|v^{\prime}S^{-1}\|\|_{2}}$	$\displaystyle\leq\Big{\|}\frac{\|\|w^{\prime}A\|\|_{2}}{\|\|w^{\prime}S^{-1}\|\|_{2}}-\frac{\|\|v^{\prime}A\|\|_{2}}{\|\|v^{\prime}S^{-1}\|\|_{2}}\Big{\|}$
		$\displaystyle=\Big{\|}\frac{\|\|w^{\prime}A\|\|_{2}}{\|\|w^{\prime}S^{-1}\|\|_{2}}-\frac{\|\|v^{\prime}A\|\|_{2}}{\|\|w^{\prime}S^{-1}\|\|_{2}}+\frac{\|\|v^{\prime}A\|\|_{2}}{\|\|w^{\prime}S^{-1}\|\|_{2}}-\frac{\|\|v^{\prime}A\|\|_{2}}{\|\|v^{\prime}S^{-1}\|\|_{2}}\Big{\|}$
		$\displaystyle\leq\|\|SA\|\|_{2}\frac{\frac{1}{4\kappa}\|\|S^{-1}\|\|_{2}}{\|\|w^{\prime}S^{-1}\|\|_{2}}+\|\|SA\|\|_{2}\Big{\|}\frac{\|\|v^{\prime}S^{-1}\|\|_{2}}{\|\|w^{\prime}S^{-1}\|\|_{2}}-1\Big{\|}$
		$\displaystyle\leq\frac{\|\|SA\|\|_{2}}{2}$

$\displaystyle\Big{\|}\Big{\|}\hat{\mathcal{H}}-\mathcal{H}_{0,d,d}\Big{\|}\Big{\|}_{2}$	$\displaystyle=\Big{\|}\Big{\|}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l+d,d}\Big{)}^{-1}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{+\top}_{l+d+1,d}\mathcal{T}_{0,d}^{\top}$
	$\displaystyle+\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\top}_{l,l}\mathcal{H}_{d,d,l}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l+d,d}\mathcal{O}_{0,d,d}^{\top}$
	$\displaystyle+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{+\top}_{l+d+1,d}\mathcal{T}\mathcal{O}^{\top}_{0,d}+\tilde{U}^{-}_{l+d,d}\tilde{\eta}^{-\top}_{l,l}\mathcal{O}_{d,d,l}^{\top}+\tilde{U}^{-}_{l+d,d}\tilde{w}^{+\top}_{l+d+1,d}\Big{)}\Big{\|}\Big{\|}_{2}$	(37)

	$\displaystyle\mathbb{P}\Big{(}\Big{\|}\Big{\|}V^{-1/2}_{T}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}\Big{)}\Big{\|}\Big{\|}_{2}\geq a,\frac{TI}{2}\preceq V_{T}\preceq\frac{3TI}{2}\Big{)}$
	$\displaystyle\leq\mathbb{P}\Big{(}\Big{\|}\Big{\|}\sqrt{\frac{2}{T}}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}\Big{)}\Big{\|}\Big{\|}_{2}\geq a,\frac{TI}{2}\preceq V_{T}\preceq\frac{3TI}{2}\Big{)}$
	$\displaystyle\leq\mathbb{P}\Big{(}2\sup_{v\in{\mathcal{N}}_{\frac{1}{2}}}\Big{\|}\Big{\|}\sqrt{\frac{2}{T}}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}v\Big{)}\Big{\|}\Big{\|}_{2}\geq a\Big{)}+\mathbb{P}\Big{(}\frac{TI}{2}\preceq V_{T}\preceq\frac{3TI}{2}\Big{)}-1$
	$\displaystyle\leq 5^{pd}\mathbb{P}\Big{(}2\Big{\|}\Big{\|}\sqrt{\frac{2}{T}}\Big{(}\sum_{l=0}^{T-1}\tilde{U}^{-}_{l+d,d}\tilde{U}^{-\prime}_{l,l}\mathcal{H}_{d,d,l}^{\prime}v\Big{)}\Big{\|}\Big{\|}_{2}\geq a\Big{)}-\delta.$		(40)

	$\displaystyle\Sigma^{S}_{-}U_{-}^{\prime}V_{+}(\Sigma^{P}_{+})^{-1}$	$\displaystyle=U_{-}^{\prime}V_{+}+U_{-}^{\prime}RV_{+}(\Sigma^{P}_{+})^{-1}$
	$\displaystyle\|\|\Sigma^{S}_{-}U_{-}^{\prime}V_{+}(\Sigma^{P}_{+})^{-1}\|\|$	$\displaystyle\geq\|\|U_{-}^{\prime}V_{+}\|\|-\|\|U_{-}^{\prime}RV_{+}(\Sigma^{P}_{+})^{-1}\|\|$
	$\displaystyle\frac{\alpha}{\beta}\|\|U_{-}^{\prime}V_{+}\|\|$	$\displaystyle\geq\|\|U_{-}^{\prime}V_{+}\|\|-\frac{\epsilon}{\beta}$
	$\displaystyle\|\|U_{-}^{\prime}V_{+}\|\|$	$\displaystyle\leq\frac{\epsilon}{\beta-\alpha}$

Nonparametric Finite Time LTI System Identification

Abstract

keywords:

1 Introduction

Example 1.1.

1.1 Related Work

2 Mathematical Preliminaries

Definition 2.1.

Definition 2.2.

Definition 2.3.

Definition 2.4.

Proposition 1 (Lemma 2.2 Glover (1987)).

Definition 2.5 (Kung and Lin (1981)).

Definition 2.6.

Theorem 2 (Theorem 21.26 Zhou et al. (1996)).

Theorem 3.

Proposition 4 (System Reduction).

3 Contributions

4 Problem Formulation and Discussion

4.1 Data Generation

Assumption 1

Assumption 2

Remark 4.1 (ℋ∞\mathcal{H}_{\infty}-norm estimation).

Remark 4.2 (RR estimation).

5 Algorithmic Details

5.1 Hankel Submatrix Estimation

Lemma 1.

Theorem 2.

Proof 5.1.

Remark 5.2.

Proposition 3.

Proof 5.3.

Corollary 4.

Proof 5.4.

Proposition 5.

Proof 5.5.

5.2 Model Selection

Theorem 6.

Proposition 7.

5.3 Parameter Recovery

Theorem 8.

5.4 Order Estimation Lower Bound

Definition 5.6.

Theorem 9.

Proof 5.7.

6 Experiments

7 Discussion

References

8 Preliminaries

Theorem 1 (Theorem 5.39 Vershynin (2010)).

Proposition 2 (Vershynin (2010)).

Theorem 3 (Theorem 1 Meckes et al. (2007)).

Theorem 4 (Hanson–Wright Inequality).

Proposition 5 (Lecture 2 Tyrtyshnikov (2012)).

Theorem 6 (Corollary 2.2.5 in Van Der Vaart and Wellner (1996)).

Theorem 7 (Theorem 1 in Abbasi-Yadkori et al. (2011)).

Proof 8.1.

Lemma 8.

Proof 8.2.

Proposition 9 (Lemma 4.1 Simchowitz et al. (2018)).

Proof 8.3.

9 Control and Systems Theory Preliminaries

9.1 Sylvester Matrix Equation

Proposition 1.

9.2 Properties of System Hankel matrix

9.3 Model Reduction

Proposition 2.

Proof 9.1.

Proposition 3.

Proof 9.2.

10 Isometry of Input Matrix: Proof of Lemma 1

Theorem 10.1.

Proof 10.2.

Lemma 1.

Proof 10.3.

11 Error Analysis for Theorem 5.1

11.1 Proof of Theorem 2

Proposition 1.

Proof 11.1.

Proposition 2.

Remark 4.1 ( $\mathcal{H}_{\infty}$ -norm estimation).

Remark 4.2 ( $R$ estimation).

13.1 Existence of $T^{(\kappa)}_{*}(\delta)<\infty$