This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adversarial Tradeoffs in Robust State Estimation

Thomas T.C.K. Zhang , Bruce D. Lee11footnotemark: 1 , Hamed Hassani, and Nikolai Matni Equal contribution
(Department of Electrical and Systems Engineering, University of Pennsylvania)
Abstract

Adversarially robust training has been shown to reduce the susceptibility of learned models to targeted input data perturbations. However, it has also been observed that such adversarially robust models suffer a degradation in accuracy when applied to unperturbed data sets, leading to a robustness-accuracy tradeoff. Inspired by recent progress in the adversarial machine learning literature which characterize such tradeoffs in simple settings, we develop tools to quantitatively study the performance-robustness tradeoff between nominal and robust state estimation. In particular, we define and analyze a novel adversarially robust Kalman Filtering problem. We show that in contrast to most problem instances in adversarial machine learning, we can precisely derive the adversarial perturbation in the Kalman Filtering setting. We provide an algorithm to find this perturbation given data realizations, and develop upper and lower bounds on the adversarial state estimation error in terms of the standard (non-adversarial) estimation error and the spectral properties of the resulting observer. Through these results, we show a natural connection between a filter’s robustness to adversarial perturbation and underlying control theoretic properties of the system being observed, namely the spectral properties of its observability gramian.

1 Introduction

It has been demonstrated across various application areas that contemporary learning-based models, despite their impressive nominal performance, can be extremely susceptible to small, adversarially designed input perturbations (Carlini and Wagner, 2016; Goodfellow et al., 2014; Szegedy et al., 2013; Huang et al., 2017). In order to mitigate the effects of such attacks, various adversarially robust training algorithms (Carlini and Wagner, 2016, 2017; Madry et al., 2017; Xie et al., 2020; Deka et al., 2020) have been developed. However, it was soon noticed that while adversarial training could be used to improve model robustness, it often comes with a corresponding decrease in accuracy on nominal (unperturbed) data. Further, various simplified theoretical models (Tsipras et al., 2018; Zhang et al., 2019; Nakkiran, 2019; Raghunathan et al., 2019; Chen et al., 2020; Javanmard et al., 2020), have been used to explain this phenomena, and to argue that such robustness-accuracy tradeoffs are unavoidable.

In this paper, we extend the study of such robustness-accuracy tradeoffs to data generated by a dynamical system via the setting of adversarially robust Kalman Filtering. Adversarial robustness is an appealing model of robust filtering, as it captures measurement disturbances that are composed of both stochastic and worst-case components. Motivated by applications of adversarial robustness in the reinforcement literature (Lutter et al., 2021; Pinto et al., 2017; Mandlekar et al., 2017), we provide the first theoretical analysis robustness-accuracy tradeoffs for state estimation of a dynamical system, and in doing so establish connections to natural control theoretic properties of the underlying system.

Our specific contributions can be summarized as follows:

  • We propose a simple and computationally efficient algorithm that provably finds optimal worst-case 2\ell^{2} norm-bounded adversarial perturbations of the measurements for a given observer and trajectory data. This allows us to efficiently compute and explore the Pareto-optimal robustness-accuracy tradeoff curve.

  • We analyze the adversarially robust Kalman Filtering problem, and show that upper and lower bounds on the gap between the adversarial and standard (unperturbed) state estimation error can be controlled in terms of the spectral properties of the observability gramian (Zhou and Doyle, 1998) of the underlying system.

  • As an intermediate step to deriving the aforementioned bounds, we bound the gap between adversarial and standard (unperturbed) risks for general linear inverse problems in terms of the spectral properties of a given linear model. We also show that our bounds are tight in the one-dimensional setting, recovering the results of Javanmard et al. (2020), and for matrices with full column rank and orthogonal columns. These results may be of independent interest.

  • We empirically demonstrate through numerical simulations that our results predict robustness-accuracy tradeoffs in Kalman Filtering as a function of underlying spectral properties of the observability gramian.

The rest of this paper is organized as follows: in Section 2, we pose the adversarially robust Kalman Filtering problem. In Section 3, we first present tight upper and lower bounds on the adversarial risk in a general linear inverse problem, then refine these bounds for the Kalman filtering setting. The results reveal that for the setting where data is generated by a dynamical system, robustness-accuracy tradeoffs are dictated by natural control theoretic properties of the underlying system (namely, the observability gramian). In Section 4, we provide empirical evidence to support the trends predicted by our bounds, and demonstrate the efficacy of adversarially robust Kalman Filtering against sensor drift. We end with conclusions and a discussion of future work in Section 5.

1.1 Related Work

Our work makes connections between adversarial robustnesss and robust estimation and control. We now provide a brief overview of work related to ours from these areas.

Robustness-accuracy tradeoffs: We draw inspiration from recent work offering theoretical characterizations of robustness-accuracy tradeoffs: Tsipras et al. (2018) and Zhang et al. (2019) posit that high standard accuracy is fundamentally at odds with high robust accuracy by considering classification problems, whereas Nakkiran (2019) suggests an alternative explanation that classifiers that are simultaneously robust and accurate are complex, and may not be contained in current function classes. However, Raghunathan et al. (2019) shows that the tradeoff is not due to optimization or representation issues by showing that such tradeoffs exist even for a problem with a convex loss where the optimal predictor achieves 100% standard and robust accuracy. In contrast to previous work, we provide sharp and interpretable characterizations of the robustness-accuracy tradeoffs that may arise in inverse problems, albeit restricted to linear models. Most closely related to our work, Javanmard et al. (2020) derive a formula for the exact tradeoff between standard and robust accuracy in the linear regression setting. We derive similar results for a matrix-valued linear inverse problem, and apply these tools to the adversarially robust Kalman Filtering problem, wherein data is generated by a dynamical system.

Robust estimation and control: Robustness in estimation and control has traditionally been studied from a worst-case induced gain perspective (Hassibi et al., 1999; Zhou and Doyle, 1998). When perturbations are restricted to be 2\ell^{2}-bounded, this gives rise to \mathcal{H}_{\infty} estimation and control problems. Although widely known, and celebrated for their applications in robust control, \mathcal{H}_{\infty}-based methods are often overly conservative. This conservatism can be reduced by using mixed 2/\mathcal{H}_{2}/\mathcal{H}_{\infty} methods (Khargonekar et al., 1996), which blend Gaussian and worst-case disturbance assumptions. While such an approach is related to the adversarially robust Kalman Filtering problem that we pose, we note that mixed 2/\mathcal{H}_{2}/\mathcal{H}_{\infty} decouples the worst-case and stochastic inputs during design, leading to a fundamentally different tradeoff. On the other hand, our method considers the stochastic and worst-case components jointly, finding the optimal filter robust to 2\ell^{2}-bounded disturbances given the realizations of stochastic noise. We note the decoupling of worst-case and stochastic inputs allows mixed 2/\mathcal{H}_{2}/\mathcal{H}_{\infty} methods to guarantee stability under dynamic uncertainties. We leave similar guarantees for the filtering problem that we pose, and further characterizing connections between traditional and adversarial robustness to future work. More recently, Al Makdah et al. (2020) have considered the robustness-accuracy tradeoff in data-driven perception-based control. However, the adversary in Al Makdah et al. (2020) perturbs the covariance of the noise distribution, which is assumed to remain Gaussian, whereas our adversary additively attacks each measurement, which is more aligned to the perturbations considered in machine learning and robust control contexts. Additionally, Al Makdah et al. (2020) does not quantitatively analyze the severity of tradeoffs, but rather proves the existence of tradeoffs. Our prior work Lee et al. (2022) studies analogous tradeoffs to the ones in this paper arising in the setting of adversarially robust LQR, and bounds the severity of these tradeoffs in terms of the spectral properties of the controllability gramian.

2 Adversarially Robust Kalman Filtering

We consider a modification to the standard Kalman Filtering problem to incorporate adversarial robustness. We then bound the inflation of the state estimation error caused by the adversary, with the goal of relating control theoretic properties of the underlying linear dynamical system to the robustness-accuracy tradeoffs that it induces.

2.1 State Estimation and Observability

The Kalman filter is designed for the setting where the underlying dynamical system is linear, and disturbances are Gaussian. In particular, consider a linear-time-invariant (LTI) autonomous system with state and measurement disturbances: let xtnx_{t}\in\mathbb{R}^{n} be the system state, wtnw_{t}\in\mathbb{R}^{n} the process noise, ytpy_{t}\in\mathbb{R}^{p} the measurement, and vtpv_{t}\in\mathbb{R}^{p} the measurement noise. The initial condition, process noise, and measurement noise are assumed to be i.i.d. zero-mean Gaussians: x0𝒩(0,Σ0)x_{0}\sim{}\mathcal{N}(0,\Sigma_{0}), wti.i.d.𝒩(0,Σw)w_{t}\overset{\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,\Sigma_{w}), vti.i.d.𝒩(0,Σv)v_{t}\overset{\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,\Sigma_{v}). The LTI system is then defined by:

xt+1\displaystyle x_{t+1} =Axt+wt,\displaystyle=Ax_{t}+w_{t}, (1)
yt\displaystyle y_{t} =Cxt+vt.\displaystyle=Cx_{t}+v_{t}.

Finite horizon state estimation determines an estimate for the state of the system at time kk given some sequence of measurements y0,,yNy_{0},\dots,y_{N}. This problem encompasses smoothing (k<Nk<N), filtering (k=Nk=N), and prediction (k>Nk>N). When the measurement and process noise satisfy the assumptions above, the optimal state estimator is the celebrated Kalman filter (or smoother/predictor), which produces state estimates that are a linear function of the observations. Therefore, the optimal estimate x^k\hat{x}_{k} of the state xkx_{k} at time kk can be written as111State-space representations for the Kalman filter (see Appendix A) also exist (Hassibi et al., 1999), but for our purposes it is more convenient to view it as a linear map. x^k:=LYN\hat{x}_{k}:=LY_{N}, where Ln×p(N+1)L\in\mathbb{R}^{n\times p(N+1)} is some matrix and YNY_{N} is a vector of stacked observations

YN:=[y0yN].Y_{N}:=\begin{bmatrix}y_{0}&\ldots&y_{N}\end{bmatrix}^{\top}.

We similarly define the stacked process and measurement noise vectors as

WN:=[w0wN1] and VN:=[v0vN].W_{N}:=\begin{bmatrix}w_{0}&\ldots&w_{N-1}\end{bmatrix}^{\top}\mbox{ and }V_{N}:=\begin{bmatrix}v_{0}&\ldots&v_{N}\end{bmatrix}^{\top}.

Furthermore, suppose kNk\leq N and let

𝒪N\displaystyle\mathcal{O}_{N} =[CCACAN],τN=[0C0CAC0CAN1C0]\displaystyle=\begin{bmatrix}C\\ CA\\ \vdots\\ CA^{N}\end{bmatrix},\quad\tau_{N}=\begin{bmatrix}0&&&&\\ C&0&&\\ CA&C&&&\\ \vdots&&\ddots&0&\\ CA^{N-1}&\ldots&&C&0\end{bmatrix}
Γk\displaystyle\Gamma_{k} =[Ak1Ak2I00]\displaystyle=\begin{bmatrix}A^{k-1}&A^{k-2}&\ldots&I&0&\ldots&0\end{bmatrix}

so that YN=𝒪Nx0+τNWN+VNY_{N}=\mathcal{O}_{N}x_{0}+\tau_{N}W_{N}+V_{N} and xk=Akx0+ΓkWNx_{k}=A^{k}x_{0}+\Gamma_{k}W_{N}. Here 𝒪N\mathcal{O}_{N} is the NN-step observability matrix, which is a quantity of interest in our analysis. Recall that a system of the form (1) is observable if and only if the observability matrix 𝒪n1\mathcal{O}_{n-1} has rank nn.

Throughout the remainder of the paper we make the following assumption:

Assumption 2.1.

System (1) is observable and Nn1N\geq n-1.

As stated, observability is a binary notion that determines whether state consistent estimation is possible. However, observability does not capture the conditioning of the problem defining the Kalman filter.

A more refined, non-binary notion of observability can be defined in terms of the observability gramian of a system.

Definition 2.1.

The NN-step observability gramian is defined as Wo(N):=𝒪N𝒪NW_{o}(N):=\mathcal{O}_{N}^{\top}\mathcal{O}_{N}. If the spectral radius of AA is less than one, then taking the limit as NN\to\infty results in the observability gramian Wo()=t=0(At)CCAtW_{o}(\infty)=\sum_{t=0}^{\infty}\left(A^{t}\right)^{\top}C^{\top}CA^{t}.

The observability gramian provides significantly more information about the difficulty of state estimation than the rank condition on the observability matrix. In particular, the ellipsoid {x|xWo()x1}\left\{x\,|\,x^{\top}W_{o}(\infty)x\leq 1\right\} contains the initial states xx that lead to measurement signals with 2\ell^{2} norm bounded by 11 in the absense of process and measurement noise. To see this, let x0=xx_{0}=x. Then we have x0Wo()x0=t=0x0(At)CCAtx0=t=0xtCCxt=t=0yt22x_{0}^{\top}W_{o}(\infty)x_{0}=\sum_{t=0}^{\infty}x_{0}^{\top}\left(A^{t}\right)^{\top}C^{\top}CA^{t}x_{0}=\sum_{t=0}^{\infty}x_{t}^{\top}C^{\top}Cx_{t}=\sum_{t=0}^{\infty}\left\|y_{t}\right\|_{2}^{2}. As such, small eigenvalues of the observability gramian imply that a large subset of the state space leads to relatively small impacts on future measurements. This makes it difficult to use measurements to distinguish states in this region in the presence of process and measurement noise, requiring high-gain estimators. This in turn suggests that such estimators may be more susceptible to small adversarial perturbations.

2.2 Kalman Filtering and Smoothing

We begin by reviewing relevant results from standard Kalman Filtering and Smoothing.

Standard State Estimation

Under Assumption 2.1, we define the minimum mean square estimator for the state xkx_{k} as

L^k=argminLn×p(N+1)𝔼[xkLYN22].\displaystyle\hat{L}_{k}=\operatorname*{argmin}_{L\in\mathbb{R}^{n\times p(N+1)}}\mathbb{E}\left[\left\|x_{k}-LY_{N}\right\|_{2}^{2}\right]. (2)

We note that the optimal solution to this problem is precisely the Kalman filter (k=Nk=N) or smoother (k<Nk<N). We explicitly solve for the minimum mean square estimator L^k\hat{L}_{k} in the following standard lemma, included for completeness.

Lemma 2.1.

Suppose kNk\leq N. The finite horizon Kalman state estimator is the solution to optimization problem (2), and is given by

L^k\displaystyle\hat{L}_{k} =(AkΣ0𝒪N+ΓkΣwτN)(𝒪NΣ0𝒪N+τNΣwτN+Σv)1.\displaystyle=\left(A^{k}\Sigma_{0}\mathcal{O}_{N}^{\top}+\Gamma_{k}\Sigma_{w}\tau_{N}^{\top}\right)\cdot\left(\mathcal{O}_{N}\Sigma_{0}\mathcal{O}_{N}^{\top}+\tau_{N}\Sigma_{w}\tau_{N}^{\top}+\Sigma_{v}\right)\operatorname{{}^{-1}}.

Adversarially Robust State Estimation

We now modify the standard filtering problem (2) to allow adversarial perturbations to enter through sensor measurements.222We choose to restrict our attention to adversarial sensor measurements because it is a more direct analog to the traditional adversarial robustness literature, which considers perturbations to image data, and not to the image data-generating distribution (Szegedy et al., 2013; Goodfellow et al., 2014; Carlini and Wagner, 2016). In particular, for some ε>0\varepsilon>0, the adversarially robust state estimator is defined by

L^k(ε):=argminLn×p(N+1)𝔼[maxδ2εxkL(YN+δ)22].\displaystyle\hat{L}_{k}(\varepsilon):=\operatorname*{argmin}_{L\in\mathbb{R}^{n\times p(N+1)}}\mathbb{E}\left[\max_{\left\|\delta\right\|_{2}\leq\varepsilon}\left\|x_{k}-L(Y_{N}+\delta)\right\|_{2}^{2}\right]. (3)

In contrast to the nominal state estimation problem, no closed form expression exists for the adversarially robust estimation problem, due to the inner maximization in (3). We show next that despite the non-convexity of the inner maximization problem, it can be solved efficiently. This allows us to apply stochastic gradient descent to solve for L^k(ε)\hat{L}_{k}(\varepsilon). In particular, note that the objective to the minimization problem is the expectation of a point-wise supremum of convex functions in LL, and hence convex in LL itself (Boyd and Vandenberghe, 2004). Next observe that we can draw samples of x0𝒩(0,Σ0)x_{0}\sim\mathcal{N}(0,\Sigma_{0}), WN𝒩(0,INΣw)W_{N}\sim\mathcal{N}(0,I_{N}\otimes\Sigma_{w}), VN𝒩(0,IN+1Σv)V_{N}\sim\mathcal{N}(0,I_{N+1}\otimes\Sigma_{v}), and apply the solution to the inner maximization problem to solve for realizations of maxδ2εxkL(YN+δ)22\max_{\left\|\delta\right\|_{2}\leq\varepsilon}\left\|x_{k}-L(Y_{N}+\delta)\right\|_{2}^{2}. Taking the gradient of these realizations with respect to LL provides a stochastic descent direction. As the overall expression is convex in LL, stochastic gradient descent with an appropriately decaying stepsize converges to the optimal solution (Bottou et al., 2018).

2.3 Solving the Inner Maximization

As earlier stated, no closed-form expression exists for the adversarially robust estimation problem: indeed, even in scalar linear regression studied in (Javanmard et al., 2020), it is characterized by a recursive relationship. Furthermore, the techniques used to derive that recursion do not extend to the multi-variable case.

To address this challenge, we show how to efficiently compute solutions to the inner maximization in (3). We observe that the maximization maxδ2εxkL(YN+δ)22\max_{\left\|\delta\right\|_{2}\leq\varepsilon}\left\|x_{k}-L(Y_{N}+\delta)\right\|_{2}^{2} can be expanded and re-written as the following (non-convex) quadratically-constrained quadratic maximization problem:

maximizeδn\displaystyle\operatorname*{maximize}_{\delta\in\mathbb{R}^{n}} δLLδ2δLb\displaystyle\quad\delta^{\top}L^{\top}L\delta-2\delta^{\top}L^{\top}b (P)
subjectto\displaystyle\operatorname{subject\ to} δδε2,\displaystyle\quad\delta^{\top}\delta\leq\varepsilon^{2},

where we set b:=xkLYNb:=x_{k}-LY_{N}. Let L=U[Σ0]Vn×(N+1)pL=U\begin{bmatrix}\Sigma&0\end{bmatrix}V^{\top}\in\mathbb{R}^{n\times(N+1)p} be the full singular-value decomposition of LL, with Un×nU\in\mathbb{R}^{n\times n}, Σn×(N+1)p\Sigma\in\mathbb{R}^{n\times(N+1)p}, V(N+1)p×(N+1)pV\in\mathbb{R}^{(N+1)p\times(N+1)p}, and Σ=diag(σ1,,σn)\Sigma=\mathrm{diag}(\sigma_{1},\dots,\sigma_{n}), σ1σn0\sigma_{1}\geq\cdots\geq\sigma_{n}\geq 0 the singular values of LL. We also denote the columns of UU and VV by uiu_{i} and viv_{i}, respectively. It is known that (P) satisfies strong duality (Boyd and Vandenberghe, 2004) and the optimal primal-dual pair (δ,λ)(\delta^{*},\lambda^{*}) can be characterized by the KKT conditions:

2(λILL)δ+2Lb\displaystyle 2(\lambda^{*}I-L^{\top}L)\delta^{*}+2L^{\top}b =0\displaystyle=0
λ(δδε2)\displaystyle\lambda^{*}({\delta^{*}}^{\top}\delta^{*}-\varepsilon^{2}) =0\displaystyle=0
(λILL)\displaystyle(\lambda^{*}I-L^{\top}L) 0.\displaystyle\succeq 0.

The KKT conditions can then be leveraged to solve for the optimal dual solution λ\lambda^{*} and subsequently the optimal perturbation δ\delta^{*}. The full procedure is summarized in Algorithm 1.

We note that Boyd and Vandenberghe (2004) shows how to solve (P) via semidefinite programming. Algorithm 1, however, allows us to recycle the SVD of LL to solve (P) for different values of b:=xkLYNb:=x_{k}-LY_{N} simply by solving a root finding problem for each bb. This enables efficient batching when applying SGD to the outer minimization problem.

Algorithm 1 Inner Maximization Solution
given L=UΣVn×(N+1)pL=U\Sigma V^{\top}\in\mathbb{R}^{n\times(N+1)p}, bnb\in\mathbb{R}^{n}, perturbation bound ε>0\varepsilon>0
if i:σi<σ1(bui)2σi2(σ12σi2)2<ε2\sum_{i:\sigma_{i}<\sigma_{1}}\frac{(b^{\top}u_{i})^{2}\sigma_{i}^{2}}{(\sigma_{1}^{2}-\sigma_{i}^{2})^{2}}<\varepsilon^{2} then
     c=ε2i:σi<σ1(bui)2σi2(σ12σi2)2c=\sqrt{\varepsilon^{2}-\sum_{i:\sigma_{i}<\sigma_{1}}\frac{(b^{\top}u_{i})^{2}\sigma_{i}^{2}}{(\sigma_{1}^{2}-\sigma_{i}^{2})^{2}}}
     Set vv as any unit vector lying in the null-space of (σ12IΣΣ)V\left(\sigma_{1}^{2}I-\Sigma^{\top}\Sigma\right)V^{\top}, i.e. vspan{vi:σi=σ1}v\in\textbf{span}\left\{v_{i}:\sigma_{i}=\sigma_{1}\right\}
     δ=V(σ12IΣΣ)ΣUb+cv\delta^{*}=-V(\sigma_{1}^{2}I-\Sigma^{\top}\Sigma)^{\dagger}\Sigma^{\top}U^{\top}b+cv
else
     solve i=1n(bui)2σi2(λσi2)2=ε2\sum_{i=1}^{n}\frac{(b^{\top}u_{i})^{2}\sigma_{i}^{2}}{(\lambda-\sigma_{i}^{2})^{2}}=\varepsilon^{2} for λ\lambda, e.g. by Newton’s method
     δ=V(λIΣΣ)1ΣUb\delta^{*}=-V(\lambda^{*}I-\Sigma^{\top}\Sigma)^{-1}\Sigma^{\top}U^{\top}b
end if
return δ\delta^{*}

The proof of correctness for Algorithm 1 is detailed in Section B.2.

3 Robustness-Accuracy Tradeoffs in Kalman Filtering

The Kalman state estimation problem and adversarial state estimation problem can be viewed as standard and adversarially robust risk minimization problems by defining

SR(L)\displaystyle\operatorname{SR}(L) :=𝔼[xkLYN22],\displaystyle:=\mathbb{E}\left[\left\|x_{k}-LY_{N}\right\|_{2}^{2}\right],
AR(L)\displaystyle\operatorname{AR}(L) :=𝔼[maxδ2εxkL(YN+δ)22].\displaystyle:=\mathbb{E}\left[\max_{\left\|\delta\right\|_{2}\leq\varepsilon}\left\|x_{k}-L(Y_{N}+\delta)\right\|_{2}^{2}\right].

Our goal is to characterize robustness-accuracy trade-offs for this linear inverse problem. We refer to the set of points (SR(L),AR(L))(\operatorname{SR}(L),\operatorname{AR}(L)) over all Ln×(N+1)pL\in\mathbb{R}^{n\times(N+1)p} as the (SR,AR)(\operatorname{SR},\operatorname{AR}) region. The optimal tradeoff between standard and adversarial risks is characterized via the so-called Pareto boundary of this region, which we denote {(SR(Lλ),AR(Lλ)):λ0}\left\{(\operatorname{SR}(L_{\lambda}),\operatorname{AR}(L_{\lambda})):\lambda\geq 0\right\}. Using standard results in multi-objective optimization, LλL_{\lambda} are computed by solving the regularized optimization problem

Lλ\displaystyle L_{\lambda} :=argminLSR(L)+λAR(L).\displaystyle:=\operatorname*{argmin}_{L}\;\operatorname{SR}(L)+\lambda\operatorname{AR}(L). (4)

Varying the regularization parameter λ\lambda in problem (4) thus allows us to characterize the aforementioned Pareto boundary by interpolating between the solution to the standard (i.e. L0L_{0}) and adversarial (i.e. LL_{\infty}) problems. Via our results from Section 2.3, each solution LλL_{\lambda} to the regularized optimization problem (4) can be computed efficiently using stochastic optimization. We use these results to trace out the optimal tradeoff curves for specific examples in Section 4.

In this section, we show that the gap AR(L)SR(L)\operatorname{AR}(L)-\operatorname{SR}(L) can be bounded in terms of the spectral properties of the observability gramian of the system, establishing a natural connection to the robust control and estimation literature (Hassibi et al., 1999; Zhou and Doyle, 1998). In particular, our results indicate the robustness-accuracy tradeoff is more severe for systems with uniformly low observability, as characterized by the Frobenius norm of the observability gramian.

3.1 Tradeoffs for Linear Inverse Problems

We note that the Kalman state estimation problem can be posed as a general linear inverse problem, where x𝒩(0,Σx)x\sim\mathcal{N}(0,\Sigma_{x}), w𝒩(0,Σw)w\sim\mathcal{N}(0,\Sigma_{w}), y=Mx+wy=Mx+w, and our goal is to minimize one of the following risks

SR(L)\displaystyle\operatorname{SR}(L) =𝔼[xLy22],\displaystyle=\mathbb{E}\left[\left\|x-Ly\right\|_{2}^{2}\right],
AR(L)\displaystyle\operatorname{AR}(L) =𝔼[maxδ2εxL(y+δ)22].\displaystyle=\mathbb{E}\left[\max_{\left\|\delta\right\|_{2}\leq\varepsilon}\left\|x-L(y+\delta)\right\|_{2}^{2}\right].

Although no closed-form expression exists for the adversarial risk AR(L)\operatorname{AR}(L) exists, we show now that interepretable upper and lower bounds on the robustness-accuracy tradeoff, as characterized by the gap AR(L)SR(L)\operatorname{AR}(L)-\operatorname{SR}(L), can be derived. Such bounds predict the severity of the robustness-accuracy tradeoff based upon underlying properties of specific linear inverse problems. We further show that these bounds are tight in the sense that they are exact for certain classes of matrices LL, and strong in the sense that the lower and upper bounds differ only in higher-order terms with respect to the adversarial budget ε\varepsilon.

Theorem 3.1.

Given any Lp×nL\in\mathbb{R}^{p\times n}, we have the following lower bound on AR(L)SR(L)\operatorname{AR}(L)-\operatorname{SR}(L):

AR(L)SR(L)\displaystyle\operatorname{AR}(L)-\operatorname{SR}(L) 2ε𝔼x,w[L(xLy)2]+ε2λmin(LL),\displaystyle\geq 2\varepsilon\;\mathbb{E}_{x,w}\left[\left\|L^{\top}(x-Ly)\right\|_{2}\right]+\varepsilon^{2}\lambda_{\min}(L^{\top}L), (5)

and a corresponding upper bound

AR(L)SR(L)\displaystyle\operatorname{AR}(L)-\operatorname{SR}(L) 2ε𝔼x,w[L(xLy)2]+ε2λmax(LL),\displaystyle\leq 2\varepsilon\;\mathbb{E}_{x,w}\left[\left\|L^{\top}(x-Ly)\right\|_{2}\right]+\varepsilon^{2}\lambda_{\max}(L^{\top}L), (6)

where λmin(LL)\lambda_{\min}(L^{\top}L) and λmax(LL)\lambda_{\max}(L^{\top}L) are the minimum and maximum eigenvalues of LLL^{\top}L, respectively.

The proof of these bounds relies on turning the inner maximization of the adversarial risk into various equivalent optimization problems, and utilizing the properties of Schur complements and the S-lemma. See Section C.1 for details.

We note that when LnL^{\top}\in\mathbb{R}^{n}, inequality (6) recovers the exact characterization of the gap AR(L)SR(L)\operatorname{AR}(L)-\operatorname{SR}(L) provided in Javanmard et al. (2020); thus when p=1p=1, the inequality (6) is in fact an equality. We also note that the upper and lower bounds differ only in the 𝒪(ε2)\mathcal{O}(\varepsilon^{2}) terms. This leads immediately to the following result.

Corollary 3.1.

If pnp\geq n and LL has orthogonal columns, then bounds (5) and (6) match.

Proof.

If pnp\geq n and LL has orthogonal columns, then λmin(LL)=λmax(LL)\lambda_{\min}(L^{\top}L)=\lambda_{\max}(L^{\top}L). ∎

The terms involving the eigenvalues of LLL^{\top}L in our bounds also support the intuition that adversarial robustness is a form of implicit regularization, which is visualized in Figure 1. In the one-dimensional linear classification setting, this phenomenon is well-understood (Tsipras et al., 2018; Dobriban et al., 2020), where robustness to adversarial perturbations prevent a robust feature vector from relying on an aggregate of small features. We note that in the case of state estimation, we have in general p<np<n, and thus the quadratic factor in ε\varepsilon is 0 in the lower bound (5).

Refer to caption
Figure 1: Optimizing for adversarial robustness induces an implicit regularization, which is visualized in this heatmap of a 2×52\times 5 nominal initial state estimator LL_{\star} (top) and adversarially robust solution L^(ε)\hat{L}(\varepsilon) (bottom), where ε=5\varepsilon=5, x0𝒩(0,I)x_{0}\sim\mathcal{N}(0,I), wt𝒩(0,I)w_{t}\sim\mathcal{N}(0,I), vt𝒩(0,1)v_{t}\sim\mathcal{N}(0,1), A=[1101]A=\begin{bmatrix}1&1\\ 0&1\end{bmatrix}, C=[0,1]C=\begin{bmatrix}0,1\end{bmatrix}, and N=4N=4.

In the subsequent section, we will leverage bounds (5) and (6) from Theorem 3.2 to bound both the susceptibility and robustness of the Kalman Filter.

3.2 Bounding ARSR\operatorname{AR}-\operatorname{SR} for State Estimation

The adversarial risk does not admit a closed-form solution in general. Upper and lower bounds on the gap between the adversarial risk and standard risk, however, still highlight the role control theoretic quantities play in robustness-accuracy tradeoffs. We make the following simplifying assumption for presentation purposes going forward.

Assumption 3.1.

Σ0=σ02I\Sigma_{0}=\sigma_{0}^{2}I, Σw=σw2I\Sigma_{w}=\sigma_{w}^{2}I, Σv=σv2I\Sigma_{v}=\sigma_{v}^{2}I. We further assume that the system matrix A=ρQA=\rho Q, ρ[0,1]\rho\in[0,1], is a scaled orthogonal matrix, such that ρ\rho controls the stability of the system.

As stated in 2.1, (A,C)(A,C) is always assumed to be observable. Generalizations of our subsequent results to generic dynamics AA and positive definite covariance matrices are stated and proven in Appendix C: although more notationally cumbersome, they nevertheless convey the same overall trends.

We first present a closed form for the standard risk SR(L)\operatorname{SR}(L) which indicates the role that observability plays in robustness-accuracy tradeoffs.

Lemma 3.1.

The standard risk may be expressed as

SR(\displaystyle\operatorname{SR}( L)=𝔼[xkLYN22]=σ02AkL𝒪NF2+σw2ΓkLτNF2+σv2LF2.\displaystyle L)=\mathbb{E}\left[\left\|x_{k}-LY_{N}\right\|_{2}^{2}\right]=\sigma_{0}^{2}\left\|A^{k}-L\mathcal{O}_{N}\right\|_{F}^{2}+\sigma_{w}^{2}\left\|\Gamma_{k}-L\tau_{N}\right\|_{F}^{2}+\sigma_{v}^{2}\left\|L\right\|_{F}^{2}.

Lemma 3.1 makes clear that the noise terms act as a regularizer: if σw2=σv2=0\sigma_{w}^{2}=\sigma_{v}^{2}=0, then minLSR(L)=0\min_{L}\operatorname{SR}(L)=0 and is achieved by L=Ak(Wo(N))1𝒪NL=A^{k}(W_{o}(N))^{-1}\mathcal{O}_{N}^{\top}. The gain of this filter has clear dependence upon the spectral properties of Wo(N)W_{o}(N), indicating the key role that Wo(N)W_{o}(N) plays in the robustness-accuracy tradeoffs satisfied by an LTI system (1). We formalize this intuition next. As a first step, we specialize the lower bound in Theorem 3.1 to the dynamical system setting.

Lemma 3.2.

For any Ln×p(N+1)L\in\mathbb{R}^{n\times p(N+1)}, the gap between AR(L)\operatorname{AR}(L) and SR(L)\operatorname{SR}(L) admits the following lower bound:

AR(L)SR(L)22πεnσvLF2\displaystyle\operatorname{AR}(L)-\operatorname{SR}(L)\geq 2\sqrt{\frac{2}{\pi}}\frac{\varepsilon}{\sqrt{n}}\sigma_{v}\left\|L\right\|_{F}^{2} (7)

We now turn our attention to studying the tradeoffs enjoyed by the Kalman Filter/Smoother L=L^kL=\hat{L}_{k} defined in Lemma 2.1. Since the Kalman estimator is the optimal estimator in the nominal setting and is commonly used in practice, instantiating Lemma 3.2 for L=L^kL=\hat{L}_{k} captures the susceptibility of a nominal estimator to small adversarial perturbations. To simplify notation in the subsequent results, we will denote σ2=max{σ02,σw2}\sigma_{\vee}^{2}=\max\left\{\sigma_{0}^{2},\sigma_{w}^{2}\right\}, σ2=min{σ02,σw2}\sigma_{\wedge}^{2}=\min\left\{\sigma_{0}^{2},\sigma_{w}^{2}\right\} and

rk(ρ)={k,ρ=11ρ2(k+1)1ρ2,ρ1.r_{k}(\rho)=\begin{cases}k,\quad&\rho=1\\ \frac{1-\rho^{2(k+1)}}{1-\rho^{2}},\quad&\rho\neq 1.\end{cases} (8)

With these definitions, we have the following theorem.

Theorem 3.2.

Suppose that L^k\hat{L}_{k} is the Kalman estimator from Lemma 2.1. We have the following bound on the gap between AR\operatorname{AR} and SR\operatorname{SR}.

AR(L^k)\displaystyle\operatorname{AR}(\hat{L}_{k}) SR(L^k)22πεnσvCF2(ρ2kσ02+rk(ρ)σw2(N+1)σ2Wo(N)F+σv2)2.\displaystyle-\operatorname{SR}(\hat{L}_{k})\geq 2\sqrt{\frac{2}{\pi}}\frac{\varepsilon}{\sqrt{n}}\sigma_{v}\left\|C\right\|_{F}^{2}\left(\frac{\rho^{2k}\sigma_{0}^{2}+r_{k}(\rho)\;\sigma_{w}^{2}}{(N+1)\sigma_{\vee}^{2}\left\|W_{o}(N)\right\|_{F}+\sigma_{v}^{2}}\right)^{2}. (9)

We see that the lower bound increases as the Frobenius norm of the observability gramian decreases. This indicates that as observability becomes uniformly low, i.e., if all eigenvalues of Wo(N)W_{o}(N) are small, then a nominal state estimator L^k\hat{L}_{k} will have a large gap AR(L^k)SR(L^k)\operatorname{AR}(\hat{L}_{k})-\operatorname{SR}(\hat{L}_{k}). Observe that increasing σw\sigma_{w} will increase the lower bound shown above when σwσ0\sigma_{w}\leq\sigma_{0}.

We now derive an upper bound on the gap between the standard and adversarial risk for any given LL. This bound follows from the upper bound in Theorem 3.1.

Lemma 3.3.

For any Ln×p(N+1)L\in\mathbb{R}^{n\times p(N+1)}, the following bound holds

AR(L)SR(L)2εL2Σ1/2F+ε2L22,\displaystyle\operatorname{AR}(L)-\operatorname{SR}(L)\leq 2\varepsilon\left\|L\right\|_{2}\left\|\Sigma^{1/2}\right\|_{F}+\varepsilon^{2}\left\|L\right\|_{2}^{2},

where Σ1/2\Sigma^{1/2} is the symmetric square root of the covariance of xkLYNx_{k}-LY_{N}.

Again, we consider how this upper bound looks for the Kalman estimator L^k\hat{L}_{k}.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Pareto boundaries of (SR,AR)(\operatorname{SR},\operatorname{AR}) for initial and final state estimation. The observability of a the system is determined by α\alpha. When α\alpha approaches one, observability decreases and the tradeoff between SR\operatorname{SR} and AR\operatorname{AR} becomes more severe.
Theorem 3.3.

Suppose that L^k\hat{L}_{k} is the Kalman state estimator from Lemma 2.1. Then

AR(L^k)SR(L^k)\displaystyle\operatorname{AR}(\hat{L}_{k})-\operatorname{SR}(\hat{L}_{k})\leq ε(ρ2kσ02+rk(ρ)σw2σ2λmin(Wo(N))1/2)\displaystyle\;\varepsilon\left(\frac{\rho^{2k}\sigma_{0}^{2}+r_{k}(\rho)\;\sigma_{w}^{2}}{\sigma_{\wedge}^{2}\lambda_{\min}(W_{o}(N))^{1/2}}\right)
[2n(σ2+(σvσ2λmin(Wo(N))1/2)2)1/2+ε(1σ2λmin(Wo(N))1/2)].\displaystyle\cdot\Bigg{[}2\sqrt{n}\left(\sigma_{\vee}^{2}+\left(\frac{\sigma_{v}}{\sigma_{\wedge}^{2}\lambda_{\min}(W_{o}(N))^{1/2}}\right)^{2}\right)^{1/2}+\varepsilon\left(\frac{1}{\sigma_{\wedge}^{2}\lambda_{\min}(W_{o}(N))^{1/2}}\right)\Bigg{]}.

Furthermore, when λmin(Wo(N))1/2σvσ2\lambda_{\min}(W_{o}(N))^{1/2}\geq\frac{\sigma_{v}}{\sigma_{\wedge}^{2}}, we have

AR(L^k)SR(L^k)\displaystyle\operatorname{AR}(\hat{L}_{k})-\operatorname{SR}(\hat{L}_{k})\leq ε(σ2λmin(Wo(N))1/2σ4λmin(Wo(N))+σv2(ρ2kσ02+rk(ρ)σw2))\displaystyle\;\varepsilon\left(\frac{\sigma_{\wedge}^{2}\lambda_{\min}(W_{o}(N))^{1/2}}{\sigma_{\wedge}^{4}\lambda_{\min}(W_{o}(N))+\sigma_{v}^{2}}\left(\rho^{2k}\sigma_{0}^{2}+r_{k}(\rho)\;\sigma_{w}^{2}\right)\right)
[2n(σ2+σv2(σ2λmin(Wo(N))1/2σ4λmin(Wo(N))+σv2)2)1/2+ε(σ2λmin(Wo(N))1/2σ4λmin(Wo(N))+σv2)].\displaystyle\cdot\Bigg{[}2\sqrt{n}\left(\sigma_{\vee}^{2}+\sigma_{v}^{2}\left(\frac{\sigma_{\wedge}^{2}\lambda_{\min}(W_{o}(N))^{1/2}}{\sigma_{\wedge}^{4}\lambda_{\min}(W_{o}(N))+\sigma_{v}^{2}}\right)^{2}\right)^{1/2}+\varepsilon\left(\frac{\sigma_{\wedge}^{2}\lambda_{\min}(W_{o}(N))^{1/2}}{\sigma_{\wedge}^{4}\lambda_{\min}(W_{o}(N))+\sigma_{v}^{2}}\right)\Bigg{]}.

The upper bound on the gap decreases as the minimum eigenvalue of the observability gramian increases. This indicates that as the observability of the system becomes uniformly large, the gap between standard and adversarial risk for the nominal Kalman estimator will decrease. Perhaps counter-intuitively, when observability is poor in some direction, i.e. when λmin(Wo(N))\lambda_{\min}\left(W_{o}(N)\right) is small, increasing the sensor noise σv\sigma_{v} will actually decrease the above upper bound, as long as λmin(Wo(N))σvσ2\lambda_{\min}\left(W_{o}(N)\right)\geq\frac{\sigma_{v}}{\sigma_{\wedge}^{2}}. This aligns with results demonstrating that injecting artificial noise can improve the robustness of state observers (Doyle and Stein, 1979), and is further consistent with our interpretation of noise as a regularizer following Lemma 3.1.

We note that since the properties of the observability gramian Wo(N)W_{o}(N) are tied to ρ\rho, it is not immediately clear how to extract the role of stability ρ[0,1]\rho\in[0,1] in either Equation 9 or Theorem 3.3. This is to be expected, as the fragility of the Kalman Filter has more to do with the observability of the system rather than its autonomous stability. For example, when CC=IC^{\top}C=I, corresponding to maximal observability Wo(N)=rk(ρ)W_{o}(N)=r_{k}(\rho), and k𝒪(N)k\leq\mathcal{O}(N), then the dominant terms in the lower and upper bound are essentially independent of ρ\rho when NN is large.

4 Numerical Results

We now demonstrate that the theoretical results shown in the previous section predict the tradeoffs arising in Kalman Filtering problems.

Refer to caption
Figure 3: SR\operatorname{SR} vs AR\operatorname{AR} for a nominal Kalman smoother and an adversarially robust smoother, where observability of the underlying system increases in the direction of the arrows. For any fixed value of the standard risk, the adversarially robust smoother achieves a lower adversarial risk than the nominal smoother. On the right side of the plot, when observability is low, the adversarial risk of the adversarially robust smoother is much lower than that of the nomimal smoother. This difference shrinks as we move to the left and observability increases.

Pareto Curves for Adversarial Kalman Filtering

In this experiment we compute the Pareto-optimal frontier for adversarially robust Kalman filtering on systems with varying observability. We consider the system defined by the tuple (A,C,Σ0,Σw,Σv,N)=([αββα],[10],I,0.1I,0.1,5)\left(A,C,\Sigma_{0},\Sigma_{w},\Sigma_{v},N\right)=\left(\begin{bmatrix}\alpha&\beta\\ -\beta&\alpha\end{bmatrix},\begin{bmatrix}1&0\end{bmatrix},I,0.1I,0.1,5\right) where α2+β2=1\alpha^{2}+\beta^{2}=1, and we vary α\alpha. As α\alpha approaches one the minimum eigenvalues of the observability gramian become small. In particular, for α=0.95\alpha=0.95, α=0.98\alpha=0.98, α=0.99\alpha=0.99, the minimum eigenvalues of the observability gramian are given by 1.22,0.811.22,0.81 and 0.580.58 respectively. The adversarial budget is set to ε=0.5\varepsilon=0.5. Figure 2 shows the resulting tradeoff curves which demonstrate that as observability decreases, both SR\operatorname{SR} and AR\operatorname{AR} increase, as do the distance between the extremes of the Pareto curve. The results therefore support Section 3.2, where we showed shrinking the eigenvalues of Wo(N)W_{o}(N) increases the severity of the tradeoff between SR\operatorname{SR} and AR\operatorname{AR}.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Pareto boundaries of (SR,AR)(\operatorname{SR},\operatorname{AR}) for initial state estimation for a variety of measurement matrices CC by both a linear state estimator and a neural network (Lower bound on AR\operatorname{AR} plotted for NN). As the first entry of CC decreases, the tradeoff curve becomes more severe. The trade-offs are not alleviated by a nonlinear estimator.

Tradeoffs of Kalman Smoother versus Adversarially Robust Kalman Smoother

In Figure 3, we demonstrate the impact of adversaries on the risk incurred by an estimator optimized for SR\operatorname{SR} versus AR\operatorname{AR}. We consider initial state estimation of a system defined by the tuple (A,C,Σ0,Σw,Σv,N)=([1ρ01],[10],I,0.1I,0.1,5)(A,C,\Sigma_{0},\Sigma_{w},\Sigma_{v},N)=\left(\begin{bmatrix}1&\rho\\ 0&1\end{bmatrix},\begin{bmatrix}1&0\end{bmatrix},I,0.1I,0.1,5\right), where we vary ρ\rho from 0.10.1 to 10\sqrt{10}, such that the eigenvalues of Wo(N)W_{o}(N) increase as ρ\rho increases. The adversarial budget is fixed at ε=0.5\varepsilon=0.5. Evaluating the nominal and robust estimators on this class of systems, we see that the adversarially robust smoother has significantly smaller adversarial risk compared to the nominal Kalman smoother when observability is low. As observability increases, this advantage shrinks. This suggests that considering the tradeoffs from adversarially robust state estimation is most important when observability is low.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: Simulations of the performance of the Kalman/2\mathcal{H}_{2} filter, our adversarially robust Kalman filter, a mixed 2/\mathcal{H}_{2}/\mathcal{H}_{\infty} filter, and the \mathcal{H}_{\infty} filter on a linear system. The average filtering error is computed as 1tk=1txtx^t22\frac{1}{t}\sum_{k=1}^{t}\left\|x_{t}-\hat{x}_{t}\right\|_{2}^{2}.

Tradeoffs with Nonlinear State Estimators

In Figure 4, we demonstrate empirically that the fundamental tradeoffs are not overcome by using nonlinear state estimators. In particular, we consider the system defined by (A,Σ0,Σw,Σv,N)=([1101],I,0.1I,0.1,5)(A,\Sigma_{0},\Sigma_{w},\Sigma_{v},N)=\left(\begin{bmatrix}1&1\\ 0&1\end{bmatrix},I,0.1I,0.1,5\right), and CC as in the legend of the plots. We set the adversarial budget to ε=0.5\varepsilon=0.5. We solve for the Pareto boundary for the linear estimator as in Figure 2. We also consider a two layer network with ten neurons per layer and ReLU activation functions. To train this network, we perform an SGD procedure similar to that used in the linear case, with the exception being that we do not solve the exact adversary corresponding to each data point, but rather apply 100 steps of projected gradient ascent to find the adversarial perturbation. Once the neural network is trained, this same approach to find the adversary is used again to estimate the adversarial risk of the resultant estimator. As a result, the tradeoff curves shown for the neural network are an under-approximation to the true values of AR\operatorname{AR}. As such, it appears that the neural network is getting roughly the same tradeoff curves as the linear estimator, which is expected due to the fact that linear estimators are optimal in the 2\mathcal{H}_{2} and \mathcal{H}_{\infty} settings (Zhou and Doyle, 1998).

Comparison with Kalman Filter, mixed 2/\mathcal{H}_{2}/\mathcal{H}_{\infty}, and \mathcal{H}_{\infty} Filters

In Figure 5, we compare the performance of the adversarially robust Kalman filter with the optimal \mathcal{H}_{\infty} filter, a mixed 2/\mathcal{H}_{2}/\mathcal{H}_{\infty} filter which minimizes the 2\mathcal{H}_{2} norm subject to a \mathcal{H}_{\infty} norm bound of 1.71.7, and the nominal Kalman or 2\mathcal{H}_{2} filter. In particular, we consider the filtering problem for a trajectory generated by the system (A,C,Σ0,Σw,Σv)=([0.9100.9],[1,0],I,0.1I,0.1)(A,C,\Sigma_{0},\Sigma_{w},\Sigma_{v})=\left(\begin{bmatrix}0.9&1\\ 0&0.9\end{bmatrix},\begin{bmatrix}1,0\end{bmatrix},I,0.1I,0.1\right). For ease of computation, we use the stationary mixed, \mathcal{H}_{\infty}, and Kalman filters (Lewis et al., 2008; Caverly and Forbes, 2019), and the adversarially robust Kalman Filter for a horizon of 10 with adversarial budget ε=0.75\varepsilon=0.75. We consider a nominal setting, where process and state disturbances are Gaussian with covariances defined by Σw\Sigma_{w} and Σv\Sigma_{v}, and a setting to simulate sensor drift, where we have the white noise disturbances in addition to a sinusoidally varying measurement disturbance sin(2πk10)\sin\left(\frac{2\pi k}{10}\right). The \mathcal{H}_{\infty} norm bound for the mixed filter was tuned to the value of 1.71.7 for good performance on the sensor drift setting. As expected, the nominal Kalman filter performs best in the nominal setting with zero-mean disturbances. In the setting with sensor drift, however, the adversarially robust Kalman filter performs substantially better than the nominal and \mathcal{H}_{\infty} filters, and slightly better than the mixed 2/\mathcal{H}_{2}/\mathcal{H}_{\infty} controller. In both settings, the \mathcal{H}_{\infty} filter is overly conservative. Note that a key advantage of the adversarially robust controller is the interpretability of the robustness level determined by the parameter ε\varepsilon, which directly corresponds to the power of the adversarial disturbance. This feature is not shared by other filter designs, such as the \mathcal{H}_{\infty} norm bound in the mixed 2/\mathcal{H}_{2}/\mathcal{H}_{\infty} filter.

5 Conclusion

We analyzed the robustness-accuracy tradeoffs arising in Kalman Filtering. We did this in two parts. We first provided an algorithm to solve for the optimal adversarial perturbation, which can be used to trace out the Pareto boundary. We then bounded the gap between the adversarial and standard state estimation error in terms of the spectral properties of the observability gramian. These bounds extend upon the robustness-accuracy tradeoff results arising in the classification and linear regression settings. An interesting avenue of future work is to extend these results to infinite horizon filtering, and combine them with tradeoff analyses in LQR control (Lee et al., 2022) to provide an understanding of the adversarial tradeoffs in the control of partially observed linear systems. It would also be interesting to consider the adversarial tradeoffs of state estimation when adversarial state perturbations are also present.

Acknowledgements

Bruce Lee is supported by the Department of Defense through the National Defense Science & Engineering Graduate Fellowship Program. The research of Hamed Hassani is supported by NSF Grants 1837253, 1943064, 1934876, AFOSR Grant FA9550-20-1-0111, and DCIST-CRA. Nikolai Matni is funded by NSF awards CPS-2038873, CAREER award ECCS-2045834, and a Google Research Scholar award.

References

  • Al Makdah et al. (2020) A. A. Al Makdah, V. Katewa, and F. Pasqualetti. Accuracy prevents robustness in perception-based control. In 2020 American Control Conference (ACC), pages 3940–3946. IEEE, 2020.
  • Bottou et al. (2018) L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018. doi: 10.1137/16M1080173.
  • Boyd and Vandenberghe (2004) S. P. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Carlini and Wagner (2016) N. Carlini and D. Wagner. Defensive distillation is not robust to adversarial examples. arXiv preprint arXiv:1607.04311, 2016.
  • Carlini and Wagner (2017) N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE symposium on security and privacy, pages 39–57. IEEE, 2017.
  • Caverly and Forbes (2019) R. J. Caverly and J. R. Forbes. Lmi properties and applications in systems, stability, and control theory. arXiv preprint arXiv:1903.08599, 2019.
  • Chen et al. (2020) L. Chen, Y. Min, M. Zhang, and A. Karbasi. More data can expand the generalization gap between adversarially robust and standard models. In International Conference on Machine Learning, pages 1670–1680. PMLR, 2020.
  • Deka et al. (2020) S. A. Deka, D. M. Stipanović, and C. J. Tomlin. Dynamically computing adversarial perturbations for recurrent neural networks. arXiv preprint arXiv:2009.02874, 2020.
  • Dobriban et al. (2020) E. Dobriban, H. Hassani, D. Hong, and A. Robey. Provable tradeoffs in adversarially robust classification. arXiv preprint arXiv:2006.05161, 2020.
  • Doyle and Stein (1979) J. Doyle and G. Stein. Robustness with observers. IEEE Transactions on Automatic Control, 24(4):607–611, 1979. doi: 10.1109/TAC.1979.1102095.
  • Goodfellow et al. (2014) I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • Hassibi et al. (1999) B. Hassibi, T. Kailath, and A. H. Sayed. Indefinite-quadratic estimation and control: a unified approach to H22 and H\infty theories. SIAM studies in applied and numerical mathematics, 1999.
  • Huang et al. (2017) S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
  • Javanmard et al. (2020) A. Javanmard, M. Soltanolkotabi, and H. Hassani. Precise tradeoffs in adversarial training for linear regression. In Conference on Learning Theory, pages 2034–2078. PMLR, 2020.
  • Khargonekar et al. (1996) P. P. Khargonekar, M. A. Rotea, and E. Baeyens. Mixed H2/H\infty filtering. International Journal of Robust and Nonlinear Control, 6(4):313–330, 1996.
  • Lee et al. (2022) B. D. Lee, T. T. C. K. Zhang, H. Hassani, and N. Matni. Performance-robustness tradeoffs in adversarially robust linear-quadratic control, 2022.
  • Lewis et al. (2008) F. L. Lewis, L. Xie, and D. Popa. Optimal and Robust Estimation: With an Introduction to Stochastic Control Theory. CRC Press, 2008.
  • Lutter et al. (2021) M. Lutter, S. Mannor, J. Peters, D. Fox, and A. Garg. Robust value iteration for continuous control tasks. arXiv preprint arXiv:2105.12189, 2021.
  • Madry et al. (2017) A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • Mandlekar et al. (2017) A. Mandlekar, Y. Zhu, A. Garg, L. Fei-Fei, and S. Savarese. Adversarially robust policy learning: Active construction of physically-plausible perturbations. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3932–3939. IEEE, 2017.
  • Nakkiran (2019) P. Nakkiran. Adversarial robustness may be at odds with simplicity. arXiv preprint arXiv:1901.00532, 2019.
  • Petersen et al. (2008) K. B. Petersen, M. S. Pedersen, et al. The matrix cookbook. Technical University of Denmark, 7(15):510, 2008.
  • Pinto et al. (2017) L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta. Robust adversarial reinforcement learning. In International Conference on Machine Learning, pages 2817–2826. PMLR, 2017.
  • Raghunathan et al. (2019) A. Raghunathan, S. M. Xie, F. Yang, J. C. Duchi, and P. Liang. Adversarial training can hurt generalization. arXiv preprint arXiv:1906.06032, 2019.
  • Szegedy et al. (2013) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Tsipras et al. (2018) D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
  • Xie et al. (2020) C. Xie, M. Tan, B. Gong, A. Yuille, and Q. V. Le. Smooth adversarial training. arXiv preprint arXiv:2006.14536, 2020.
  • Zhang et al. (2019) H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan. Theoretically principled trade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573, 2019.
  • Zhou and Doyle (1998) K. Zhou and J. C. Doyle. Essentials of Robust Control. Prentice-Hall, 1998.

Appendix A Kalman Filtering State Space Solution

Consider the setting defined in Section 2.1, i.e. we have a dynamical system which progresses according to

xt+1\displaystyle x_{t+1} =Axt+wt\displaystyle=Ax_{t}+w_{t}
yt\displaystyle y_{t} =Cxt+vt\displaystyle=Cx_{t}+v_{t}

with x0𝒩(0,Σ0)x_{0}\sim{}\mathcal{N}(0,\Sigma_{0}), wti.i.d.𝒩(0,Σw)w_{t}\overset{\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,\Sigma_{w}), vti.i.d.𝒩(0,Σv)v_{t}\overset{\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,\Sigma_{v}). Consider estimating state xkx_{k} given measurements y0,,yky_{0},\dots,y_{k}. The minimum mean square estimator is given by

x^k=minz𝔼[zxk22|y0,,yk].\hat{x}_{k}=\min_{z}\mathbb{E}\left[\left\|z-x_{k}\right\|_{2}^{2}|y_{0},\dots,y_{k}\right].

This can be written as an integral

minznzxk22f(xk|y0,,yk)𝑑xk,\min_{z}\int_{\mathbb{R}^{n}}\left\|z-x_{k}\right\|_{2}^{2}f(x_{k}|y_{0},\dots,y_{k})dx_{k},

where ff denotes the conditional density of xkx_{k} given the measurements. The objective is convex in zz, and thus we can find the minimizer by setting the gradient with respect to zz to zero. In particular, we have

ddznzxk22f(xk|y0,,yk)\displaystyle\frac{d}{dz}\int_{\mathbb{R}^{n}}\left\|z-x_{k}\right\|_{2}^{2}f(x_{k}|y_{0},\dots,y_{k}) =nddzzxk22f(xk|y0,,yk)𝑑xk\displaystyle=\int_{\mathbb{R}^{n}}\frac{d}{dz}\left\|z-x_{k}\right\|_{2}^{2}f(x_{k}|y_{0},\dots,y_{k})dx_{k}
=n2(zxk)f(xk|y0,,yk)𝑑xk\displaystyle=\int_{\mathbb{R}^{n}}2(z-x_{k})f(x_{k}|y_{0},\dots,y_{k})dx_{k}
=0,\displaystyle=0,

where dominated convergence theorem permits the exchange of integration and differentiation. Therefore, the state estimate may be expressed

x^k=nxkf(xk|y0,,yk)𝑑xk=𝔼[xk|y0,yk].\hat{x}_{k}=\int_{\mathbb{R}^{n}}x_{k}f(x_{k}|y_{0},\dots,y_{k})dx_{k}=\mathbb{E}\left[x_{k}|y_{0},\dots y_{k}\right].

Thus we can determine the state estimates x^k\hat{x}_{k} as the mean of the conditional distribution f(xk|y0,yk)f(x_{k}|y_{0},\dots y_{k}). This may be computed recursively. In particular, let xk|y0:kx_{k}|y_{0:k} be the random variable with probability density function f(xk|y0,yk)f(x_{k}|y_{0},\dots y_{k}). Then for all kk, we have that xk|y0:k=𝒩(x^k+,Pk+)x_{k}|y_{0:k}=\mathcal{N}(\hat{x}_{k}^{+},P_{k}^{+}) where

x0\displaystyle x_{0}^{-} =0\displaystyle=0 (10)
P0\displaystyle P_{0}^{-} =Σ0\displaystyle=\Sigma_{0}
x^k+\displaystyle\hat{x}_{k}^{+} =xk+PkC(CPkC+Σv)1(ykCx^k)\displaystyle=x_{k}^{-}+P_{k}^{-}C^{\top}\left(CP_{k}^{-}C^{\top}+\Sigma_{v}\right)^{-1}\left(y_{k}-C\hat{x}_{k}^{-}\right)
Pk+\displaystyle P_{k}^{+} =PkPkC(CPkC+Σv)CPk\displaystyle=P_{k}^{-}-P_{k}^{-}C^{\top}\left(CP_{k}^{-}C^{\top}+\Sigma_{v}\right)CP_{k}^{-}
x^k+1\displaystyle\hat{x}_{k+1}^{-} =Ax^k+\displaystyle=A\hat{x}_{k}^{+}
Pk+1\displaystyle P_{k+1}^{-} =APk+A+Σw.\displaystyle=A^{\top}P_{k}^{+}A+\Sigma_{w}.

To see that this is the case, recall that x0𝒩(0,Σ0)x_{0}\sim\mathcal{N}(0,\Sigma_{0}), by assumption. Now suppose that xk|y0:k1𝒩(x^k,Pk)x_{k}|y_{0:k-1}\sim\mathcal{N}(\hat{x}_{k}^{-},P_{k}^{-}). Observe that

[xk|y0:k1yk]𝒩([x^kCx^k],[PkPkCCPkCPkC+Σv])\begin{bmatrix}x_{k}|y_{0:k-1}\\ y_{k}\end{bmatrix}\sim\mathcal{N}\left(\begin{bmatrix}\hat{x}_{k}^{-}\\ C\hat{x}_{k}^{-}\end{bmatrix},\begin{bmatrix}P_{k}^{-}&P_{k}^{-}C^{\top}\\ CP_{k}^{-}&CP_{k}^{-}C^{\top}+\Sigma_{v}\end{bmatrix}\right)

Thus

xk|y0:k\displaystyle x_{k}|y_{0:k} 𝒩(xk+PkC(CPkC+Σv)1(ykCx^k),PkPkC(CPkC+Σv)CPk)\displaystyle\sim\mathcal{N}\left(x_{k}^{-}+P_{k}^{-}C^{\top}\left(CP_{k}^{-}C^{\top}+\Sigma_{v}\right)^{-1}\left(y_{k}-C\hat{x}_{k}^{-}\right),P_{k}^{-}-P_{k}^{-}C^{\top}\left(CP_{k}^{-}C^{\top}+\Sigma_{v}\right)CP_{k}^{-}\right)
=𝒩(xk+,Pk+)\displaystyle=\mathcal{N}(x_{k}^{+},P_{k}^{+})

Now, given xk|y0:k𝒩(xk+,Pk+)x_{k}|y_{0:k}\sim\mathcal{N}(x_{k}^{+},P_{k}^{+}), observe that xk+1|y0:k=Axk|y0:k+wk𝒩(Axk+,APk+A+Σw)=𝒩(xk+1,Pk)x_{k+1}|y_{0:k}=Ax_{k}|y_{0:k}+w_{k}\sim\mathcal{N}(A^{\top}x_{k}^{+},A^{\top}P_{k}^{+}A+\Sigma_{w})=\mathcal{N}(x_{k+1}^{-},P_{k}^{-}).

Using the equations in (10), we can write the Kalman filter as a state space system with inputs yty_{t}. In particular, if PkP_{k}^{-} and Pk+P_{k}^{+} are defined as in (10), we can let Kk=PkC(CPkC+Σv)1K_{k}=P_{k}^{-}C^{\top}\left(CP_{k}^{-}C^{\top}+\Sigma_{v}\right)^{-1}. Then we may express our state estimates using the following time varying system.

x^k+1\displaystyle\hat{x}_{k+1} =(AKkCA)x^k+Kkyk.\displaystyle=(A-K_{k}CA)\hat{x}_{k}+K_{k}y_{k}.

Appendix B Proofs from Section 2

B.1 Proof of Lemma 2.1

Lemma 2.1:  Suppose kNk\leq N. The finite horizon Kalman state estimator is the solution to optimization problem (2), and is given by

L^k\displaystyle\hat{L}_{k} =(AkΣ0𝒪N+ΓkΣwτN)(𝒪NΣ0𝒪N+τN(INΣw)τN+(IN+1Σv))1.\displaystyle=\left(A^{k}\Sigma_{0}\mathcal{O}_{N}\operatorname{{}^{\top}}+\Gamma_{k}\Sigma_{w}\tau_{N}\operatorname{{}^{\top}}\right)\left(\mathcal{O}_{N}\Sigma_{0}\mathcal{O}_{N}\operatorname{{}^{\top}}+\tau_{N}\left(I_{N}\otimes\Sigma_{w}\right)\tau_{N}\operatorname{{}^{\top}}+\left(I_{N+1}\otimes\Sigma_{v}\right)\right)^{-1}.

Proof: We know SR(L)\operatorname{SR}(L) is convex in LL, thus we may take the derivative of SR(L)\operatorname{SR}(L) with respect to LL and set it to 0 to solve for L^k\hat{L}_{k}. Matrix derivatives can be found in Petersen et al. (2008).

B.2 Proof of Correctness for Algorithm 1

Recalling the optimization problem

maximizeδn\displaystyle\operatorname*{maximize}_{\delta\in\mathbb{R}^{n}} δLLδ2δLb\displaystyle\quad\delta^{\top}L^{\top}L\delta-2\delta^{\top}L^{\top}b (P)
subjectto\displaystyle\operatorname{subject\ to} δδε2,\displaystyle\quad\delta^{\top}\delta\leq\varepsilon^{2},

and the corresponding KKT conditions:

2(λILL)δ+2Lb\displaystyle 2(\lambda^{*}I-L^{\top}L)\delta^{*}+2L^{\top}b =0\displaystyle=0
λ(δδε2)\displaystyle\lambda^{*}({\delta^{*}}^{\top}\delta^{*}-\varepsilon^{2}) =0\displaystyle=0
(λILL)\displaystyle(\lambda^{*}I-L^{\top}L) 0.\displaystyle\succeq 0.

The third condition implies that λσ12\lambda^{*}\geq\sigma_{1}^{2}. We assume σ1>0\sigma_{1}>0, otherwise the problem is trivial. Then using the SVD of LL to re-arrange the first stationarity condition, we get

(λIΣΣ)Vδ\displaystyle(\lambda^{*}I-\Sigma^{\top}\Sigma)V^{\top}\delta^{*} =ΣUb.\displaystyle=\Sigma^{\top}U^{\top}b.

Maximizing a convex function over a convex set achieves its maximum on the boundary; it suffices to search over δδ=ε2\delta^{\top}\delta=\varepsilon^{2}. We now consider two cases: λ>σ12\lambda^{*}>\sigma_{1}^{2} and λ=σ12\lambda^{*}=\sigma_{1}^{2}.

  • In the first case λ>σ12\lambda^{*}>\sigma_{1}^{2}, we know λIΣΣ\lambda^{*}I-\Sigma^{\top}\Sigma must be invertible, and thus

    δ\displaystyle\delta^{*} =V(λIΣΣ)1ΣUb\displaystyle=-V(\lambda^{*}I-\Sigma^{\top}\Sigma)^{-1}\Sigma^{\top}U^{\top}b
    ε2=δδ\displaystyle\varepsilon^{2}={\delta^{*}}^{\top}\delta^{*} =bUΣ(λIΣΣ)2ΣUb\displaystyle=b^{\top}U\Sigma(\lambda^{*}I-\Sigma^{\top}\Sigma)^{-2}\Sigma^{\top}U^{\top}b
    =i=1min{n,(N+1)p}(bui)2σi2(λσi2)2,\displaystyle=\sum_{i=1}^{\min\left\{n,(N+1)p\right\}}\frac{(b^{\top}u_{i})^{2}\sigma_{i}^{2}}{(\lambda^{*}-\sigma_{i}^{2})^{2}},

    where uiu_{i} are the columns of UU. By 2.1, we know n<(N+1)pn<(N+1)p. Observe that

    f(λ)\displaystyle f(\lambda) :=i=1n(bui)2σi2(λσi2)2,\displaystyle:=\sum_{i=1}^{n}\frac{(b^{\top}u_{i})^{2}\sigma_{i}^{2}}{(\lambda-\sigma_{i}^{2})^{2}},

    is a strictly monotonically decreasing function when λ>σ12\lambda>\sigma_{1}^{2}, and converges to 0 when λ\lambda\to\infty. This implies there is a unique λ\lambda^{*} such that f(λ)=ε2f(\lambda^{*})=\varepsilon^{2}, which can be numerically solved for in various ways, such as bisection. Such methods can be initialized by setting the left boundary to λ:=σ12\lambda_{\ell}:=\sigma_{1}^{2}, corresponding to f(λ)=f(\lambda_{\ell})=\infty, and the right boundary to a precomputable over-estimate λr\lambda_{r} such that f(λr)<ε2f(\lambda_{r})<\varepsilon^{2}. As an example, one such crude over-estimate can be derived by observing

    i=1n(bui)2σi2(λσi2)2\displaystyle\sum_{i=1}^{n}\frac{(b^{\top}u_{i})^{2}\sigma_{i}^{2}}{(\lambda-\sigma_{i}^{2})^{2}} (i=1n(bui)2)σ12(λσ12)2\displaystyle\leq\left(\sum_{i=1}^{n}\left(b^{\top}u_{i}\right)^{2}\right)\frac{\sigma_{1}^{2}}{\left(\lambda-\sigma_{1}^{2}\right)^{2}} Hölder’s inequality
    =b2/(c2σ12)ε2\displaystyle=\left\|b\right\|^{2}/\left(c^{2}\sigma_{1}^{2}\right)\leftarrow\varepsilon^{2} U orthogonal; set λ=(1+c)σ12\displaystyle U\text{ orthogonal; set }\lambda=(1+c)\sigma_{1}^{2}
    λr\displaystyle\implies\lambda_{r} :=(1+bεσ1)σ12.\displaystyle:=\left(1+\frac{\left\|b\right\|}{\varepsilon\sigma_{1}}\right)\sigma_{1}^{2}.

    Therefore, bisection can solve for λ\lambda^{*} up to a desired tolerance δ\delta in at most log2(λrλδ)=log2(bεσ1δ)\left\lceil\log_{2}\left(\frac{\lambda_{r}-\lambda_{\ell}}{\delta}\right)\right\rceil=\left\lceil\log_{2}\left(\frac{\left\|b\right\|}{\varepsilon\sigma_{1}\delta}\right)\right\rceil iterations, which is linear bit-complexity over problem parameters. Since f(λ)f(\lambda) enjoys favorable regularity properties such as monotonicity, strict convexity, and smoothness on an open interval around λ\lambda^{*}, more advanced root-finding methods such as variants of Newton’s method or the secant method can be employed for superlinear convergence.

  • Now we consider the case where λ=σ12\lambda^{*}=\sigma_{1}^{2}. In this case, δ\delta^{*} will no longer be unique, and will come in the form

    δ\displaystyle\delta^{*} =V(σ12IΣΣ)ΣUb+cv,\displaystyle=-V(\sigma_{1}^{2}I-\Sigma^{\top}\Sigma)^{\dagger}\Sigma^{\top}U^{\top}b+cv,

    where denotes the Moore-Penrose pseudoinverse, and vv is any unit vector lying in the null-space of (σ12IΣ2)V\left(\sigma_{1}^{2}I-\Sigma^{2}\right)V^{\top}, which is precisely characterized in this case by

    ker((σ12IΣ2)V)\displaystyle\ker\left(\left(\sigma_{1}^{2}I-\Sigma^{2}\right)V^{\top}\right) =span({vi:σi2=σ12}),\displaystyle=\textbf{span}\left(\left\{v_{i}:\sigma_{i}^{2}=\sigma_{1}^{2}\right\}\right),

    with viv_{i} denoting the iith column of VV. To find the appropriate scaling cc, we observe

    δδ\displaystyle{\delta^{*}}^{\top}\delta^{*} =bUΣ((σ12IΣΣ))2ΣUb+c2\displaystyle=b^{\top}U\Sigma\left((\sigma_{1}^{2}I-\Sigma^{\top}\Sigma)^{\dagger}\right)^{2}\Sigma^{\top}U^{\top}b+c^{2}
    =i:σi<σ1(bui)2σi2(σ12σi2)2=ε2\displaystyle=\sum_{i:\sigma_{i}<\sigma_{1}}\frac{(b^{\top}u_{i})^{2}\sigma_{i}^{2}}{(\sigma_{1}^{2}-\sigma_{i}^{2})^{2}}=\varepsilon^{2}
    c\displaystyle c =ε2i:σi<σ1(bui)2σi2(σ12σi2)2.\displaystyle=\sqrt{\varepsilon^{2}-\sum_{i:\sigma_{i}<\sigma_{1}}\frac{(b^{\top}u_{i})^{2}\sigma_{i}^{2}}{(\sigma_{1}^{2}-\sigma_{i}^{2})^{2}}}.

    Combining our precise characterization of ker((σ12IΣΣ)V)\ker\left(\left(\sigma_{1}^{2}I-\Sigma^{\top}\Sigma\right)V^{\top}\right) using the columns of VV, and the formula for cc, we can extract an optimal perturbation vector δ\delta^{*}. Therefore, we have demonstrated that (P), as well as extracting its optimal solution, can be solved to arbitrary precision.

Appendix C General Statements and Proofs for Section 3

C.1 Proof of Theorem 3.1

Theorem 3.1  Given any Lp×nL\in\mathbb{R}^{p\times n}, we have the following lower bound on AR(L)SR(L)\operatorname{AR}(L)-\operatorname{SR}(L):

AR(L)SR(L)\displaystyle\operatorname{AR}(L)-\operatorname{SR}(L) 2ε𝔼x,w[L(xLy)2]+ε2λmin(LL),\displaystyle\geq 2\varepsilon\;\mathbb{E}_{x,w}\left[\left\|L^{\top}(x-Ly)\right\|_{2}\right]+\varepsilon^{2}\lambda_{\min}(L^{\top}L),

and a corresponding upper bound

AR(L)SR(L)2ε𝔼x,w[L(xLy)2]+ε2λmax(LL),\operatorname{AR}(L)-\operatorname{SR}(L)\leq 2\varepsilon\;\mathbb{E}_{x,w}\left[\left\|L^{\top}(x-Ly)\right\|_{2}\right]+\varepsilon^{2}\lambda_{\max}(L^{\top}L),

where λmin(LL)\lambda_{\min}(L^{\top}L) and λmax(LL)\lambda_{\max}(L^{\top}L) are the minimum and maximum eigenvalues of LLL^{\top}L, respectively.

Proof: First recall the definitions of SR and AR:

SR(L)\displaystyle\operatorname{SR}(L) =𝔼[xLy22]\displaystyle=\mathbb{E}\left[\left\|x-Ly\right\|_{2}^{2}\right]
AR(L)\displaystyle\operatorname{AR}(L) =𝔼[maxδ2εxL(y+δ)22].\displaystyle=\mathbb{E}\left[\max_{\left\|\delta\right\|_{2}\leq\varepsilon}\left\|x-L(y+\delta)\right\|_{2}^{2}\right].

Given fixed x,yx,y, let us define the quantity d(L):=xLyd(L):=x-Ly. Writing out the inner maximization of AR we have:

maxδ2εxL(y+δ)22\displaystyle\max_{\left\|\delta\right\|_{2}\leq\varepsilon}\left\|x-L(y+\delta)\right\|_{2}^{2} =maxδ2εd(L)Lδ22.\displaystyle=\max_{\left\|\delta\right\|_{2}\leq\varepsilon}\left\|d(L)-L\delta\right\|_{2}^{2}.

Observe that this is equivalent to the problem

minimizes\displaystyle\operatorname*{minimize}_{s}\quad s\displaystyle s (P1)
subjectto\displaystyle\operatorname{subject\ to}\quad sd(L)Lδ220 for all δδε2.\displaystyle s-\left\|d(L)-L\delta\right\|_{2}^{2}\geq 0\text{ for all }\delta^{\top}\delta\leq\varepsilon^{2}.

We now recall the S-lemma for quadratic functions.

Lemma C.1 (S-lemma).

Given quadratic functions p(x),q(x):np(x),q(x):\mathbb{R}^{n}\to\mathbb{R}, suppose there exists xx such that p(x)>0p(x)>0. Then,

p(x)0q(x)0 for all xp(x)\geq 0\implies q(x)\geq 0\text{ for all }x

if and only if

t0 such that q(x)tp(x) for all x.\exists t\geq 0\text{ such that }q(x)\geq tp(x)\text{ for all }x.

Using this lemma, we set p(δ)=ε2δδp(\delta)=\varepsilon^{2}-\delta^{\top}\delta, q(δ)=sd(L)Lδ220q(\delta)=s-\left\|d(L)-L\delta\right\|_{2}^{2}\geq 0. We observe that trivially, there exists δ=0\delta=0 such that p(δ)>0p(\delta)>0. Now given feasible ss for (P1), we observe that by our constraints, any δ\delta such that p(δ)0p(\delta)\geq 0 immediately implies q(δ)0q(\delta)\geq 0. By the S-lemma, this is equivalent to the existence of some t0t\geq 0 such that

q(δ)tp(δ)=sd(L)Lδ22t(ε2δδ)0q(\delta)-tp(\delta)=s-\left\|d(L)-L\delta\right\|_{2}^{2}-t(\varepsilon^{2}-\delta^{\top}\delta)\geq 0

for all δ\delta. Therefore, we can re-write the optimization problem (P1) into

minimizes\displaystyle\operatorname*{minimize}_{s}\quad s\displaystyle s (P2)
subjectto\displaystyle\operatorname{subject\ to}\quad t0 s.t. sd(L)Lδ22t(ε2δδ)0 for all δ.\displaystyle\exists t\geq 0\text{ s.t. }s-\left\|d(L)-L\delta\right\|_{2}^{2}-t(\varepsilon^{2}-\delta^{\top}\delta)\geq 0\text{ for all }\delta.

Re-arranging the terms in the quadratic expression, we get:

sd(L)Lδ22t(ε2δδ)\displaystyle s-\left\|d(L)-L\delta\right\|_{2}^{2}-t(\varepsilon^{2}-\delta^{\top}\delta) =s(δLLδ2d(L)Lδ+d(L)22)t(ε2δδ)\displaystyle=s-\left(\delta^{\top}L^{\top}L\delta-2d(L)^{\top}L\delta+\left\|d(L)\right\|_{2}^{2}\right)-t(\varepsilon^{2}-\delta^{\top}\delta)
=δ(tILL)δ+2d(L)Lδ+(stε2d(L)22).\displaystyle=\delta^{\top}\left(tI-L^{\top}L\right)\delta+2d(L)^{\top}L\delta+\left(s-t\varepsilon^{2}-\left\|d(L)\right\|_{2}^{2}\right).

Now we recall a property of Schur complements.

Lemma C.2 (Schur Complement).

Given p(x)=xPx+bx+cp(x)=x^{\top}Px+b^{\top}x+c, we have

p(x)0x\displaystyle p(x)\geq 0\;\forall\;x [Pbbc]0\displaystyle\iff\begin{bmatrix}P&b\\ b^{\top}&c\end{bmatrix}\succeq 0
P0,cbPb0.\displaystyle\iff P\succeq 0,\;c-b^{\top}P^{\dagger}b\geq 0.

Applying this to (P2), we see the constraints can be re-written

t0 s.t. sd(L)Lδ22t(ε2δδ)0 for all δ\displaystyle\exists t\geq 0\text{ s.t. }s-\left\|d(L)-L\delta\right\|_{2}^{2}-t(\varepsilon^{2}-\delta^{\top}\delta)\geq 0\text{ for all }\delta
\displaystyle\iff\quad t0,tILL0,stε2d(L)22d(L)L(tILL)Ld(L)0\displaystyle\exists t\geq 0,\;tI-L^{\top}L\succeq 0,\;s-t\varepsilon^{2}-\left\|d(L)\right\|_{2}^{2}-d(L)^{\top}L(tI-L^{\top}L)^{\dagger}L^{\top}d(L)\geq 0
\displaystyle\iff\quad tλmax(LL),stε2d(L)22d(L)L(tILL)Ld(L)0.\displaystyle\exists t\geq\lambda_{\max}(L^{\top}L),\;s-t\varepsilon^{2}-\left\|d(L)\right\|_{2}^{2}-d(L)^{\top}L(tI-L^{\top}L)^{\dagger}L^{\top}d(L)\geq 0.

Therefore, we get the optimization problem

minimizes,t\displaystyle\operatorname*{minimize}_{s,t}\quad s\displaystyle s
subjectto\displaystyle\operatorname{subject\ to}\quad tλmax(LL)\displaystyle t\geq\lambda_{\max}(L^{\top}L)
stε2d(L)22d(L)L(tILL)Ld(L)0.\displaystyle s-t\varepsilon^{2}-\left\|d(L)\right\|_{2}^{2}-d(L)^{\top}L(tI-L^{\top}L)^{\dagger}L^{\top}d(L)\geq 0.

However, this is clearly equivalent and has the same optimal value as the following problem

minimizet\displaystyle\operatorname*{minimize}_{t}\quad tε2+d(L)22+d(L)L(tILL)Ld(L)\displaystyle t\varepsilon^{2}+\left\|d(L)\right\|_{2}^{2}+d(L)^{\top}L(tI-L^{\top}L)^{\dagger}L^{\top}d(L)
subjectto\displaystyle\operatorname{subject\ to}\quad tλmax(LL).\displaystyle t\geq\lambda_{\max}(L^{\top}L).

Notice that so far we are simply considering equivalent formulations to the original optimization. The ensuing step is where the lower and upper bounds (5) and (6)\eqref{eq: s-lemma upper bound} arise. Recall the Neumann series, where since tλmax(LL)t\geq\lambda_{\max}(L^{\top}L), we have

(tILL)1\displaystyle(tI-L^{\top}L)^{-1} =1t(I1tLL)1\displaystyle=\frac{1}{t}\left(I-\frac{1}{t}L^{\top}L\right)^{-1}
=1t(I+1tLL+1t2(LL)2+).\displaystyle=\frac{1}{t}\left(I+\frac{1}{t}L^{\top}L+\frac{1}{t^{2}}\left(L^{\top}L\right)^{2}+\cdots\right).

From the Neumann series, we see that we can upper and lower bound the inverse using geometric series of the largest and smallest eigenvalues of LLL^{\top}L, respectively,

1tλmin(LL)I(tILL)11tλmax(LL)I\displaystyle\frac{1}{t-\lambda_{\min}(L^{\top}L)}I\preceq(tI-L^{\top}L)^{-1}\preceq\frac{1}{t-\lambda_{\max}(L^{\top}L)}I

From now on, we will deal with the inverse, since instead of the pseudo-inverse we can take the infimum of the above problem, which is bounded from below. Let us consider the lower bound first. The upper bound follows using the exact same analysis. We have

tε2+d(L)22+d(L)L(tILL)1Ld(L)\displaystyle t\varepsilon^{2}+\left\|d(L)\right\|_{2}^{2}+d(L)^{\top}L(tI-L^{\top}L)^{-1}L^{\top}d(L) tε2+d(L)22+d(L)L(1tλmin(LL)I)Ld(L)\displaystyle\geq t\varepsilon^{2}+\left\|d(L)\right\|_{2}^{2}+d(L)^{\top}L\left(\frac{1}{t-\lambda_{\min}(L^{\top}L)}I\right)L^{\top}d(L)
=tε2+d(L)22+1tλmin(LL)Ld(L)22.\displaystyle=t\varepsilon^{2}+\left\|d(L)\right\|_{2}^{2}+\frac{1}{t-\lambda_{\min}(L^{\top}L)}\left\|L^{\top}d(L)\right\|_{2}^{2}.

Therefore, we have

maxδ2εxL(y+δ)22\displaystyle\max_{\left\|\delta\right\|_{2}\leq\varepsilon}\left\|x-L(y+\delta)\right\|_{2}^{2} d(L)22+mint>λmax(LL)tε2+1tλmin(LL)Ld(L)22.\displaystyle\geq\left\|d(L)\right\|_{2}^{2}+\min_{t>\lambda_{\max}(L^{\top}L)}t\varepsilon^{2}+\frac{1}{t-\lambda_{\min}(L^{\top}L)}\left\|L^{\top}d(L)\right\|_{2}^{2}.

We now make a second relaxation:

d(L)22+mint>λmax(LL)tε2+1tLd(L)22\displaystyle\left\|d(L)\right\|_{2}^{2}+\min_{t>\lambda_{\max}(L^{\top}L)}t\varepsilon^{2}+\frac{1}{t}\left\|L^{\top}d(L)\right\|_{2}^{2} d(L)22+mint0tε2+1tλmin(LL)Ld(L)22\displaystyle\geq\left\|d(L)\right\|_{2}^{2}+\min_{t\geq 0}t\varepsilon^{2}+\frac{1}{t-\lambda_{\min}(L^{\top}L)}\left\|L^{\top}d(L)\right\|_{2}^{2}
=d(L)22+2εLd(L)2+ε2λmin(LL),\displaystyle=\left\|d(L)\right\|_{2}^{2}+2\varepsilon\left\|L^{\top}d(L)\right\|_{2}+\varepsilon^{2}\lambda_{\min}(L^{\top}L),

which we get by deriving the unconstrained minimizer t=Ld(L)2ε+λmin(LL)t^{*}=\frac{\left\|L^{\top}d(L)\right\|_{2}}{\varepsilon}+\lambda_{\min}(L^{\top}L). Now putting expectations on both sides of the inequality, we get

AR(L)\displaystyle\operatorname{AR}(L) SR(L)+2ε𝔼[L(xLy)2]+ε2λmin(LL).\displaystyle\geq\operatorname{SR}(L)+2\varepsilon\;\mathbb{E}\left[\left\|L^{\top}(x-Ly)\right\|_{2}\right]+\varepsilon^{2}\lambda_{\min}(L^{\top}L).

\blacksquare

In the subsequent full statements of the corresponding results in the paper, we do not make 3.1, and instead consider general observable (A,C)(A,C), where ρ(A)1\rho(A)\leq 1, and positive definite noise covariances Σ0,Σw,Σv0\Sigma_{0},\Sigma_{w},\Sigma_{v}\succ 0.

C.2 General Statement of Lemma 3.1

Lemma 3.1.  The standard risk may be expressed as

SR(L)\displaystyle\operatorname{SR}(L) :=𝔼[xkLYN22]\displaystyle:=\mathbb{E}\left[\left\|x_{k}-LY_{N}\right\|_{2}^{2}\right]
=(AkL𝒪N)Σ01/2F2+(ΓkLτN)(INΣw)1/2F2+L(IN+1Σv)1/2F2.\displaystyle=\left\|\left(A^{k}-L\mathcal{O}_{N}\right)\Sigma_{0}^{1/2}\right\|_{F}^{2}+\left\|\left(\Gamma_{k}-L\tau_{N}\right)\left(I_{N}\otimes\Sigma_{w}\right)^{1/2}\right\|_{F}^{2}+\left\|L\left(I_{N+1}\otimes\Sigma_{v}\right)^{1/2}\right\|_{F}^{2}.

Proof: This follows simply by expanding the norm inside the expectation, and noticing that since x0x_{0}, WNW_{N}, VNV_{N} are defined to be zero-mean Gaussian random vectors, their cross terms vanish. More precisely, we have

𝔼[xkLYN22]\displaystyle\mathbb{E}\left[\left\|x_{k}-LY_{N}\right\|_{2}^{2}\right] =𝔼[tr((xkLYN)(xkLYN))]\displaystyle=\mathbb{E}\left[\operatorname{\mathrm{tr}}\left((x_{k}-LY_{N})(x_{k}-LY_{N})^{\top}\right)\right]
=𝔼[tr((AkL𝒪N)x0x0(AkL𝒪N)+(ΓkLτN)WNWN(ΓkLτN)\displaystyle=\mathbb{E}\bigg{[}\operatorname{\mathrm{tr}}\bigg{(}\left(A^{k}-L\mathcal{O}_{N}\right)x_{0}x_{0}^{\top}\left(A^{k}-L\mathcal{O}_{N}\right)^{\top}+\left(\Gamma_{k}-L\tau_{N}\right)W_{N}W_{N}^{\top}(\Gamma_{k}-L\tau_{N})^{\top}
+LVNVNL)]+0\displaystyle\quad+LV_{N}V_{N}^{\top}L^{\top}\bigg{)}\bigg{]}+0
=(AkL𝒪N)Σ01/2F2+(ΓkLτN)(INΣw)1/2F2+L(IN+1Σv)1/2F2.\displaystyle=\left\|\left(A^{k}-L\mathcal{O}_{N}\right)\Sigma_{0}^{1/2}\right\|_{F}^{2}+\left\|\left(\Gamma_{k}-L\tau_{N}\right)\left(I_{N}\otimes\Sigma_{w}\right)^{1/2}\right\|_{F}^{2}+\left\|L\left(I_{N+1}\otimes\Sigma_{v}\right)^{1/2}\right\|_{F}^{2}.

\blacksquare

C.3 General Statement of Lemma 3.2

Lemma 3.2.  For any Ln×p(N+1)L\in\mathbb{R}^{n\times p(N+1)}, the gap between AR(L)\operatorname{AR}(L) and SR(L)\operatorname{SR}(L) admits the following lower bound:

AR(L)SR(L)\displaystyle\operatorname{AR}(L)-\operatorname{SR}(L) 22πεntr((L(SΣ0S+T(INΣw)T+L(IN+1Σv)L)L)1/2)\displaystyle\geq 2\sqrt{\frac{2}{\pi}}\frac{\varepsilon}{\sqrt{n}}\operatorname{\mathrm{tr}}\bigg{(}\bigg{(}L^{\top}\Big{(}S\Sigma_{0}S^{\top}+T\left(I_{N}\otimes\Sigma_{w}\right)T^{\top}+L\left(I_{N+1}\otimes\Sigma_{v}\right)L^{\top}\Big{)}L\bigg{)}^{1/2}\bigg{)}
22πεnσmin(Σv)1/2LF2\displaystyle\geq 2\sqrt{\frac{2}{\pi}}\frac{\varepsilon}{\sqrt{n}}\sigma_{\min}(\Sigma_{v})^{1/2}\left\|L\right\|_{F}^{2} (11)

where S:=AkL𝒪NS:=A^{k}-L\mathcal{O}_{N}, T:=ΓkLτNT:=\Gamma_{k}-L\tau_{N}.

Proof: Applying the lower bound (5), we have

AR(L)\displaystyle\operatorname{AR}(L) SR(L)+2ε𝔼[L(xkLYT)2].\displaystyle\geq\operatorname{SR}(L)+2\varepsilon\;\mathbb{E}\left[\left\|L^{\top}(x_{k}-LY_{T})\right\|_{2}\right].

Then to derive the lower bound (C.3), we observe that the random vector

z\displaystyle z =L(xkLYT)\displaystyle=L^{\top}(x_{k}-LY_{T})
=L(Sx0TWT+LVT)),\displaystyle=L^{\top}\left(Sx_{0}-TW_{T}+LV_{T})\right),

is a zero-mean Gaussian with covariance

Σ\displaystyle\Sigma =L(SΣ0S+TΣwT+LΣvL)L.\displaystyle=L^{\top}\left(S\Sigma_{0}S^{\top}+T\Sigma_{w}T^{\top}+L\Sigma_{v}L^{\top}\right)L.

We can also write z=Σ1/2wz=\Sigma^{1/2}w where w𝒩(0,I)w\sim\mathcal{N}(0,I). Consider the diagonalization of Σ1/2=VSV\Sigma^{1/2}=VSV^{\top}. Then

𝔼[z2]\displaystyle\mathbb{E}\left[\left\|z\right\|_{2}\right] =𝔼[VSVw2]\displaystyle=\mathbb{E}\left[\left\|VSV^{\top}w\right\|_{2}\right]
=𝔼[isiwivi2],\displaystyle=\mathbb{E}\left[\left\|\sum_{i}s_{i}w_{i}v_{i}\right\|_{2}\right],

where viv_{i} is the iith row of VV and sis_{i} is the iith singular value of Σ1/2\Sigma^{1/2}. We have that

isiwivi22=isi2wi2vivi=isi2wi2\displaystyle\left\|\sum_{i}s_{i}w_{i}v_{i}\right\|_{2}^{2}=\sum_{i}s_{i}^{2}w_{i}^{2}v_{i}^{\top}v_{i}=\sum_{i}s_{i}^{2}w_{i}^{2}

We have by equivalence of norms, i=1nxi2n1/2i=1n|xi|\sqrt{\sum_{i=1}^{n}x_{i}^{2}}\geq n^{-1/2}\sum_{i=1}^{n}{\left|x_{i}\right|}. Therefore,

i=1nsi2wi21ni=1nsi|wi|,\displaystyle\sqrt{\sum_{i=1}^{n}s_{i}^{2}w_{i}^{2}}\geq\frac{1}{\sqrt{n}}\sum_{i=1}^{n}s_{i}{\left|w_{i}\right|},

and thus

𝔼[z2]\displaystyle\mathbb{E}\left[\left\|z\right\|_{2}\right] 1n𝔼[i=1nsi|w|i]=1n𝔼[|w|]i=1nsi\displaystyle\geq\frac{1}{\sqrt{n}}\mathbb{E}\left[\sum_{i=1}^{n}s_{i}|w|_{i}\right]=\frac{1}{\sqrt{n}}\mathbb{E}\left[{\left|w\right|}\right]\sum_{i=1}^{n}s_{i}

where w𝒩(0,1)w\sim\mathcal{N}(0,1). The quantity 𝔼[|w|]\mathbb{E}\left[{\left|w\right|}\right] is the expected value of a folded standard normal, which is 2π\sqrt{\frac{2}{\pi}}, while i=1nsi=tr(Σ1/2)\sum_{i=1}^{n}s_{i}=\operatorname{\mathrm{tr}}\left(\Sigma^{1/2}\right). Putting this together, we have that

AR(L)SR(L)\displaystyle\operatorname{AR}(L)-\operatorname{SR}(L) 22πεntr((L(SΣ0S+T(INΣw)T+L(IN+1Σv)L)L)1/2).\displaystyle\geq 2\sqrt{\frac{2}{\pi}}\frac{\varepsilon}{\sqrt{n}}\operatorname{\mathrm{tr}}\bigg{(}\bigg{(}L^{\top}\Big{(}S\Sigma_{0}S^{\top}+T\left(I_{N}\otimes\Sigma_{w}\right)T^{\top}+L\left(I_{N+1}\otimes\Sigma_{v}\right)L^{\top}\Big{)}L\bigg{)}^{1/2}\bigg{)}.

From the above bound, we may now derive a cruder lower bound from which we can observe a dependence on the singular values of the observability grammian, Wo(N)W_{o}(N). In particular, begin with the expression above, and note that the terms involving Σ0\Sigma_{0} and Σw\Sigma_{w} are positive definite to achieve a lower bound in terms of LL:

22πεntr((L(SΣ0S+T(INΣw)T+L(IN+1Σv)L)L)1/2)\displaystyle 2\sqrt{\frac{2}{\pi}}\frac{\varepsilon}{\sqrt{n}}\operatorname{\mathrm{tr}}\left(\left(L\operatorname{{}^{\top}}\left(S\Sigma_{0}S\operatorname{{}^{\top}}+T\left(I_{N}\otimes\Sigma_{w}\right)T\operatorname{{}^{\top}}+L\left(I_{N+1}\otimes\Sigma_{v}\right)L\operatorname{{}^{\top}}\right)L\right)^{1/2}\right)
\displaystyle\geq\; 22πεnσmin(Σv)1/2tr((LLLL)1/2)\displaystyle 2\sqrt{\frac{2}{\pi}}\frac{\varepsilon}{\sqrt{n}}\sigma_{\min}\left(\Sigma_{v}\right)^{1/2}\operatorname{\mathrm{tr}}\left(\left(L\operatorname{{}^{\top}}LL\operatorname{{}^{\top}}L\right)^{1/2}\right)
=\displaystyle=\; 22πεnσmin(Σv)1/2LF2.\displaystyle 2\sqrt{\frac{2}{\pi}}\frac{\varepsilon}{\sqrt{n}}\sigma_{\min}\left(\Sigma_{v}\right)^{1/2}\left\|L\right\|_{F}^{2}.

which completes the proof of inequality (C.3). \blacksquare

Introducing additional notation to express the Kalman estimator will be helpful in subsequent sections. Let

Σ¯\displaystyle\bar{\Sigma} :=[Σ0INΣw]\displaystyle:=\begin{bmatrix}\Sigma_{0}&\\ &I_{N}\otimes\Sigma_{w}\end{bmatrix} (12)
H\displaystyle H :=[IAIANAN1I]Σ¯1/2\displaystyle:=\begin{bmatrix}I&&&&\\ A&I&&&\\ \vdots&&\ddots&&\\ A^{N}&A^{N-1}&\ldots&I\end{bmatrix}\bar{\Sigma}^{1/2}
Hk\displaystyle H_{k} :=EkH=[AkAk1I00]Σ¯1/2\displaystyle:=E_{k}^{\top}H=\begin{bmatrix}A^{k}&A^{k-1}&\dots&I&0&\dots&0\end{bmatrix}\bar{\Sigma}^{1/2}
M\displaystyle M :=H(CI)=Σ¯1/2[𝒪N(𝒵𝒪N)(𝒵N𝒪N)],\displaystyle:=H^{\top}\left(C^{\top}\otimes I\right)=\bar{\Sigma}^{1/2}\begin{bmatrix}\mathcal{O}_{N}^{\top}\\ \left(\mathcal{Z}\mathcal{O}_{N}\right)^{\top}\\ \vdots\\ \left(\mathcal{Z}^{N}\mathcal{O}_{N}\right)^{\top}\end{bmatrix},

where 𝒵p(N+1)×p(N+1)\mathcal{Z}\in\mathbb{R}^{p(N+1)\times p(N+1)} is a block downshift operator, with blocks of size mm. With this notation, the Kalman estimator given in Lemma 2.1 may be rewritten more compactly as

L^k=HkM(MM+Σv)1.\hat{L}_{k}=H_{k}M\left(M^{\top}M+\Sigma_{v}\right)^{-1}.

C.4 General Statement of Theorem 9

Theorem 9.  Suppose that L^k\hat{L}_{k} is the Kalman estimator from Lemma 2.1. Then we have the following bound on the gap between AR\operatorname{AR} and SR\operatorname{SR}.

AR(L^k)SR(L^k)22πεnσmin(Σv)CF2(σmin(AkΣ0Ak+i=1kAkiΣwAki)(N+1)Σ¯22Wo(N)F+Σv2)2.\displaystyle\operatorname{AR}(\hat{L}_{k})-\operatorname{SR}(\hat{L}_{k})\geq 2\sqrt{\frac{2}{\pi}}\frac{\varepsilon}{\sqrt{n}}\sigma_{\min}\left(\Sigma_{v}\right)\left\|C\right\|_{F}^{2}\left(\frac{\sigma_{\min}\left(A^{k}\Sigma_{0}{A^{k}}^{\top}+\sum_{i=1}^{k}A^{k-i}\Sigma_{w}{A^{k-i}}^{\top}\right)}{(N+1)\left\|\bar{\Sigma}\right\|_{2}^{2}\left\|W_{o}(N)\right\|_{F}+\left\|\Sigma_{v}\right\|_{2}}\right)^{2}.

Proof: We begin by writing the Kalman estimator using the notation defined in (12)

Lk=HkM(MM+IN+1Σv)1.L_{k}=H_{k}M\left(M^{\top}M+I_{N+1}\otimes\Sigma_{v}\right)^{-1}.

Suppose the rank of MM is mm. Then the singular value decomposition of MM can be taken to be

U[S000]V=MU\begin{bmatrix}S&0\\ 0&0\end{bmatrix}V^{\top}=M

where S=diag([s1s2sm])S=\operatorname{diag}\left(\begin{bmatrix}s_{1}&s_{2}&\dots s_{m}\end{bmatrix}\right) with s1s2sm0s_{1}\geq s_{2}\geq\dots\geq s_{m}\geq 0, while Un(N+1)×n(N+1)U\in\mathbb{R}^{n(N+1)\times n(N+1)} and Vp(N+1)×p(N+1)V\in\mathbb{R}^{p(N+1)\times p(N+1)}. We can now lower bound the Frobenius norm of LkL_{k} as follows.

LkF2HkMF2σmin{(MM+ΣvI)1}2.\displaystyle\left\|L_{k}\right\|_{F}^{2}\geq\left\|H_{k}M\right\|_{F}^{2}\sigma_{\min}\left\{\left(M^{\top}M+\Sigma_{v}I\right)^{-1}\right\}^{2}.

Note that σmin{(MM+ΣvI)1}σmin{(MM+Σv2I)1}\sigma_{\min}\left\{\left(M^{\top}M+\Sigma_{v}I\right)^{-1}\right\}\geq\sigma_{\min}\left\{\left(M^{\top}M+\left\|\Sigma_{v}\right\|_{2}I\right)^{-1}\right\}. Therefore

LkF2\displaystyle\left\|L_{k}\right\|_{F}^{2} σmin{([S20]+Σv2I)2}HkMF2\displaystyle\geq\sigma_{\min}\left\{\left(\begin{bmatrix}S^{2}&\\ &0\end{bmatrix}+\left\|\Sigma_{v}\right\|_{2}I\right)^{-2}\right\}\left\|H_{k}M\right\|_{F}^{2} (13)
=σmin{([S20]+ΣvI)2}HkH(CI)F2\displaystyle=\sigma_{\min}\left\{\left(\begin{bmatrix}S^{2}&\\ &0\end{bmatrix}+\left\|\Sigma_{v}\right\|I\right)^{-2}\right\}\left\|H_{k}H^{\top}\left(C^{\top}\otimes I\right)\right\|_{F}^{2}
=σmin{([S20]+Σv2I)2}\displaystyle=\sigma_{\min}\left\{\left(\begin{bmatrix}S^{2}&\\ &0\end{bmatrix}+\left\|\Sigma_{v}\right\|_{2}I\right)^{-2}\right\} [AkΣ0CAkΣ0(AN)C+i=1kAkiΣw(ANi)C]F2\displaystyle\left\|\begin{bmatrix}A^{k}\Sigma_{0}C^{\top}&\ldots&A^{k}\Sigma_{0}\left(A^{N}\right)^{\top}C^{\top}+\sum_{i=1}^{k}A^{k-i}\Sigma_{w}\left(A^{N-i}\right)^{\top}C^{\top}\end{bmatrix}\right\|_{F}^{2}

Now observe that

σmin{([S20]+ΣvI)2}=1(s12+Σv2)2\sigma_{\min}\left\{\left(\begin{bmatrix}S^{2}&\\ &0\end{bmatrix}+\left\|\Sigma_{v}\right\|I\right)^{-2}\right\}=\frac{1}{\left(s_{1}^{2}+\left\|\Sigma_{v}\right\|_{2}\right)^{2}} (14)

Also note that s1=M2Σ¯1/22Σ¯1/2MFs_{1}=\left\|M\right\|_{2}\leq\left\|\bar{\Sigma}^{1/2}\right\|_{2}\left\|\bar{\Sigma}^{-1/2}M\right\|_{F}. We have that

Σ¯1/2MF2=[𝒪N(𝒵𝒪N)(𝒵N𝒪N)]F2i=0N𝒵i𝒪NF2(N+1)𝒪NF2.\displaystyle\left\|\bar{\Sigma}^{-1/2}M\right\|_{F}^{2}=\left\|\begin{bmatrix}\mathcal{O}_{N}^{\top}\\ \left(\mathcal{Z}\mathcal{O}_{N}\right)^{\top}\\ \vdots\\ \left(\mathcal{Z}^{N}\mathcal{O}_{N}\right)^{\top}\end{bmatrix}\right\|_{F}^{2}\leq\sum_{i=0}^{N}\left\|\mathcal{Z}^{i}\mathcal{O}_{N}\right\|_{F}^{2}\leq(N+1)\left\|\mathcal{O}_{N}\right\|_{F}^{2}.

Then

s1Σ¯1/22N+1𝒪NF\displaystyle s_{1}\leq\left\|\bar{\Sigma}^{1/2}\right\|_{2}\sqrt{N+1}\left\|\mathcal{O}_{N}\right\|_{F} (15)

When k0k\geq 0, we have

[AkΣ0CAkΣ0(AN)C+i=1kAkiΣw(ATi)C]F2\displaystyle\left\|\begin{bmatrix}A^{k}\Sigma_{0}C^{\top}&\ldots&A^{k}\Sigma_{0}\left(A^{N}\right)^{\top}C^{\top}+\sum_{i=1}^{k}A^{k-i}\Sigma_{w}\left(A^{T-i}\right)^{\top}C^{\top}\end{bmatrix}\right\|_{F}^{2}
(AkΣ0(Ak)+i=1kAkiΣw(Aki))CF2\displaystyle\geq\left\|\left(A^{k}\Sigma_{0}\left(A^{k}\right)^{\top}+\sum_{i=1}^{k}A^{k-i}\Sigma_{w}\left(A^{k-i}\right)^{\top}\right)C^{\top}\right\|_{F}^{2}
σmin(AkΣ0(Ak)+i=1kAkiΣw(Aki))2CF2.\displaystyle\geq\sigma_{\min}\left(A^{k}\Sigma_{0}\left(A^{k}\right)^{\top}+\sum_{i=1}^{k}A^{k-i}\Sigma_{w}\left(A^{k-i}\right)^{\top}\right)^{2}\left\|C\right\|_{F}^{2}.

In conjunction with (13), (14), and (15), this leads to (9). \blacksquare

C.5 Proof of Lemma 3.3

Lemma 3.3.  For any Ln×p(N+1)L\in\mathbb{R}^{n\times p(N+1)}, the following bound holds

AR(L)SR(L)2εL2Σ1/2F+ε2L22\displaystyle\operatorname{AR}(L)-\operatorname{SR}(L)\leq 2\varepsilon\left\|L\right\|_{2}\left\|\Sigma^{1/2}\right\|_{F}+\varepsilon^{2}\left\|L\right\|_{2}^{2}

where Σ1/2\Sigma^{1/2} is the symmetric square root of the covariance of xkLYNx_{k}-LY_{N}.

Proof: By Theorem 3.1,

AR(L)SR(L)2ε𝔼[L(xtLYN)2]+ε2λmax(LL)2εL2𝔼[xtLYN2]+ε2L22\displaystyle\operatorname{AR}(L)-\operatorname{SR}(L)\leq 2\varepsilon\mathbb{E}\left[\left\|L(x_{t}-LY_{N})\right\|_{2}\right]+\varepsilon^{2}\lambda_{\max}\left(L^{\top}L\right)\leq 2\varepsilon\left\|L\right\|_{2}\mathbb{E}\left[\left\|x_{t}-LY_{N}\right\|_{2}\right]+\varepsilon^{2}\left\|L\right\|_{2}^{2}

We can upper bound 𝔼[xtLYN2]\mathbb{E}\left[\left\|x_{t}-LY_{N}\right\|_{2}\right] by bounding the expectation of the euclidean norm of a normal random variable. In particular, let ww be a nn dimensional standard normal random variable, so that 𝔼[xtLYN2]=𝔼[Σ1/2w2]\mathbb{E}\left[\left\|x_{t}-LY_{N}\right\|_{2}\right]=\mathbb{E}\left[\left\|\Sigma^{1/2}w\right\|_{2}\right], where Σ\Sigma is defined as the covariance of xkLYNx_{k}-LY_{N}, and Σ1/2\Sigma^{1/2} is its symmetric square root. Let US1/2U:=Σ1/2US^{1/2}U\operatorname{{}^{\top}}:=\Sigma^{1/2} be the eigenvalue decomposition of Σ1/2\Sigma^{1/2} so that

Σ1/2z2\displaystyle\left\|\Sigma^{1/2}z\right\|_{2} =US1/2Uw2\displaystyle=\left\|US^{1/2}U\operatorname{{}^{\top}}w\right\|_{2}

Now define z=Uwz=U\operatorname{{}^{\top}}w. We have that zN(0,I)z\sim N(0,I). Then the above quantity equals US1/2w22\sqrt{\left\|US^{1/2}w\right\|_{2}^{2}}. Jensen’s inequality tells us that

𝔼[US1/2w22]𝔼[US1/2w22]=𝔼[wS1/2w]=S1/2F=Σ1/2F,\displaystyle\mathbb{E}\left[\sqrt{\left\|US^{1/2}w\right\|_{2}^{2}}\right]\leq\sqrt{\mathbb{E}\left[\left\|US^{1/2}w\right\|_{2}^{2}\right]}=\sqrt{\mathbb{E}\left[w\operatorname{{}^{\top}}S^{1/2}w\right]}=\left\|S^{1/2}\right\|_{F}=\left\|\Sigma^{1/2}\right\|_{F},

from which the theorem follows. \blacksquare

C.6 General Statement of Theorem 3.3

Theorem 3.3.  Suppose that L^k\hat{L}_{k} is the Kalman state estimator given by Lemma 2.1. Then the gap between AR(L^k)\operatorname{AR}(\hat{L}_{k}) and SR(L^k)\operatorname{SR}(\hat{L}_{k}) is upper bounded by

AR(L^k)SR(L^k)\displaystyle\operatorname{AR}(\hat{L}_{k})-\operatorname{SR}(\hat{L}_{k}) ε(AkΣ0(Ak)+i=1kAkiΣw(Aki)22σmin(Wo(N))1/2σmin(Σ¯))\displaystyle\geq\varepsilon\left(\frac{{\left\|A^{k}\Sigma_{0}\left(A^{k}\right)^{\top}+\sum_{i=1}^{k}A^{k-i}\Sigma_{w}\left(A^{k-i}\right)^{\top}_{2}\right\|_{2}}}{\sigma_{\min}(W_{o}(N))^{1/2}\sigma_{\min}\left(\bar{\Sigma}\right)}\right)
×(2n(Σ¯2+(Σv2σmin(Wo(N))1/2σmin(Σ¯))2)1/2+ε(1σmin(Wo(N))1/2σmin(Σ¯))).\displaystyle\quad\times\left(2\sqrt{n}\left(\left\|\bar{\Sigma}\right\|_{2}+\left(\frac{\sqrt{\left\|\Sigma_{v}\right\|_{2}}}{\sigma_{\min}(W_{o}(N))^{1/2}\sigma_{\min}\left(\bar{\Sigma}\right)}\right)^{2}\right)^{1/2}+\varepsilon\left(\frac{1}{\sigma_{\min}(W_{o}(N))^{1/2}\sigma_{\min}\left(\bar{\Sigma}\right)}\right)\right).

Furthermore, if λmin(Wo(N))1/2σv/σmin(Σ¯)\lambda_{\min}(W_{o}(N))^{1/2}\geq\sigma_{v}/\sigma_{\min}\left(\bar{\Sigma}\right), and defining κ=λmin(Wo(N))1/2σmin(Σ¯)λmin(Wo(N))σmin(Σ¯)2+σv\kappa=\frac{\lambda_{\min}(W_{o}(N))^{1/2}\sigma_{\min}\left(\bar{\Sigma}\right)}{\lambda_{\min}(W_{o}(N))\sigma_{\min}\left(\bar{\Sigma}\right)^{2}+\sigma_{v}}, we get the bound

AR(L^k)SR(L^k)\displaystyle\operatorname{AR}(\hat{L}_{k})-\operatorname{SR}(\hat{L}_{k}) ε(κAkΣ0(Ak)+i=1kAkiΣw(Aki)22)\displaystyle\leq\varepsilon\left(\kappa\sqrt{\left\|A^{k}\Sigma_{0}\left(A^{k}\right)^{\top}+\sum_{i=1}^{k}A^{k-i}\Sigma_{w}\left(A^{k-i}\right)^{\top}_{2}\right\|_{2}}\right)
×(2n(σmax(Σ¯)2+σmin(Σv)κ2)1/2+εκ)\displaystyle\quad\times\left(2\sqrt{n}\left(\sigma_{\max}(\bar{\Sigma})^{2}+\sigma_{\min}(\Sigma_{v})\kappa^{2}\right)^{1/2}+\varepsilon\kappa\right)

Proof: By Lemma 3.3, upper bounding the gap between AR(Lk)\operatorname{AR}(L_{k}) and SR(Lk)\operatorname{SR}(L_{k}) reduces to upper bounding Σ1/2F\left\|\Sigma^{1/2}\right\|_{F} and Lk2\left\|L_{k}\right\|_{2}. First consider Σ1/2F\left\|\Sigma^{1/2}\right\|_{F}. Equivalence of norms tells us that

Σ1/2FnΣ1/22=nΣ21/2.\displaystyle\left\|\Sigma^{1/2}\right\|_{F}\leq\sqrt{n}\left\|\Sigma^{1/2}\right\|_{2}=\sqrt{n}\left\|\Sigma\right\|_{2}^{1/2}. (16)

Recalling the notation defined in (12), the Kalman filter may be expressed as Lk=HkM(MM+Σv)1L_{k}=H_{k}M\left(M^{\top}M+\Sigma_{v}\right)^{-1}. We may also express Σ\Sigma in terms of this notation: Σ=(HkLkM)Σ¯(HkLkM)+Lk(IΣv)Lk\Sigma=\left(H_{k}-L_{k}M\right)\bar{\Sigma}\left(H_{k}-L_{k}M\right)^{\top}+L_{k}\left(I\otimes\Sigma_{v}\right)L_{k}\operatorname{{}^{\top}}. To upper bound the spectral radius of this, we can leverage triangle inequality and submultiplicativity

Σ2\displaystyle\left\|\Sigma\right\|_{2} (HkLkM)Σ¯(HkLkM)2+Lk(IΣv)Lk2\displaystyle\leq\left\|\left(H_{k}-L_{k}M\right)\bar{\Sigma}\left(H_{k}-L_{k}M\right)^{\top}\right\|_{2}+\left\|L_{k}\left(I\otimes\Sigma_{v}\right)L_{k}\operatorname{{}^{\top}}\right\|_{2}
Σ¯2HkLkM22+Σv2Lk22.\displaystyle\leq\left\|\bar{\Sigma}\right\|_{2}\left\|H_{k}-L_{k}M\right\|_{2}^{2}+\left\|\Sigma_{v}\right\|_{2}\left\|L_{k}\right\|_{2}^{2}.

Note that HkLkM=HkHkM(MM+Σv)1M=Hk(IM(MM+Σv)1M)H_{k}-L_{k}M=H_{k}-H_{k}M\left(M^{\top}M+\Sigma_{v}\right)^{-1}M=H_{k}\left(I-M\left(M^{\top}M+\Sigma_{v}\right)^{-1}M^{\top}\right). Then by submultiplicativity,

Hk(IM(MM+Σv)1M)2Hk2IM(MM+Σv)1M2Hk2.\displaystyle\left\|H_{k}\left(I-M\left(M^{\top}M+\Sigma_{v}\right)^{-1}M^{\top}\right)\right\|_{2}\leq\left\|H_{k}\right\|_{2}\left\|I-M\left(M^{\top}M+\Sigma_{v}\right)^{-1}M^{\top}\right\|_{2}\leq\left\|H_{k}\right\|_{2}.

We can further upper bound Hk2\left\|H_{k}\right\|_{2} in terms of system properties. In particular, we have

Hk2=Hk22=HkHk2=AkΣ0(Ak)+i=1kAkiΣw(Aki)22.\displaystyle\left\|H_{k}\right\|_{2}=\sqrt{\left\|H_{k}\right\|_{2}^{2}}=\sqrt{\left\|H_{k}H_{k}^{\top}\right\|_{2}}=\sqrt{\left\|A^{k}\Sigma_{0}\left(A^{k}\right)^{\top}+\sum_{i=1}^{k}A^{k-i}\Sigma_{w}\left(A^{k-i}\right)^{\top}_{2}\right\|_{2}}.

Thus

Σ2Σ¯2Hk22+Σv2L22Σ¯2AkΣ0(Ak)+i=1kAkiΣw(Aki)22+Σv2L22.\displaystyle\left\|\Sigma\right\|_{2}\leq\left\|\bar{\Sigma}\right\|_{2}\left\|H_{k}\right\|_{2}^{2}+\left\|\Sigma_{v}\right\|_{2}\left\|L\right\|_{2}^{2}\leq\left\|\bar{\Sigma}\right\|_{2}\left\|A^{k}\Sigma_{0}\left(A^{k}\right)^{\top}+\sum_{i=1}^{k}A^{k-i}\Sigma_{w}\left(A^{k-i}\right)^{\top}_{2}\right\|_{2}+\left\|\Sigma_{v}\right\|_{2}\left\|L\right\|_{2}^{2}. (17)

Next we obtain a bound on Lk2\left\|L_{k}\right\|_{2}. As in the proof of Theorem 9, we will assign m:=rank(M)m:=\operatorname{rank}(M) and take the singular value decomposition of MM to be U[S000]VU\begin{bmatrix}S&0\\ 0&0\end{bmatrix}V^{\top} where S=diag([s1sm])S=\operatorname{diag}\left(\begin{bmatrix}s_{1}&\dots&s_{m}\end{bmatrix}\right) with s1s2sm0s_{1}\geq s_{2}\geq\dots\geq s_{m}\geq 0, while Un(N+1)×n(N+1)U\in\mathbb{R}^{n(N+1)\times n(N+1)} and Vp(N+1)×p(N+1)V\in\mathbb{R}^{p(N+1)\times p(N+1)}.

Lk2\displaystyle\left\|L_{k}\right\|_{2} =HkM(MM+Σv)12Hk2M(MM+σmin(Σv))12\displaystyle=\left\|H_{k}M\left(M^{\top}M+\Sigma_{v}\right)^{-1}\right\|_{2}\leq\left\|H_{k}\right\|_{2}\left\|M\left(M^{\top}M+\sigma_{\min}\left(\Sigma_{v}\right)\right)^{-1}\right\|_{2}
=Hk2U[S0]V(V([S20]+σmin(Σv))V)1\displaystyle=\left\|H_{k}\right\|_{2}\left\|U\begin{bmatrix}S&\\ &0\end{bmatrix}V^{\top}\left(V\left(\begin{bmatrix}S^{2}&\\ &0\end{bmatrix}+\sigma_{\min}(\Sigma_{v})\right)V^{\top}\right)^{-1}\right\|
=Hk2U[S0]([S20]+σmin(Σv))1V2\displaystyle=\left\|H_{k}\right\|_{2}\left\|U\begin{bmatrix}S&\\ &0\end{bmatrix}\left(\begin{bmatrix}S^{2}&\\ &0\end{bmatrix}+\sigma_{\min}(\Sigma_{v})\right)^{-1}V^{\top}\right\|_{2}
=Hk2S(S2+σmin(Σv))12\displaystyle=\left\|H_{k}\right\|_{2}\left\|S(S^{2}+\sigma_{\min}(\Sigma_{v}))^{-1}\right\|_{2}
Hk2max1kmsksk2+σmin(Σv).\displaystyle\leq\left\|H_{k}\right\|_{2}\max_{1\leq k\leq m}\frac{s_{k}}{s_{k}^{2}+\sigma_{\min}(\Sigma_{v})}.

A simple bound on the last maximization would be

max1kmsksk2+σmin(Σv)\displaystyle\max_{1\leq k\leq m}\frac{s_{k}}{s_{k}^{2}+\sigma_{\min}(\Sigma_{v})} max1kmsksk21sm\displaystyle\leq\max_{1\leq k\leq m}\frac{s_{k}}{s_{k}^{2}}\leq\frac{1}{s_{m}} (18)

Note that smλmin(Wo(N))1/2σmin(Σ¯)s_{m}\geq\lambda_{\min}(W_{o}(N))^{1/2}\sigma_{\min}\left(\bar{\Sigma}\right), so 1sm1λmin(Wo(N))1/2σmin(Σ¯)\frac{1}{s_{m}}\leq\frac{1}{\lambda_{\min}(W_{o}(N))^{1/2}\sigma_{\min}\left(\bar{\Sigma}\right)}. Then

Lk2Hk2λmin(Wo(N))1/2σmin(Σ¯)AkΣ0(Ak)+i=1kAkiΣw(Aki)22λmin(Wo(N))1/2σmin(Σ¯).\displaystyle\left\|L_{k}\right\|_{2}\leq\frac{\left\|H_{k}\right\|_{2}}{\lambda_{\min}(W_{o}(N))^{1/2}\sigma_{\min}\left(\bar{\Sigma}\right)}\leq\frac{\sqrt{\left\|A^{k}\Sigma_{0}\left(A^{k}\right)^{\top}+\sum_{i=1}^{k}A^{k-i}\Sigma_{w}\left(A^{k-i}\right)^{\top}_{2}\right\|_{2}}}{\lambda_{\min}(W_{o}(N))^{1/2}\sigma_{\min}\left(\bar{\Sigma}\right)}. (19)

Then the first half of the theorem follows by combining the result of Lemma 3.3 with (16), (17) and (19).

However, if λmin(Wo(N))1/2σmin(Σ¯)σv\lambda_{\min}(W_{o}(N))^{1/2}\sigma_{\min}\left(\bar{\Sigma}\right)\geq\sigma_{v}, then the maximum (18) is attained at

max1kmsksk2+σmin(Σv)\displaystyle\max_{1\leq k\leq m}\frac{s_{k}}{s_{k}^{2}+\sigma_{\min}(\Sigma_{v})} =smsm2+σv\displaystyle=\frac{s_{m}}{s_{m}^{2}+\sigma_{v}}
λmin(Wo(N))1/2σmin(Σ¯)λmin(Wo(N))σmin(Σ¯)2+σv.\displaystyle\leq\frac{\lambda_{\min}(W_{o}(N))^{1/2}\sigma_{\min}\left(\bar{\Sigma}\right)}{\lambda_{\min}(W_{o}(N))\sigma_{\min}\left(\bar{\Sigma}\right)^{2}+\sigma_{v}}.

Therefore, when λmin(Wo(N))1/2σv/σmin(Σ¯)\lambda_{\min}(W_{o}(N))^{1/2}\geq\sigma_{v}/\sigma_{\min}\left(\bar{\Sigma}\right), we have the more precise bound

Lk2\displaystyle\left\|L_{k}\right\|_{2} Hk2λmin(Wo(N))1/2σmin(Σ¯)λmin(Wo(N))σmin(Σ¯)2+σv\displaystyle\leq\left\|H_{k}\right\|_{2}\frac{\lambda_{\min}(W_{o}(N))^{1/2}\sigma_{\min}\left(\bar{\Sigma}\right)}{\lambda_{\min}(W_{o}(N))\sigma_{\min}\left(\bar{\Sigma}\right)^{2}+\sigma_{v}}
AkΣ0(Ak)+i=1kAkiΣw(Aki)22λmin(Wo(N))1/2σmin(Σ¯)λmin(Wo(N))σmin(Σ¯)2+σv,\displaystyle\leq\sqrt{\left\|A^{k}\Sigma_{0}\left(A^{k}\right)^{\top}+\sum_{i=1}^{k}A^{k-i}\Sigma_{w}\left(A^{k-i}\right)^{\top}_{2}\right\|_{2}}\frac{\lambda_{\min}(W_{o}(N))^{1/2}\sigma_{\min}\left(\bar{\Sigma}\right)}{\lambda_{\min}(W_{o}(N))\sigma_{\min}\left(\bar{\Sigma}\right)^{2}+\sigma_{v}},

which leads to the second half of the theorem.

\blacksquare