This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robustness against Adversarial Attacks in Neural Networks using Incremental Dissipativity

Bernardo Aquino, Arash Rahnama, Peter Seiler, Lizhen Lin, and Vijay Gupta B. Aquino and V. Gupta are with the Department of Electrical Engineering, University of Notre Dame, Notre Dame, IN, 46656 USA. Email: {bcruz2,vgupta}@nd.edu A. Rahnama is with Amazon Inc. Email: arashrahnama@gmail.com P. Seiler is with the Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA. Email: pseiler@umich.edu L. Lin is with the Department of Applied Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN, 46656 USA. Email: lizhen.lin@nd.edu
Abstract

Adversarial examples can easily degrade the classification performance in neural networks. Empirical methods for promoting robustness to such examples have been proposed, but often lack both analytical insights and formal guarantees. Recently, some robustness certificates have appeared in the literature based on system theoretic notions. This work proposes an incremental dissipativity-based robustness certificate for neural networks in the form of a linear matrix inequality for each layer. We also propose a sufficient spectral norm bound for this certificate which is scalable to neural networks with multiple layers. We demonstrate the improved performance against adversarial attacks on a feed-forward neural network trained on MNIST and an Alexnet trained using CIFAR-10.

Index Terms:
Adversarial Attacks, Deep Neural Networks, Robust Design, Passivity Theory, Spectral Regularization

I Introduction

Neural networks are powerful structures that can represent any non-linear function through appropriate training for classification and regression tasks. However, neural networks still lack formal performance guarantees, limiting their application in safety-critical systems [1]. Furthermore, many studies have shown the susceptibility of neural networks to small perturbations through carefully crafted adversarial attacks, which, in the case of imaging systems, may be imperceptible to the human eye [1], [2], [3], [4].

Different types of defenses have emerged trying to address this shortcoming, with perhaps the most successful of them being adversarial training [5], [6], [7], [8] and defensive distillation [9], [10]. However, even though neural networks with these defenses empirically show superior performance against adversarial attacks than those without such approaches, these methods do not broadly provide either design insights or formal guarantees on robustness.

In the search for guarantees, other studies have proposed systems-theoretic robustness certificates during training. For instance, Lyapunov stability certificates with incremental quadratic constraints were proposed in  [11] and imitation learning in [12]. Lipschitz constant-based certificates have been a particularly fruitful approach, of which we highlight [13], which uses a convex optimization approach, [14] with sparse polynomial optimization, [15], which uses averaged activation operators, and [16], which proposes global Lipschitz training using the Alternate Direction Method of Multipliers (ADMM) method.

In this work, we take an incremental dissipativity-based approach and derive a robustness certificate for neural network architectures in the form of a Linear Matrix Inequality (LMI). This LMI utilizes Incremental Quadratic Constraints (iQCs) to describe non-linearities in each neuron as proposed in [17]. Recently, [18] used iQCs to define a new class of Recurrent (Equilibrium) Neural Networks for applications such as system identification and robust feedback control. We derive a sufficient condition from this LMI as a bound on the spectral norm of the weight matrix in each layer that is easily implementable. There are three advantages of this approach. One, this approach generalizes the Lipschitz condition that has been imposed on neural networks in previous works to a condition guaranteeing that the input output response of the neural network is sector bounded. The Lipschitz condition defines a particular sector that can be expressed as a special case in our framework. Providing more degrees of freedom to the designer in choosing the sector can lead to a better point in the performance-robustness tradeoff. Two, this approach can scale with the number of layers of the neural network. Consideration of deep neural structures and convolutional layers is a known issue with the existing optimization based paradigms [19] and has been explicitly pointed out as a limitation in works such as [16]. Three, our condition provides an insight on the high empirical effectiveness of the usage of spectral norm regularization for improved generalizability in deep network structures [20] and training stability of Generative Adversarial Networks (GANs) [21].

Passivity analysis for neural networks has been considered for systems with time delay [22], [23], but no training methodology was offered, and the resulting LMIs can quickly become computationally burdensome. [24] provided a passivity approach for robust neural networks, but no certificate was presented. Further, our theoretical development is for the more general notion of incremental sector bounded neural network layers, which allows for negative passivity indices.

II Background

In this work, we propose an incremental dissipativity-based approach to quantify and engineer the robustness of neural networks against adversarial attacks. We focus on image classification networks, since the design of adversarial perturbations is perhaps the most developed in such systems [25, 26]. Further, the presence of convolutional layers in most of such architectures poses challenges to existing approaches for guaranteeing robustness. In such applications, our main goal is to reduce the classification error rate on adversarial image test sets. One should notice that the proposed approach can be applied to other applications as well.

Robustness against adversarial attacks

Adversarial attacks on neural networks seek to produce a significant change in the output when the input is perturbed slightly. Thus, designing a network that limits the change in the output as a function of the change in the input can mitigate the effects of adversarial attacks. We claim that enforcing neural network systems to be incrementally dissipative (specifically, sector bounded) can limit the variation on the output, given some perturbation in the input.

Consider a neural network fNN:nmf_{NN}:\mathcal{R}^{n}\to\mathcal{R}^{m}. In the image classification example, the input may be the vectorized version of the image and the output may be a vector of probabilities corresponding to the various classes. We are interested in the deviation between fNN(x)f_{NN}(x) and fNN(x+δx)f_{NN}(x+\delta_{x}) where xx and x+δxx+\delta_{x} are the actual and adversarial inputs, respectively. Recently, a number of works have pointed out that by enforcing the condition

fNN(x+δx)fNN(x)2γδx2\left\|f_{NN}(x+\delta_{x})-f_{NN}(x)\right\|_{2}\leq\gamma\left\|\delta_{x}\right\|_{2} (1)

for some γ>0\gamma>0, we can guarantee that a norm-bounded adversarial perturbation can shift the output of the neural network only by a bounded amount. Intuitively, this leads to a certain robustness in the classification performance of the neural network. While the Lipschitz constant based approach enforces a symmetric sector in the sense of equation 1) above, the incremental sector boundedness constraint that we consider in this paper generalizes it by allowing the sector slopes to be independent of each other and have arbitrary signs. Further, we can consider convolutional layers as well as compose metrics for each layer to obtain a robustness certificate for the entire network. Finally, we can obtain a computationally efficient training approach to ensure that this certificate is met using a spectral norm based condition. In this sense, our work also sheds light on the empirically observed effectiveness of spectral norm based regularization to promote robustness in neural networks. However, this approach cannot guarantee robustness for all attacks and data sets. As stated in [27], depending on the specific data set, the margin between decision boundaries can be very narrow, and even a tiny perturbation on output can change the classification label. Therefore, our method implicitly assumes there is some margin between classes.

Sector boundedness

Dissipativity has been widely used in control systems theory due to its close relation to stability analysis and compositional properties [28][29][30].

Definition II.1.

([31]) A discrete-time system with input 𝐱(k)\mathbf{x}(k) and corresponding output 𝐲(k)\mathbf{y}(k) at time kk is (Q,S,R)(Q,S,R) dissipative if k\forall k, the condition 0s(𝐱(k),𝐲(k))0\leq s(\mathbf{x}(k),\mathbf{y}(k)) holds for all admissible initial conditions of the system and all admissible sequences of the inputs {𝐱(j)}j=0k,\{\mathbf{x}(j)\}_{j=0}^{k}, where s(𝐱(k),𝐲(k))s(\mathbf{x}(k),\mathbf{y}(k)), is the supply rate given by

s(𝐱,𝐲)=𝐲T𝐐𝐲+𝐱T𝐒𝐲+𝐲T𝐒T𝐱+𝐱T𝐑𝐱,s(\mathbf{x},\mathbf{y})=\mathbf{y}^{T}\mathbf{Q}\mathbf{y}+\mathbf{x}^{T}\mathbf{S}\mathbf{y}+\mathbf{y}^{T}\mathbf{S}^{T}\mathbf{x}+\mathbf{x}^{T}\mathbf{R}\mathbf{x}, (2)

for matrices 𝐐\mathbf{Q}, 𝐒\mathbf{S}, and 𝐑\mathbf{R} of appropriate dimensions.

Depending on the matrices 𝐐\mathbf{Q}, 𝐒\mathbf{S}, and 𝐑\mathbf{R}, the system will exhibit different dynamical properties. We are particularly interested in the following three cases:

Definition II.2.

([29],[32]) A (Q,S,R)(Q,S,R) dissipative discrete-time system is:

  1. 1.

    passive, if 𝐐=𝟎\mathbf{Q}=\mathbf{0}, 𝐒=12𝐈\mathbf{S}=\frac{1}{2}\mathbf{I}, 𝐑=𝟎\mathbf{R}=\mathbf{0}.

  2. 2.

    strictly passive, if 𝐐=δ𝐈\mathbf{Q}=-\delta\mathbf{I}, 𝐒=12𝐈\mathbf{S}=\frac{1}{2}\mathbf{I}, 𝐑=ν𝐈\mathbf{R}=-\nu\mathbf{I}, for some δ>0\delta>0 and ν>0\nu>0.

  3. 3.

    sector bounded with slopes 114δν2δ\frac{1-\sqrt{1-4\delta\nu}}{2\delta} and 1+14δν2δ\frac{1+\sqrt{1-4\delta\nu}}{2\delta}, if 𝐐=δ𝐈\mathbf{Q}=-\delta\mathbf{I}, 𝐒=12𝐈\mathbf{S}=\frac{1}{2}\mathbf{I},𝐑=ν𝐈\mathbf{R}=-\nu\mathbf{I}, for some δ\delta and ν\nu.

For sector bounded systems, the constants δ\delta and ν\nu are called passivity indexes. We will call a sector bounded system output strictly passive (OSP) when δ>0\delta>0 and input strictly passive (ISP) when ν>0\nu>0. Finally, all the above definitions can be extended to the case of incrementally QSRQSR-dissipative systems, incrementally passive systems, incrementally strictly passive systems, and incrementally sector bounded systems by defining the supply rate as

s(𝚫𝐱,𝚫𝐲)=𝚫𝐲T𝐐𝚫𝐲+𝚫𝐱T𝐒𝚫𝐲+𝚫𝐲T𝐒T𝚫𝐱+𝚫𝐱T𝐑𝚫𝐱,s(\mathbf{\Delta_{x}},\mathbf{\Delta_{y}})=\mathbf{\Delta_{y}}^{T}\mathbf{Q}\mathbf{\Delta_{y}}\\ +\mathbf{\Delta_{x}}^{T}\mathbf{S}\mathbf{\Delta_{y}}+\mathbf{\Delta_{y}}^{T}\mathbf{S}^{T}\mathbf{\Delta_{x}}+\mathbf{\Delta_{x}}^{T}\mathbf{R}\mathbf{\Delta_{x}}, (3)

where 𝚫𝐱=𝐱𝟏𝐱𝟐\mathbf{\Delta_{x}}=\mathbf{x_{1}}-\mathbf{x_{2}} and 𝚫𝐲=𝐲𝟏𝐲𝟐\mathbf{\Delta_{y}}=\mathbf{y_{1}}-\mathbf{y_{2}} for two inputs 𝐱𝟏\mathbf{x_{1}} and 𝐱𝟐\mathbf{x_{2}}, and the corresponding outputs 𝐲𝟏\mathbf{y_{1}} and 𝐲𝟐\mathbf{y_{2}}.

III Enforcing Incremental Sector Boundedness

Incremental QSR-dissipativity for a single layer

Consider a neural network layer that receives the vector 𝐱\mathbf{x} as an input and yields the output 𝐲\mathbf{y}. For a non-convolutional layer, 𝐲=ϕ(𝐖𝐱+𝐛)\mathbf{y}=\phi(\mathbf{W}\mathbf{x}+\mathbf{b}), where ϕ()\phi(\cdot) is an element-wise nonlinear activation function. In this paper, we assume that the function is incrementally sector bounded by [α,β][\alpha,\beta] for some α\alpha and β\beta. This assumption is satisfied by most commonly used functions including tanh, ReLU, and leaky ReLU functions, and is commonly made in works such as [11]. We also assume that the same activation function ϕ(.)\phi(.) is used in every element of the layer, which is the case for most neural network layer architectures. We can then state the following result.

Theorem 1.

Consider a non-convolutional neural network layer defined by 𝐲=ϕ(𝐖𝐱+𝐛)\mathbf{y}=\phi(\mathbf{Wx+b}) where ϕ(.)\phi(.) is incrementally sector bounded by [α,β][\alpha,\beta]. Define m=α+β2m=\frac{\alpha+\beta}{2} and p=αβp=\alpha\beta. The layer is incrementally QSR dissipative if the LMI

𝐌=[𝐐𝐒𝐒𝐑]+[𝚲m𝚲𝐖m𝐖𝚲p𝐖𝚲𝐖]0\mathbf{M}=\left[\begin{matrix}\mathbf{Q}&\mathbf{S}\\ \mathbf{S}^{\top}&\mathbf{R}\end{matrix}\right]+\left[\begin{matrix}\mathbf{\Lambda}&-m\mathbf{\Lambda}\mathbf{W}\\ -m\mathbf{W}^{\top}\mathbf{\Lambda}&p\mathbf{W}^{\top}\mathbf{\Lambda}\mathbf{W}\end{matrix}\right]\succcurlyeq 0 (4)

is feasible for some 𝚲\mathbf{\Lambda} defined as

𝚲=1inλii𝐞i𝐞i0,\mathbf{\Lambda}=\sum_{1\leq i\leq n}\lambda_{ii}\mathbf{e}_{i}\mathbf{e}_{i}^{\top}\geq 0, (5)

where 𝐞i\mathbf{e}_{i} is the ii-th standard basis vector.

Proof.

The proof follows by using an S-procedure following the arguments in [17]. First, for notational ease, for any input 𝐱𝐢\mathbf{x_{i}} of the neural network layer, denote the output by 𝐲𝐢\mathbf{y_{i}} and define 𝐯𝐢=𝐖𝐱𝐢+𝐛\mathbf{v_{i}}=\mathbf{Wx_{i}}+\mathbf{b}. With this notation, we can write:

[𝟎𝐖𝐈𝟎][𝐲𝟏𝐲𝟐𝐱𝟏𝐱𝟐]=[𝐯𝟏𝐯𝟐𝐲𝟏𝐲𝟐],\left[\begin{matrix}\mathbf{0}&\mathbf{W}\\ \mathbf{I}&\mathbf{0}\end{matrix}\right]\left[\begin{matrix}\mathbf{y_{1}}-\mathbf{y_{2}}\\ \mathbf{x_{1}}-\mathbf{x_{2}}\end{matrix}\right]=\left[\begin{matrix}\mathbf{v_{1}}-\mathbf{v_{2}}\\ \mathbf{y_{1}}-\mathbf{y_{2}}\end{matrix}\right], (6)

for any inputs 𝐱1\mathbf{x}_{1} and 𝐱2\mathbf{x}_{2}.

The first quadratic form for the S-procedure can be obtained using the iQC approach outlined in [17]. Since the nonlinear function ϕ()\phi(\cdot) is an element wise function that is incrementally sector bounded by [α,β],[\alpha,\beta], we have

α𝐲1j𝐲2j𝐯1j𝐯2jβ,\alpha\leq\frac{\mathbf{y}_{1}^{j}-\mathbf{y}_{2}^{j}}{\mathbf{v}_{1}^{j}-\mathbf{v}_{2}^{j}}\leq\beta, (7)

for all j=1,,nj=1,\ldots,n, where nn is the number of neurons at the layer under consideration, 𝐯1j\mathbf{v}_{1}^{j} denotes the input to the nonlinear activation function for the jj-th neuron, and 𝐲1j\mathbf{y}_{1}^{j} denotes its output. The collection of these equations can be written as

[𝐯𝟏𝐯𝟐𝐲𝟏𝐲𝟐]T[p𝚪m𝚪m𝚪𝚪][𝐯𝟏𝐯𝟐𝐲𝟏𝐲𝟐]0,\left[\begin{array}[]{c}\mathbf{v_{1}-v_{2}}\\ \mathbf{y_{1}-y_{2}}\\ \end{array}\right]^{T}\left[\begin{array}[]{lr}p\mathbf{\Gamma}&-m\mathbf{\Gamma}\\ -m\mathbf{\Gamma}&\mathbf{\Gamma}\\ \end{array}\right]\left[\begin{array}[]{c}\mathbf{v_{1}-v_{2}}\\ \mathbf{y_{1}-y_{2}}\\ \end{array}\right]\leq 0, (8)

for a matrix 𝚪\mathbf{\Gamma} of Lagrange multipliers with the structure defined in (5) and where m=α+β2m=\frac{\alpha+\beta}{2} and p=αβp=\alpha\beta [13]. Finally, (6) yields:

[𝐲𝟏𝐲𝟐𝐱𝟏𝐱𝟐]T[𝚪m𝚪𝐖m𝐖T𝚪p𝐖T𝚪𝐖][𝐲𝟏𝐲𝟐𝐱𝟏𝐱𝟐]0.\left[\begin{array}[]{c}\mathbf{y_{1}-y_{2}}\\ \mathbf{x_{1}-x_{2}}\\ \end{array}\right]^{T}\left[\begin{array}[]{lr}\mathbf{\Gamma}&-m\mathbf{\Gamma}\mathbf{W}\\ -m\mathbf{W}^{T}\mathbf{\Gamma}&p\mathbf{W}^{T}\mathbf{\Gamma}\mathbf{W}\\ \end{array}\right]\left[\begin{array}[]{c}\mathbf{y_{1}-y_{2}}\\ \mathbf{x_{1}-x_{2}}\\ \end{array}\right]\leq 0. (9)

The second quadratic form is obtained from the definition of incremental (QSR)(QSR) dissipativity as

[𝐲1𝐲2𝐱1𝐱2]T[𝐐𝐒𝐒T𝐑][𝐲1𝐲2𝐱1𝐱2]0.\left[\begin{matrix}\mathbf{y}_{1}-\mathbf{y}_{2}\\ \mathbf{x}_{1}-\mathbf{x}_{2}\end{matrix}\right]^{T}\left[\begin{matrix}-\mathbf{Q}&-\mathbf{S}\\ -\mathbf{S}^{T}&-\mathbf{R}\end{matrix}\right]\left[\begin{matrix}\mathbf{y}_{1}-\mathbf{y}_{2}\\ \mathbf{x}_{1}-\mathbf{x}_{2}\end{matrix}\right]\leq 0. (10)

Thus, using an S-Procedure on (9) to enforce (10) we obtain:

λ[𝚪m𝚪𝐖m𝐖T𝚪p𝐖T𝚪𝐖][𝐐𝐒𝐒T𝐑]0,\lambda\left[\begin{matrix}\mathbf{\Gamma}&-m\mathbf{\Gamma}\mathbf{W}\\ -m\mathbf{W}^{T}\mathbf{\Gamma}&p\mathbf{W}^{T}\mathbf{\Gamma}\mathbf{W}\\ \end{matrix}\right]-\left[\begin{matrix}-\mathbf{Q}&-\mathbf{S}\\ -\mathbf{S}^{T}&-\mathbf{R}\\ \end{matrix}\right]\succcurlyeq 0, (11)

where λ0\lambda\geq 0. The result now follows by defining 𝚲=λ𝚪.\mathbf{\Lambda}=\lambda\mathbf{\Gamma}.

For a convolutional layer, we can derive a similar result by using the result from [33] that the convolution operation with a filter is equivalent to a matrix multiplication of a block circulant matrix composed by the filter coefficients. Specifically, if the input image 𝐗\mathbf{X} is convolved with a filter with impulse response 𝐅n×n\mathbf{F}\in\mathcal{R}^{n\times n}, then define the doubly block circulant matrix 𝐂\mathbf{C} as the matrix

[circ(𝐅(𝟎,:))circ(𝐅(𝟏,:))circ(𝐅(𝐧𝟏,:))circ(𝐅(𝐧𝟏,:))circ(𝐅(𝟎,:))circ(𝐅(𝐧𝟐,:))circ(𝐅(𝟏,:))circ(𝐅(𝟐,:))circ(𝐅(𝟎,:))],\left[\begin{matrix}\text{circ}(\mathbf{F(0,:)})&\text{circ}(\mathbf{F(1,:)})&\ldots&\text{circ}(\mathbf{F(n-1,:)})\\ \text{circ}(\mathbf{F(n-1,:)})&\text{circ}(\mathbf{F(0,:)})&\ldots&\text{circ}(\mathbf{F(n-2,:)})\\ \vdots&\vdots&\ddots&\vdots\\ \text{circ}(\mathbf{F(1,:)})&\text{circ}(\mathbf{F(2,:)})&\ldots&\text{circ}(\mathbf{F(0,:)})\\ \end{matrix}\right], (12)

where 𝐅(𝐩,:)\mathbf{F(p,:)} denotes the pp-th row of the matrix 𝐅\mathbf{F} and circ(𝐯)\text{circ}(\mathbf{v}) with a vector 𝐯n×1\mathbf{v}\in\mathcal{R}^{n\times 1} produces an n×nn\times n circulant matrix with the first row as the vector 𝐯\mathbf{v}. Then, [24, 33] showed that the output 𝐎\mathbf{O} of the convolution can be expressed as

vec(𝐎)=𝐂vec(𝐗),\text{vec}(\mathbf{O})=\mathbf{C}\text{vec}(\mathbf{X}), (13)

where vec(.)\text{vec}(.) is the standard vectorization operation. Thus, we can extend Theorem 1 to a convolutional layer as follows.

Corollary 2.

Consider the setting of Theorem 1 but with a convolutional neural network layer in which the input is convolved with the filter 𝐅\mathbf{F}, vectorized, and transmitted through an element-wise non-linear activation function ϕ(.)\phi(.) that is sector bounded by [α,β][\alpha,\beta]. Define m=α+β2m=\frac{\alpha+\beta}{2} and p=αβp=\alpha\beta. The layer is incrementally QSR dissipative if the LMI

𝐌=[𝐐𝐒𝐒𝐑]+[𝚲m𝚲𝐂m𝐂𝚲p𝐂𝚲𝐂]0\mathbf{M}=\left[\begin{matrix}\mathbf{Q}&\mathbf{S}\\ \mathbf{S}^{\top}&\mathbf{R}\end{matrix}\right]+\left[\begin{matrix}\mathbf{\Lambda}&-m\mathbf{\Lambda}\mathbf{C}\\ -m\mathbf{C}^{\top}\mathbf{\Lambda}&p\mathbf{C}^{\top}\mathbf{\Lambda}\mathbf{C}\end{matrix}\right]\succcurlyeq 0 (14)

is feasible for some 𝚲\mathbf{\Lambda} defined as

𝚲=1inλii𝐞i𝐞i0,\mathbf{\Lambda}=\sum_{1\leq i\leq n}\lambda_{ii}\mathbf{e}_{i}\mathbf{e}_{i}\geq 0, (15)

where 𝐞i\mathbf{e}_{i} is the ii-th standard basis vector and 𝐂\mathbf{C} is defined as in (12).

We note that by construction, the matrix 𝚲\mathbf{\Lambda} with the structure in (5) is positive definite.

Extension for a multi layered neural network

The argument given above can be extended to consider the entire neural network instead of one layer. However, as was noted in [16] even for the LMIs resulting from the simpler Lipschitz constraint, this approach quickly becomes computationally cumbersome. Instead, we can utilize the compositional property of (QSR)(QSR)-dissipativity to ensure that a multi-layered neural network is sector bounded by imposing constraints on each layer separately. We have the following result.

Theorem 3.

Consider a neural network with nn layers, where each layer ii is incrementally QSRQSR dissipative with 𝐐=δi𝐈\mathbf{Q}=-\delta_{i}\mathbf{I}, 𝐒=0.5𝐈\mathbf{S}=0.5\mathbf{I}, and 𝐑=νi𝐈\mathbf{R}=-\nu_{i}\mathbf{I}. Then, the neural network is incrementally sector bounded with parameters 𝐐=δ𝐈\mathbf{Q}=-\delta\mathbf{I}, 𝐒=0.5𝐈\mathbf{S}=0.5\mathbf{I}, and 𝐑=ν𝐈\mathbf{R}=-\nu\mathbf{I} if the matrix 𝐀0,\mathbf{A}\preceq 0, where

𝐀[ν1+ν1201212ν2+δ1120012νn+δn11212012δn+δ].\mathbf{A}\triangleq\left[\begin{matrix}-\nu_{1}+\nu&\frac{1}{2}&0&\cdots&-\frac{1}{2}\\ \frac{1}{2}&-\nu_{2}+\delta_{1}&\frac{1}{2}&\cdots&0\\ \vdots&\ddots&\ddots&\ddots&\vdots\\ 0&\cdots&\frac{1}{2}&-\nu_{n}+\delta_{n-1}&\frac{1}{2}\\ -\frac{1}{2}&0&\cdots&\frac{1}{2}&-\delta_{n}+\delta\\ \end{matrix}\right].
Proof.

Proof follows directly be viewing the neural network as a cascade of layers and applying [34, Theorem 5]. ∎

Theorem 3 thus provides one way of ensuring that the neural network is sector bounded, and hence, robust. Specifically, we can impose the constraint (4) during the training of the network. However, this requires solving an SDP problem on each gradient descent step or, at best, after a certain number of epochs, as described in [16]. While ensuring 𝐌i0\mathbf{M}_{i}\succcurlyeq 0, with i=1,,li=1,\ldots,l for each layer separately is computationally more tractable than optimizing the entire neural network through a large SDP, it nonetheless makes the training much slower. This makes it desirable to further reduce the computational complexity as discussed next.

IV Training Robust Neural Networks

In this section we derive a sufficient condition on the spectral norm of the weight matrix that provides a feasible solution for the LMI (4) for sector bounded systems. We focus once again on a single neural network layer in the setting of Theorem 1 for ease of notation. Denote the spectral norm of matrix 𝐀\mathbf{A} by 𝐀2\|\mathbf{A}\|_{2}. Further, define the infimum seminorm 𝐀i\|\mathbf{A}\|_{i} as the square root of the minimum eigenvalue of 𝐀𝐀\mathbf{A}^{\top}\mathbf{A}. While the spectral norm is sub-multiplicative, the infimum seminorm is super-multiplicative. Further, the two norms are related for invertible matrices through the relation 𝐀12=𝐀i1\|\mathbf{A}^{-1}\|_{2}=\|\mathbf{A}\|_{i}^{-1} [35, Equation 2.5]. Finally, for a normal matrix 𝚪\mathbf{\Gamma}, we have [36, Theorem 2.5] 𝚪Ii=𝚪i1.\|\mathbf{\Gamma}-I\|_{i}=\|\mathbf{\Gamma}\|_{i}-1.

Note that for a sector bound on a neural network to lead to bounded perturbation of the output given a perturbation in the input, it is natural to assume that the slope of the upper bound on the sector given by δ\delta is positive. The slope of the lower bound ν\nu can have arbitrary sign, but should be bounded. We make these assumptions in presenting the following result.

Theorem 4.

Consider the setting of Theorem 1 with 𝐐=δ𝐈\mathbf{Q}=-\delta\mathbf{I}, 𝐒=0.5𝐈\mathbf{S}=0.5\mathbf{I}, and 𝐑=ν𝐈\mathbf{R}=-\nu\mathbf{I} with δ>0\delta>0. If the following inequalities hold

𝐖2\displaystyle\left\|\mathbf{W}\right\|_{2} 1|m|(1(δ+0.5)(1|p|𝐖i2)δν)\displaystyle\leq\frac{1}{|m|}\left(1-\frac{(\delta+0.5)(1-|p|\left\|\mathbf{W}\right\|_{i}^{2})}{\delta-\nu}\right) (16)
|p|𝐖i2\displaystyle|p|\left\|\mathbf{W}\right\|_{i}^{2} 1,\displaystyle\leq 1, (17)

then (4) is feasible with a matrix 𝚲\mathbf{\Lambda} of the form (5) with

𝚲2=𝚲iδν1|p|𝐖i2.\left\|\mathbf{\Lambda}\right\|_{2}=\left\|\mathbf{\Lambda}\right\|_{i}\geq\frac{\delta-\nu}{1-|p|\left\|\mathbf{W}\right\|_{i}^{2}}. (18)
Proof.

A sufficient condition for the feasibility of the LMI (4) is that the matrix 𝐌\mathbf{M} be block diagonally dominant with diagonal blocks positive semi-definite [35]. In other words, if there exists a matrix 𝚲\mathbf{\Lambda} of the form (5) that ensures that the following four conditions are satisfied, then the LMI (4) is feasible for that 𝚲\mathbf{\Lambda}:

𝚲+𝐐\displaystyle\mathbf{\Lambda+Q} 0\displaystyle\succ 0 (19)
𝐖p𝚲𝐖+𝐑\displaystyle\mathbf{W}^{\top}p\mathbf{\Lambda W+R} 0\displaystyle\succ 0 (20)
(𝐐+𝚲)1(𝐒m𝚲𝐖)2\displaystyle\left\|(\mathbf{Q}+\mathbf{\Lambda})^{-1}(\mathbf{S}-m\mathbf{\Lambda}\mathbf{W})\right\|_{2} 1\displaystyle\leq 1 (21)
(𝐖p𝚲𝐖+𝐑)1(𝐒m𝐖𝚲)2\displaystyle\left\|(\mathbf{W}^{\top}p\mathbf{\Lambda W+R})^{-1}(\mathbf{S}^{\top}-m\mathbf{W}^{\top}\mathbf{\Lambda})\right\|_{2} 1.\displaystyle\leq 1. (22)

We now consider these conditions one by one.

  1. 1.

    Claim 1: Any 𝚲\mathbf{\Lambda} that satisfies equation

    𝚲iδ,\|\mathbf{\Lambda}\|_{i}\geq\delta, (23)

    will ensure that (19) is satisfied. This claim follows by noting that Q=δ𝐈Q=-\delta\mathbf{I}, δ>0\delta>0, and 𝚲0.\mathbf{\Lambda}\succcurlyeq 0.

  2. 2.

    Claim 2: Any 𝚲\mathbf{\Lambda} that satisfies the condition

    |p|𝐖i2𝚲i>ν|p|\|\mathbf{W}\|_{i}^{2}\|\mathbf{\Lambda}\|_{i}>\nu (24)

    will ensure that (20) is satisfied. This is because

    (20)\displaystyle\eqref{pd2} |p|𝐖𝚲𝐖i>ν\displaystyle\Longleftrightarrow|p|\|\mathbf{W}^{\top}\mathbf{\Lambda}\mathbf{W}\|_{i}>\nu
    (a)|p|𝐖i𝚲i𝐖i>ν\displaystyle\overset{(a)}{\Longleftarrow}|p|\|\mathbf{W}^{\top}\|_{i}\|\mathbf{\Lambda}\|_{i}\|\mathbf{W}\|_{i}>\nu
    |p|𝐖i2𝚲i>ν,\displaystyle\Longleftrightarrow|p|\|\mathbf{W}\|_{i}^{2}\|\mathbf{\Lambda}\|_{i}>\nu,

    where (a) follows from the supermultiplicativity of the infimum norm.

  3. 3.

    Claim 3: Any 𝚲\mathbf{\Lambda} that satisfies the condition

    δ+0.5+|m|𝚲2𝐖2𝚲i\delta+0.5+|m|\left\|\mathbf{\Lambda}\right\|_{2}\left\|\mathbf{W}\right\|_{2}\leq\|\mathbf{\Lambda}\|_{i} (25)

    will ensure that equation (21) is satisfied. This claim follows by noting that

    (21)\displaystyle\eqref{sn1} (a)(𝐐+𝚲)12𝐒m𝚲𝐖21\displaystyle\overset{(a)}{\Longleftarrow}\left\|(\mathbf{Q}+\mathbf{\Lambda})^{-1}\right\|_{2}\left\|\mathbf{S}-m\mathbf{\Lambda}\mathbf{W}\right\|_{2}\leq 1
    (b)𝐒m𝚲𝐖2𝐐+𝚲i\displaystyle\overset{(b)}{\Longleftrightarrow}\left\|\mathbf{S}-m\mathbf{\Lambda}\mathbf{W}\right\|_{2}\leq\|\mathbf{\mathbf{Q}+\Lambda}\|_{i}
    0.5𝐈m𝚲𝐖2𝚲δ𝐈i\displaystyle\Longleftrightarrow\left\|0.5\mathbf{I}-m\mathbf{\Lambda}\mathbf{W}\right\|_{2}\leq\|\mathbf{\Lambda}-\delta\mathbf{I}\|_{i}
    (c)0.5+|m|𝚲2𝐖2𝚲δ𝐈i\displaystyle\overset{(c)}{\Longleftarrow}0.5+|m|\left\|\mathbf{\Lambda}\right\|_{2}\left\|\mathbf{W}\right\|_{2}\leq\|\mathbf{\Lambda}-\delta\mathbf{I}\|_{i}
    0.5+|m|𝚲2𝐖2𝚲iδ\displaystyle\Longleftrightarrow 0.5+|m|\left\|\mathbf{\Lambda}\right\|_{2}\left\|\mathbf{W}\right\|_{2}\leq\|\mathbf{\Lambda}\|_{i}-\delta
    δ+0.5+|m|𝚲2𝐖2𝚲i,\displaystyle\Longleftrightarrow\delta+0.5+|m|\left\|\mathbf{\Lambda}\right\|_{2}\left\|\mathbf{W}\right\|_{2}\leq\|\mathbf{\Lambda}\|_{i},

    where (a) and (c) follow from submultiplicativity and subadditivity, respectively, of the spectral norm, while (b) follows from the relation 𝐀12=𝐀i1\|\mathbf{A}^{-1}\|_{2}=\|\mathbf{A}\|_{i}^{-1}.

  4. 4.

    Claim 4: Any 𝚲\mathbf{\Lambda} that satisfies the condition

    0.5+|m|𝚲2𝐖2|p|𝐖i2𝚲iν0.5+|m|\left\|\mathbf{\Lambda}\right\|_{2}\left\|\mathbf{W}\right\|_{2}\leq|p|\left\|\mathbf{W}\right\|_{i}^{2}\|\mathbf{\Lambda}\|_{i}-\nu (26)

    will ensure that (22) is satisfied. This claim follows by noting that

    (22)(p𝐖𝚲𝐖+𝐑)12𝐒m𝚲𝐖21\displaystyle\eqref{sn2}\Longleftarrow\left\|\left(p\mathbf{W}^{\top}\mathbf{\Lambda}\mathbf{W}+\mathbf{R}\right)^{-1}\right\|_{2}\left\|\mathbf{S}-m\mathbf{\Lambda}\mathbf{W}\right\|_{2}\leq 1
    (𝐒2+|m|𝚲2𝐖2)p𝐖𝚲𝐖+𝐑i\displaystyle\Longleftrightarrow\left(\left\|\mathbf{S}\right\|_{2}+|m|\left\|\mathbf{\Lambda}\right\|_{2}\left\|\mathbf{W}\right\|_{2}\right)\leq\left\|p\mathbf{W}^{\top}\mathbf{\Lambda}\mathbf{W}+\mathbf{R}\right\|_{i}
    0.5+|m|𝚲2𝐖2p𝐖𝚲𝐖ν𝐈i\displaystyle\Longleftrightarrow 0.5+|m|\left\|\mathbf{\Lambda}\right\|_{2}\left\|\mathbf{W}\right\|_{2}\leq\left\|p\mathbf{W}^{\top}\mathbf{\Lambda}\mathbf{W}-\nu\mathbf{I}\right\|_{i}
    0.5+|m|𝚲2𝐖2p𝐖𝚲𝐖iν\displaystyle\Longleftrightarrow 0.5+|m|\left\|\mathbf{\Lambda}\right\|_{2}\left\|\mathbf{W}\right\|_{2}\leq\left\|p\mathbf{W}^{\top}\mathbf{\Lambda}\mathbf{W}\right\|_{i}-\nu
    0.5+|m|𝚲2𝐖2|p|𝐖i2𝚲iν.\displaystyle\Longleftarrow 0.5+|m|\left\|\mathbf{\Lambda}\right\|_{2}\left\|\mathbf{W}\right\|_{2}\leq|p|\left\|\mathbf{W}\right\|_{i}^{2}\|\mathbf{\Lambda}\|_{i}-\nu.

Since (25) is a sufficient condition for (23) and (26) is a sufficient condition for (24), we have shown that if there exists a matrix 𝚲\mathbf{\Lambda} of the form (5) that ensures that the two conditions conditions (25) and (26) are satisfied, then the LMI (4) is feasible for that 𝚲\mathbf{\Lambda}. The proof now follows by noting that if (16), (17), and (18) hold, then (25) and (26) are satisfied. ∎

This result provides a computationally easy to impose condition that guarantees sector boundedness of the mapping defined by each neural network layer in terms of the spectral norm of the weight matrix. The conditions for each layer can be combined to guarantee the sector boundedness of the entire neural network following Theorem 3. While this theorem provides only a sufficient condition, a few observations can be made from how easy it is to satisfy the condition (16).

  • If p=0p=0 (as is the case with ReLU for instance), then the bound on the right hand side of this condition is smaller. This implies that considering leaky ReLU (for which p0p\neq 0) may lead to superior robustness performance as compared to ReLU activation functions, as has been observed empirically in the literature [37].

  • Similarly, if ν>0\nu>0, then the condition becomes harder to satisfy. Thus, imposing strict passivity on the neural network (ν>0\nu>0) may be overly conservative and lead to low performance as compared to simply sector bounding it and allowing ν<0\nu<0.

  • We emphasize that although the result aligns with the empirical observation in the literature that regularizing the loss function with the spectral norm of the weight matrix leads to superior robustness against adversarial attacks, our result provides an analytical justification of such a procedure and further identifies the region in which the spectral norm should be bounded to get such robustness. Furthermore, by not using an SDP approach, we can expand our technique to deep Neural Network structures, which is a shortcoming of SDP methods, as discussed in [38]. However, the trade-off is an increased conservatism in the spectral norm bound.

  • Finally, although our motivation in this paper was to ensure robustness against adversarial perturbations, imposing passivity and sector boundedness on neural networks is of independent interest. For instance, this result can be used to guarantee stability of a system where the controller is implemented as a neural network through standard results in passivity based control.

V Experimental Results

We now present the experimental setup and performance improvement when the results above are utilized while training a neural network. Implementation is provided in Python using Tensorflow 1.15.4 on https://github.com/beaquino/Robust-Design-of-Neural-Networks-using-Dissipativity. A MATLAB script to obtain sector bounds for each layer to satisfy Theorem 3 is also available.

We use two commonly known data sets for image classification, MNIST [39] and CIFAR-10 [40]. For the MNIST data set, we use a 3 layer feed-forward network with leaky ReLU activation function (with a=0.1a=0.1, and therefore α=0.1\alpha=0.1 and β=1\beta=1) on the first two layers. For the CIFAR-10 dataset, we use an Alexnet [41], which is composed of 2 convolution layers, each followed by a max pooling layer and 3 fully connected layers. Leaky ReLU activation function (with a=0.1a=0.1, and therefore α=0.1\alpha=0.1 and β=1\beta=1) are used on the first four layers. In some implementations, a final Softmax layer may be utilized for conversion into probabilities. Since Softmax is only an exponential average, we do not consider such a layer without loss of generality. The adversarial attacks chosen for testing are the Fast Gradient Sign Method (FGSM) attack [25], and the Projected Gradient Descent Method attack (PGDM) [26] using a range of strength ϵ\epsilon in the interval (0.1,0.5)(0.1,0.5). Strength 0 represents no attack.

We split training and test sets as 86%-14% of the dataset for MNIST and 80%-20% for CIFAR-10, and we train the model for 200200 epochs using Adam optimization. Parameters ν\nu and δ\delta should be chosen before the training procedure. We select the pair (ν,δ)=(2,0.4)(\nu,\delta)=(-2,0.4) as the indices for the entire network, which restricts the neural network to a sector approximately between (0.215,0.465)(-0.215,0.465). These values were chosen as the basis of comparison, because they presented the best result among other different choices. We tuned them as hyperparameters and optimized them on the MNIST model. Given these values, we select the individual indexes (νi,δi)(\nu_{i},\delta_{i}) for each layer, using Theorem 3, which are then used to calculate the spectral norm bound for each layer.

Figure 1 presents the results (Sp Norm) for MNIST data, with the accuracy for a test set generated using FGM attack in the left panel and the accuracy for a test set generated using PGDM attack in the right. Figure 2 presents the results (Sp Norm) for CIFAR-10 data, with the left panel presenting the accuracy for a test set generated using FGM attack and the right panel presenting the accuracy for a test set generated using PGDM attack. Also plotted are the accuracy with the commonly used method of L2L_{2}-regularization as a comparison point (L2 Norm) and a regularly trained Neural Network (Vanilla). Both figures show an improvement on classification for both cases especially as the attack strength increases, demonstrating the effectiveness of the proposed approach.

Refer to caption
Figure 1: Accuracy for both Fast Gradient Sign Method attack and Projected Gradient Descent Method attack, compared for a Vanilla model and spectral norm regularized with passivity indexes (2,0.4)(-2,0.4). Network was trained on MNIST dataset.
Refer to caption
Figure 2: Accuracy for both Projected Gradient Descent Method attack and Projected Gradient Descent Method attack, compared for a Vanilla model and spectral norm regularized with passivity indexes (2,0.4)(-2,0.4). Network was trained on CIFAR-10 dataset.

A remark can be made about the computational tractability of the proposed approach. Considering the entire neural network at once to impose sector boundedness (or even the simpler condition of Lipschitz constant) is computationally tractable only for shallow networks. Our approach of considering each layer separately, and crucially imposing a spectral norm constraint, is more scalable. Note that while using the spectral norm forces us to calculate the maximum eigenvalue of a matrix at every gradient descent step, this can be performed efficiently using the power iteration method.

VI Conclusion

We proposed a robustness certificate for neural networks based on QSR-dissipativity. This method guarantees that a change in the input produces a bounded change in the output, that is, the neural network function is incrementally sector bounded. We first expressed the certificate in the form of a linear matrix inequality. By using the compositional properties of dissipativity, we then decomposed the certificate into one for individual layers of the neural network. We also proposed a sufficient condition based on a spectral norm bound to offer a more computationally tractable problem for deep neural network structures. We presented the results for experiments using a 3 layer feed-forward network and an Alexnet structurure, trained with MNIST and CIFAR-10 respectively. Results showed superior performance when compared to vanilla training and L2L_{2} regularization.

Acknowledgment

The authors acknowledge comments from Julien Béguinot (Télécom Paris - Institut Polytechnique de Paris) and Léo Monbroussou (École Normale Supérieure Paris-Saclay - Institut Polytechnique de Paris) to update the stated assumptions in earlier versions of the draft.

References

  • [1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
  • [2] N. Carlini and D. Wagner, “Adversarial examples are not easily detected: Bypassing ten detection methods,” in Proceedings of the 10th ACM workshop on artificial intelligence and security, 2017, pp. 3–14.
  • [3] A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” in International conference on machine learning.   PMLR, 2018, pp. 274–283.
  • [4] J. Su, D. V. Vargas, and K. Sakurai, “One pixel attack for fooling deep neural networks,” IEEE Transactions on Evolutionary Computation, vol. 23, no. 5, pp. 828–841, 2019.
  • [5] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The journal of machine learning research, vol. 17, no. 1, pp. 2096–2030, 2016.
  • [6] A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein, “Adversarial training for free!” arXiv preprint arXiv:1904.12843, 2019.
  • [7] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel, “Ensemble adversarial training: Attacks and defenses,” arXiv:1705.07204, 2017.
  • [8] A. Robey, L. F. O. Chamon, G. J. Pappas, H. Hassani, and A. Ribeiro, “Adversarial robustness with semi-infinite constrained learning,” arXiv preprint arXiv:2110.15767, 2021.
  • [9] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in 2016 IEEE symposium on security and privacy, 2016, pp. 582–597.
  • [10] N. Papernot and P. McDaniel, “On the effectiveness of defensive distillation,” arXiv preprint arXiv:1607.05113, 2016.
  • [11] H. Yin, P. Seiler, and M. Arcak, “Stability analysis using quadratic constraints for systems with neural network controllers,” IEEE Transactions on Automatic Control, pp. 1–1, 2021.
  • [12] H. Yin, P. Seiler, M. Jin, and M. Arcak, “Imitation learning with stability and safety guarantees,” IEEE Control Systems Letters, vol. 6, pp. 409–414, May 2021.
  • [13] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. J. Pappas, “Efficient and accurate estimation of Lipschitz constants for deep neural networks,” arXiv preprint arXiv:1906.04893, 2019.
  • [14] F. Latorre, P. Rolland, and V. Cevher, “Lipschitz constant estimation of neural networks via sparse polynomial optimization,” arXiv preprint arXiv:2004.08688, 2020.
  • [15] P. L. Combettes and J.-C. Pesquet, “Lipschitz certificates for layered network structures driven by averaged activation operators,” SIAM Journal on Mathematics of Data Science, vol. 2, no. 2, pp. 529–557, 2020.
  • [16] P. Pauli, A. Koch, J. Berberich, P. Kohler, and F. Allgower, “Training robust neural networks using Lipschitz bounds,” IEEE Control Systems Letters, vol. 6, pp. 121–126, 2021.
  • [17] M. Fazlyab, M. Morari, and G. J. Pappas, “Safety verification and robustness analysis of neural networks via quadratic constraints and semidefinite programming,” IEEE Transactions on Automatic Control, 2020, dOI: 10.1109/TAC.2020.3046193.
  • [18] M. Revay, R. Wang, and I. R. Manchester, “Recurrent equilibrium networks: Flexible dynamic models with guaranteed stability and robustness,” arXiv preprint arXiv:2104.05942, 2021.
  • [19] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in 12th {\{USENIX}\} symposium on operating systems design and implementation ({\{OSDI}\} 16), 2016, pp. 265–283.
  • [20] Y. Yoshida and T. Miyato, “Spectral norm regularization for improving the generalizability of deep learning,” arXiv preprint arXiv:1705.10941, 2017.
  • [21] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018.
  • [22] C. Li and X. Liao, “Passivity analysis of neural networks with time delay,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 52, no. 8, pp. 471–475, 2005.
  • [23] S. Xu, W. X. Zheng, and Y. Zou, “Passivity analysis of neural networks with time-varying delays,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 56, no. 4, pp. 325–329, 2009.
  • [24] A. Rahnama, A. T. Nguyen, and E. Raff, “Robust design of deep neural networks against adversarial attacks based on Lyapunov theory,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [25] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
  • [26] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
  • [27] K. Nar, O. Ocal, S. S. Sastry, and K. Ramchandran, “Cross-entropy loss and low-rank features have responsibility for adversarial examples,” arXiv preprint arXiv:1901.08360, 2019.
  • [28] A. Schaft, L2-Gain and Passivity Techniques in Nonlinear Control.   Springer International Publishing, 01 2017.
  • [29] D. Hill and P. Moylan, “The stability of nonlinear dissipative systems,” IEEE Transactions on Automatic Control, vol. 21, no. 5, pp. 708–711, 1976.
  • [30] J. Bao and L. Peter, Process Control: The Passive Systems Approach.   Springer-Verlag London, 01 2007.
  • [31] C. Byrnes and W. Lin, “Losslessness, feedback equivalence, and the global stabilization of discrete-time nonlinear systems,” IEEE Transactions on Automatic Control, vol. 39, no. 1, pp. 83–98, 1994.
  • [32] J. C. Willems, “Dissipative dynamical systems part ii: Linear systems with quadratic supply rates,” Archive for Rational Mechanics and Analysis, vol. 45, no. 5, pp. 352–393, 1972.
  • [33] H. Sedghi, V. Gupta, and P. M. Long, “The singular values of convolutional layers,” arXiv preprint arXiv:1805.10408, 2018.
  • [34] H. Yu and P. J. Antsaklis, “A passivity measure of systems in cascade based on passivity indices,” in 49th IEEE Conference on Decision and Control (CDC).   IEEE, 2010, pp. 2186–2191.
  • [35] D. G. Feingold and R. S. Varga, “Block diagonally dominant matrices and generalizations of the Gerschgorin circle theorem.” Pacific Journal of Mathematics, vol. 12, no. 4, pp. 1241–1250, 1962.
  • [36] W. H. Bennett, “Block diagonal dominance and the design of decentralized compensation,” Master’s thesis, University of Maryland, 1979, https://user.eng.umd.edu/b̃aras/publications/dissertations/1975-1995/79-MS-Bennett.pdf.
  • [37] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv:1505.00853, 2015.
  • [38] P. Pauli, N. Funcke, D. Gramlich, M. A. Msalmi, and F. Allgöwer, “Neural network training under semidefinite constraints,” arXiv preprint arXiv:2201.00632, 2022.
  • [39] L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
  • [40] A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009. [Online]. Available: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
  • [41] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’12.   Red Hook, NY, USA: Curran Associates Inc., 2012, p. 1097–1105.