This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Archived Results

1. Projections

Lemma 1.1.

Let X(i1)X^{(i-1)}, X~(i1)\widetilde{X}^{(i-1)} be as in (LABEL:eq:layer-input) and let wNi1w\in\mathbb{R}^{N_{i-1}} be a neuron in the ii-th layer. Applying the data alignment procedure in (LABEL:eq:quantization-algorithm-step1), for t=1,2,,Ni1t=1,2,\ldots,N_{i-1}, we have

u^t=j=1twjPX~t(i1)PX~j+1(i1)PX~j(i1)(Xj(i1)).\hat{u}_{t}=\sum_{j=1}^{t}w_{j}P_{\widetilde{X}_{t}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j}).

Moreover, if d:=Ni1md:=\lfloor\frac{N_{i-1}}{m}\rfloor\in\mathbb{N} and Aj(i1):=PX~(j+1)m(i1)PX~jm+2(i1)PX~jm+1(i1)A_{j}^{(i-1)}:=P_{\widetilde{X}_{(j+1)m}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{jm+2}^{(i-1)\perp}}P_{\widetilde{X}_{jm+1}^{(i-1)\perp}} for 1jd11\leq j\leq d-1, then

(1) u^Ni12mw(2+k=1d1Ad1(i1)Ak+1(i1)Ak(i1)2)max1jNi1Xj(i1)X~j(i1)2.\|\hat{u}_{N_{i-1}}\|_{2}\leq m\|w\|_{\infty}\Bigl{(}2+\sum_{k=1}^{d-1}\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\|_{2}\Bigr{)}\max_{1\leq j\leq N_{i-1}}\|X^{(i-1)}_{j}-\widetilde{X}_{j}^{(i-1)}\|_{2}.
Proof.

We consider the proof by induction on tt. By (LABEL:eq:quantization-algorithm-step1), the case t=1t=1 is straightforward, since we have

u^1\displaystyle\hat{u}_{1} =w1X1(i1)w~1X~1(i1)\displaystyle=w_{1}X^{(i-1)}_{1}-\widetilde{w}_{1}\widetilde{X}^{(i-1)}_{1}
=w1X1(i1)X~1(i1),w1X1(i1)X~1(i1)22X~1(i1)\displaystyle=w_{1}X^{(i-1)}_{1}-\frac{\langle\widetilde{X}_{1}^{(i-1)},w_{1}X_{1}^{(i-1)}\rangle}{\|\widetilde{X}^{(i-1)}_{1}\|_{2}^{2}}\widetilde{X}^{(i-1)}_{1}
=w1X1(i1)PX~1(i1)(w1X1(i1))\displaystyle=w_{1}X^{(i-1)}_{1}-P_{\widetilde{X}_{1}^{(i-1)}}(w_{1}X^{(i-1)}_{1})
=w1PX~1(i1)(X1(i1))\displaystyle=w_{1}P_{\widetilde{X}_{1}^{(i-1)\perp}}(X^{(i-1)}_{1})

where we apply the properties of orthogonal projections in (LABEL:eq:orth-proj) and (LABEL:eq:orth-proj-mat). For t2t\geq 2, assume that

u^t1=j=1t1wjPX~t1(i1)PX~j+1(i1)PX~j(i1)(Xj(i1)).\hat{u}_{t-1}=\sum_{j=1}^{t-1}w_{j}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j}).

Then, by (LABEL:eq:quantization-algorithm-step1), one gets

u^t\displaystyle\hat{u}_{t} =u^t1+wtXt(i1)w~tX~t(i1)\displaystyle=\hat{u}_{t-1}+w_{t}X^{(i-1)}_{t}-\widetilde{w}_{t}\widetilde{X}^{(i-1)}_{t}
=u^t1+wtXt(i1)X~t(i1),u^t1+wtXt(i1)X~t(i1)22X~t(i1)\displaystyle=\hat{u}_{t-1}+w_{t}X^{(i-1)}_{t}-\frac{\langle\widetilde{X}_{t}^{(i-1)},\hat{u}_{t-1}+w_{t}X_{t}^{(i-1)}\rangle}{\|\widetilde{X}^{(i-1)}_{t}\|_{2}^{2}}\widetilde{X}^{(i-1)}_{t}
=u^t1+wtXt(i1)PX~t(i1)(u^t1+wtXt(i1))\displaystyle=\hat{u}_{t-1}+w_{t}X^{(i-1)}_{t}-P_{\widetilde{X}_{t}^{(i-1)}}(\hat{u}_{t-1}+w_{t}X^{(i-1)}_{t})
=PX~t(i1)(u^t1+wtXt(i1)).\displaystyle=P_{\widetilde{X}_{t}^{(i-1)\perp}}(\hat{u}_{t-1}+w_{t}X^{(i-1)}_{t}).

Applying the induction hypothesis, we obtain

u^t\displaystyle\hat{u}_{t} =PX~t(i1)(u^t1)+wtPX~t(i1)(Xt(i1))\displaystyle=P_{\widetilde{X}_{t}^{(i-1)\perp}}(\hat{u}_{t-1})+w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}(X^{(i-1)}_{t})
=j=1t1wjPX~t(i1)PX~j+1(i1)PX~j(i1)(Xj(i1))+wtPX~t(i1)(Xt(i1))\displaystyle=\sum_{j=1}^{t-1}w_{j}P_{\widetilde{X}_{t}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})+w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}(X^{(i-1)}_{t})
=j=1twjPX~t(i1)PX~j+1(i1)PX~j(i1)(Xj(i1)).\displaystyle=\sum_{j=1}^{t}w_{j}P_{\widetilde{X}_{t}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j}).

This completes the proof by induction. In particular, if t=Ni1t=N_{i-1}, then we have

u^Ni1=j=1Ni1wjPX~Ni1(i1)PX~j+1(i1)PX~j(i1)(Xj(i1)).\hat{u}_{N_{i-1}}=\sum_{j=1}^{N_{i-1}}w_{j}P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j}).

It follows from triangle inequality and the definition of AjA_{j} that

u^Ni12\displaystyle\|\hat{u}_{N_{i-1}}\|_{2} =j=1Ni1wjPX~Ni1(i1)PX~j+1(i1)PX~j(i1)(Xj(i1))2\displaystyle=\Bigl{\|}\sum_{j=1}^{N_{i-1}}w_{j}P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\Bigr{\|}_{2}
wj=1Ni1PX~Ni1(i1)PX~j+1(i1)PX~j(i1)(Xj(i1))2\displaystyle\leq\|w\|_{\infty}\sum_{j=1}^{N_{i-1}}\|P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\|_{2}
=wk=1d1j=(k1)m+1kmPX~Ni1(i1)PX~j+1(i1)PX~j(i1)(Xj(i1))2\displaystyle=\|w\|_{\infty}\sum_{k=1}^{d-1}\sum_{j=(k-1)m+1}^{km}\|P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\|_{2}
+wj=(d1)m+1Ni1PX~Ni1(i1)PX~j+1(i1)PX~j(i1)(Xj(i1))2.\displaystyle+\|w\|_{\infty}\sum_{j=(d-1)m+1}^{N_{i-1}}\|P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\|_{2}.

Since P2=1\|P\|_{2}=1 for any nonzero orthogonal projection PP, Ni1md+1\frac{N_{i-1}}{m}\leq d+1, and Aj(i1)=PX~(j+1)m(i1)PX~jm+2(i1)PX~jm+1(i1)A_{j}^{(i-1)}=P_{\widetilde{X}_{(j+1)m}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{jm+2}^{(i-1)\perp}}P_{\widetilde{X}_{jm+1}^{(i-1)\perp}} for 1jd11\leq j\leq d-1, we deduce that

u^Ni12wk=1d1j=(k1)m+1kmAd1(i1)Ak+1(i1)Ak(i1)2PX~j(i1)(Xj(i1))2\displaystyle\|\hat{u}_{N_{i-1}}\|_{2}\leq\|w\|_{\infty}\sum_{k=1}^{d-1}\sum_{j=(k-1)m+1}^{km}\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\|_{2}\|P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\|_{2}
+wj=(d1)m+1Ni1PX~j(i1)(Xj(i1))2\displaystyle+\|w\|_{\infty}\sum_{j=(d-1)m+1}^{N_{i-1}}\|P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\|_{2}
mw(k=1d1Ad1(i1)Ak+1(i1)Ak(i1)2+Ni1m(d1))max1jNi1PX~j(i1)(Xj(i1))2\displaystyle\leq m\|w\|_{\infty}\Bigl{(}\sum_{k=1}^{d-1}\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\|_{2}+\frac{N_{i-1}}{m}-(d-1)\Bigr{)}\max_{1\leq j\leq N_{i-1}}\|P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\|_{2}
mw(2+k=1d1Ad1(i1)Ak+1(i1)Ak(i1)2)max1jNi1PX~j(i1)(Xj(i1)X~j(i1))2\displaystyle\leq m\|w\|_{\infty}\Bigl{(}2+\sum_{k=1}^{d-1}\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\|_{2}\Bigr{)}\max_{1\leq j\leq N_{i-1}}\|P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j}-\widetilde{X}_{j}^{(i-1)})\|_{2}
mw(2+k=1d1Ad1(i1)Ak+1(i1)Ak(i1)2)max1jNi1Xj(i1)X~j(i1)2.\displaystyle\leq m\|w\|_{\infty}\Bigl{(}2+\sum_{k=1}^{d-1}\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\|_{2}\Bigr{)}\max_{1\leq j\leq N_{i-1}}\|X^{(i-1)}_{j}-\widetilde{X}_{j}^{(i-1)}\|_{2}.

2. Minimum \ell_{\infty} solutions for a linear system

Let Am×nA\in\mathbb{R}^{m\times n} be a matrix with rank(A)=m<n\mathrm{rank}(A)=m<n and let bmb\in\mathbb{R}^{m} be a nonzero vector. Then the Rouché-Capelli theorem implies that the linear system Ax=bAx=b admits infinitely many solutions. An intriguing problem that has important applications is to find the solutions of Ax=bAx=b whose \ell_{\infty} norm is the smallest possible. Specifically, we aim to solve the following primal problem:

(2) minxn\displaystyle\min_{x\in\mathbb{R}^{n}} x\displaystyle\|x\|_{\infty}
s.t. Ax=b.\displaystyle Ax=b.

Apart from the linear programming formulation [abdelmalek1977minimum], two powerful tools, namely, Cadzow’s method [cadzow1973finite, cadzow1974efficient] and the Shim-Yoon method [shim1998stabilized], are widely used to solve (2). To perform the perturbation analysis, we will focus on Cadzow’s method throughout this section, which applies the duality principle to get

(3) minAx=bx=maxAy1=1by.\min_{Ax=b}\|x\|_{\infty}=\max_{\|A^{\top}y\|_{1}=1}b^{\top}y.

Moreover, suppose that a1,a2,,anma_{1},a_{2},\ldots,a_{n}\in\mathbb{R}^{m} are column vectors of Am×nA\in\mathbb{R}^{m\times n}. By (LABEL:eq:orth-proj), we have aj=Pb(aj)+Pb(aj)a_{j}=P_{b}(a_{j})+P_{b^{\perp}}(a_{j}) with Pb(aj)=aj,bb22bP_{b}(a_{j})=\frac{\langle a_{j},b\rangle}{\|b\|_{2}^{2}}b. Then one can uniquely decompose AA as

(4) A=A1+A2,A=A_{1}+A_{2},

where

A1:=[ξ1b,ξ2b,,ξnb]m×nwithξ:=(ξ1,ξ2,,ξn)=Abb22,A_{1}:=[\xi_{1}b,\xi_{2}b,\ldots,\xi_{n}b]\in\mathbb{R}^{m\times n}\quad\text{with}\quad\xi:=(\xi_{1},\xi_{2},\ldots,\xi_{n})^{\top}=\frac{A^{\top}b}{\|b\|_{2}^{2}},

and

A2:=[Pb(a1),Pb(a2),,Pb(an)]m×n.A_{2}:=[P_{b^{\perp}}(a_{1}),P_{b^{\perp}}(a_{2}),\ldots,P_{b^{\perp}}(a_{n})]\in\mathbb{R}^{m\times n}.

According to the transformation technique used in section 2 of [cadzow1974efficient], the dual problem in (3) can be reformulated as

(5) maxAy1=1by=b22(minymA1b+A2y1)1.\max_{\|A^{\top}y\|_{1}=1}b^{\top}y=\|b\|_{2}^{2}\biggl{(}\min_{y\in\mathbb{R}^{m}}\|A_{1}^{\top}b+A_{2}^{\top}y\|_{1}\biggr{)}^{-1}.

It follows immediately from (3) and (5) that

(6) x=minAx=bx=b22A1b+A2y1.\|x^{*}\|_{\infty}=\min_{Ax=b}\|x\|_{\infty}=\frac{\|b\|_{2}^{2}}{\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\|_{1}}.

where xx^{*} and yy^{*} are solutions of the primal problem and the dual problem respectively.

Now we evaluate the change of optimal value of (2) under a small perturbation of AA. Suppose that A~:=A+ΔA\widetilde{A}:=A+\Delta A with rank(A~)=m\mathrm{rank}(\widetilde{A})=m and ΔA:=[Δa1,Δa2,,Δan]m×n\Delta A:=[\Delta a_{1},\Delta a_{2},\ldots,\Delta a_{n}]\in\mathbb{R}^{m\times n}. Then A~=[a1+Δa1,a2+Δa2,,an+Δan]\widetilde{A}=[a_{1}+\Delta a_{1},a_{2}+\Delta a_{2},\ldots,a_{n}+\Delta a_{n}]. Let x~,y~\widetilde{x},\widetilde{y} be primal and dual solutions for the perturbed problem minA~x=bx\min_{\widetilde{A}x=b}\|x\|_{\infty}. Then, similar to (6), we deduce that

(7) x~=b22A~1b+A~2y~1\|\widetilde{x}\|_{\infty}=\frac{\|b\|_{2}^{2}}{\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\|_{1}}

where

A~1:=A1+ΔA1withΔA1:=[ζ1b,ζ2b,,ζnb]m×n,ζ:=(ζ1,ζ2,,ζn)=ΔAbb22,\widetilde{A}_{1}:=A_{1}+\Delta A_{1}\quad\text{with}\quad\Delta A_{1}:=[\zeta_{1}b,\zeta_{2}b,\ldots,\zeta_{n}b]\in\mathbb{R}^{m\times n},\quad\zeta:=(\zeta_{1},\zeta_{2},\ldots,\zeta_{n})=\frac{\Delta A^{\top}b}{\|b\|_{2}^{2}},

and

A~2:=A2+ΔA2withΔA2:=[Pb(Δa1),Pb(Δa2),,Pb(Δan)]m×n.\widetilde{A}_{2}:=A_{2}+\Delta A_{2}\quad\text{with}\quad\Delta A_{2}:=[P_{b^{\perp}}(\Delta a_{1}),P_{b^{\perp}}(\Delta a_{2}),\ldots,P_{b^{\perp}}(\Delta a_{n})]\in\mathbb{R}^{m\times n}.
Lemma 2.1.

Let Δy:=y~y\Delta y:=\widetilde{y}-y^{*}. Suppose that there exist positive constants c1c_{1}, c2c_{2}, and c3c_{3} such that

(8) Δy2c1,A~1b+A~2y~1A1b+A2y1c2,andΔajc3ajfor allj.\|\Delta y\|_{2}\leq c_{1},\quad\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\|_{1}\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\|_{1}\geq c_{2},\quad\text{and}\quad\|\Delta a_{j}\|_{\infty}\leq c_{3}\|a_{j}\|_{\infty}\quad\text{for all}\;j.

Then we have

|x~x|mj=1naj.\Bigl{|}\|\widetilde{x}\|_{\infty}-\|x^{*}\|_{\infty}\Bigr{|}\lesssim\sqrt{m}\sum_{j=1}^{n}\|a_{j}\|_{\infty}.
Proof.

By triangle inequality, we have

|A~1b+A~2y~1A1b+A2y1|\displaystyle\Bigl{|}\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\|_{1}-\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\|_{1}\Bigr{|} (A~1A1)b+A~2y~A2y1\displaystyle\leq\|(\widetilde{A}_{1}-A_{1})^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}-A_{2}^{\top}y^{*}\|_{1}
=ΔA1b+A2(y+Δy)+ΔA2y~A2y1\displaystyle=\|\Delta A_{1}^{\top}b+A_{2}^{\top}(y^{*}+\Delta y)+\Delta A_{2}^{\top}\widetilde{y}-A_{2}^{\top}y^{*}\|_{1}
=ΔA1b+A2Δy+ΔA2y~1\displaystyle=\|\Delta A_{1}^{\top}b+A_{2}^{\top}\Delta y+\Delta A_{2}^{\top}\widetilde{y}\|_{1}
(9) ΔA1b1+A2Δy1+ΔA2y~1.\displaystyle\leq\|\Delta A_{1}^{\top}b\|_{1}+\|A_{2}^{\top}\Delta y\|_{1}+\|\Delta A_{2}^{\top}\widetilde{y}\|_{1}.

Applying Hölder’s inequality and the fact that Pb(x)2x2\|P_{b^{\perp}}(x)\|_{2}\leq\|x\|_{2} holds for all xmx\in\mathbb{R}^{m}, we get

ΔA1b1=b22ζ1=ΔAb1=j=1n|Δaj,b|b1j=1nΔaj,\|\Delta A_{1}^{\top}b\|_{1}=\|b\|_{2}^{2}\|\zeta\|_{1}=\|\Delta A^{\top}b\|_{1}=\sum_{j=1}^{n}|\langle\Delta a_{j},b\rangle|\leq\|b\|_{1}\sum_{j=1}^{n}\|\Delta a_{j}\|_{\infty},
A2Δy1=j=1n|Pb(aj),Δy|Δy2j=1naj2mΔy2j=1naj,\|A_{2}^{\top}\Delta y\|_{1}=\sum_{j=1}^{n}|\langle P_{b^{\perp}}(a_{j}),\Delta y\rangle|\leq\|\Delta y\|_{2}\sum_{j=1}^{n}\|a_{j}\|_{2}\leq\sqrt{m}\|\Delta y\|_{2}\sum_{j=1}^{n}\|a_{j}\|_{\infty},

and

ΔA2y~1=j=1n|Pb(Δaj),y~|y~2j=1nΔaj2my~2j=1nΔaj.\|\Delta A_{2}^{\top}\widetilde{y}\|_{1}=\sum_{j=1}^{n}|\langle P_{b^{\perp}}(\Delta a_{j}),\widetilde{y}\rangle|\leq\|\widetilde{y}\|_{2}\sum_{j=1}^{n}\|\Delta a_{j}\|_{2}\leq\sqrt{m}\|\widetilde{y}\|_{2}\sum_{j=1}^{n}\|\Delta a_{j}\|_{\infty}.

Plugging three inequalities above into (2) and applying (8), we obtain

|A~1b+A~2y~1A1b+A2y1|\displaystyle\Bigl{|}\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\|_{1}-\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\|_{1}\Bigr{|} (b1+my~2)j=1nΔaj+mΔy2j=1naj\displaystyle\leq(\|b\|_{1}+\sqrt{m}\|\widetilde{y}\|_{2})\sum_{j=1}^{n}\|\Delta a_{j}\|_{\infty}+\sqrt{m}\|\Delta y\|_{2}\sum_{j=1}^{n}\|a_{j}\|_{\infty}
(10) (c3b1+c3my~2+c1m)j=1naj.\displaystyle\leq\Bigl{(}c_{3}\|b\|_{1}+c_{3}\sqrt{m}\|\widetilde{y}\|_{2}+c_{1}\sqrt{m}\Bigr{)}\sum_{j=1}^{n}\|a_{j}\|_{\infty}.

It follows from (6) and (7) that

|x~x|\displaystyle\Bigl{|}\|\widetilde{x}\|_{\infty}-\|x^{*}\|_{\infty}\Bigr{|} =|b22A~1b+A~2y~1b22A1b+A2y1|\displaystyle=\Biggl{|}\frac{\|b\|_{2}^{2}}{\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\|_{1}}-\frac{\|b\|_{2}^{2}}{\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\|_{1}}\Biggr{|}
=b22A~1b+A~2y~1A1b+A2y1|A~1b+A~2y~1A1b+A2y1|\displaystyle=\frac{\|b\|_{2}^{2}}{\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\|_{1}\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\|_{1}}\Bigl{|}\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\|_{1}-\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\|_{1}\Bigr{|}
b22c2(c3b1+c3my~2+c1m)j=1naj\displaystyle\leq\frac{\|b\|_{2}^{2}}{c_{2}}\Bigl{(}c_{3}\|b\|_{1}+c_{3}\sqrt{m}\|\widetilde{y}\|_{2}+c_{1}\sqrt{m}\Bigr{)}\sum_{j=1}^{n}\|a_{j}\|_{\infty}
mj=1naj.\displaystyle\lesssim\sqrt{m}\sum_{j=1}^{n}\|a_{j}\|_{\infty}.

In the first inequality above, we used (8) and (2). ∎

In general, to evaluate the error bounds for the ii-th layer, we need to approximate μt2\|\mu_{t}\|_{2} by considering the small distance Xt(i1)X~t(i1)2\|X_{t}^{(i-1)}-\widetilde{X}_{t}^{(i-1)}\|_{2} and the effect of consecutive orthogonal projections onto X~t(i1)\widetilde{X}_{t}^{(i-1)\perp}. Note that Xt(i1)=φ(i1)(X(i2)Wt(i1))X_{t}^{(i-1)}=\varphi^{(i-1)}(X^{(i-2)}W^{(i-1)}_{t}) and X~t(i1)=φ(i1)(X~(i2)Qt(i1))\widetilde{X}_{t}^{(i-1)}=\varphi^{(i-1)}(\widetilde{X}^{(i-2)}Q^{(i-1)}_{t}) where the tt-th neuron of the (i1)(i-1)-th layer, denoted by Wt(i1)Ni2W_{t}^{(i-1)}\in\mathbb{R}^{N_{i-2}}, is quantized as Qt(i1)𝒜Ni2Q_{t}^{(i-1)}\in\mathcal{A}^{N_{i-2}}. Since all neurons are quantized separately using a stochastic approach with independent random variables, {X~t(i1)Xt(i1)}i=1Ni1\{\widetilde{X}_{t}^{(i-1)}-X_{t}^{(i-1)}\}_{i=1}^{N_{i-1}} are independent.

Corollary 2.2.

Let X(i1)X^{(i-1)}, X~(i1)\widetilde{X}^{(i-1)} be as in (LABEL:eq:layer-input) such that, for 1tNi11\leq t\leq N_{i-1}, the input discrepancy defined by Δt:=X~t(i1)Xt(i1)\Delta_{t}:=\widetilde{X}_{t}^{(i-1)}-X_{t}^{(i-1)} satisfies Δtcx𝒩(0,α2I)\Delta_{t}\leq_{\mathrm{cx}}\mathcal{N}(0,\alpha^{2}I) where α>0\alpha>0 is a constant. Suppose that Δ1,Δ2,,ΔNi1\Delta_{1},\Delta_{2},\ldots,\Delta_{N_{i-1}} are independent. Let wNi1w\in\mathbb{R}^{N_{i-1}} be the weights associated with a neuron in the ii-th layer. Quantizing ww using (LABEL:eq:quantization-expression) over alphabets 𝒜=𝒜δ\mathcal{A}=\mathcal{A}_{\infty}^{\delta} with step size δ>0\delta>0,

uNi12(w2+σNi1)mlogNi1\|u_{N_{i-1}}\|_{2}\lesssim(\|w\|_{2}+\sigma_{N_{i-1}})\sqrt{m\log N_{i-1}}

holds with probability exceeding 12Ni11-\frac{2}{N_{i-1}}.

Proof.

Recall that

μt=PX~t(i1)(μt1+wtXt(i1))withμ0=0.\mu_{t}=P_{\widetilde{X}_{t}^{(i-1)\perp}}(\mu_{t-1}+w_{t}X_{t}^{(i-1)})\quad\text{with}\quad\mu_{0}=0.

Due to Δt:=X~t(i1)Xt(i1)\Delta_{t}:=\widetilde{X}_{t}^{(i-1)}-X_{t}^{(i-1)}, we have

(11) μt=PX~t(i1)(μt1+wtXt(i1))=PX~t(i1)(μt1)wtPX~t(i1)(Δt).\mu_{t}=P_{\widetilde{X}_{t}^{(i-1)\perp}}(\mu_{t-1}+w_{t}X_{t}^{(i-1)})=P_{\widetilde{X}_{t}^{(i-1)\perp}}(\mu_{t-1})-w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t}).

Since Δ1cx𝒩(0,α2I)\Delta_{1}\leq_{\mathrm{cx}}\mathcal{N}(0,\alpha^{2}I) and μ0=0\mu_{0}=0, we get μ1=w1PX~1(i1)(Δ1)cx𝒩(0,α2w12PX~1(i1))\mu_{1}=-w_{1}P_{\widetilde{X}_{1}^{(i-1)\perp}}(\Delta_{1})\leq_{\mathrm{cx}}\mathcal{N}(0,\alpha^{2}w_{1}^{2}P_{\widetilde{X}_{1}^{(i-1)\perp}}), where used LABEL:lemma:cx-afine to get the inequality. Assume that the following inequality holds:

μt1cx𝒩(0,α2j=1t1wj2PX~t1(i1)PX~t2(i1)PX~j(i1)PX~t2(i1)PX~t1(i1)).\mu_{t-1}\leq_{\mathrm{cx}}\mathcal{N}\biggl{(}0,\alpha^{2}\sum_{j=1}^{t-1}w_{j}^{2}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}P_{\widetilde{X}_{t-2}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{t-2}^{(i-1)\perp}}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}\biggr{)}.

Then applying LABEL:lemma:cx-afine and LABEL:lemma:cx-independent-sum to (11) yields

μt\displaystyle\mu_{t} =PX~t(i1)(μt1)wtPX~t(i1)(Δt)\displaystyle=P_{\widetilde{X}_{t}^{(i-1)\perp}}(\mu_{t-1})-w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})
cx𝒩(0,α2j=1t1wj2PX~t(i1)PX~t1(i1)PX~j(i1)PX~t1(i1)PX~t(i1))+𝒩(0,α2wt2PX~t(i1))\displaystyle\leq_{\mathrm{cx}}\mathcal{N}\biggl{(}0,\alpha^{2}\sum_{j=1}^{t-1}w_{j}^{2}P_{\widetilde{X}_{t}^{(i-1)\perp}}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{t-1}^{(i-1)\perp}}P_{\widetilde{X}_{t}^{(i-1)\perp}}\biggr{)}+\mathcal{N}\biggl{(}0,\alpha^{2}w_{t}^{2}P_{\widetilde{X}_{t}^{(i-1)\perp}}\biggr{)}
=𝒩(0,α2j=1twj2PX~t(i1)PX~t1(i1)PX~j(i1)PX~t1(i1)PX~t(i1)).\displaystyle=\mathcal{N}\biggl{(}0,\alpha^{2}\sum_{j=1}^{t}w_{j}^{2}P_{\widetilde{X}_{t}^{(i-1)\perp}}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{t-1}^{(i-1)\perp}}P_{\widetilde{X}_{t}^{(i-1)\perp}}\biggr{)}.

Hence, by induction, we have proved that, for 1tNi11\leq t\leq N_{i-1},

μtcx𝒩(0,α2j=1twj2PX~t(i1)PX~t1(i1)PX~j(i1)PX~t1(i1)PX~t(i1)).\mu_{t}\leq_{\mathrm{cx}}\mathcal{N}\biggl{(}0,\alpha^{2}\sum_{j=1}^{t}w_{j}^{2}P_{\widetilde{X}_{t}^{(i-1)\perp}}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{t-1}^{(i-1)\perp}}P_{\widetilde{X}_{t}^{(i-1)\perp}}\biggr{)}.

Since PX~t(i1)PX~t1(i1)PX~j(i1)PX~t1(i1)PX~t(i1)IP_{\widetilde{X}_{t}^{(i-1)\perp}}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{t-1}^{(i-1)\perp}}P_{\widetilde{X}_{t}^{(i-1)\perp}}\preceq I for all jj, by LABEL:lemma:cx-normal, we have μtcx𝒩(0,α2(j=1twj2)I)\mu_{t}\leq_{\mathrm{cx}}\mathcal{N}(0,\alpha^{2}(\sum_{j=1}^{t}w_{j}^{2})I). In particular, we get μNi1cx𝒩(0,α2w22I)\mu_{N_{i-1}}\leq_{\mathrm{cx}}\mathcal{N}(0,\alpha^{2}\|w\|_{2}^{2}I). It follows from LABEL:lemma:cx-gaussian-tail that

μNi12mμNi1αw2mlogNi1\|\mu_{N_{i-1}}\|_{2}\leq\sqrt{m}\|\mu_{N_{i-1}}\|_{\infty}\lesssim\alpha\|w\|_{2}\sqrt{m\log N_{i-1}}

holds with probability at least 11Ni11-\frac{1}{N_{i-1}}. Additionally, (LABEL:eq:inf-alphabet-tails) implies that with probability exceeding 11Ni11-\frac{1}{N_{i-1}}, we have

uNi1μNi12σNi1mlogNi1\|u_{N_{i-1}}-\mu_{N_{i-1}}\|_{2}\lesssim\sigma_{N_{i-1}}\sqrt{m\log N_{i-1}}

where σNi1=δπ2max1jNi1X~j(i1)2\sigma_{N_{i-1}}=\delta\sqrt{\frac{\pi}{2}}\max_{1\leq j\leq N_{i-1}}\|\widetilde{X}_{j}^{(i-1)}\|_{2}. Thus the union bound yields

uNi12(αw2+σNi1)mlogNi1\|u_{N_{i-1}}\|_{2}\lesssim(\alpha\|w\|_{2}+\sigma_{N_{i-1}})\sqrt{m\log N_{i-1}}

with probability at least 12Ni11-\frac{2}{N_{i-1}}. ∎

[**JZ: Will consider the special case for orthogonal X~t(i1)\widetilde{X}^{(i-1)}_{t}.] [**JZ: Previous loose bound: Suppose that Δt2αXt(i1)2\|\Delta_{t}\|_{2}\leq\alpha\|X_{t}^{(i-1)}\|_{2} with α(0,1)\alpha\in(0,1). If t=1t=1, then

μ12=w1PX~1(i1)(Δ1+X~1(i1))2=w1PX~1(i1)Δ12|w1|Δ12α|w1|X1(i1)2.\|\mu_{1}\|_{2}=\|w_{1}P_{\widetilde{X}_{1}^{(i-1)\perp}}(\Delta_{1}+\widetilde{X}_{1}^{(i-1)})\|_{2}=\|w_{1}P_{\widetilde{X}_{1}^{(i-1)\perp}}\Delta_{1}\|_{2}\leq|w_{1}|\|\Delta_{1}\|_{2}\leq\alpha|w_{1}|\|X_{1}^{(i-1)}\|_{2}.

Assume that μt12αj=1t1|wj|Xj(i1)2\|\mu_{t-1}\|_{2}\leq\alpha\sum_{j=1}^{t-1}|w_{j}|\|X_{j}^{(i-1)}\|_{2}. Using the triangle inequality and the induction hypothesis, one can get

μt2\displaystyle\|\mu_{t}\|_{2} =PX~t(i1)(μt1+wtXt(i1))2\displaystyle=\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(\mu_{t-1}+w_{t}X_{t}^{(i-1)})\|_{2}
PX~t(i1)μt12+wtPX~t(i1)Xt(i1)2\displaystyle\leq\|P_{\widetilde{X}_{t}^{(i-1)\perp}}\mu_{t-1}\|_{2}+\|w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}X_{t}^{(i-1)}\|_{2}
μt12+wtPX~t(i1)Δt2\displaystyle\leq\|\mu_{t-1}\|_{2}+\|w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}\Delta_{t}\|_{2}
αj=1t1|wj|Xj(i1)2+α|wt|Xt(i1)2\displaystyle\leq\alpha\sum_{j=1}^{t-1}|w_{j}|\|X_{j}^{(i-1)}\|_{2}+\alpha|w_{t}|\|X_{t}^{(i-1)}\|_{2}
=αj=1t|wj|Xj(i1)2.\displaystyle=\alpha\sum_{j=1}^{t}|w_{j}|\|X_{j}^{(i-1)}\|_{2}.

Therefore, by induction, we have

(12) μt2αj=1t|wj|Xj(i1)2αtwmax1jtXj(i1)2.\|\mu_{t}\|_{2}\leq\alpha\sum_{j=1}^{t}|w_{j}|\|X_{j}^{(i-1)}\|_{2}\leq\alpha t\|w\|_{\infty}\max_{1\leq j\leq t}\|X_{j}^{(i-1)}\|_{2}.

]

Recall from (LABEL:eq:ht), (LABEL:eq:ut-bound-infinite-eq2), and (LABEL:eq:ut-bound-infinite-eq4) that

ut=PX~t(i1)(ht)+(vtqt)X~t(i1)u_{t}=P_{\widetilde{X}_{t}^{(i-1)\perp}}(h_{t})+(v_{t}-q_{t})\widetilde{X}_{t}^{(i-1)}

where

ht=ut1+wtXt(i1),vt=X~t(i1),htX~t(i1)22,andqt=𝒬StocQ(vt).h_{t}=u_{t-1}+w_{t}X^{(i-1)}_{t},\quad v_{t}=\frac{\langle\widetilde{X}_{t}^{(i-1)},\quad h_{t}\rangle}{\|\widetilde{X}_{t}^{(i-1)}\|_{2}^{2}},\quad\text{and}\quad q_{t}=\mathcal{Q}_{\mathrm{StocQ}}(v_{t}).

Let Δt=X~t(i1)Xt(i1)\Delta_{t}=\widetilde{X}_{t}^{(i-1)}-X_{t}^{(i-1)} for all tt. Then we obtain

PX~t(i1)(ht)22\displaystyle\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(h_{t})\|_{2}^{2} =PX~t(i1)(ut1)+wtPX~t(i1)(Xt(i1))22\displaystyle=\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1})+w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}(X_{t}^{(i-1)})\|_{2}^{2}
=PX~t(i1)(ut1)wtPX~t(i1)(Δt)22\displaystyle=\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1})-w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\|_{2}^{2}
=PX~t(i1)(ut1)22+wt2PX~t(i1)(Δt)222wtPX~t(i1)(ut1),PX~t(i1)(Δt).\displaystyle=\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1})\|_{2}^{2}+w_{t}^{2}\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\|_{2}^{2}-2w_{t}\langle P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1}),P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\rangle.

It follows that

ut22ut122\displaystyle\|u_{t}\|_{2}^{2}-\|u_{t-1}\|^{2}_{2} =PX~t(i1)(ht)+(vtqt)X~t(i1)22ut122\displaystyle=\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(h_{t})+(v_{t}-q_{t})\widetilde{X}_{t}^{(i-1)}\|_{2}^{2}-\|u_{t-1}\|_{2}^{2}
=PX~t(i1)(ht)22+(vtqt)2X~t(i1)22ut122\displaystyle=\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(h_{t})\|_{2}^{2}+(v_{t}-q_{t})^{2}\|\widetilde{X}_{t}^{(i-1)}\|_{2}^{2}-\|u_{t-1}\|_{2}^{2}
=(vtqt)2X~t(i1)22+PX~t(i1)(ut1)22+wt2PX~t(i1)(Δt)22\displaystyle=(v_{t}-q_{t})^{2}\|\widetilde{X}_{t}^{(i-1)}\|_{2}^{2}+\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1})\|_{2}^{2}+w_{t}^{2}\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\|_{2}^{2}
2wtPX~t(i1)(ut1),PX~t(i1)(Δt)ut122\displaystyle-2w_{t}\langle P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1}),P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\rangle-\|u_{t-1}\|_{2}^{2}
(13) =(vtqt)2X~t(i1)22PX~t(i1)(ut1)22+et\displaystyle=(v_{t}-q_{t})^{2}\|\widetilde{X}_{t}^{(i-1)}\|_{2}^{2}-\|P_{\widetilde{X}_{t}^{(i-1)}}(u_{t-1})\|_{2}^{2}+e_{t}

where et:=wt2PX~t(i1)(Δt)222wtPX~t(i1)(ut1),PX~t(i1)(Δt)e_{t}:=w_{t}^{2}\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\|_{2}^{2}-2w_{t}\langle P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1}),P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\rangle.

[**JZ: The proof for the general ii-th layer]

Proof.

In general, to evaluate the error bounds for the ii-th layer (with i>1i>1) using (LABEL:eq:inf-alphabet-tails), we need to approximate μt\mu_{t} by considering recursive orthogonal projections of Xj(i1)X_{j}^{(i-1)} onto span(X~j(i1))\mathrm{span}(\widetilde{X}_{j}^{(i-1)})^{\perp} for 1jt1\leq j\leq t. Specifically, according to (LABEL:eq:def-mu), we have

μt=j=1twjPX~t(i1)PX~j+1(i1)PX~j(i1)Xj(i1),1tNi1.\mu_{t}=\sum_{j=1}^{t}w_{j}P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j},\quad 1\leq t\leq N_{i-1}.

To control μt2\|\mu_{t}\|_{2} using concentration inequalities, we impose randomness on the weights. In particular, by assuming w𝒩(0,I)w\sim\mathcal{N}(0,I), we get μt𝒩(0,St)\mu_{t}\sim\mathcal{N}(0,S_{t}) with StS_{t} defined as follows

(14) St:=j=1tPX~t(i1)PX~j+1(i1)PX~j(i1)Xj(i1)Xj(i1)PX~j(i1)PX~j+1(i1)PX~t(i1)m×m.S_{t}:=\sum_{j=1}^{t}P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j}X^{(i-1)\top}_{j}P_{\widetilde{X}^{(i-1)\perp}_{j}}P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}}\in\mathbb{R}^{m\times m}.

Then St12μt𝒩(0,I)S_{t}^{-\frac{1}{2}}\mu_{t}\sim\mathcal{N}(0,I). Applying the Hanson-Wright inequality, see e.g. [rudelson2013hanson] and [vershynin2018high], we obtain for all α0\alpha\geq 0 that

(15) P(|μt2St1/2F|α)12exp(cα2St1/222).\mathrm{P}\biggl{(}\Bigl{|}\|\mu_{t}\|_{2}-\|S_{t}^{1/2}\|_{F}\Bigr{|}\leq\alpha\biggr{)}\geq 1-2\exp\biggl{(}-\frac{c\alpha^{2}}{\|S_{t}^{1/2}\|_{2}^{2}}\biggr{)}.

It remains to evaluate St1/2F\|S_{t}^{1/2}\|_{F} and St1/22\|S_{t}^{1/2}\|_{2}. Note that the error bounds for the (i1)(i-1)-th layer guarantee [**JZ: We may need to discuss the effect of activation functions. Consider two cases: one for large X2\|X\|_{2} another for small X2\|X\|_{2}.]

Xj(i1)X~j(i1)2Xj(i1)2mlogNi2Ni2.\frac{\|X^{(i-1)}_{j}-\widetilde{X}^{(i-1)}_{j}\|_{2}}{\|X^{(i-1)}_{j}\|_{2}}\lesssim\sqrt{\frac{m\log N_{i-2}}{N_{i-2}}}.

It follows from the inequality above and Lemma 2.5 that

(16) PX~j(i1)(Xj(i1))2Xj(i1)2mlogNi22Ni2(1+Xj(i1)22X~j(i1)22).\|P_{\widetilde{X}^{(i-1)\perp}_{j}}(X^{(i-1)}_{j})\|_{2}\leq\|X^{(i-1)}_{j}\|_{2}\sqrt{\frac{m\log N_{i-2}}{2N_{i-2}}\biggl{(}1+\frac{\|X^{(i-1)}_{j}\|_{2}^{2}}{\|\widetilde{X}^{(i-1)}_{j}\|_{2}^{2}}\biggr{)}}.

Moreover, since AF2=tr(AA)\|A\|_{F}^{2}=\operatorname{tr}(A^{\top}A) and tr(AB)=tr(BA)\operatorname{tr}(AB)=\operatorname{tr}(BA), we have

St1/2F2\displaystyle\|S_{t}^{1/2}\|_{F}^{2} =tr(St)\displaystyle=\operatorname{tr}(S_{t})
=j=1ttr(PX~t(i1)PX~j+1(i1)PX~j(i1)Xj(i1)Xj(i1)PX~j(i1)PX~j+1(i1)PX~t(i1))\displaystyle=\sum_{j=1}^{t}\operatorname{tr}(P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j}X^{(i-1)\top}_{j}P_{\widetilde{X}^{(i-1)\perp}_{j}}P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}})
=j=1ttr(Xj(i1)PX~j(i1)PX~j+1(i1)PX~t(i1)PX~j+1(i1)PX~j(i1)Xj(i1))\displaystyle=\sum_{j=1}^{t}\operatorname{tr}(X^{(i-1)\top}_{j}P_{\widetilde{X}^{(i-1)\perp}_{j}}P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j})
=j=1tXj(i1)PX~j(i1)PX~j+1(i1)PX~t(i1)PX~j+1(i1)PX~j(i1)Xj(i1)\displaystyle=\sum_{j=1}^{t}X^{(i-1)\top}_{j}P_{\widetilde{X}^{(i-1)\perp}_{j}}P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j}
=j=1t(PX~j(i1)Xj(i1))PX~j+1(i1)PX~t(i1)PX~j+1(i1)(PX~j(i1)Xj(i1))\displaystyle=\sum_{j=1}^{t}(P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j})^{\top}P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}(P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j})
j=1tPX~j+1(i1)PX~t(i1)PX~j+1(i1)2PX~j(i1)(Xj(i1))22\displaystyle\leq\sum_{j=1}^{t}\|P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\|_{2}\|P_{\widetilde{X}^{(i-1)\perp}_{j}}(X^{(i-1)}_{j})\|_{2}^{2}
j=1tPX~j(i1)(Xj(i1))22.\displaystyle\leq\sum_{j=1}^{t}\|P_{\widetilde{X}^{(i-1)\perp}_{j}}(X^{(i-1)}_{j})\|_{2}^{2}.

Here, the first inequality holds because maxx2=1xAx=A2\max_{\|x\|_{2}=1}x^{\top}Ax=\|A\|_{2} for any positive semidefinite matrix AA. The second inequality is due to P21\|P\|_{2}\leq 1 for any orthogonal projection PP. Plugging (16) into the inequality above, we get

(17) St1/2F2j=1tmlogNi22Ni2(1+Xj(i1)22X~j(i1)22)Xj(i1)22.\|S_{t}^{1/2}\|_{F}^{2}\leq\sum_{j=1}^{t}\frac{m\log N_{i-2}}{2N_{i-2}}\biggl{(}1+\frac{\|X^{(i-1)}_{j}\|_{2}^{2}}{\|\widetilde{X}^{(i-1)}_{j}\|_{2}^{2}}\biggr{)}\|X^{(i-1)}_{j}\|_{2}^{2}.

Next, since P21\|P\|_{2}\leq 1 for all orthogonal projections PP and aa2=a22\|aa^{\top}\|_{2}=\|a\|_{2}^{2} for all vectors ama\in\mathbb{R}^{m}, we obtain

St1/222\displaystyle\|S_{t}^{1/2}\|_{2}^{2} =St2\displaystyle=\|S_{t}\|_{2}
j=1tPX~t(i1)PX~j+1(i1)PX~j(i1)Xj(i1)Xj(i1)PX~j(i1)PX~j+1(i1)PX~t(i1)2\displaystyle\leq\sum_{j=1}^{t}\|P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j}X^{(i-1)\top}_{j}P_{\widetilde{X}^{(i-1)\perp}_{j}}P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}}\|_{2}
j=1tPX~j(i1)Xj(i1)Xj(i1)PX~j(i1)2\displaystyle\leq\sum_{j=1}^{t}\|P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j}X^{(i-1)\top}_{j}P_{\widetilde{X}^{(i-1)\perp}_{j}}\|_{2}
=j=1tPX~j(i1)(Xj(i1))22.\displaystyle=\sum_{j=1}^{t}\|P_{\widetilde{X}^{(i-1)\perp}_{j}}(X^{(i-1)}_{j})\|_{2}^{2}.

Again, by (16), the inequality above becomes

(18) St1/222j=1tmlogNi22Ni2(1+Xj(i1)22X~j(i1)22)Xj(i1)22.\|S_{t}^{1/2}\|_{2}^{2}\leq\sum_{j=1}^{t}\frac{m\log N_{i-2}}{2N_{i-2}}\biggl{(}1+\frac{\|X^{(i-1)}_{j}\|_{2}^{2}}{\|\widetilde{X}^{(i-1)}_{j}\|_{2}^{2}}\biggr{)}\|X^{(i-1)}_{j}\|_{2}^{2}.

[**JZ: Old results by solving the optimization problem:]

Our strategy is to align Xk(i1)X_{k}^{(i-1)} with X~k(i1)\widetilde{X}_{k}^{(i-1)} for each kk which leads to μt=0\mu_{t}=0. Specifically, given a neuron wNi1w\in\mathbb{R}^{N_{i-1}} in layer ii, recall that our quantization algorithm generates q𝒜Ni1q\in\mathcal{A}^{N_{i-1}} such that X~(i1)q\widetilde{X}^{(i-1)}q can track X(i1)wX^{(i-1)}w in the sense of 2\ell_{2} distance. If one can find a proper vector w~Ni1\widetilde{w}\in\mathbb{R}^{N_{i-1}} satisfying X(i1)w=X~(i1)w~X^{(i-1)}w=\widetilde{X}^{(i-1)}\widetilde{w}, then quantizing the new weights w~\widetilde{w} using data X~(i1)\widetilde{X}^{(i-1)} amounts to solving for q𝒜Ni1q\in\mathcal{A}^{N_{i-1}} such that

(19) X~(i1)qX~(i1)w~=X(i1)w,\widetilde{X}^{(i-1)}q\approx\widetilde{X}^{(i-1)}\widetilde{w}=X^{(i-1)}w,

which indeed does not change our initial target. Therefore, it suffices to quantize w~\widetilde{w} using (LABEL:eq:quantization-expression) in which we set X(i1)=X~(i1)X^{(i-1)}=\widetilde{X}^{(i-1)} and w=w~w=\widetilde{w}. In this case, the modified iterations of quantization are given by

(20) {u0=0m,qt=𝒬StocQ(X~t(i1),ut1+w~tX~t(i1)X~t(i1)22),ut=ut1+w~tX~t(i1)qtX~t(i1).\begin{cases}u_{0}=0\in\mathbb{R}^{m},\\ q_{t}=\mathcal{Q}_{\mathrm{StocQ}}\Biggl{(}\frac{\langle\widetilde{X}_{t}^{(i-1)},u_{t-1}+\widetilde{w}_{t}\widetilde{X}_{t}^{(i-1)}\rangle}{\|\widetilde{X}^{(i-1)}_{t}\|_{2}^{2}}\Biggr{)},\\ u_{t}=u_{t-1}+\widetilde{w}_{t}\widetilde{X}^{(i-1)}_{t}-q_{t}\widetilde{X}^{(i-1)}_{t}.\end{cases}

Moreover, the corresponding error bound for quantizing the ii-th layer with i>1i>1 can be derived as follows.

Corollary 2.3.

Let X(i1)X^{(i-1)}, X~(i1)\widetilde{X}^{(i-1)} be as in (LABEL:eq:layer-input) and suppose that rank(X~(i1))=m\mathrm{rank}(\widetilde{X}^{(i-1)})=m. Let wNi1w\in\mathbb{R}^{N_{i-1}} be the weights associated with a neuron in the ii-th layer and let w~Ni1\widetilde{w}\in\mathbb{R}^{N_{i-1}} be any solution of the linear system X~(i1)w~=X(i1)w\widetilde{X}^{(i-1)}\widetilde{w}=X^{(i-1)}w. Quantizing w~\widetilde{w} using (20) over alphabets 𝒜=𝒜δ\mathcal{A}=\mathcal{A}_{\infty}^{\delta} with step size δ>0\delta>0, the following inequality holds with probability exceeding 1γ1-\gamma,

ut2σtlog(2m/γ)\|u_{t}\|_{\infty}\leq 2\sigma_{t}\sqrt{\log(\sqrt{2}m/\gamma)}

where σt2=πδ22max1jtX~j(i1)22\sigma_{t}^{2}=\frac{\pi\delta^{2}}{2}\max_{1\leq j\leq t}\|\widetilde{X}_{j}^{(i-1)}\|_{2}^{2}. In particular, we have

X(i1)wX~(i1)q2σNi1log(2m/γ)\|X^{(i-1)}w-\widetilde{X}^{(i-1)}q\|_{\infty}\leq 2\sigma_{N_{i-1}}\sqrt{\log(\sqrt{2}m/\gamma)}

holds with probability at least 1γ1-\gamma.

Proof.

Because it suffices to use X~(i1)\widetilde{X}^{(i-1)} to quantize w~\widetilde{w}, we have X(i1)=X~(i1)X^{(i-1)}=\widetilde{X}^{(i-1)} in (LABEL:eq:quantization-expression). So LABEL:thm:ut-bound-infinite and LABEL:corollary:ut-bound-infinite still hold. It follows from (LABEL:eq:def-mu) that μt=0\mu_{t}=0 for 1tNi11\leq t\leq N_{i-1}. Additionally, in this case, (LABEL:eq:inf-alphabet-tails) becomes

ut2σtlog(2m/γ)\|u_{t}\|_{\infty}\leq 2\sigma_{t}\sqrt{\log(\sqrt{2}m/\gamma)}

holds with probability at least 1γ1-\gamma. Further, if t=Ni1t=N_{i-1}, then, by (19), we have

uNi1=X~(i1)qX~(i1)w~=X~(i1)qX(i1)w.u_{N_{i-1}}=\widetilde{X}^{(i-1)}q-\widetilde{X}^{(i-1)}\widetilde{w}=\widetilde{X}^{(i-1)}q-X^{(i-1)}w.

and thus

X(i1)wX~(i1)q=uNi12σNi1log(2m/γ)\|X^{(i-1)}w-\widetilde{X}^{(i-1)}q\|_{\infty}=\|u_{N_{i-1}}\|_{\infty}\leq 2\sigma_{N_{i-1}}\sqrt{\log(\sqrt{2}m/\gamma)}

holds with probability at least 1γ1-\gamma. ∎

Further, if we assume w𝒩(0,I)w\sim\mathcal{N}(0,I), then 𝔼X(i1)w22=X(i1)F2\mathbb{E}\|X^{(i-1)}w\|_{2}^{2}=\|X^{(i-1)}\|_{F}^{2}. As a consequence of Hanson-Wright inequality, see e.g. [rudelson2013hanson], one can obtain

P(|X(i1)w2X(i1)F|t)2exp(ct2X(i1)22)\mathrm{P}\biggl{(}\Bigl{|}\|X^{(i-1)}w\|_{2}-\|X^{(i-1)}\|_{F}\Bigr{|}\geq t\biggr{)}\leq 2\exp\Biggl{(}-\frac{ct^{2}}{\|X^{(i-1)}\|_{2}^{2}}\Biggr{)}

for all t0t\geq 0. Combining this with Corollary 2.3, we deduce the relative error

X(i1)wX~(i1)q2X(i1)w2δmlogmmax1jNi1X~j(i1)2X(i1)FδmlogmNi1.\frac{\|X^{(i-1)}w-\widetilde{X}^{(i-1)}q\|_{2}}{\|X^{(i-1)}w\|_{2}}\lesssim\frac{\delta\sqrt{m\log m}\max_{1\leq j\leq N_{i-1}}\|\widetilde{X}_{j}^{(i-1)}\|_{2}}{\|X^{(i-1)}\|_{F}}\approx\delta\sqrt{\frac{m\log m}{N_{i-1}}}.

To quantize a neuron wNi1w\in\mathbb{R}^{N_{i-1}} in the ii-th layer using finite alphabets 𝒜=𝒜Kδ\mathcal{A}=\mathcal{A}_{K}^{\delta} in (LABEL:eq:alphabet-midtread), one can align the input X(i1)X^{(i-1)} with its analogue X~(i1)\widetilde{X}^{(i-1)} by solving for w~\widetilde{w} in X~(i1)w~=X(i1)w\widetilde{X}^{(i-1)}\widetilde{w}=X^{(i-1)}w. Then it suffices to quantize w~\widetilde{w} merely based on the input X~(i1)\widetilde{X}^{(i-1)}. However, unlike the case of using infinite alphabets, the choice of the solution w~\widetilde{w} is not arbitrary since we need to bound w~\|\widetilde{w}\|_{\infty} such that w~Kδ\|\widetilde{w}\|_{\infty}\lesssim K\delta.

Now we pass to a detailed procedure for finding proper w~\widetilde{w} and suppose that rank(X~(i1))=m\mathrm{rank}(\widetilde{X}^{(i-1)})=m. We will use ASA_{S} to express the submatrix of a matrix AA with columns indexed by SS and use xSx_{S} to denote the restriction of a vector xx to the entries indexed by SS. By permuting columns if necessary, we can assume X~(i1)=[X~T(i1),X~Tc(i1)]\widetilde{X}^{(i-1)}=[\widetilde{X}^{(i-1)}_{T},\widetilde{X}^{(i-1)}_{T^{c}}] with T={1,2,,m}T=\{1,2,\ldots,m\} and rank(X~T(i1))=m\mathrm{rank}(\widetilde{X}^{(i-1)}_{T})=m. Additionally, we set

w~=[w~Tw~Tc],w=[wTwTc].\widetilde{w}=\begin{bmatrix}\widetilde{w}_{T}\\ \widetilde{w}_{T^{c}}\end{bmatrix},\quad w=\begin{bmatrix}w_{T}\\ w_{T^{c}}\end{bmatrix}.

Then the linear system X~(i1)w~=X(i1)w\widetilde{X}^{(i-1)}\widetilde{w}=X^{(i-1)}w is equivalent to

(21) (X(i1)X~(i1))w=X~(i1)(w~w)=X~T(i1)(w~TwT)+X~Tc(i1)(w~TcwTc).(X^{(i-1)}-\widetilde{X}^{(i-1)})w=\widetilde{X}^{(i-1)}(\widetilde{w}-w)=\widetilde{X}^{(i-1)}_{T}(\widetilde{w}_{T}-w_{T})+\widetilde{X}^{(i-1)}_{T^{c}}(\widetilde{w}_{T^{c}}-w_{T^{c}}).

To simplify the problem, we let w~Tc=wTc\widetilde{w}_{T^{c}}=w_{T^{c}}. Then the original linear system (21) becomes

(22) {w~Tc=wTc,X~T(i1)(w~TwT)=E(i1)w\begin{cases}\widetilde{w}_{T^{c}}=w_{T^{c}},\\ \widetilde{X}^{(i-1)}_{T}(\widetilde{w}_{T}-w_{T})=E^{(i-1)}w\end{cases}

where E(i1):=X(i1)X~(i1)E^{(i-1)}:=X^{(i-1)}-\widetilde{X}^{(i-1)}. Since X~T(i1)\widetilde{X}^{(i-1)}_{T} is invertible, the solution w~\widetilde{w} of (22) is unique. Moreover, we have

(23) w~w2=w~TwT2σm(X~T(i1))1E(i1)w2mσm(X~T(i1))1E(i1)w.\|\widetilde{w}-w\|_{2}=\|\widetilde{w}_{T}-w_{T}\|_{2}\leq\sigma_{m}(\widetilde{X}^{(i-1)}_{T})^{-1}\|E^{(i-1)}w\|_{2}\leq\sqrt{m}\sigma_{m}(\widetilde{X}^{(i-1)}_{T})^{-1}\|E^{(i-1)}w\|_{\infty}.

Note that E(i1)=X(i1)X~(i1)E^{(i-1)}=X^{(i-1)}-\widetilde{X}^{(i-1)} has independent columns. Further, if we assume that row vectors e1,e2,,eme_{1},e_{2},\ldots,e_{m} of E(i1)E^{(i-1)} are isotropic sub-gaussian vectors with max1imeiψ2J(Ni1)\max_{1\leq i\leq m}\|e_{i}\|_{\psi_{2}}\leq J(N_{i-1}). It follows that

(24) E(i1)wJlogmw2\|E^{(i-1)}w\|_{\infty}\lesssim J\sqrt{\log m}\|w\|_{2}

holds with high probability. By triangle inequality, (23) and (24),

(25) w~w~2w2+w~w2w2(1+Jmlogmσm(X~T(i1)))\|\widetilde{w}\|_{\infty}\leq\|\widetilde{w}\|_{2}\leq\|w\|_{2}+\|\widetilde{w}-w\|_{2}\leq\|w\|_{2}\biggl{(}1+\frac{J\sqrt{m\log m}}{\sigma_{m}(\widetilde{X}^{(i-1)}_{T})}\biggr{)}

holds with high probability.

Remark 2.4.

According to [rudelson2009smallest], if AA is an m×mm\times m random matrix, whose entries are independent copies of a mean zero sub-gaussian random variable BB with unit variance, then, for every ϵ>0\epsilon>0, we have

P(σm(A)ϵm)Cϵ+ecm\mathrm{P}\Bigl{(}\sigma_{m}(A)\leq\frac{\epsilon}{\sqrt{m}}\Bigr{)}\leq C\epsilon+e^{-cm}

where C,c>0C,c>0 depend (polynomially) only on Bψ2\|B\|_{\psi_{2}}.

Lemma 2.5.

Let ϵ(0,12]\epsilon\in(0,\frac{1}{2}] and x,ymx,y\in\mathbb{R}^{m} be nonzero vectors such that

(26) xy2x2ϵ.\frac{\|x-y\|_{2}}{\|x\|_{2}}\leq\epsilon.

Then we have

Py(x)2x2ϵ102\frac{\|P_{y^{\perp}}(x)\|_{2}}{\|x\|_{2}}\leq\frac{\epsilon\sqrt{10}}{2}

where the orthogonal projection Py=Iyyy22P_{y^{\perp}}=I-\frac{yy^{\top}}{\|y\|_{2}^{2}} is given by (LABEL:eq:orth-proj-mat).

Proof.

(26) implies that

xy22=x22+y222x,yϵ2x22.\|x-y\|_{2}^{2}=\|x\|_{2}^{2}+\|y\|_{2}^{2}-2\langle x,y\rangle\leq\epsilon^{2}\|x\|_{2}^{2}.

Then we have

(27) x,y12((1ϵ2)x22+y22).\langle x,y\rangle\geq\frac{1}{2}\Bigl{(}(1-\epsilon^{2})\|x\|_{2}^{2}+\|y\|_{2}^{2}\Bigr{)}.

Let α:=x2y2\alpha:=\frac{\|x\|_{2}}{\|y\|_{2}}. It follows from (27) and the definition of α\alpha that

Py(x)22\displaystyle\|P_{y^{\perp}}(x)\|_{2}^{2} =x22x,y2y22\displaystyle=\|x\|_{2}^{2}-\frac{\langle x,y\rangle^{2}}{\|y\|_{2}^{2}}
x2214y22((1ϵ2)x22+y22)2\displaystyle\leq\|x\|_{2}^{2}-\frac{1}{4\|y\|_{2}^{2}}\Bigl{(}(1-\epsilon^{2})\|x\|_{2}^{2}+\|y\|_{2}^{2}\Bigr{)}^{2}
=1+ϵ22x22(1ϵ2)24x24y2214y22\displaystyle=\frac{1+\epsilon^{2}}{2}\|x\|_{2}^{2}-\frac{(1-\epsilon^{2})^{2}}{4}\frac{\|x\|_{2}^{4}}{\|y\|_{2}^{2}}-\frac{1}{4}\|y\|_{2}^{2}
=1+ϵ22x22(1ϵ2)2α24x2214α2x22\displaystyle=\frac{1+\epsilon^{2}}{2}\|x\|_{2}^{2}-\frac{(1-\epsilon^{2})^{2}\alpha^{2}}{4}\|x\|_{2}^{2}-\frac{1}{4\alpha^{2}}\|x\|_{2}^{2}
=x22((1+α2)ϵ2+1214(α2+1α2))\displaystyle=\|x\|_{2}^{2}\Bigl{(}\frac{(1+\alpha^{2})\epsilon^{2}+1}{2}-\frac{1}{4}(\alpha^{2}+\frac{1}{\alpha^{2}})\Bigr{)}
(1+α2)ϵ22x22.\displaystyle\leq\frac{(1+\alpha^{2})\epsilon^{2}}{2}\|x\|_{2}^{2}.

In the last inequality, we used the numerical inequality α2+1α22\alpha^{2}+\frac{1}{\alpha^{2}}\geq 2 for all α>0\alpha>0. Then the result above becomes

(28) Py(x)2x2ϵ1+α22=ϵ12(1+x22y22).\frac{\|P_{y^{\perp}}(x)\|_{2}}{\|x\|_{2}}\leq\epsilon\sqrt{\frac{1+\alpha^{2}}{2}}=\epsilon\sqrt{\frac{1}{2}\Bigl{(}1+\frac{\|x\|_{2}^{2}}{\|y\|_{2}^{2}}\Bigr{)}}.

Moreover, by triangle inequality and (26), we have

x2xy2+y2ϵx2+y2.\|x\|_{2}\leq\|x-y\|_{2}+\|y\|_{2}\leq\epsilon\|x\|_{2}+\|y\|_{2}.

It implies that

(29) x2y211ϵ2,\frac{\|x\|_{2}}{\|y\|_{2}}\leq\frac{1}{1-\epsilon}\leq 2,

where the last inequality is due to ϵ(0,12]\epsilon\in(0,\frac{1}{2}]. Plugging (29) into (28), we obtain

Py(x)2x2ϵ102.\frac{\|P_{y^{\perp}}(x)\|_{2}}{\|x\|_{2}}\leq\frac{\epsilon\sqrt{10}}{2}.

3. Archived results

Theorem 3.1.

Let Φ\Phi be an LL-layer neural network as in (LABEL:eq:mlp) where the activation function is φ(i)(x)=ρ(x):=max{0,x}\varphi^{(i)}(x)=\rho(x):=\max\{0,x\} for 1iL1\leq i\leq L. Suppose that each W(i)Ni1×NiW^{(i)}\in\mathbb{R}^{N_{i-1}\times N_{i}} has i.i.d. 𝒩(0,1Ni1)\mathcal{N}(0,\frac{1}{N_{i-1}}) entries and {W(i)}i=1L\{W^{(i)}\}_{i=1}^{L} are independent. Sample data Xm×N0X\in\mathbb{R}^{m\times N_{0}} and quantize W(i)W^{(i)} using (LABEL:eq:quantization-algorithm) with alphabet 𝒜=𝒜δ(i)\mathcal{A}=\mathcal{A}_{\infty}^{\delta^{(i)}} where δ(i)(0,1]\delta^{(i)}\in(0,1] and mNim\leq N_{i} for all 1iL1\leq i\leq L. Fix pp\in\mathbb{N} with p3p\geq 3. Then

max1jNLΦ(X)jΦ~(X)j2=max1jNLXj(L)X~j(L)2\displaystyle\max_{1\leq j\leq N_{L}}\|\Phi(X)_{j}-\widetilde{\Phi}(X)_{j}\|_{2}=\max_{1\leq j\leq N_{L}}\|X^{(L)}_{j}-\widetilde{X}^{(L)}_{j}\|_{2}
(30) (2πpm)L2(i=1LlogNi1)12i=1L(4+k=1di1Adi1(i1)Ak+1(i1)Ak(i1)2)max1jN0Xj2\displaystyle\leq(2\pi pm)^{\frac{L}{2}}\Bigl{(}\prod_{i=1}^{L}\log N_{i-1}\Bigr{)}^{\frac{1}{2}}\prod_{i=1}^{L}\Bigl{(}4+\sum_{k=1}^{d_{i}-1}\|A^{(i-1)}_{d_{i}-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\|_{2}\Bigr{)}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}

holds with probability at least 12i=1LmNiNi1p2i=2LNiNi1p1i=1L1NiNi1p1-\sqrt{2}\sum_{i=1}^{L}\frac{mN_{i}}{N_{i-1}^{p}}-\sqrt{2}\sum_{i=2}^{L}\frac{N_{i}}{N_{i-1}^{p-1}}-\sum_{i=1}^{L-1}\frac{N_{i}}{N_{i-1}^{p}}. Here, did_{i} and Aj(i1)A^{(i-1)}_{j} are defined as in Lemma 1.1.

Proof.

To prove (3.1), we will proceed inductively over layer indices ii. The case i=1i=1 is trivial, since the error bound of quantizing W(1)W^{(1)} is given by part (a) of LABEL:thm:error-bound-single-layer-infinite. Additionally, by a union bound, the quantization of W(1)W^{(1)} yields

max1jN1Xj(1)X~j(1)2\displaystyle\max_{1\leq j\leq N_{1}}\|X^{(1)}_{j}-\widetilde{X}^{(1)}_{j}\|_{2} =max1jN1ρ(XWj(1))ρ(XQj(1))2\displaystyle=\max_{1\leq j\leq N_{1}}\|\rho(XW_{j}^{(1)})-\rho(XQ_{j}^{(1)})\|_{2}
max1jN1XWj(1)XQj(1)2\displaystyle\leq\max_{1\leq j\leq N_{1}}\|XW_{j}^{(1)}-XQ_{j}^{(1)}\|_{2}
(31) δ(1)2πpmlogN0max1jN0Xj2\displaystyle\leq\delta^{(1)}\sqrt{2\pi pm\log N_{0}}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}

with probability at least 12mN1N0p1-\frac{\sqrt{2}mN_{1}}{N_{0}^{p}}. Note that function f(z):=ρ(Xz)2f(z):=\|\rho(Xz)\|_{2} is Lipschitz with Lipschitz constant Lf=X2L_{f}=\|X\|_{2} and thatN0Xj(1)2=N0ρ(XWj(1))2=f(N0Wj(1))\sqrt{N_{0}}\|X^{(1)}_{j}\|_{2}=\sqrt{N_{0}}\|\rho(XW^{(1)}_{j})\|_{2}=f(\sqrt{N_{0}}W_{j}^{(1)}) with N0Wj(1)𝒩(0,I)\sqrt{N_{0}}W_{j}^{(1)}\sim\mathcal{N}(0,I). Applying LABEL:lemma:Lip-concentration to ff with X=N0Wj(1)X=\sqrt{N_{0}}W_{j}^{(1)} and α=2plogN0X2\alpha=\sqrt{2p\log N_{0}}\|X\|_{2}, we obtain

(32) P(Xj(1)2𝔼Xj(1)22plogN0N0X2)11N0p.\mathrm{P}\Bigl{(}\|X_{j}^{(1)}\|_{2}-\mathbb{E}\|X_{j}^{(1)}\|_{2}\leq\sqrt{\frac{2p\log N_{0}}{N_{0}}}\|X\|_{2}\Bigr{)}\geq 1-\frac{1}{N_{0}^{p}}.

Using Jensen’s inequality and the identity 𝔼ρ(Xg)22=12XF2\mathbb{E}\|\rho(Xg)\|_{2}^{2}=\frac{1}{2}\|X\|_{F}^{2} where g𝒩(0,I)g\sim\mathcal{N}(0,I), we have

𝔼Xj(1)2𝔼Xj(1)22=𝔼ρ(XWj(1))22=12N0XF.\mathbb{E}\|X_{j}^{(1)}\|_{2}\leq\sqrt{\mathbb{E}\|X_{j}^{(1)}\|_{2}^{2}}=\sqrt{\mathbb{E}\|\rho(XW^{(1)}_{j})\|_{2}^{2}}=\frac{1}{\sqrt{2N_{0}}}\|X\|_{F}.

Applying the inequalities above to (32) and taking the union bound over all N1N_{1} neurons in W(1)W^{(1)}, we obtain that

(33) max1jN1Xj(1)212N0(XF+2plogN0X2)2πplogN0N0XF\max_{1\leq j\leq N_{1}}\|X_{j}^{(1)}\|_{2}\leq\frac{1}{\sqrt{2N_{0}}}\bigl{(}\|X\|_{F}+2\sqrt{p\log N_{0}}\|X\|_{2}\bigr{)}\leq\sqrt{\frac{2\pi p\log N_{0}}{N_{0}}}\|X\|_{F}

holds with probability exceeding 1N1N0p1-\frac{N_{1}}{N_{0}^{p}}.

Now, we consider i=2i=2. Let \mathcal{E} be the event that both (3) and (33) hold. Conditioning on \mathcal{E}, we quantize W(2)N1×N2W^{(2)}\in\mathbb{R}^{N_{1}\times N_{2}}. Since Wj(2)𝒩(0,1N1I)W^{(2)}_{j}\sim\mathcal{N}(0,\frac{1}{N_{1}}I) and mN1m\leq N_{1}, LABEL:lemma:cx-gaussian-tail yields

(34) P(Wj(2)2πplogN1m)P(Wj(2)2plogN1N1)12N1p1,1jN2.\mathrm{P}\Bigl{(}\|W^{(2)}_{j}\|_{\infty}\leq\sqrt{\frac{2\pi p\log N_{1}}{m}}\Bigr{)}\geq\mathrm{P}\Bigl{(}\|W^{(2)}_{j}\|_{\infty}\leq 2\sqrt{\frac{p\log N_{1}}{N_{1}}}\Bigr{)}\geq 1-\frac{\sqrt{2}}{N_{1}^{p-1}},\quad 1\leq j\leq N_{2}.

Using LABEL:thm:error-bound-single-layer-infinite with i=2i=2, δ=δ(2)\delta=\delta^{(2)}, and applying (34), we have that, conditioning on \mathcal{E},

X(1)Wj(2)X~(1)Qj(2)2\displaystyle\|X^{(1)}W^{(2)}_{j}-\widetilde{X}^{(1)}Q^{(2)}_{j}\|_{2}
(mWj(2)(2+k=1d21Ad21(1)Ak+1(1)Ak(1)2)+δ(2)2πpmlogN1)\displaystyle\leq\biggl{(}m\|W^{(2)}_{j}\|_{\infty}\Bigl{(}2+\sum_{k=1}^{d_{2}-1}\|A^{(1)}_{d_{2}-1}\ldots A^{(1)}_{k+1}A^{(1)}_{k}\|_{2}\Bigr{)}+\delta^{(2)}\sqrt{2\pi pm\log N_{1}}\biggr{)}
×max1jN1Xj(1)X~j(1)2+δ(2)2πpmlogN1max1jN1Xj(1)2\displaystyle\times\max_{1\leq j\leq N_{1}}\|X^{(1)}_{j}-\widetilde{X}_{j}^{(1)}\|_{2}+\delta^{(2)}\sqrt{2\pi pm\log N_{1}}\max_{1\leq j\leq N_{1}}\|X^{(1)}_{j}\|_{2}
(2+k=1d21Ad21(1)Ak+1(1)Ak(1)2+δ(2))2πpmlogN1max1jN1Xj(1)X~j(1)2\displaystyle\leq\Bigl{(}2+\sum_{k=1}^{d_{2}-1}\|A^{(1)}_{d_{2}-1}\ldots A^{(1)}_{k+1}A^{(1)}_{k}\|_{2}+\delta^{(2)}\Bigr{)}\sqrt{2\pi pm\log N_{1}}\max_{1\leq j\leq N_{1}}\|X^{(1)}_{j}-\widetilde{X}_{j}^{(1)}\|_{2}
(35) +δ(2)2πpmlogN1max1jN1Xj(1)2\displaystyle+\delta^{(2)}\sqrt{2\pi pm\log N_{1}}\max_{1\leq j\leq N_{1}}\|X^{(1)}_{j}\|_{2}

holds with probability exceeding 12mN1p2N1p+11-\sqrt{2}mN_{1}^{-p}-\sqrt{2}N_{1}^{-p+1}. Moreover, taking a union bound over (3), (33), and (3), we obtain

max1jN2Xj(2)X~j(2)2\displaystyle\max_{1\leq j\leq N_{2}}\|X^{(2)}_{j}-\widetilde{X}^{(2)}_{j}\|_{2}
max1jN2X(1)Wj(2)X~(1)Qj(2)2\displaystyle\leq\max_{1\leq j\leq N_{2}}\|X^{(1)}W^{(2)}_{j}-\widetilde{X}^{(1)}Q^{(2)}_{j}\|_{2}
2πδ(1)pmlogN0logN1(2+k=1d21Ad21(1)Ak+1(1)Ak(1)2+δ(2))max1jN0Xj2\displaystyle\leq 2\pi\delta^{(1)}pm\sqrt{\log N_{0}\log N_{1}}\Bigl{(}2+\sum_{k=1}^{d_{2}-1}\|A^{(1)}_{d_{2}-1}\ldots A^{(1)}_{k+1}A^{(1)}_{k}\|_{2}+\delta^{(2)}\Bigr{)}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}
+2πδ(2)pmlogN0logN1N0XF\displaystyle+2\pi\delta^{(2)}p\sqrt{\frac{m\log N_{0}\log N_{1}}{N_{0}}}\|X\|_{F}
2πmplogN0logN1(4+k=1d21Ad21(1)Ak+1(1)Ak(1)2)max1jN0Xj2\displaystyle\leq 2\pi mp\sqrt{\log N_{0}\log N_{1}}\Bigl{(}4+\sum_{k=1}^{d_{2}-1}\|A^{(1)}_{d_{2}-1}\ldots A^{(1)}_{k+1}A^{(1)}_{k}\|_{2}\Bigr{)}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}

with probability at least 12mN2N1p2mN1N0p2N2N1p1N1N0p.1-\frac{\sqrt{2}mN_{2}}{N_{1}^{p}}-\frac{\sqrt{2}mN_{1}}{N_{0}^{p}}-\frac{\sqrt{2}N_{2}}{N_{1}^{p-1}}-\frac{N_{1}}{N_{0}^{p}}. In the last inequality, we used XFN0max1jN0Xj2\|X\|_{F}\leq\sqrt{N_{0}}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2} and the assumption that δ(i)1\delta^{(i)}\leq 1.

Finally, (3.1) follows inductively by using the same proof technique we showed above. ∎

Now we pass to the case of finite alphabets using a similar proof technique except that weights are assumed to be Gaussian.

Lemma 3.2 (Finite alphabets).

Suppose the weights W(1)N0×N1W^{(1)}\in\mathbb{R}^{N_{0}\times N_{1}} has i.i.d. 𝒩(0,1N1)\mathcal{N}(0,\frac{1}{N_{1}}) entries. If we quantize W(1)W^{(1)} using (LABEL:eq:quantization-algorithm) with alphabet 𝒜=𝒜Kδ\mathcal{A}=\mathcal{A}^{\delta}_{K} defined by (LABEL:eq:alphabet-midtread) and input data Xm×N0X\in\mathbb{R}^{m\times N_{0}}, then for every column (neuron) wN0w\in\mathbb{R}^{N_{0}}, of W(1)W^{(1)},

(36) XwXq2δηN02πmplogN1\|Xw-Xq\|_{2}\leq\delta\eta_{N_{0}}\sqrt{2\pi mp\log N_{1}}

holds with probability at least 12mN1p2exp(K2δ2N14)2t=2N0exp(K2δ24βt2)1-\frac{\sqrt{2}m}{N_{1}^{p}}-\sqrt{2}\exp\Bigl{(}-\frac{K^{2}\delta^{2}N_{1}}{4}\Bigr{)}-\sqrt{2}\sum_{t=2}^{N_{0}}\exp\Bigl{(}-\frac{K^{2}\delta^{2}}{4\beta_{t}^{2}}\Bigr{)}. Here, ηt:=max1jtXj2\eta_{t}:=\max_{1\leq j\leq t}\|X_{j}\|_{2} and βt2:=1N1+πδ2ηt122Xt22\beta_{t}^{2}:=\frac{1}{N_{1}}+\frac{\pi\delta^{2}\eta_{t-1}^{2}}{2\|X_{t}\|_{2}^{2}}.

Proof.

Fix a neuron w=Wj(1)N0w=W^{(1)}_{j}\in\mathbb{R}^{N_{0}} for some 1jN11\leq j\leq N_{1}. Additionally, we have X(0)=X~(0)=XX^{(0)}=\widetilde{X}^{(0)}=X in (LABEL:eq:quantization-algorithm) when i=1i=1. At the tt-th iteration of quantizing ww, similar to (LABEL:eq:error-bound-step2-infinite-eq1), (LABEL:eq:error-bound-step2-infinite-eq3), and (LABEL:eq:error-bound-step2-infinite-eq5), we have

ut=PXt(ht)+(vtqt)Xtu_{t}=P_{X_{t}^{\perp}}(h_{t})+(v_{t}-q_{t})X_{t}

where

(37) ht=ut1+wtXt,vt=Xt,htXt22,andqt=𝒬StocQ(vt).h_{t}=u_{t-1}+w_{t}X_{t},\quad v_{t}=\frac{\langle X_{t},h_{t}\rangle}{\|X_{t}\|_{2}^{2}},\quad\text{and}\quad q_{t}=\mathcal{Q}_{\mathrm{StocQ}}(v_{t}).

To prove (36), we proceed by induction on tt. If t=1t=1, then h1=w1X1h_{1}=w_{1}X_{1}, v1=w1v_{1}=w_{1}, and q1=𝒬StocQ(v1)q_{1}=\mathcal{Q}_{\mathrm{StocQ}}(v_{1}). Since w1𝒩(0,1N1)w_{1}\sim\mathcal{N}(0,\frac{1}{N_{1}}), LABEL:lemma:cx-gaussian-tail indicates

P(|w1|Kδ)12exp(K2δ2N14).\mathrm{P}\Bigl{(}|w_{1}|\leq K\delta\Bigr{)}\geq 1-\sqrt{2}\exp\Bigl{(}-\frac{K^{2}\delta^{2}N_{1}}{4}\Bigr{)}.

Conditioning on the event {|w1|Kδ}\{|w_{1}|\leq K\delta\}, we get v1=w1[Kδ,Kδ]v_{1}=w_{1}\in[-K\delta,K\delta]. Hence, |v1q1|δ|v_{1}-q_{1}|\leq\delta and the proof technique used for the case t=1t=1 in LABEL:thm:ut-bound-infinite can be applied here. It implies that u1cx𝒩(0,Σ1)u_{1}\leq_{\mathrm{cx}}\mathcal{N}(0,\Sigma_{1}) with Σ1=πδ22X1X1\Sigma_{1}=\frac{\pi\delta^{2}}{2}X_{1}X_{1}^{\top}. By LABEL:corollary:ut-bound-infinite, we obtain u1cx𝒩(0,σ12I)u_{1}\leq_{\mathrm{cx}}\mathcal{N}(0,\sigma_{1}^{2}I) with σ12=π2δ2η12\sigma_{1}^{2}=\frac{\pi}{2}\delta^{2}\eta_{1}^{2}.

Next, for t2t\geq 2, assume that ut1cx𝒩(0,σt12I)u_{t-1}\leq_{\mathrm{cx}}\mathcal{N}(0,\sigma_{t-1}^{2}I) holds where σt12=πδ22ηt12\sigma_{t-1}^{2}=\frac{\pi\delta^{2}}{2}\eta_{t-1}^{2}. Since wt𝒩(0,1N1)w_{t}\sim\mathcal{N}(0,\frac{1}{N_{1}}) and wtw_{t} is independent of ut1u_{t-1}, it follows from (37), LABEL:lemma:cx-afine, and LABEL:lemma:cx-independent-sum that

vt=Xt,ut1Xt22+wtcx𝒩(0,βt2)\displaystyle v_{t}=\frac{\langle X_{t},u_{t-1}\rangle}{\|X_{t}\|_{2}^{2}}+w_{t}\leq_{\mathrm{cx}}\mathcal{N}(0,\beta^{2}_{t})

where βt2:=1N1+πδ2ηt122Xt22\beta^{2}_{t}:=\frac{1}{N_{1}}+\frac{\pi\delta^{2}\eta_{t-1}^{2}}{2\|X_{t}\|_{2}^{2}}. It follows from LABEL:lemma:cx-gaussian-tail that

P(|vt|Kδ)12exp(K2δ24βt2)\mathrm{P}(|v_{t}|\leq K\delta)\geq 1-\sqrt{2}\exp\Bigl{(}-\frac{K^{2}\delta^{2}}{4\beta_{t}^{2}}\Bigr{)}

On event {|vt|Kδ}\{|v_{t}|\leq K\delta\}, we can quantize vtv_{t} as if the quantizer 𝒬StocQ\mathcal{Q}_{\mathrm{StocQ}} were over infinite alphabets 𝒜δ\mathcal{A}^{\delta}_{\infty}. So utcx𝒩(0,σt2I)u_{t}\leq_{\mathrm{cx}}\mathcal{N}(0,\sigma^{2}_{t}I) with σt2=π2δ2ηt2\sigma_{t}^{2}=\frac{\pi}{2}\delta^{2}\eta_{t}^{2}.

Therefore the induction steps above indicate that

(38) P(uN0cx𝒩(0,σN02I))12exp(K2δ2N14)2t=2N0exp(K2δ24βt2).\mathrm{P}\Bigl{(}u_{N_{0}}\leq_{\mathrm{cx}}\mathcal{N}(0,\sigma_{N_{0}}^{2}I)\Bigr{)}\geq 1-\sqrt{2}\exp\Bigl{(}-\frac{K^{2}\delta^{2}N_{1}}{4}\Bigr{)}-\sqrt{2}\sum_{t=2}^{N_{0}}\exp\Bigl{(}-\frac{K^{2}\delta^{2}}{4\beta_{t}^{2}}\Bigr{)}.

where σN02=π2δ2ηN02\sigma_{N_{0}}^{2}=\frac{\pi}{2}\delta^{2}\eta_{N_{0}}^{2}. Conditioning on uN0cx𝒩(0,σN02I)u_{N_{0}}\leq_{\mathrm{cx}}\mathcal{N}(0,\sigma_{N_{0}}^{2}I), LABEL:corollary:ut-bound-infinite leads to

P(uN02σN0log(2m/γ))1γ.\mathrm{P}\Bigl{(}\|u_{N_{0}}\|_{\infty}\leq 2\sigma_{N_{0}}\sqrt{\log(\sqrt{2}m/\gamma)}\Bigr{)}\geq 1-\gamma.

So uN02σN0log(2m/γ)\|u_{N_{0}}\|_{\infty}\leq 2\sigma_{N_{0}}\sqrt{\log(\sqrt{2}m/\gamma)} holds with probability exceeding

1γ2exp(K2δ2N14)2t=2N0exp(K2δ24βt2).1-\gamma-\sqrt{2}\exp\Bigl{(}-\frac{K^{2}\delta^{2}N_{1}}{4}\Bigr{)}-\sqrt{2}\sum_{t=2}^{N_{0}}\exp\Bigl{(}-\frac{K^{2}\delta^{2}}{4\beta_{t}^{2}}\Bigr{)}.

Setting γ=2mN1p\gamma=\sqrt{2}mN_{1}^{-p}, we obtain

XwXq2=uN02muN02σN0mlogN1p=2σN0mplogN1\|Xw-Xq\|_{2}=\|u_{N_{0}}\|_{2}\leq\sqrt{m}\|u_{N_{0}}\|_{\infty}\leq 2\sigma_{N_{0}}\sqrt{m\log N_{1}^{p}}=2\sigma_{N_{0}}\sqrt{mp\log N_{1}}

holds with probability at least 12mN1p2exp(K2δ2N14)2t=2N0exp(K2δ24βt2)1-\frac{\sqrt{2}m}{N_{1}^{p}}-\sqrt{2}\exp\Bigl{(}-\frac{K^{2}\delta^{2}N_{1}}{4}\Bigr{)}-\sqrt{2}\sum_{t=2}^{N_{0}}\exp\Bigl{(}-\frac{K^{2}\delta^{2}}{4\beta_{t}^{2}}\Bigr{)}. This completes the proof. ∎

Theorem 3.3.

Let Φ\Phi be a two-layer neural network as in (LABEL:eq:mlp) where L=2L=2 and the activation function is φ(i)(x)=ρ(x):=max{0,x}\varphi^{(i)}(x)=\rho(x):=\max\{0,x\} for all 1iL1\leq i\leq L. Suppose that each W(i)Ni1×NiW^{(i)}\in\mathbb{R}^{N_{i-1}\times N_{i}} has i.i.d. 𝒩(0,1Ni)\mathcal{N}(0,\frac{1}{N_{i}}) entries and {W(i)}i=1L\{W^{(i)}\}_{i=1}^{L} are independent. Let m,p+m,p\in\mathbb{N}^{+} with p2p\geq 2,

(39) δ(1)=1mN1,δ(2)=1N2,Kmlog(N0N1N2).\delta^{(1)}=\frac{1}{\sqrt{mN_{1}}},\quad\delta^{(2)}=\frac{1}{\sqrt{N_{2}}},\quad K\gtrsim\sqrt{m\log(N_{0}N_{1}N_{2})}.

Suppose the input data Xm×N0X\in\mathbb{R}^{m\times N_{0}} satisfies

(40) max1jN0Xj21d1XF,andX21d2XF\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}\lesssim\frac{1}{\sqrt{d_{1}}}\|X\|_{F},\quad\text{and}\quad\|X\|_{2}\lesssim\frac{1}{\sqrt{d_{2}}}\|X\|_{F}

for pN1logN1d1N0pN_{1}\log N_{1}\lesssim d_{1}\leq N_{0} and plogN1d2mp\log N_{1}\lesssim d_{2}\leq m. If we quantize W(i)W^{(i)} using (LABEL:eq:quantization-algorithm) with alphabet 𝒜=𝒜Kδ(i)\mathcal{A}=\mathcal{A}_{K}^{\delta^{(i)}} and data Xm×N0X\in\mathbb{R}^{m\times N_{0}}, then

(41) max1jN2X(1)Wj(2)X~(1)Qj(2)22XFpmlogN2N1N2\max_{1\leq j\leq N_{2}}\|X^{(1)}W^{(2)}_{j}-\widetilde{X}^{(1)}Q^{(2)}_{j}\|_{2}\leq 2\|X\|_{F}\sqrt{\frac{pm\log N_{2}}{N_{1}N_{2}}}

holds with high probability.

Proof.

The proof is organized into four steps. In step 1, we will use the randomness of the weights in the first layer, as well as LABEL:thm:error-bound-finite-first-layer to control the norm of the difference between Xj(1)X^{(1)}_{j} and X~j(1)\widetilde{X}^{(1)}_{j} in (3), as well the deviation of the norm of Xj(1)X_{j}^{(1)} from its expectation, in (44). By subsequently controlling the expectation, we obtain upper and lower bounds on Xj(1)2\|X_{j}^{(1)}\|_{2} that hold with high probability in (45). In step 2, we condition on the event that the above derived bounds hold, and control the magnitude of the quantizer’s argument for quantizing the first weight of the second layer. This results in (48) which in turn leads to (49) showing that u1u_{1} is dominated in the convex order by an appropriate gaussian. This forms the base-case for an induction argument to control the norm of the error in the second layer. In step 3, we complete the induction argument by dealing with indices t2t\geq 2, resulting in (57) showing that utu_{t} is also convexly dominated by an appropriate gaussian. Finally, in step 4, we use these results to obtain an error bound (41) that holds with high probability.

Step 1: Following , we define η0:=0\eta_{0}:=0 and

(42) ηt:=max1jtXj2,βt2:=1N1+π(δ(1)ηt1)22Xt22,1tN0.\eta_{t}:=\max_{1\leq j\leq t}\|X_{j}\|_{2},\quad\beta_{t}^{2}:=\frac{1}{N_{1}}+\frac{\pi(\delta^{(1)}\eta_{t-1})^{2}}{2\|X_{t}\|_{2}^{2}},\quad 1\leq t\leq N_{0}.

Given step size δ(1)\delta^{(1)}, by LABEL:thm:error-bound-finite-first-layer and a union bound, the quantization of W(1)W^{(1)} yields

max1jN1Xj(1)X~j(1)2\displaystyle\max_{1\leq j\leq N_{1}}\|X^{(1)}_{j}-\widetilde{X}^{(1)}_{j}\|_{2} =max1jN1ρ(XWj(1))ρ(XQj(1))2\displaystyle=\max_{1\leq j\leq N_{1}}\|\rho(XW_{j}^{(1)})-\rho(XQ_{j}^{(1)})\|_{2}
max1jN1XWj(1)XQj(1)2\displaystyle\leq\max_{1\leq j\leq N_{1}}\|XW_{j}^{(1)}-XQ_{j}^{(1)}\|_{2}
(43) ηN0δ(1)2πmplogN1\displaystyle\leq\eta_{N_{0}}\delta^{(1)}\sqrt{2\pi mp\log N_{1}}

with probability at least

12mN1p12N1exp((Kδ(1))2N14)2N1t=2N0exp((Kδ(1))24βt2).1-\frac{\sqrt{2}m}{N_{1}^{p-1}}-\sqrt{2}N_{1}\exp\Bigl{(}-\frac{(K\delta^{(1)})^{2}N_{1}}{4}\Bigr{)}-\sqrt{2}N_{1}\sum_{t=2}^{N_{0}}\exp\Bigl{(}-\frac{(K\delta^{(1)})^{2}}{4\beta_{t}^{2}}\Bigr{)}.

Note that function f(z):=ρ(Xz)2f(z):=\|\rho(Xz)\|_{2} is Lipschitz with Lipschitz constant Lf=X2L_{f}=\|X\|_{2} and thatN1Xj(1)2=N1ρ(XWj(1))2=f(N1Wj(1))\sqrt{N_{1}}\|X^{(1)}_{j}\|_{2}=\sqrt{N_{1}}\|\rho(XW^{(1)}_{j})\|_{2}=f(\sqrt{N_{1}}W_{j}^{(1)}) with N1Wj(1)𝒩(0,I)\sqrt{N_{1}}W_{j}^{(1)}\sim\mathcal{N}(0,I). Applying LABEL:lemma:Lip-concentration to ff with X=N1Wj(1)X=\sqrt{N_{1}}W_{j}^{(1)} and α=2plogN1X2\alpha=\sqrt{2p\log N_{1}}\|X\|_{2}, we obtain

(44) P(|Xj(1)2𝔼Xj(1)2|2plogN1N1X2)12N1p.\mathrm{P}\Bigl{(}\bigl{|}\|X_{j}^{(1)}\|_{2}-\mathbb{E}\|X_{j}^{(1)}\|_{2}\bigr{|}\leq\sqrt{\frac{2p\log N_{1}}{N_{1}}}\|X\|_{2}\Bigr{)}\geq 1-\frac{2}{N_{1}^{p}}.

Using Jensen’s inequality and the identity 𝔼ρ(Xg)22=12XF2\mathbb{E}\|\rho(Xg)\|_{2}^{2}=\frac{1}{2}\|X\|_{F}^{2} where g𝒩(0,I)g\sim\mathcal{N}(0,I), we have

𝔼Xj(1)2𝔼Xj(1)22=𝔼ρ(XWj(1))22=12N1XF.\mathbb{E}\|X_{j}^{(1)}\|_{2}\leq\sqrt{\mathbb{E}\|X_{j}^{(1)}\|_{2}^{2}}=\sqrt{\mathbb{E}\|\rho(XW^{(1)}_{j})\|_{2}^{2}}=\frac{1}{\sqrt{2N_{1}}}\|X\|_{F}.

Additionally, LABEL:prop:expect-relu-gaussian implies that

𝔼Xj(1)2=𝔼ρ(XWj(1))212πN1XF.\mathbb{E}\|X_{j}^{(1)}\|_{2}=\mathbb{E}\|\rho(XW^{(1)}_{j})\|_{2}\geq\frac{1}{\sqrt{2\pi N_{1}}}\|X\|_{F}.

Applying the inequalities above to (44) and taking the union bound over all N1N_{1} neurons in W(1)W^{(1)}, we obtain that the inequality

(45) 12πN1(XF2πplogN1X2)Xj(1)212N1(XF+2plogN1X2)\frac{1}{\sqrt{2\pi N_{1}}}\bigl{(}\|X\|_{F}-2\sqrt{\pi p\log N_{1}}\|X\|_{2}\bigr{)}\leq\|X_{j}^{(1)}\|_{2}\leq\frac{1}{\sqrt{2N_{1}}}\bigl{(}\|X\|_{F}+2\sqrt{p\log N_{1}}\|X\|_{2}\bigr{)}

holds uniformly for all 1jN11\leq j\leq N_{1} with probability exceeding 12N1p11-\frac{2}{N_{1}^{p-1}}.

Step 2: Let \mathcal{E} be the event that both (3) and (45) hold. Conditioning on \mathcal{E}, we quantize the second layer W(2)N1×N2W^{(2)}\in\mathbb{R}^{N_{1}\times N_{2}}. Fix a neuron w=Wj(2)N1w=W^{(2)}_{j}\in\mathbb{R}^{N_{1}} for some 1jN21\leq j\leq N_{2}. At the tt-th iteration of quantizing ww, similar to (LABEL:eq:error-bound-step2-infinite-eq1), (LABEL:eq:error-bound-step2-infinite-eq2), and (LABEL:eq:error-bound-step2-infinite-eq3), we have

(46) ut=ut1+wtXt(1)qtX~t(1),vt=X~t(1),ut1+wtXt(1)X~t(1)22,andqt=𝒬StocQ(vt).u_{t}=u_{t-1}+w_{t}X^{(1)}_{t}-q_{t}\widetilde{X}^{(1)}_{t},\quad v_{t}=\frac{\langle\widetilde{X}_{t}^{(1)},u_{t-1}+w_{t}X^{(1)}_{t}\rangle}{\|\widetilde{X}_{t}^{(1)}\|_{2}^{2}},\quad\text{and}\quad q_{t}=\mathcal{Q}_{\mathrm{StocQ}}(v_{t}).

To prove (41), we proceed by induction on tt. Let t=1t=1. In this case, due to w1𝒩(0,1N2)w_{1}\sim\mathcal{N}(0,\frac{1}{N_{2}}), we have

v1=X~1(1),X1(1)X~1(1)22w1𝒩(0,X~1(1),X1(1)2N2X~1(1)24)v_{1}=\frac{\langle\widetilde{X}_{1}^{(1)},X^{(1)}_{1}\rangle}{\|\widetilde{X}_{1}^{(1)}\|_{2}^{2}}\cdot w_{1}\sim\mathcal{N}\Biggl{(}0,\frac{\langle\widetilde{X}_{1}^{(1)},X^{(1)}_{1}\rangle^{2}}{N_{2}\|\widetilde{X}_{1}^{(1)}\|_{2}^{4}}\Biggr{)}

and q1=𝒬StocQ(v1)q_{1}=\mathcal{Q}_{\mathrm{StocQ}}(v_{1}). Additionally, (3), (45), and (40) imply that

X1(1)2N2X~1(1)2\displaystyle\frac{\|X^{(1)}_{1}\|_{2}}{\sqrt{N_{2}}\|\widetilde{X}_{1}^{(1)}\|_{2}} 1N2X1(1)2X1(1)2X1(1)X~1(1)2\displaystyle\leq\frac{1}{\sqrt{N_{2}}}\cdot\frac{\|X^{(1)}_{1}\|_{2}}{\|X_{1}^{(1)}\|_{2}-\|X^{(1)}_{1}-\widetilde{X}_{1}^{(1)}\|_{2}}
πN2XF+2plogN1X2XF2πplogN1X22πδ(1)ηN0mpN1logN1\displaystyle\leq\sqrt{\frac{\pi}{N_{2}}}\cdot\frac{\|X\|_{F}+2\sqrt{p\log N_{1}}\|X\|_{2}}{\|X\|_{F}-2\sqrt{\pi p\log N_{1}}\|X\|_{2}-2\pi\delta^{(1)}\eta_{N_{0}}\sqrt{mpN_{1}\log N_{1}}}
πN21+2plogN1d212πplogN1d22πδ(1)mpN1logN1d1\displaystyle\leq{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\sqrt{\frac{\pi}{N_{2}}}\cdot\frac{1+2\sqrt{\frac{p\log N_{1}}{d_{2}}}}{1-2\sqrt{\frac{\pi p\log N_{1}}{d_{2}}}-2\pi\delta^{(1)}\sqrt{\frac{mpN_{1}\log N_{1}}{d_{1}}}}}
(47) =:c1\displaystyle=:c_{1}

Using Cauchy-Schwarz inequality and (3), we obtain |X~1(1),X1(1)|N2X~1(1)22X1(1)2N2X~1(1)2c1\frac{|\langle\widetilde{X}_{1}^{(1)},X^{(1)}_{1}\rangle|}{\sqrt{N_{2}}\|\widetilde{X}_{1}^{(1)}\|_{2}^{2}}\leq\frac{\|X^{(1)}_{1}\|_{2}}{\sqrt{N_{2}}\|\widetilde{X}_{1}^{(1)}\|_{2}}\leq c_{1}. Thus we can apply LABEL:lemma:cx-normal and deduce v1cx𝒩(0,c12)v_{1}\leq_{\mathrm{cx}}\mathcal{N}(0,c_{1}^{2}). By LABEL:lemma:cx-gaussian-tail, we get

(48) P(|v1|Kδ(2))12exp((Kδ(2))24c12).\mathrm{P}(|v_{1}|\leq K\delta^{(2)})\geq 1-\sqrt{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4c_{1}^{2}}\Bigr{)}.

Conditioning on the event {|v1|Kδ(2)}\{|v_{1}|\leq K\delta^{(2)}\}, we have |v1q1|δ(2)|v_{1}-q_{1}|\leq\delta^{(2)} and the proof technique used for the case t=1t=1 in LABEL:thm:ut-bound-infinite still works here. LABEL:corollary:ut-bound-infinite implies that, conditioning on w1w_{1}, we have

(49) u1|w1cx𝒩(μ1,σ12I) with μ1=w1PX~1(1)(X1(1)) and σ12=π2(δ(2))2X~1(1)22.u_{1}\,|\,w_{1}\leq_{\mathrm{cx}}\mathcal{N}(\mu_{1},\sigma_{1}^{2}I)\text{\quad with \quad}\mu_{1}=w_{1}P_{\widetilde{X}^{(1)\perp}_{1}}(X_{1}^{(1)})\text{\quad and \quad}\sigma_{1}^{2}=\frac{\pi}{2}(\delta^{(2)})^{2}\|\widetilde{X}_{1}^{(1)}\|_{2}^{2}.

Step 3: Next, for t2t\geq 2, we assume

(50) ut1|{wi}i=1t1cx𝒩(μt1,σt12I)u_{t-1}\,|\,\{w_{i}\}_{i=1}^{t-1}\leq_{\mathrm{cx}}\mathcal{N}(\mu_{t-1},\sigma_{t-1}^{2}I)

where

μt1=j=1t1wjPX~t1(1)PX~j+1(1)PX~j(1)Xj(1),σt12=π2(δ(2))2max1jt1X~j(1)22.\mu_{t-1}=\sum_{j=1}^{t-1}w_{j}P_{\widetilde{X}^{(1)\perp}_{t-1}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j},\quad\sigma_{t-1}^{2}=\frac{\pi}{2}(\delta^{(2)})^{2}\max_{1\leq j\leq t-1}\|\widetilde{X}^{(1)}_{j}\|_{2}^{2}.

Note that the randomness in (50) comes from the stochastic quantizer 𝒬StocQ\mathcal{Q}_{\mathrm{StocQ}}. Due to the independence of the weights wj𝒩(0,1N2)w_{j}\sim\mathcal{N}(0,\frac{1}{N_{2}}), we have

(51) μt1𝒩(0,St1)\mu_{t-1}\sim\mathcal{N}(0,S_{t-1})

with

St1:=1N2j=1t1PX~t1(1)PX~j+1(1)PX~j(1)Xj(1)Xj(1)PX~j(1)PX~j+1(1)PX~t1(1).S_{t-1}:=\frac{1}{N_{2}}\sum_{j=1}^{t-1}P_{\widetilde{X}^{(1)\perp}_{t-1}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t-1}}.

Applying LABEL:lemma:cx-sum to (50) and (51), we obtain

ut1cx𝒩(0,St1+σt12I).u_{t-1}\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,S_{t-1}+\sigma_{t-1}^{2}I\Bigr{)}.

Additionally, it follows from LABEL:lemma:cx-afine and LABEL:lemma:cx-independent-sum that

ut1+wtXt(1)cx𝒩(0,St1+σt12I+1N2Xt(1)Xt(1)).u_{t-1}+w_{t}X_{t}^{(1)}\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,S_{t-1}+\sigma_{t-1}^{2}I+\frac{1}{N_{2}}X^{(1)}_{t}X^{(1)\top}_{t}\Bigr{)}.

Since P21\|P\|_{2}\leq 1 for all orthogonal projections PP and aa2=a22\|aa^{\top}\|_{2}=\|a\|_{2}^{2} for all vectors ama\in\mathbb{R}^{m}, we have

St1\displaystyle S_{t-1} St12I\displaystyle\preceq\|S_{t-1}\|_{2}I
1N2j=1t1PX~t1(1)PX~j+1(1)PX~j(1)Xj(1)Xj(1)PX~j(1)PX~j+1(1)PX~t1(1)2I\displaystyle\preceq\frac{1}{N_{2}}\sum_{j=1}^{t-1}\|P_{\widetilde{X}^{(1)\perp}_{t-1}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t-1}}\|_{2}I
1N2j=1t1PX~j(1)Xj(1)Xj(1)PX~j(1)2I\displaystyle\preceq\frac{1}{N_{2}}\sum_{j=1}^{t-1}\|P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}\|_{2}I
=1N2j=1t1PX~j(1)(Xj(1))22I\displaystyle=\frac{1}{N_{2}}\sum_{j=1}^{t-1}\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j})\|_{2}^{2}I
=1N2j=1t1PX~j(1)(Xj(1)X~j(1))22I\displaystyle=\frac{1}{N_{2}}\sum_{j=1}^{t-1}\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j}-\widetilde{X}^{(1)}_{j})\|_{2}^{2}I
t1N2max1jN1Xj(1)X~j(1)22I\displaystyle\preceq\frac{t-1}{N_{2}}\max_{1\leq j\leq N_{1}}\|X_{j}^{(1)}-\widetilde{X}^{(1)}_{j}\|_{2}^{2}I
2π(t1)mp(δ(1))2logN1d1N2XF2I\displaystyle\preceq{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{2\pi(t-1)mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}N_{2}}}\|X\|_{F}^{2}I
(52) :=c2XF2I.\displaystyle:=c_{2}\|X\|_{F}^{2}I.

In the last inequality, we applied (3) and (40). Moreover, using (3), (45), and (40), we get

σt12\displaystyle\sigma_{t-1}^{2} =π2(δ(2))2max1jt1X~j(1)22\displaystyle=\frac{\pi}{2}(\delta^{(2)})^{2}\max_{1\leq j\leq t-1}\|\widetilde{X}^{(1)}_{j}\|_{2}^{2}
π2(δ(2))2(max1jN1Xj(1)X~j(1)2+max1jN1Xj(1)2)2\displaystyle\leq\frac{\pi}{2}(\delta^{(2)})^{2}\Bigl{(}\max_{1\leq j\leq N_{1}}\|X^{(1)}_{j}-\widetilde{X}^{(1)}_{j}\|_{2}+\max_{1\leq j\leq N_{1}}\|X^{(1)}_{j}\|_{2}\Bigr{)}^{2}
π2(δ(2))2(δ(1)ηN02πmplogN1+12N1XF+2plogN1N1X2)2\displaystyle\leq\frac{\pi}{2}(\delta^{(2)})^{2}\Bigl{(}\delta^{(1)}\eta_{N_{0}}\sqrt{2\pi mp\log N_{1}}+\frac{1}{\sqrt{2N_{1}}}\|X\|_{F}+\sqrt{\frac{2p\log N_{1}}{N_{1}}}\|X\|_{2}\Bigr{)}^{2}
3π2(δ(2))2(2π(δ(1)ηN0)2mplogN1+12N1XF2+2plogN1N1X22)\displaystyle\leq\frac{3\pi}{2}(\delta^{(2)})^{2}\Bigl{(}2\pi(\delta^{(1)}\eta_{N_{0}})^{2}mp\log N_{1}+\frac{1}{2N_{1}}\|X\|_{F}^{2}+\frac{2p\log N_{1}}{N_{1}}\|X\|_{2}^{2}\Bigr{)}
3π(δ(2))2(πmp(δ(1))2logN1d1+14N1+plogN1N1d2)XF2\displaystyle\leq{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3\pi(\delta^{(2)})^{2}\Bigl{(}\frac{\pi mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}}+\frac{1}{4N_{1}}+\frac{p\log N_{1}}{N_{1}d_{2}}\Bigr{)}}\|X\|_{F}^{2}
(53) :=c3XF2\displaystyle:=c_{3}\|X\|_{F}^{2}

Combining (3) with (3), we obtain

(54) ut1cx𝒩(0,(c2+c3)XF2I).u_{t-1}\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,\bigl{(}c_{2}+c_{3}\bigr{)}\|X\|_{F}^{2}I\Bigr{)}.

Further, using (45) and (40), we have

1N2Xt(1)Xt(1)\displaystyle\frac{1}{N_{2}}X^{(1)}_{t}X^{(1)\top}_{t} 1N2Xt(1)22I\displaystyle\preceq\frac{1}{N_{2}}\|X_{t}^{(1)}\|_{2}^{2}I
12N1N2(XF+2plogN1X2)2I\displaystyle\preceq\frac{1}{2N_{1}N_{2}}\Bigl{(}\|X\|_{F}+2\sqrt{p\log N_{1}}\|X\|_{2}\Bigr{)}^{2}I
1N1N2(XF2+4plogN1X22)I\displaystyle\preceq\frac{1}{N_{1}N_{2}}\Bigl{(}\|X\|_{F}^{2}+4p\log N_{1}\|X\|_{2}^{2}\Bigr{)}I
1N1N2(1+4plogN1d2)XF2I\displaystyle\preceq{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{1}{N_{1}N_{2}}\Bigl{(}1+\frac{4p\log N_{1}}{d_{2}}\Bigr{)}}\|X\|_{F}^{2}I
(55) :=c4XF2I\displaystyle:=c_{4}\|X\|_{F}^{2}I

Combining (3), (3), (3), we obtain

ut1+wtXt(1)cx𝒩(0,(c2+c3+c4)XF2I).u_{t-1}+w_{t}X_{t}^{(1)}\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,\bigl{(}c_{2}+c_{3}+c_{4}\bigr{)}\|X\|_{F}^{2}I\Bigr{)}.

Then it follows from (46) that

(56) vt=X~t(1),ut1+wtXt(1)X~t(1)22cx𝒩(0,(c2+c3+c4)XF2X~t(1)22).v_{t}=\frac{\langle\widetilde{X}_{t}^{(1)},u_{t-1}+w_{t}X^{(1)}_{t}\rangle}{\|\widetilde{X}_{t}^{(1)}\|_{2}^{2}}\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,\bigl{(}c_{2}+c_{3}+c_{4}\bigr{)}\frac{\|X\|_{F}^{2}}{\|\widetilde{X}_{t}^{(1)}\|_{2}^{2}}\Bigr{)}.

By (3), (45), and (40), we have

XFX~t(1)2\displaystyle\frac{\|X\|_{F}}{\|\widetilde{X}_{t}^{(1)}\|_{2}} XFXt(1)2Xt(1)X~t(1)2\displaystyle\leq\frac{\|X\|_{F}}{\|X_{t}^{(1)}\|_{2}-\|X^{(1)}_{t}-\widetilde{X}_{t}^{(1)}\|_{2}}
2πN112πplogN1d22πδ(1)mpN1logN1d1\displaystyle\leq{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{\sqrt{2\pi N_{1}}}{1-2\sqrt{\frac{\pi p\log N_{1}}{d_{2}}}-2\pi\delta^{(1)}\sqrt{\frac{mpN_{1}\log N_{1}}{d_{1}}}}}
:=c5\displaystyle:=c_{5}

Plugging the result above into (56), we get

vt\displaystyle v_{t} cx𝒩(0,(c2+c3+c4)c52).\displaystyle\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,(c_{2}+c_{3}+c_{4})c_{5}^{2}\Bigr{)}.

One can deduce from LABEL:lemma:cx-gaussian-tail that

P(|vt|Kδ(2))12exp((Kδ(2))24(c2+c3+c4)c52).\mathrm{P}(|v_{t}|\leq K\delta^{(2)})\geq 1-\sqrt{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4(c_{2}+c_{3}+c_{4})c_{5}^{2}}\Bigr{)}.

On the event {|vt|Kδ(2)}\{|v_{t}|\leq K\delta^{(2)}\}, we can quantize vtv_{t} as if the quantizer 𝒬StocQ\mathcal{Q}_{\mathrm{StocQ}} were over infinite alphabets 𝒜δ\mathcal{A}^{\delta}_{\infty}. So, conditioning on this event, utcx𝒩(μt,σt2I)u_{t}\leq_{\mathrm{cx}}\mathcal{N}(\mu_{t},\sigma^{2}_{t}I). Hence conditioning on \mathcal{E}, namely the event that both (3) and (45) hold, the induction steps above indicate that

(57) X(1)wX~(1)q=uN1cx𝒩(0,(c2+c3)XF2I).X^{(1)}w-\widetilde{X}^{(1)}q=u_{N_{1}}\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,(c_{2}+c_{3})\|X\|_{F}^{2}I\Bigr{)}.

holds with probability at least

12exp((Kδ(2))24c12)2(N11)exp((Kδ(2))24(c2+c3+c4)c52).1-\sqrt{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4c_{1}^{2}}\Bigr{)}-\sqrt{2}(N_{1}-1)\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4(c_{2}+c_{3}+c_{4})c_{5}^{2}}\Bigr{)}.

Step 4: Conditioning on \mathcal{E}, applying LABEL:lemma:cx-gaussian-tail with γ=2mN2p\gamma=\sqrt{2}mN_{2}^{-p}, and taking union bound over all neurons in W(2)W^{(2)},

(58) max1jN2X(1)Wj(2)X~(1)Qj(2)22(c2+c3)pmlogN2XF\max_{1\leq j\leq N_{2}}\|X^{(1)}W^{(2)}_{j}-\widetilde{X}^{(1)}Q^{(2)}_{j}\|_{2}\leq 2\sqrt{(c_{2}+c_{3})pm\log N_{2}}\|X\|_{F}

holds with probability at least

12mN2p12N2exp((Kδ(2))24c12)2(N11)N2exp((Kδ(2))24(c2+c3+c4)c52).1-\frac{\sqrt{2}m}{N_{2}^{p-1}}-\sqrt{2}N_{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4c_{1}^{2}}\Bigr{)}-\sqrt{2}(N_{1}-1)N_{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4(c_{2}+c_{3}+c_{4})c_{5}^{2}}\Bigr{)}.

Considering the conditioned event \mathcal{E}, (58) holds with probability exceeding

1\displaystyle 1 2mN1p12N1exp((Kδ(1))2N14)2N1t=2N0exp((Kδ(1))24βt2)2N1p1\displaystyle-\frac{\sqrt{2}m}{N_{1}^{p-1}}-\sqrt{2}N_{1}\exp\Bigl{(}-\frac{(K\delta^{(1)})^{2}N_{1}}{4}\Bigr{)}-\sqrt{2}N_{1}\sum_{t=2}^{N_{0}}\exp\Bigl{(}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\frac{(K\delta^{(1)})^{2}}{4\beta_{t}^{2}}}\Bigr{)}-\frac{2}{N_{1}^{p-1}}
2mN2p12N2exp((Kδ(2))24c12)2(N11)N2exp((Kδ(2))24(c2+c3+c4)c52)\displaystyle-\frac{\sqrt{2}m}{N_{2}^{p-1}}-\sqrt{2}N_{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4c_{1}^{2}}\Bigr{)}-\sqrt{2}(N_{1}-1)N_{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4(c_{2}+c_{3}+c_{4})c_{5}^{2}}\Bigr{)}

where βt2=1N1+π(δ(1)ηt1)22Xt22\beta_{t}^{2}=\frac{1}{N_{1}}+\frac{\pi(\delta^{(1)}\eta_{t-1})^{2}}{2\|X_{t}\|_{2}^{2}}. Note that

c2+c3=2πN1mp(δ(1))2logN1d1N2+3π(δ(2))2(πmp(δ(1))2logN1d1+14N1+plogN1N1d2)c_{2}+c_{3}={\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\frac{2\pi N_{1}mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}N_{2}}}+3\pi(\delta^{(2)})^{2}\Bigl{(}\frac{\pi mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}}+\frac{1}{4N_{1}}+\frac{p\log N_{1}}{N_{1}d_{2}}\Bigr{)}

and

c2+c3+c4\displaystyle c_{2}+c_{3}+c_{4} =2πN1mp(δ(1))2logN1d1N2+3π(δ(2))2(πmp(δ(1))2logN1d1+14N1+plogN1N1d2)\displaystyle={\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\frac{2\pi N_{1}mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}N_{2}}}+3\pi(\delta^{(2)})^{2}\Bigl{(}\frac{\pi mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}}+\frac{1}{4N_{1}}+\frac{p\log N_{1}}{N_{1}d_{2}}\Bigr{)}
+1N1N2(1+4plogN1d2).\displaystyle+\frac{1}{N_{1}N_{2}}\Bigl{(}1+\frac{4p\log N_{1}}{d_{2}}\Bigr{)}.

[**JZ: Bottleneck terms are highlighted in red.] We can set δ(1)=1mN1\delta^{(1)}=\frac{1}{\sqrt{mN_{1}}}. Another bottleneck term 2πN1mp(δ(1))2logN1d1N2\frac{2\pi N_{1}mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}N_{2}} implies that N0d1pN1logN1N_{0}\geq d_{1}\gtrsim pN_{1}\log N_{1}. Additionally, we choose δ(2)=1N2\delta^{(2)}=\frac{1}{\sqrt{N_{2}}}, d2plogN1d_{2}\gtrsim p\log N_{1}, and Kmlog(N0N1N2)K\gtrsim\sqrt{m\log(N_{0}N_{1}N_{2})}. Then we deduce that c2+c31N1N2c_{2}+c_{3}\lesssim\frac{1}{N_{1}N_{2}} and (c2+c3+c4)c521N2(c_{2}+c_{3}+c_{4})c_{5}^{2}\lesssim\frac{1}{N_{2}}. Therefore, we have

max1jN2X(1)Wj(2)X~(1)Qj(2)22pmlogN2N1N2XF\max_{1\leq j\leq N_{2}}\|X^{(1)}W^{(2)}_{j}-\widetilde{X}^{(1)}Q^{(2)}_{j}\|_{2}\leq 2\sqrt{\frac{pm\log N_{2}}{N_{1}N_{2}}}\|X\|_{F}

with high probability. ∎

Since Xj(1)21N1XF\|X^{(1)}_{j}\|_{2}\approx\frac{1}{\sqrt{N_{1}}}\|X\|_{F} for all 1jN11\leq j\leq N_{1}, we have X(1)F2XF2\|X^{(1)}\|_{F}^{2}\approx\|X\|_{F}^{2}. If follows that

𝔼X(1)Wj(2)22=1N2X(1)F21N2XF2.\mathbb{E}\|X^{(1)}W^{(2)}_{j}\|_{2}^{2}=\frac{1}{N_{2}}\|X^{(1)}\|_{F}^{2}\approx\frac{1}{N_{2}}\|X\|_{F}^{2}.

Thus, the relative error of quantizing the second layer is given by

X(1)Wj(2)X~(1)Qj(2)2X(1)Wj(2)2pmlogN2N1.\frac{\|X^{(1)}W^{(2)}_{j}-\widetilde{X}^{(1)}Q^{(2)}_{j}\|_{2}}{\|X^{(1)}W^{(2)}_{j}\|_{2}}\approx\sqrt{\frac{pm\log N_{2}}{N_{1}}}.

[**JZ: All results in this section serve as references which will be modified.] Given the result (LABEL:eq:inf-alphabet-tails), one can bound the quantization error for the first layer without approximating μt\mu_{t} because μt=0\mu_{t}=0 in this case. See the following example for details.

Theorem 3.4 (Quantization error bound for the first layer).

Suppose we quantize the weights W(1)N0×N1W^{(1)}\in\mathbb{R}^{N_{0}\times N_{1}} in the first layer using (LABEL:eq:quantization-algorithm) with alphabet 𝒜=𝒜δ\mathcal{A}=\mathcal{A}^{\delta}_{\infty}, step size δ>0\delta>0, and input data Xm×N0X\in\mathbb{R}^{m\times N_{0}}.

  1. (1)

    For every neuron wN0w\in\mathbb{R}^{N_{0}} that is a column of W(1)W^{(1)}, the quantization error satisfies

    (59) XwXq2σN0mlogN0\|Xw-Xq\|_{2}\lesssim\sigma_{N_{0}}\sqrt{m\log N_{0}}

    with probability exceeding 1N021-N_{0}^{-2}.

  2. (2)

    Let Q(1)𝒜N0×N1Q^{(1)}\in\mathcal{A}^{N_{0}\times N_{1}} denote the quantized output for all neurons. Then

    (60) XW(1)XQ(1)FσN0mN1logN0.\|XW^{(1)}-XQ^{(1)}\|_{F}\lesssim\sigma_{N_{0}}\sqrt{mN_{1}\log N_{0}}.

    holds with probability at least 1N1N021-\frac{N_{1}}{N_{0}^{2}}.

Here, σN0=δπ2max1jN0Xj2\sigma_{N_{0}}=\delta\sqrt{\frac{\pi}{2}}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2} is defined by (LABEL:eq:def-sigma).

Proof.

(1) Since we perform parallel and independent quantization for all neurons in the same layer, it suffices to consider w=Wj(1)w=W^{(1)}_{j} for some 1jN11\leq j\leq N_{1}. Additionally, by (LABEL:eq:layer-input), we have X(0)=X~(0)=Xm×N0X^{(0)}=\widetilde{X}^{(0)}=X\in\mathbb{R}^{m\times N_{0}}. According to LABEL:corollary:ut-bound-infinite, quantization of ww guarantees that

utμt2σtlogm+log2logγ.\|u_{t}-\mu_{t}\|_{\infty}\leq 2\sigma_{t}\sqrt{\log m+\log\sqrt{2}-\log\gamma}.

holds with probability exceeding 1γ1-\gamma. Here, μt=PX~t(0)(μt1+wtXt(0))\mu_{t}=P_{\widetilde{X}_{t}^{(0)\perp}}(\mu_{t-1}+w_{t}X_{t}^{(0)}) with μ0=0\mu_{0}=0, and σt2=πδ22max1jtXj22\sigma_{t}^{2}=\frac{\pi\delta^{2}}{2}\max_{1\leq j\leq t}\|X_{j}\|_{2}^{2}. Since X(0)=X~(0)X^{(0)}=\widetilde{X}^{(0)}, by induction on tt, it is easy to verify that μt=0\mu_{t}=0 for all tt. It follows that

ut2σtlogm+log2logγ\|u_{t}\|_{\infty}\leq 2\sigma_{t}\sqrt{\log m+\log\sqrt{2}-\log\gamma}

holds with probability at least 1γ1-\gamma. In particular, let t=N0mt=N_{0}\gg m and γ=1N02\gamma=\frac{1}{N_{0}^{2}}. Then

XwXq2=uN02muN0σN0mlogN0\|Xw-Xq\|_{2}=\|u_{N_{0}}\|_{2}\leq\sqrt{m}\|u_{N_{0}}\|_{\infty}\lesssim\sigma_{N_{0}}\sqrt{m\log N_{0}}

holds with probability at least 1N021-N_{0}^{-2}.

(2) Note that (59) is valid for every neuron ww. By taking union bound over all neurons,

XW(1)XQ(1)F2=j=1N1XWj(1)XQj(1)22σN02mN1logN0.\|XW^{(1)}-XQ^{(1)}\|_{F}^{2}=\sum_{j=1}^{N_{1}}\|XW^{(1)}_{j}-XQ^{(1)}_{j}\|_{2}^{2}\lesssim\sigma_{N_{0}}^{2}mN_{1}\log N_{0}.

holds with probability at least 1N1N021-\frac{N_{1}}{N_{0}^{2}}. ∎

Moreover, we can obtain the relative error of quantization by estimating Xw2\|Xw\|_{2}. For example, assume that neuron wN0w\in\mathbb{R}^{N_{0}} is an isotropic random vector and data Xm×N0X\in\mathbb{R}^{m\times N_{0}} is generic in the sense that XFσN0N0\|X\|_{F}\gtrsim\sigma_{N_{0}}\sqrt{N_{0}}. Then combining the fact that 𝔼Xw22=XF2\mathbb{E}\|Xw\|_{2}^{2}=\|X\|_{F}^{2} with (59), we deduce that the following inequality holds with high probability.

XwXq2Xw2σN0mlogN0XFmlogN0N0.\frac{\|Xw-Xq\|_{2}}{\|Xw\|_{2}}\lesssim\frac{\sigma_{N_{0}}\sqrt{m\log N_{0}}}{\|X\|_{F}}\lesssim\sqrt{\frac{m\log N_{0}}{N_{0}}}.

Further, if all neurons are isotropic, then 𝔼XW(1)F2=j=1N1𝔼XWj(1)22=N1XF2.\mathbb{E}\|XW^{(1)}\|_{F}^{2}=\sum_{j=1}^{N_{1}}\mathbb{E}\|XW^{(1)}_{j}\|_{2}^{2}=N_{1}\|X\|_{F}^{2}. It follows from (60) that

XW(1)XQ(1)FXW(1)FσN0mN1logN0N1XFmlogN0N0.\frac{\|XW^{(1)}-XQ^{(1)}\|_{F}}{\|XW^{(1)}\|_{F}}\lesssim\frac{\sigma_{N_{0}}\sqrt{mN_{1}\log N_{0}}}{\sqrt{N_{1}}\|X\|_{F}}\lesssim\sqrt{\frac{m\log N_{0}}{N_{0}}}.

Note that the error bounds above are identical with the error bounds derived in [zhang2022post] where XX was assumed random and ww was deterministic.

Theorem 3.5.

Let Φ\Phi be a two-layer neural network as in (LABEL:eq:mlp) where L=2L=2 and the activation function is φ(i)(x)=ρ(x):=max{0,x}\varphi^{(i)}(x)=\rho(x):=\max\{0,x\} for all 1iL1\leq i\leq L. Suppose that each W(i)Ni1×NiW^{(i)}\in\mathbb{R}^{N_{i-1}\times N_{i}} has i.i.d. 𝒩(0,1Ni1)\mathcal{N}(0,\frac{1}{N_{i-1}}) entries and {W(i)}i=1L\{W^{(i)}\}_{i=1}^{L} are independent. If we quantize Φ\Phi using (LABEL:eq:quantization-algorithm) with alphabet 𝒜=𝒜δ\mathcal{A}=\mathcal{A}_{\infty}^{\delta}, and input data Xm×N0X\in\mathbb{R}^{m\times N_{0}}, then for 1kN21\leq k\leq N_{2},

X(1)Wk(2)X~(1)Qk(2)2\displaystyle\|X^{(1)}W^{(2)}_{k}-\widetilde{X}^{(1)}Q^{(2)}_{k}\|_{2} (δ2mlogN0logN1+δmlogN0)max1jN0Xj2\displaystyle\lesssim(\delta^{2}m\sqrt{\log N_{0}\log N_{1}}+\delta\sqrt{m}\log N_{0})\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}
+δmlogN1N0(logN0X2+XF)\displaystyle+\delta\sqrt{\frac{m\log N_{1}}{N_{0}}}\Bigl{(}\sqrt{\log N_{0}}\|X\|_{2}+\|X\|_{F}\Bigr{)}

holds with probability at least 1N1N03N0cN131-\frac{N_{1}}{N_{0}^{3}}-N_{0}^{-c}-N_{1}^{-3}.

Proof.

By quantizing the weights W(1)W^{(1)} of the first layer, one can deduce the following error bound as in Theorem 3.4:

(61) P(XWj(1)XQj(1)2δmlogN0max1jN0Xj2)112N03,1jN1.\mathrm{P}\Bigl{(}\|XW^{(1)}_{j}-XQ^{(1)}_{j}\|_{2}\lesssim\delta\sqrt{m\log N_{0}}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}\Bigr{)}\geq 1-\frac{1}{2N_{0}^{3}},\quad 1\leq j\leq N_{1}.

Since Xj(1)=ρ(XWj(1))X^{(1)}_{j}=\rho(XW^{(1)}_{j}), X~j(1)=ρ(XQj(1))\widetilde{X}^{(1)}_{j}=\rho(XQ^{(1)}_{j}), and ρ(x)ρ(y)2xy2\|\rho(x)-\rho(y)\|_{2}\leq\|x-y\|_{2} for any x,ymx,y\in\mathbb{R}^{m}, it follows from (61) that, for 1jN11\leq j\leq N_{1},

(62) Xj(1)X~j(1)2δmlogN0max1jN0Xj2\|X^{(1)}_{j}-\widetilde{X}^{(1)}_{j}\|_{2}\lesssim\delta\sqrt{m\log N_{0}}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}

holds with probability at least 112N031-\frac{1}{2N_{0}^{3}}. Additionally, one can show that, for each jj,

(63) Xj(1)2=ρ(XWj(1))21N0(logN0X2+XF)\|X^{(1)}_{j}\|_{2}=\|\rho(XW^{(1)}_{j})\|_{2}\lesssim\frac{1}{\sqrt{N_{0}}}\Bigl{(}\sqrt{\log N_{0}}\|X\|_{2}+\|X\|_{F}\Bigr{)}

holds with probability exceeding 112N031-\frac{1}{2N_{0}^{3}}. [**JZ: Will add a lemma to prove the above inequality later.] Let \mathcal{E} denote the event that (62) and (63) hold uniformly for all jj. By taking a union bound, we have P()1N1N03\mathrm{P}(\mathcal{E})\geq 1-\frac{N_{1}}{N_{0}^{3}}.

Next, conditioning on \mathcal{E}, we quantize the second layer W(2)N1×N2W^{(2)}\in\mathbb{R}^{N_{1}\times N_{2}} using (LABEL:eq:quantization-algorithm) with data X(1),X~(1)m×N1X^{(1)},\widetilde{X}^{(1)}\in\mathbb{R}^{m\times N_{1}}. Applying LABEL:corollary:ut-bound-infinite with i=2i=2 and γ=N12\gamma=N_{1}^{-2}, for each neuron Wk(2)N1W^{(2)}_{k}\in\mathbb{R}^{N_{1}} and its quantized counterpart Qk(2)𝒜N1Q^{(2)}_{k}\in\mathcal{A}^{N_{1}}, we have

(64) X(1)Wk(2)X~(1)Qk(2)μk2δmlogN1max1jN1X~j(1)2\|X^{(1)}W^{(2)}_{k}-\widetilde{X}^{(1)}Q^{(2)}_{k}-\mu_{k}\|_{2}\lesssim\delta\sqrt{m\log N_{1}}\max_{1\leq j\leq N_{1}}\|\widetilde{X}^{(1)}_{j}\|_{2}

holds with probability exceeding 1N131-N_{1}^{-3}. Here, according to (LABEL:eq:def-mu), we have

μk:=j=1N1Wk(2)(j)PX~N1(1)PX~j+1(1)PX~j(1)Xj(1).\mu_{k}:=\sum_{j=1}^{N_{1}}W^{(2)}_{k}(j)P_{\widetilde{X}^{(1)\perp}_{N_{1}}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}.

On event \mathcal{E}, the triangle inequality, (62), and (63) yield

X~j(1)2\displaystyle\|\widetilde{X}^{(1)}_{j}\|_{2} X~j(1)Xj(1)2+Xj(1)2\displaystyle\leq\|\widetilde{X}^{(1)}_{j}-X^{(1)}_{j}\|_{2}+\|X^{(1)}_{j}\|_{2}
δmlogN0max1jN0Xj2+1N0(logN0X2+XF)\displaystyle\leq\delta\sqrt{m\log N_{0}}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}+\frac{1}{\sqrt{N_{0}}}\Bigl{(}\sqrt{\log N_{0}}\|X\|_{2}+\|X\|_{F}\Bigr{)}

hold uniformly for all 1jN11\leq j\leq N_{1}. Hence, (64) becomes

X(1)Wk(2)X~(1)Qk(2)μk2\displaystyle\|X^{(1)}W^{(2)}_{k}-\widetilde{X}^{(1)}Q^{(2)}_{k}-\mu_{k}\|_{2} δmlogN1max1jN1X~j(1)2\displaystyle\lesssim\delta\sqrt{m\log N_{1}}\max_{1\leq j\leq N_{1}}\|\widetilde{X}^{(1)}_{j}\|_{2}
δ2mlogN0logN1max1jN0Xj2\displaystyle\lesssim\delta^{2}m\sqrt{\log N_{0}\log N_{1}}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}
(65) +δmlogN1N0(logN0X2+XF)\displaystyle+\delta\sqrt{\frac{m\log N_{1}}{N_{0}}}\Bigl{(}\sqrt{\log N_{0}}\|X\|_{2}+\|X\|_{F}\Bigr{)}

with probability (conditioning on \mathcal{E}) at least 1N131-N_{1}^{-3}.

Now, to bound the quantization error X(1)Wk(2)X~(1)Qk(2)2\|X^{(1)}W^{(2)}_{k}-\widetilde{X}^{(1)}Q^{(2)}_{k}\|_{2} using triangle inequality, it suffices to control μk2\|\mu_{k}\|_{2}. Since Wk(2)𝒩(0,1N1I)W^{(2)}_{k}\sim\mathcal{N}(0,\frac{1}{N_{1}}I), we get μk𝒩(0,S)\mu_{k}\sim\mathcal{N}(0,S) with SS defined as follows

(66) S:=1N1j=1N1PX~N1(1)PX~j+1(1)PX~j(1)Xj(1)Xj(1)PX~j(1)PX~j+1(1)PX~N1(1)m×m.S:=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}P_{\widetilde{X}^{(1)\perp}_{N_{1}}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{N_{1}}}\in\mathbb{R}^{m\times m}.

Then S12μk𝒩(0,I)S^{-\frac{1}{2}}\mu_{k}\sim\mathcal{N}(0,I). Applying the Hanson-Wright inequality, see e.g. [rudelson2013hanson] and [vershynin2018high], we obtain for all α0\alpha\geq 0 that

(67) P(μk2S1/2Fα)1exp(cα2S1/222).\mathrm{P}\biggl{(}\|\mu_{k}\|_{2}-\|S^{1/2}\|_{F}\leq\alpha\biggr{)}\geq 1-\exp\biggl{(}-\frac{c\alpha^{2}}{\|S^{1/2}\|_{2}^{2}}\biggr{)}.

It remains to evaluate S1/2F\|S^{1/2}\|_{F} and S1/22\|S^{1/2}\|_{2}. On event \mathcal{E},

(68) PX~j(1)(Xj(1))2=PX~j(1)(Xj(1)X~j(1))2Xj(1)X~j(1)2δmlogN0max1jN0Xj2\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j})\|_{2}=\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j}-\widetilde{X}^{(1)}_{j})\|_{2}\leq\|X^{(1)}_{j}-\widetilde{X}^{(1)}_{j}\|_{2}\lesssim\delta\sqrt{m\log N_{0}}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}

holds uniformly for all 1jN11\leq j\leq N_{1}. Moreover, since AF2=tr(AA)\|A\|_{F}^{2}=\operatorname{tr}(A^{\top}A) and tr(AB)=tr(BA)\operatorname{tr}(AB)=\operatorname{tr}(BA), we have

S1/2F2\displaystyle\|S^{1/2}\|_{F}^{2} =tr(S)\displaystyle=\operatorname{tr}(S)
=1N1j=1N1tr(PX~t(1)PX~j+1(1)PX~j(1)Xj(1)Xj(1)PX~j(1)PX~j+1(1)PX~t(1))\displaystyle=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\operatorname{tr}(P_{\widetilde{X}^{(1)\perp}_{t}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t}})
=1N1j=1N1tr(Xj(1)PX~j(1)PX~j+1(1)PX~t(1)PX~j+1(1)PX~j(1)Xj(1))\displaystyle=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\operatorname{tr}(X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j})
=1N1j=1N1Xj(1)PX~j(1)PX~j+1(1)PX~t(1)PX~j+1(1)PX~j(1)Xj(1)\displaystyle=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}
=1N1j=1N1(PX~j(1)Xj(1))PX~j+1(1)PX~t(1)PX~j+1(1)(PX~j(1)Xj(1))\displaystyle=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}(P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j})^{\top}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}(P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j})
1N1j=1N1PX~j+1(1)PX~t(1)PX~j+1(1)2PX~j(1)(Xj(1))22\displaystyle\leq\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\|P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}\|_{2}\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j})\|_{2}^{2}
1N1j=1N1PX~j(1)(Xj(1))22.\displaystyle\leq\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j})\|_{2}^{2}.

Here, the first inequality holds because maxx2=1xAx=A2\max_{\|x\|_{2}=1}x^{\top}Ax=\|A\|_{2} for any positive semidefinite matrix AA and the second inequality is due to P21\|P\|_{2}\leq 1 for any orthogonal projection PP. Plugging (68) into the result above, we get

(69) S1/2F2δ2mlogN0max1jN0Xj22.\|S^{1/2}\|_{F}^{2}\leq\delta^{2}m\log N_{0}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}^{2}.

Further, since P21\|P\|_{2}\leq 1 for all orthogonal projections PP and aa2=a22\|aa^{\top}\|_{2}=\|a\|_{2}^{2} for all vectors ama\in\mathbb{R}^{m}, we obtain

S1/222\displaystyle\|S^{1/2}\|_{2}^{2} =S2\displaystyle=\|S\|_{2}
1N1j=1N1PX~t(1)PX~j+1(1)PX~j(1)Xj(1)Xj(1)PX~j(1)PX~j+1(1)PX~t(1)2\displaystyle\leq\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\|P_{\widetilde{X}^{(1)\perp}_{t}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t}}\|_{2}
1N1j=1N1PX~j(1)Xj(1)Xj(1)PX~j(1)2\displaystyle\leq\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\|P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}\|_{2}
=1N1j=1N1PX~j(1)(Xj(1))22.\displaystyle=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j})\|_{2}^{2}.

Again, by (68), the inequality above becomes

(70) S1/222δ2mlogN0max1jN0Xj22.\|S^{1/2}\|_{2}^{2}\leq\delta^{2}m\log N_{0}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}^{2}.

Combining (67), (69), (70), and choosing α=δmlogN0max1jN0Xj2\alpha=\delta\sqrt{m}\log N_{0}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}, one can get

(71) μk22δmlogN0max1jN0Xj2\|\mu_{k}\|_{2}\leq 2\delta\sqrt{m}\log N_{0}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}

with probability (conditioning on \mathcal{E}) at least 1N0c1-N_{0}^{-c}. It follows from (3), (71), and P()1N1N03\mathrm{P}(\mathcal{E})\geq 1-\frac{N_{1}}{N_{0}^{3}} that

X(1)Wk(2)X~(1)Qk(2)2\displaystyle\|X^{(1)}W^{(2)}_{k}-\widetilde{X}^{(1)}Q^{(2)}_{k}\|_{2} (δ2mlogN0logN1+δmlogN0)max1jN0Xj2\displaystyle\lesssim(\delta^{2}m\sqrt{\log N_{0}\log N_{1}}+\delta\sqrt{m}\log N_{0})\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}
+δmlogN1N0(logN0X2+XF)\displaystyle+\delta\sqrt{\frac{m\log N_{1}}{N_{0}}}\Bigl{(}\sqrt{\log N_{0}}\|X\|_{2}+\|X\|_{F}\Bigr{)}

holds with probability at least 1N1N03N0cN131-\frac{N_{1}}{N_{0}^{3}}-N_{0}^{-c}-N_{1}^{-3}. ∎