Archived Results

1. Projections

Lemma 1.1.

Let $X^{(i-1)}$ , $\widetilde{X}^{(i-1)}$ be as in (LABEL:eq:layer-input) and let $w\in\mathbb{R}^{N_{i-1}}$ be a neuron in the $i$ -th layer. Applying the data alignment procedure in (LABEL:eq:quantization-algorithm-step1), for $t=1,2,\ldots,N_{i-1}$ , we have

\hat{u}_{t}=\sum_{j=1}^{t}w_{j}P_{\widetilde{X}_{t}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j}).

Moreover, if $d:=\lfloor\frac{N_{i-1}}{m}\rfloor\in\mathbb{N}$ and $A_{j}^{(i-1)}:=P_{\widetilde{X}_{(j+1)m}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{jm+2}^{(i-1)\perp}}P_{\widetilde{X}_{jm+1}^{(i-1)\perp}}$ for $1\leq j\leq d-1$ , then

(1)

\|\hat{u}_{N_{i-1}}\|_{2}\leq m\|w\|_{\infty}\Bigl{(}2+\sum_{k=1}^{d-1}\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\|_{2}\Bigr{)}\max_{1\leq j\leq N_{i-1}}\|X^{(i-1)}_{j}-\widetilde{X}_{j}^{(i-1)}\|_{2}.

Proof.

We consider the proof by induction on $t$ . By (LABEL:eq:quantization-algorithm-step1), the case $t=1$ is straightforward, since we have

	$\displaystyle\hat{u}_{1}$	$\displaystyle=w_{1}X^{(i-1)}_{1}-\widetilde{w}_{1}\widetilde{X}^{(i-1)}_{1}$
		$\displaystyle=w_{1}X^{(i-1)}_{1}-\frac{\langle\widetilde{X}_{1}^{(i-1)},w_{1}X_{1}^{(i-1)}\rangle}{\\|\widetilde{X}^{(i-1)}_{1}\\|_{2}^{2}}\widetilde{X}^{(i-1)}_{1}$
		$\displaystyle=w_{1}X^{(i-1)}_{1}-P_{\widetilde{X}_{1}^{(i-1)}}(w_{1}X^{(i-1)}_{1})$
		$\displaystyle=w_{1}P_{\widetilde{X}_{1}^{(i-1)\perp}}(X^{(i-1)}_{1})$

where we apply the properties of orthogonal projections in (LABEL:eq:orth-proj) and (LABEL:eq:orth-proj-mat). For $t\geq 2$ , assume that

\hat{u}_{t-1}=\sum_{j=1}^{t-1}w_{j}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j}).

Then, by (LABEL:eq:quantization-algorithm-step1), one gets

	$\displaystyle\hat{u}_{t}$	$\displaystyle=\hat{u}_{t-1}+w_{t}X^{(i-1)}_{t}-\widetilde{w}_{t}\widetilde{X}^{(i-1)}_{t}$
		$\displaystyle=\hat{u}_{t-1}+w_{t}X^{(i-1)}_{t}-\frac{\langle\widetilde{X}_{t}^{(i-1)},\hat{u}_{t-1}+w_{t}X_{t}^{(i-1)}\rangle}{\\|\widetilde{X}^{(i-1)}_{t}\\|_{2}^{2}}\widetilde{X}^{(i-1)}_{t}$
		$\displaystyle=\hat{u}_{t-1}+w_{t}X^{(i-1)}_{t}-P_{\widetilde{X}_{t}^{(i-1)}}(\hat{u}_{t-1}+w_{t}X^{(i-1)}_{t})$
		$\displaystyle=P_{\widetilde{X}_{t}^{(i-1)\perp}}(\hat{u}_{t-1}+w_{t}X^{(i-1)}_{t}).$

Applying the induction hypothesis, we obtain

	$\displaystyle\hat{u}_{t}$	$\displaystyle=P_{\widetilde{X}_{t}^{(i-1)\perp}}(\hat{u}_{t-1})+w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}(X^{(i-1)}_{t})$
		$\displaystyle=\sum_{j=1}^{t-1}w_{j}P_{\widetilde{X}_{t}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})+w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}(X^{(i-1)}_{t})$
		$\displaystyle=\sum_{j=1}^{t}w_{j}P_{\widetilde{X}_{t}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j}).$

This completes the proof by induction. In particular, if $t=N_{i-1}$ , then we have

\hat{u}_{N_{i-1}}=\sum_{j=1}^{N_{i-1}}w_{j}P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j}).

It follows from triangle inequality and the definition of $A_{j}$ that

	$\displaystyle\\|\hat{u}_{N_{i-1}}\\|_{2}$	$\displaystyle=\Bigl{\\|}\sum_{j=1}^{N_{i-1}}w_{j}P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\Bigr{\\|}_{2}$
		$\displaystyle\leq\\|w\\|_{\infty}\sum_{j=1}^{N_{i-1}}\\|P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\\|_{2}$
		$\displaystyle=\\|w\\|_{\infty}\sum_{k=1}^{d-1}\sum_{j=(k-1)m+1}^{km}\\|P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\\|_{2}$
		$\displaystyle+\\|w\\|_{\infty}\sum_{j=(d-1)m+1}^{N_{i-1}}\\|P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\\|_{2}.$

Since $\|P\|_{2}=1$ for any nonzero orthogonal projection $P$ , $\frac{N_{i-1}}{m}\leq d+1$ , and $A_{j}^{(i-1)}=P_{\widetilde{X}_{(j+1)m}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{jm+2}^{(i-1)\perp}}P_{\widetilde{X}_{jm+1}^{(i-1)\perp}}$ for $1\leq j\leq d-1$ , we deduce that

	$\displaystyle\\|\hat{u}_{N_{i-1}}\\|_{2}\leq\\|w\\|_{\infty}\sum_{k=1}^{d-1}\sum_{j=(k-1)m+1}^{km}\\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\\|_{2}\\|P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\\|_{2}$
	$\displaystyle+\\|w\\|_{\infty}\sum_{j=(d-1)m+1}^{N_{i-1}}\\|P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\\|_{2}$
	$\displaystyle\leq m\\|w\\|_{\infty}\Bigl{(}\sum_{k=1}^{d-1}\\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\\|_{2}+\frac{N_{i-1}}{m}-(d-1)\Bigr{)}\max_{1\leq j\leq N_{i-1}}\\|P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\\|_{2}$
	$\displaystyle\leq m\\|w\\|_{\infty}\Bigl{(}2+\sum_{k=1}^{d-1}\\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\\|_{2}\Bigr{)}\max_{1\leq j\leq N_{i-1}}\\|P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j}-\widetilde{X}_{j}^{(i-1)})\\|_{2}$
	$\displaystyle\leq m\\|w\\|_{\infty}\Bigl{(}2+\sum_{k=1}^{d-1}\\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\\|_{2}\Bigr{)}\max_{1\leq j\leq N_{i-1}}\\|X^{(i-1)}_{j}-\widetilde{X}_{j}^{(i-1)}\\|_{2}.$

∎

2. Minimum $\ell_{\infty}$ solutions for a linear system

Let $A\in\mathbb{R}^{m\times n}$ be a matrix with $\mathrm{rank}(A)=m<n$ and let $b\in\mathbb{R}^{m}$ be a nonzero vector. Then the Rouché-Capelli theorem implies that the linear system $Ax=b$ admits infinitely many solutions. An intriguing problem that has important applications is to find the solutions of $Ax=b$ whose $\ell_{\infty}$ norm is the smallest possible. Specifically, we aim to solve the following primal problem:

(2)		$\displaystyle\min_{x\in\mathbb{R}^{n}}$	$\displaystyle\\|x\\|_{\infty}$
(2)		s.t.	$\displaystyle Ax=b.$

Apart from the linear programming formulation [abdelmalek1977minimum], two powerful tools, namely, Cadzow’s method [cadzow1973finite, cadzow1974efficient] and the Shim-Yoon method [shim1998stabilized], are widely used to solve (2). To perform the perturbation analysis, we will focus on Cadzow’s method throughout this section, which applies the duality principle to get

(3)

\min_{Ax=b}\|x\|_{\infty}=\max_{\|A^{\top}y\|_{1}=1}b^{\top}y.

Moreover, suppose that $a_{1},a_{2},\ldots,a_{n}\in\mathbb{R}^{m}$ are column vectors of $A\in\mathbb{R}^{m\times n}$ . By (LABEL:eq:orth-proj), we have $a_{j}=P_{b}(a_{j})+P_{b^{\perp}}(a_{j})$ with $P_{b}(a_{j})=\frac{\langle a_{j},b\rangle}{\|b\|_{2}^{2}}b$ . Then one can uniquely decompose $A$ as

(4)

A=A_{1}+A_{2},

where

A_{1}:=[\xi_{1}b,\xi_{2}b,\ldots,\xi_{n}b]\in\mathbb{R}^{m\times n}\quad\text{with}\quad\xi:=(\xi_{1},\xi_{2},\ldots,\xi_{n})^{\top}=\frac{A^{\top}b}{\|b\|_{2}^{2}},

and

A_{2}:=[P_{b^{\perp}}(a_{1}),P_{b^{\perp}}(a_{2}),\ldots,P_{b^{\perp}}(a_{n})]\in\mathbb{R}^{m\times n}.

According to the transformation technique used in section 2 of [cadzow1974efficient], the dual problem in (3) can be reformulated as

(5)

\max_{\|A^{\top}y\|_{1}=1}b^{\top}y=\|b\|_{2}^{2}\biggl{(}\min_{y\in\mathbb{R}^{m}}\|A_{1}^{\top}b+A_{2}^{\top}y\|_{1}\biggr{)}^{-1}.

It follows immediately from (3) and (5) that

(6)

\|x^{*}\|_{\infty}=\min_{Ax=b}\|x\|_{\infty}=\frac{\|b\|_{2}^{2}}{\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\|_{1}}.

where $x^{*}$ and $y^{*}$ are solutions of the primal problem and the dual problem respectively.

Now we evaluate the change of optimal value of (2) under a small perturbation of $A$ . Suppose that $\widetilde{A}:=A+\Delta A$ with $\mathrm{rank}(\widetilde{A})=m$ and $\Delta A:=[\Delta a_{1},\Delta a_{2},\ldots,\Delta a_{n}]\in\mathbb{R}^{m\times n}$ . Then $\widetilde{A}=[a_{1}+\Delta a_{1},a_{2}+\Delta a_{2},\ldots,a_{n}+\Delta a_{n}]$ . Let $\widetilde{x},\widetilde{y}$ be primal and dual solutions for the perturbed problem $\min_{\widetilde{A}x=b}\|x\|_{\infty}$ . Then, similar to (6), we deduce that

(7)

\|\widetilde{x}\|_{\infty}=\frac{\|b\|_{2}^{2}}{\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\|_{1}}

where

\widetilde{A}_{1}:=A_{1}+\Delta A_{1}\quad\text{with}\quad\Delta A_{1}:=[\zeta_{1}b,\zeta_{2}b,\ldots,\zeta_{n}b]\in\mathbb{R}^{m\times n},\quad\zeta:=(\zeta_{1},\zeta_{2},\ldots,\zeta_{n})=\frac{\Delta A^{\top}b}{\|b\|_{2}^{2}},

and

\widetilde{A}_{2}:=A_{2}+\Delta A_{2}\quad\text{with}\quad\Delta A_{2}:=[P_{b^{\perp}}(\Delta a_{1}),P_{b^{\perp}}(\Delta a_{2}),\ldots,P_{b^{\perp}}(\Delta a_{n})]\in\mathbb{R}^{m\times n}.

Lemma 2.1.

Let $\Delta y:=\widetilde{y}-y^{*}$ . Suppose that there exist positive constants $c_{1}$ , $c_{2}$ , and $c_{3}$ such that

(8)

\|\Delta y\|_{2}\leq c_{1},\quad\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\|_{1}\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\|_{1}\geq c_{2},\quad\text{and}\quad\|\Delta a_{j}\|_{\infty}\leq c_{3}\|a_{j}\|_{\infty}\quad\text{for all}\;j.

Then we have

\Bigl{|}\|\widetilde{x}\|_{\infty}-\|x^{*}\|_{\infty}\Bigr{|}\lesssim\sqrt{m}\sum_{j=1}^{n}\|a_{j}\|_{\infty}.

Proof.

By triangle inequality, we have

	$\displaystyle\Bigl{\|}\\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\\|_{1}-\\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\\|_{1}\Bigr{\|}$	$\displaystyle\leq\\|(\widetilde{A}_{1}-A_{1})^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}-A_{2}^{\top}y^{*}\\|_{1}$
		$\displaystyle=\\|\Delta A_{1}^{\top}b+A_{2}^{\top}(y^{}+\Delta y)+\Delta A_{2}^{\top}\widetilde{y}-A_{2}^{\top}y^{}\\|_{1}$
		$\displaystyle=\\|\Delta A_{1}^{\top}b+A_{2}^{\top}\Delta y+\Delta A_{2}^{\top}\widetilde{y}\\|_{1}$
(9)			$\displaystyle\leq\\|\Delta A_{1}^{\top}b\\|_{1}+\\|A_{2}^{\top}\Delta y\\|_{1}+\\|\Delta A_{2}^{\top}\widetilde{y}\\|_{1}.$

Applying Hölder’s inequality and the fact that $\|P_{b^{\perp}}(x)\|_{2}\leq\|x\|_{2}$ holds for all $x\in\mathbb{R}^{m}$ , we get

\|\Delta A_{1}^{\top}b\|_{1}=\|b\|_{2}^{2}\|\zeta\|_{1}=\|\Delta A^{\top}b\|_{1}=\sum_{j=1}^{n}|\langle\Delta a_{j},b\rangle|\leq\|b\|_{1}\sum_{j=1}^{n}\|\Delta a_{j}\|_{\infty},

\|A_{2}^{\top}\Delta y\|_{1}=\sum_{j=1}^{n}|\langle P_{b^{\perp}}(a_{j}),\Delta y\rangle|\leq\|\Delta y\|_{2}\sum_{j=1}^{n}\|a_{j}\|_{2}\leq\sqrt{m}\|\Delta y\|_{2}\sum_{j=1}^{n}\|a_{j}\|_{\infty},

and

\|\Delta A_{2}^{\top}\widetilde{y}\|_{1}=\sum_{j=1}^{n}|\langle P_{b^{\perp}}(\Delta a_{j}),\widetilde{y}\rangle|\leq\|\widetilde{y}\|_{2}\sum_{j=1}^{n}\|\Delta a_{j}\|_{2}\leq\sqrt{m}\|\widetilde{y}\|_{2}\sum_{j=1}^{n}\|\Delta a_{j}\|_{\infty}.

Plugging three inequalities above into (2) and applying (8), we obtain

	$\displaystyle\Bigl{\|}\\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\\|_{1}-\\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\\|_{1}\Bigr{\|}$	$\displaystyle\leq(\\|b\\|_{1}+\sqrt{m}\\|\widetilde{y}\\|_{2})\sum_{j=1}^{n}\\|\Delta a_{j}\\|_{\infty}+\sqrt{m}\\|\Delta y\\|_{2}\sum_{j=1}^{n}\\|a_{j}\\|_{\infty}$
(10)			$\displaystyle\leq\Bigl{(}c_{3}\\|b\\|_{1}+c_{3}\sqrt{m}\\|\widetilde{y}\\|_{2}+c_{1}\sqrt{m}\Bigr{)}\sum_{j=1}^{n}\\|a_{j}\\|_{\infty}.$

It follows from (6) and (7) that

	$\displaystyle\Bigl{\|}\\|\widetilde{x}\\|_{\infty}-\\|x^{*}\\|_{\infty}\Bigr{\|}$	$\displaystyle=\Biggl{\|}\frac{\\|b\\|_{2}^{2}}{\\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\\|_{1}}-\frac{\\|b\\|_{2}^{2}}{\\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\\|_{1}}\Biggr{\|}$
		$\displaystyle=\frac{\\|b\\|_{2}^{2}}{\\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\\|_{1}\\|A_{1}^{\top}b+A_{2}^{\top}y^{}\\|_{1}}\Bigl{\|}\\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\\|_{1}-\\|A_{1}^{\top}b+A_{2}^{\top}y^{}\\|_{1}\Bigr{\|}$
		$\displaystyle\leq\frac{\\|b\\|_{2}^{2}}{c_{2}}\Bigl{(}c_{3}\\|b\\|_{1}+c_{3}\sqrt{m}\\|\widetilde{y}\\|_{2}+c_{1}\sqrt{m}\Bigr{)}\sum_{j=1}^{n}\\|a_{j}\\|_{\infty}$
		$\displaystyle\lesssim\sqrt{m}\sum_{j=1}^{n}\\|a_{j}\\|_{\infty}.$

In the first inequality above, we used (8) and (2). ∎

In general, to evaluate the error bounds for the $i$ -th layer, we need to approximate $\|\mu_{t}\|_{2}$ by considering the small distance $\|X_{t}^{(i-1)}-\widetilde{X}_{t}^{(i-1)}\|_{2}$ and the effect of consecutive orthogonal projections onto $\widetilde{X}_{t}^{(i-1)\perp}$ . Note that $X_{t}^{(i-1)}=\varphi^{(i-1)}(X^{(i-2)}W^{(i-1)}_{t})$ and $\widetilde{X}_{t}^{(i-1)}=\varphi^{(i-1)}(\widetilde{X}^{(i-2)}Q^{(i-1)}_{t})$ where the $t$ -th neuron of the $(i-1)$ -th layer, denoted by $W_{t}^{(i-1)}\in\mathbb{R}^{N_{i-2}}$ , is quantized as $Q_{t}^{(i-1)}\in\mathcal{A}^{N_{i-2}}$ . Since all neurons are quantized separately using a stochastic approach with independent random variables, $\{\widetilde{X}_{t}^{(i-1)}-X_{t}^{(i-1)}\}_{i=1}^{N_{i-1}}$ are independent.

Corollary 2.2.

Let $X^{(i-1)}$ , $\widetilde{X}^{(i-1)}$ be as in (LABEL:eq:layer-input) such that, for $1\leq t\leq N_{i-1}$ , the input discrepancy defined by $\Delta_{t}:=\widetilde{X}_{t}^{(i-1)}-X_{t}^{(i-1)}$ satisfies $\Delta_{t}\leq_{\mathrm{cx}}\mathcal{N}(0,\alpha^{2}I)$ where $\alpha>0$ is a constant. Suppose that $\Delta_{1},\Delta_{2},\ldots,\Delta_{N_{i-1}}$ are independent. Let $w\in\mathbb{R}^{N_{i-1}}$ be the weights associated with a neuron in the $i$ -th layer. Quantizing $w$ using (LABEL:eq:quantization-expression) over alphabets $\mathcal{A}=\mathcal{A}_{\infty}^{\delta}$ with step size $\delta>0$ ,

\|u_{N_{i-1}}\|_{2}\lesssim(\|w\|_{2}+\sigma_{N_{i-1}})\sqrt{m\log N_{i-1}}

holds with probability exceeding $1-\frac{2}{N_{i-1}}$ .

Proof.

Recall that

\mu_{t}=P_{\widetilde{X}_{t}^{(i-1)\perp}}(\mu_{t-1}+w_{t}X_{t}^{(i-1)})\quad\text{with}\quad\mu_{0}=0.

Due to $\Delta_{t}:=\widetilde{X}_{t}^{(i-1)}-X_{t}^{(i-1)}$ , we have

(11)

\mu_{t}=P_{\widetilde{X}_{t}^{(i-1)\perp}}(\mu_{t-1}+w_{t}X_{t}^{(i-1)})=P_{\widetilde{X}_{t}^{(i-1)\perp}}(\mu_{t-1})-w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t}).

Since $\Delta_{1}\leq_{\mathrm{cx}}\mathcal{N}(0,\alpha^{2}I)$ and $\mu_{0}=0$ , we get $\mu_{1}=-w_{1}P_{\widetilde{X}_{1}^{(i-1)\perp}}(\Delta_{1})\leq_{\mathrm{cx}}\mathcal{N}(0,\alpha^{2}w_{1}^{2}P_{\widetilde{X}_{1}^{(i-1)\perp}})$ , where used LABEL:lemma:cx-afine to get the inequality. Assume that the following inequality holds:

\mu_{t-1}\leq_{\mathrm{cx}}\mathcal{N}\biggl{(}0,\alpha^{2}\sum_{j=1}^{t-1}w_{j}^{2}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}P_{\widetilde{X}_{t-2}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{t-2}^{(i-1)\perp}}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}\biggr{)}.

Then applying LABEL:lemma:cx-afine and LABEL:lemma:cx-independent-sum to (11) yields

	$\displaystyle\mu_{t}$	$\displaystyle=P_{\widetilde{X}_{t}^{(i-1)\perp}}(\mu_{t-1})-w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})$
		$\displaystyle\leq_{\mathrm{cx}}\mathcal{N}\biggl{(}0,\alpha^{2}\sum_{j=1}^{t-1}w_{j}^{2}P_{\widetilde{X}_{t}^{(i-1)\perp}}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{t-1}^{(i-1)\perp}}P_{\widetilde{X}_{t}^{(i-1)\perp}}\biggr{)}+\mathcal{N}\biggl{(}0,\alpha^{2}w_{t}^{2}P_{\widetilde{X}_{t}^{(i-1)\perp}}\biggr{)}$
		$\displaystyle=\mathcal{N}\biggl{(}0,\alpha^{2}\sum_{j=1}^{t}w_{j}^{2}P_{\widetilde{X}_{t}^{(i-1)\perp}}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{t-1}^{(i-1)\perp}}P_{\widetilde{X}_{t}^{(i-1)\perp}}\biggr{)}.$

Hence, by induction, we have proved that, for $1\leq t\leq N_{i-1}$ ,

\mu_{t}\leq_{\mathrm{cx}}\mathcal{N}\biggl{(}0,\alpha^{2}\sum_{j=1}^{t}w_{j}^{2}P_{\widetilde{X}_{t}^{(i-1)\perp}}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{t-1}^{(i-1)\perp}}P_{\widetilde{X}_{t}^{(i-1)\perp}}\biggr{)}.

Since $P_{\widetilde{X}_{t}^{(i-1)\perp}}P_{\widetilde{X}_{t-1}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{t-1}^{(i-1)\perp}}P_{\widetilde{X}_{t}^{(i-1)\perp}}\preceq I$ for all $j$ , by LABEL:lemma:cx-normal, we have $\mu_{t}\leq_{\mathrm{cx}}\mathcal{N}(0,\alpha^{2}(\sum_{j=1}^{t}w_{j}^{2})I)$ . In particular, we get $\mu_{N_{i-1}}\leq_{\mathrm{cx}}\mathcal{N}(0,\alpha^{2}\|w\|_{2}^{2}I)$ . It follows from LABEL:lemma:cx-gaussian-tail that

\|\mu_{N_{i-1}}\|_{2}\leq\sqrt{m}\|\mu_{N_{i-1}}\|_{\infty}\lesssim\alpha\|w\|_{2}\sqrt{m\log N_{i-1}}

holds with probability at least $1-\frac{1}{N_{i-1}}$ . Additionally, (LABEL:eq:inf-alphabet-tails) implies that with probability exceeding $1-\frac{1}{N_{i-1}}$ , we have

\|u_{N_{i-1}}-\mu_{N_{i-1}}\|_{2}\lesssim\sigma_{N_{i-1}}\sqrt{m\log N_{i-1}}

where $\sigma_{N_{i-1}}=\delta\sqrt{\frac{\pi}{2}}\max_{1\leq j\leq N_{i-1}}\|\widetilde{X}_{j}^{(i-1)}\|_{2}$ . Thus the union bound yields

\|u_{N_{i-1}}\|_{2}\lesssim(\alpha\|w\|_{2}+\sigma_{N_{i-1}})\sqrt{m\log N_{i-1}}

with probability at least $1-\frac{2}{N_{i-1}}$ . ∎

[**JZ: Will consider the special case for orthogonal $\widetilde{X}^{(i-1)}_{t}$ .] [**JZ: Previous loose bound: Suppose that $\|\Delta_{t}\|_{2}\leq\alpha\|X_{t}^{(i-1)}\|_{2}$ with $\alpha\in(0,1)$ . If $t=1$ , then

\|\mu_{1}\|_{2}=\|w_{1}P_{\widetilde{X}_{1}^{(i-1)\perp}}(\Delta_{1}+\widetilde{X}_{1}^{(i-1)})\|_{2}=\|w_{1}P_{\widetilde{X}_{1}^{(i-1)\perp}}\Delta_{1}\|_{2}\leq|w_{1}|\|\Delta_{1}\|_{2}\leq\alpha|w_{1}|\|X_{1}^{(i-1)}\|_{2}.

Assume that $\|\mu_{t-1}\|_{2}\leq\alpha\sum_{j=1}^{t-1}|w_{j}|\|X_{j}^{(i-1)}\|_{2}$ . Using the triangle inequality and the induction hypothesis, one can get

	$\displaystyle\\|\mu_{t}\\|_{2}$	$\displaystyle=\\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(\mu_{t-1}+w_{t}X_{t}^{(i-1)})\\|_{2}$
		$\displaystyle\leq\\|P_{\widetilde{X}_{t}^{(i-1)\perp}}\mu_{t-1}\\|_{2}+\\|w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}X_{t}^{(i-1)}\\|_{2}$
		$\displaystyle\leq\\|\mu_{t-1}\\|_{2}+\\|w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}\Delta_{t}\\|_{2}$
		$\displaystyle\leq\alpha\sum_{j=1}^{t-1}\|w_{j}\|\\|X_{j}^{(i-1)}\\|_{2}+\alpha\|w_{t}\|\\|X_{t}^{(i-1)}\\|_{2}$
		$\displaystyle=\alpha\sum_{j=1}^{t}\|w_{j}\|\\|X_{j}^{(i-1)}\\|_{2}.$

Therefore, by induction, we have

(12)

\|\mu_{t}\|_{2}\leq\alpha\sum_{j=1}^{t}|w_{j}|\|X_{j}^{(i-1)}\|_{2}\leq\alpha t\|w\|_{\infty}\max_{1\leq j\leq t}\|X_{j}^{(i-1)}\|_{2}.

]

Recall from (LABEL:eq:ht), (LABEL:eq:ut-bound-infinite-eq2), and (LABEL:eq:ut-bound-infinite-eq4) that

u_{t}=P_{\widetilde{X}_{t}^{(i-1)\perp}}(h_{t})+(v_{t}-q_{t})\widetilde{X}_{t}^{(i-1)}

where

h_{t}=u_{t-1}+w_{t}X^{(i-1)}_{t},\quad v_{t}=\frac{\langle\widetilde{X}_{t}^{(i-1)},\quad h_{t}\rangle}{\|\widetilde{X}_{t}^{(i-1)}\|_{2}^{2}},\quad\text{and}\quad q_{t}=\mathcal{Q}_{\mathrm{StocQ}}(v_{t}).

Let $\Delta_{t}=\widetilde{X}_{t}^{(i-1)}-X_{t}^{(i-1)}$ for all $t$ . Then we obtain

	$\displaystyle\\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(h_{t})\\|_{2}^{2}$	$\displaystyle=\\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1})+w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}(X_{t}^{(i-1)})\\|_{2}^{2}$
		$\displaystyle=\\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1})-w_{t}P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\\|_{2}^{2}$
		$\displaystyle=\\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1})\\|_{2}^{2}+w_{t}^{2}\\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\\|_{2}^{2}-2w_{t}\langle P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1}),P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\rangle.$

It follows that

	$\displaystyle\\|u_{t}\\|_{2}^{2}-\\|u_{t-1}\\|^{2}_{2}$	$\displaystyle=\\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(h_{t})+(v_{t}-q_{t})\widetilde{X}_{t}^{(i-1)}\\|_{2}^{2}-\\|u_{t-1}\\|_{2}^{2}$
		$\displaystyle=\\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(h_{t})\\|_{2}^{2}+(v_{t}-q_{t})^{2}\\|\widetilde{X}_{t}^{(i-1)}\\|_{2}^{2}-\\|u_{t-1}\\|_{2}^{2}$
		$\displaystyle=(v_{t}-q_{t})^{2}\\|\widetilde{X}_{t}^{(i-1)}\\|_{2}^{2}+\\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1})\\|_{2}^{2}+w_{t}^{2}\\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\\|_{2}^{2}$
		$\displaystyle-2w_{t}\langle P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1}),P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\rangle-\\|u_{t-1}\\|_{2}^{2}$
(13)			$\displaystyle=(v_{t}-q_{t})^{2}\\|\widetilde{X}_{t}^{(i-1)}\\|_{2}^{2}-\\|P_{\widetilde{X}_{t}^{(i-1)}}(u_{t-1})\\|_{2}^{2}+e_{t}$

where $e_{t}:=w_{t}^{2}\|P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\|_{2}^{2}-2w_{t}\langle P_{\widetilde{X}_{t}^{(i-1)\perp}}(u_{t-1}),P_{\widetilde{X}_{t}^{(i-1)\perp}}(\Delta_{t})\rangle$ .

[**JZ: The proof for the general $i$ -th layer]

Proof.

In general, to evaluate the error bounds for the $i$ -th layer (with $i>1$ ) using (LABEL:eq:inf-alphabet-tails), we need to approximate $\mu_{t}$ by considering recursive orthogonal projections of $X_{j}^{(i-1)}$ onto $\mathrm{span}(\widetilde{X}_{j}^{(i-1)})^{\perp}$ for $1\leq j\leq t$ . Specifically, according to (LABEL:eq:def-mu), we have

\mu_{t}=\sum_{j=1}^{t}w_{j}P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j},\quad 1\leq t\leq N_{i-1}.

To control $\|\mu_{t}\|_{2}$ using concentration inequalities, we impose randomness on the weights. In particular, by assuming $w\sim\mathcal{N}(0,I)$ , we get $\mu_{t}\sim\mathcal{N}(0,S_{t})$ with $S_{t}$ defined as follows

(14)

S_{t}:=\sum_{j=1}^{t}P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j}X^{(i-1)\top}_{j}P_{\widetilde{X}^{(i-1)\perp}_{j}}P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}}\in\mathbb{R}^{m\times m}.

Then $S_{t}^{-\frac{1}{2}}\mu_{t}\sim\mathcal{N}(0,I)$ . Applying the Hanson-Wright inequality, see e.g. [rudelson2013hanson] and [vershynin2018high], we obtain for all $\alpha\geq 0$ that

(15)

\mathrm{P}\biggl{(}\Bigl{|}\|\mu_{t}\|_{2}-\|S_{t}^{1/2}\|_{F}\Bigr{|}\leq\alpha\biggr{)}\geq 1-2\exp\biggl{(}-\frac{c\alpha^{2}}{\|S_{t}^{1/2}\|_{2}^{2}}\biggr{)}.

It remains to evaluate $\|S_{t}^{1/2}\|_{F}$ and $\|S_{t}^{1/2}\|_{2}$ . Note that the error bounds for the $(i-1)$ -th layer guarantee [**JZ: We may need to discuss the effect of activation functions. Consider two cases: one for large $\|X\|_{2}$ another for small $\|X\|_{2}$ .]

\frac{\|X^{(i-1)}_{j}-\widetilde{X}^{(i-1)}_{j}\|_{2}}{\|X^{(i-1)}_{j}\|_{2}}\lesssim\sqrt{\frac{m\log N_{i-2}}{N_{i-2}}}.

It follows from the inequality above and Lemma 2.5 that

(16)

\|P_{\widetilde{X}^{(i-1)\perp}_{j}}(X^{(i-1)}_{j})\|_{2}\leq\|X^{(i-1)}_{j}\|_{2}\sqrt{\frac{m\log N_{i-2}}{2N_{i-2}}\biggl{(}1+\frac{\|X^{(i-1)}_{j}\|_{2}^{2}}{\|\widetilde{X}^{(i-1)}_{j}\|_{2}^{2}}\biggr{)}}.

Moreover, since $\|A\|_{F}^{2}=\operatorname{tr}(A^{\top}A)$ and $\operatorname{tr}(AB)=\operatorname{tr}(BA)$ , we have

	$\displaystyle\\|S_{t}^{1/2}\\|_{F}^{2}$	$\displaystyle=\operatorname{tr}(S_{t})$
		$\displaystyle=\sum_{j=1}^{t}\operatorname{tr}(P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j}X^{(i-1)\top}_{j}P_{\widetilde{X}^{(i-1)\perp}_{j}}P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}})$
		$\displaystyle=\sum_{j=1}^{t}\operatorname{tr}(X^{(i-1)\top}_{j}P_{\widetilde{X}^{(i-1)\perp}_{j}}P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j})$
		$\displaystyle=\sum_{j=1}^{t}X^{(i-1)\top}_{j}P_{\widetilde{X}^{(i-1)\perp}_{j}}P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j}$
		$\displaystyle=\sum_{j=1}^{t}(P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j})^{\top}P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}(P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j})$
		$\displaystyle\leq\sum_{j=1}^{t}\\|P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\\|_{2}\\|P_{\widetilde{X}^{(i-1)\perp}_{j}}(X^{(i-1)}_{j})\\|_{2}^{2}$
		$\displaystyle\leq\sum_{j=1}^{t}\\|P_{\widetilde{X}^{(i-1)\perp}_{j}}(X^{(i-1)}_{j})\\|_{2}^{2}.$

Here, the first inequality holds because $\max_{\|x\|_{2}=1}x^{\top}Ax=\|A\|_{2}$ for any positive semidefinite matrix $A$ . The second inequality is due to $\|P\|_{2}\leq 1$ for any orthogonal projection $P$ . Plugging (16) into the inequality above, we get

(17)

\|S_{t}^{1/2}\|_{F}^{2}\leq\sum_{j=1}^{t}\frac{m\log N_{i-2}}{2N_{i-2}}\biggl{(}1+\frac{\|X^{(i-1)}_{j}\|_{2}^{2}}{\|\widetilde{X}^{(i-1)}_{j}\|_{2}^{2}}\biggr{)}\|X^{(i-1)}_{j}\|_{2}^{2}.

Next, since $\|P\|_{2}\leq 1$ for all orthogonal projections $P$ and $\|aa^{\top}\|_{2}=\|a\|_{2}^{2}$ for all vectors $a\in\mathbb{R}^{m}$ , we obtain

	$\displaystyle\\|S_{t}^{1/2}\\|_{2}^{2}$	$\displaystyle=\\|S_{t}\\|_{2}$
		$\displaystyle\leq\sum_{j=1}^{t}\\|P_{\widetilde{X}^{(i-1)\perp}_{t}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{j+1}}P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j}X^{(i-1)\top}_{j}P_{\widetilde{X}^{(i-1)\perp}_{j}}P_{\widetilde{X}^{(i-1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(i-1)\perp}_{t}}\\|_{2}$
		$\displaystyle\leq\sum_{j=1}^{t}\\|P_{\widetilde{X}^{(i-1)\perp}_{j}}X^{(i-1)}_{j}X^{(i-1)\top}_{j}P_{\widetilde{X}^{(i-1)\perp}_{j}}\\|_{2}$
		$\displaystyle=\sum_{j=1}^{t}\\|P_{\widetilde{X}^{(i-1)\perp}_{j}}(X^{(i-1)}_{j})\\|_{2}^{2}.$

Again, by (16), the inequality above becomes

(18)

\|S_{t}^{1/2}\|_{2}^{2}\leq\sum_{j=1}^{t}\frac{m\log N_{i-2}}{2N_{i-2}}\biggl{(}1+\frac{\|X^{(i-1)}_{j}\|_{2}^{2}}{\|\widetilde{X}^{(i-1)}_{j}\|_{2}^{2}}\biggr{)}\|X^{(i-1)}_{j}\|_{2}^{2}.

∎

[**JZ: Old results by solving the optimization problem:]

Our strategy is to align $X_{k}^{(i-1)}$ with $\widetilde{X}_{k}^{(i-1)}$ for each $k$ which leads to $\mu_{t}=0$ . Specifically, given a neuron $w\in\mathbb{R}^{N_{i-1}}$ in layer $i$ , recall that our quantization algorithm generates $q\in\mathcal{A}^{N_{i-1}}$ such that $\widetilde{X}^{(i-1)}q$ can track $X^{(i-1)}w$ in the sense of $\ell_{2}$ distance. If one can find a proper vector $\widetilde{w}\in\mathbb{R}^{N_{i-1}}$ satisfying $X^{(i-1)}w=\widetilde{X}^{(i-1)}\widetilde{w}$ , then quantizing the new weights $\widetilde{w}$ using data $\widetilde{X}^{(i-1)}$ amounts to solving for $q\in\mathcal{A}^{N_{i-1}}$ such that

(19)

\widetilde{X}^{(i-1)}q\approx\widetilde{X}^{(i-1)}\widetilde{w}=X^{(i-1)}w,

which indeed does not change our initial target. Therefore, it suffices to quantize $\widetilde{w}$ using (LABEL:eq:quantization-expression) in which we set $X^{(i-1)}=\widetilde{X}^{(i-1)}$ and $w=\widetilde{w}$ . In this case, the modified iterations of quantization are given by

(20)

\begin{cases}u_{0}=0\in\mathbb{R}^{m},\\ q_{t}=\mathcal{Q}_{\mathrm{StocQ}}\Biggl{(}\frac{\langle\widetilde{X}_{t}^{(i-1)},u_{t-1}+\widetilde{w}_{t}\widetilde{X}_{t}^{(i-1)}\rangle}{\|\widetilde{X}^{(i-1)}_{t}\|_{2}^{2}}\Biggr{)},\\ u_{t}=u_{t-1}+\widetilde{w}_{t}\widetilde{X}^{(i-1)}_{t}-q_{t}\widetilde{X}^{(i-1)}_{t}.\end{cases}

Moreover, the corresponding error bound for quantizing the $i$ -th layer with $i>1$ can be derived as follows.

Corollary 2.3.

Let $X^{(i-1)}$ , $\widetilde{X}^{(i-1)}$ be as in (LABEL:eq:layer-input) and suppose that $\mathrm{rank}(\widetilde{X}^{(i-1)})=m$ . Let $w\in\mathbb{R}^{N_{i-1}}$ be the weights associated with a neuron in the $i$ -th layer and let $\widetilde{w}\in\mathbb{R}^{N_{i-1}}$ be any solution of the linear system $\widetilde{X}^{(i-1)}\widetilde{w}=X^{(i-1)}w$ . Quantizing $\widetilde{w}$ using (20) over alphabets $\mathcal{A}=\mathcal{A}_{\infty}^{\delta}$ with step size $\delta>0$ , the following inequality holds with probability exceeding $1-\gamma$ ,

\|u_{t}\|_{\infty}\leq 2\sigma_{t}\sqrt{\log(\sqrt{2}m/\gamma)}

where $\sigma_{t}^{2}=\frac{\pi\delta^{2}}{2}\max_{1\leq j\leq t}\|\widetilde{X}_{j}^{(i-1)}\|_{2}^{2}$ . In particular, we have

\|X^{(i-1)}w-\widetilde{X}^{(i-1)}q\|_{\infty}\leq 2\sigma_{N_{i-1}}\sqrt{\log(\sqrt{2}m/\gamma)}

holds with probability at least $1-\gamma$ .

Proof.

Because it suffices to use $\widetilde{X}^{(i-1)}$ to quantize $\widetilde{w}$ , we have $X^{(i-1)}=\widetilde{X}^{(i-1)}$ in (LABEL:eq:quantization-expression). So LABEL:thm:ut-bound-infinite and LABEL:corollary:ut-bound-infinite still hold. It follows from (LABEL:eq:def-mu) that $\mu_{t}=0$ for $1\leq t\leq N_{i-1}$ . Additionally, in this case, (LABEL:eq:inf-alphabet-tails) becomes

\|u_{t}\|_{\infty}\leq 2\sigma_{t}\sqrt{\log(\sqrt{2}m/\gamma)}

holds with probability at least $1-\gamma$ . Further, if $t=N_{i-1}$ , then, by (19), we have

u_{N_{i-1}}=\widetilde{X}^{(i-1)}q-\widetilde{X}^{(i-1)}\widetilde{w}=\widetilde{X}^{(i-1)}q-X^{(i-1)}w.

and thus

\|X^{(i-1)}w-\widetilde{X}^{(i-1)}q\|_{\infty}=\|u_{N_{i-1}}\|_{\infty}\leq 2\sigma_{N_{i-1}}\sqrt{\log(\sqrt{2}m/\gamma)}

holds with probability at least $1-\gamma$ . ∎

Further, if we assume $w\sim\mathcal{N}(0,I)$ , then $\mathbb{E}\|X^{(i-1)}w\|_{2}^{2}=\|X^{(i-1)}\|_{F}^{2}$ . As a consequence of Hanson-Wright inequality, see e.g. [rudelson2013hanson], one can obtain

\mathrm{P}\biggl{(}\Bigl{|}\|X^{(i-1)}w\|_{2}-\|X^{(i-1)}\|_{F}\Bigr{|}\geq t\biggr{)}\leq 2\exp\Biggl{(}-\frac{ct^{2}}{\|X^{(i-1)}\|_{2}^{2}}\Biggr{)}

for all $t\geq 0$ . Combining this with Corollary 2.3, we deduce the relative error

\frac{\|X^{(i-1)}w-\widetilde{X}^{(i-1)}q\|_{2}}{\|X^{(i-1)}w\|_{2}}\lesssim\frac{\delta\sqrt{m\log m}\max_{1\leq j\leq N_{i-1}}\|\widetilde{X}_{j}^{(i-1)}\|_{2}}{\|X^{(i-1)}\|_{F}}\approx\delta\sqrt{\frac{m\log m}{N_{i-1}}}.

To quantize a neuron $w\in\mathbb{R}^{N_{i-1}}$ in the $i$ -th layer using finite alphabets $\mathcal{A}=\mathcal{A}_{K}^{\delta}$ in (LABEL:eq:alphabet-midtread), one can align the input $X^{(i-1)}$ with its analogue $\widetilde{X}^{(i-1)}$ by solving for $\widetilde{w}$ in $\widetilde{X}^{(i-1)}\widetilde{w}=X^{(i-1)}w$ . Then it suffices to quantize $\widetilde{w}$ merely based on the input $\widetilde{X}^{(i-1)}$ . However, unlike the case of using infinite alphabets, the choice of the solution $\widetilde{w}$ is not arbitrary since we need to bound $\|\widetilde{w}\|_{\infty}$ such that $\|\widetilde{w}\|_{\infty}\lesssim K\delta$ .

Now we pass to a detailed procedure for finding proper $\widetilde{w}$ and suppose that $\mathrm{rank}(\widetilde{X}^{(i-1)})=m$ . We will use $A_{S}$ to express the submatrix of a matrix $A$ with columns indexed by $S$ and use $x_{S}$ to denote the restriction of a vector $x$ to the entries indexed by $S$ . By permuting columns if necessary, we can assume $\widetilde{X}^{(i-1)}=[\widetilde{X}^{(i-1)}_{T},\widetilde{X}^{(i-1)}_{T^{c}}]$ with $T=\{1,2,\ldots,m\}$ and $\mathrm{rank}(\widetilde{X}^{(i-1)}_{T})=m$ . Additionally, we set

\widetilde{w}=\begin{bmatrix}\widetilde{w}_{T}\\ \widetilde{w}_{T^{c}}\end{bmatrix},\quad w=\begin{bmatrix}w_{T}\\ w_{T^{c}}\end{bmatrix}.

Then the linear system $\widetilde{X}^{(i-1)}\widetilde{w}=X^{(i-1)}w$ is equivalent to

(21)

(X^{(i-1)}-\widetilde{X}^{(i-1)})w=\widetilde{X}^{(i-1)}(\widetilde{w}-w)=\widetilde{X}^{(i-1)}_{T}(\widetilde{w}_{T}-w_{T})+\widetilde{X}^{(i-1)}_{T^{c}}(\widetilde{w}_{T^{c}}-w_{T^{c}}).

To simplify the problem, we let $\widetilde{w}_{T^{c}}=w_{T^{c}}$ . Then the original linear system (21) becomes

(22)

\begin{cases}\widetilde{w}_{T^{c}}=w_{T^{c}},\\ \widetilde{X}^{(i-1)}_{T}(\widetilde{w}_{T}-w_{T})=E^{(i-1)}w\end{cases}

where $E^{(i-1)}:=X^{(i-1)}-\widetilde{X}^{(i-1)}$ . Since $\widetilde{X}^{(i-1)}_{T}$ is invertible, the solution $\widetilde{w}$ of (22) is unique. Moreover, we have

(23)

\|\widetilde{w}-w\|_{2}=\|\widetilde{w}_{T}-w_{T}\|_{2}\leq\sigma_{m}(\widetilde{X}^{(i-1)}_{T})^{-1}\|E^{(i-1)}w\|_{2}\leq\sqrt{m}\sigma_{m}(\widetilde{X}^{(i-1)}_{T})^{-1}\|E^{(i-1)}w\|_{\infty}.

Note that $E^{(i-1)}=X^{(i-1)}-\widetilde{X}^{(i-1)}$ has independent columns. Further, if we assume that row vectors $e_{1},e_{2},\ldots,e_{m}$ of $E^{(i-1)}$ are isotropic sub-gaussian vectors with $\max_{1\leq i\leq m}\|e_{i}\|_{\psi_{2}}\leq J(N_{i-1})$ . It follows that

(24)

\|E^{(i-1)}w\|_{\infty}\lesssim J\sqrt{\log m}\|w\|_{2}

holds with high probability. By triangle inequality, (23) and (24),

(25)

\|\widetilde{w}\|_{\infty}\leq\|\widetilde{w}\|_{2}\leq\|w\|_{2}+\|\widetilde{w}-w\|_{2}\leq\|w\|_{2}\biggl{(}1+\frac{J\sqrt{m\log m}}{\sigma_{m}(\widetilde{X}^{(i-1)}_{T})}\biggr{)}

holds with high probability.

Remark 2.4.

According to [rudelson2009smallest], if $A$ is an $m\times m$ random matrix, whose entries are independent copies of a mean zero sub-gaussian random variable $B$ with unit variance, then, for every $\epsilon>0$ , we have

\mathrm{P}\Bigl{(}\sigma_{m}(A)\leq\frac{\epsilon}{\sqrt{m}}\Bigr{)}\leq C\epsilon+e^{-cm}

where $C,c>0$ depend (polynomially) only on $\|B\|_{\psi_{2}}$ .

Lemma 2.5.

Let $\epsilon\in(0,\frac{1}{2}]$ and $x,y\in\mathbb{R}^{m}$ be nonzero vectors such that

(26)

\frac{\|x-y\|_{2}}{\|x\|_{2}}\leq\epsilon.

Then we have

\frac{\|P_{y^{\perp}}(x)\|_{2}}{\|x\|_{2}}\leq\frac{\epsilon\sqrt{10}}{2}

where the orthogonal projection $P_{y^{\perp}}=I-\frac{yy^{\top}}{\|y\|_{2}^{2}}$ is given by (LABEL:eq:orth-proj-mat).

Proof.

(26) implies that

\|x-y\|_{2}^{2}=\|x\|_{2}^{2}+\|y\|_{2}^{2}-2\langle x,y\rangle\leq\epsilon^{2}\|x\|_{2}^{2}.

Then we have

(27)

\langle x,y\rangle\geq\frac{1}{2}\Bigl{(}(1-\epsilon^{2})\|x\|_{2}^{2}+\|y\|_{2}^{2}\Bigr{)}.

Let $\alpha:=\frac{\|x\|_{2}}{\|y\|_{2}}$ . It follows from (27) and the definition of $\alpha$ that

	$\displaystyle\\|P_{y^{\perp}}(x)\\|_{2}^{2}$	$\displaystyle=\\|x\\|_{2}^{2}-\frac{\langle x,y\rangle^{2}}{\\|y\\|_{2}^{2}}$
		$\displaystyle\leq\\|x\\|_{2}^{2}-\frac{1}{4\\|y\\|_{2}^{2}}\Bigl{(}(1-\epsilon^{2})\\|x\\|_{2}^{2}+\\|y\\|_{2}^{2}\Bigr{)}^{2}$
		$\displaystyle=\frac{1+\epsilon^{2}}{2}\\|x\\|_{2}^{2}-\frac{(1-\epsilon^{2})^{2}}{4}\frac{\\|x\\|_{2}^{4}}{\\|y\\|_{2}^{2}}-\frac{1}{4}\\|y\\|_{2}^{2}$
		$\displaystyle=\frac{1+\epsilon^{2}}{2}\\|x\\|_{2}^{2}-\frac{(1-\epsilon^{2})^{2}\alpha^{2}}{4}\\|x\\|_{2}^{2}-\frac{1}{4\alpha^{2}}\\|x\\|_{2}^{2}$
		$\displaystyle=\\|x\\|_{2}^{2}\Bigl{(}\frac{(1+\alpha^{2})\epsilon^{2}+1}{2}-\frac{1}{4}(\alpha^{2}+\frac{1}{\alpha^{2}})\Bigr{)}$
		$\displaystyle\leq\frac{(1+\alpha^{2})\epsilon^{2}}{2}\\|x\\|_{2}^{2}.$

In the last inequality, we used the numerical inequality $\alpha^{2}+\frac{1}{\alpha^{2}}\geq 2$ for all $\alpha>0$ . Then the result above becomes

(28)

\frac{\|P_{y^{\perp}}(x)\|_{2}}{\|x\|_{2}}\leq\epsilon\sqrt{\frac{1+\alpha^{2}}{2}}=\epsilon\sqrt{\frac{1}{2}\Bigl{(}1+\frac{\|x\|_{2}^{2}}{\|y\|_{2}^{2}}\Bigr{)}}.

Moreover, by triangle inequality and (26), we have

\|x\|_{2}\leq\|x-y\|_{2}+\|y\|_{2}\leq\epsilon\|x\|_{2}+\|y\|_{2}.

It implies that

(29)

\frac{\|x\|_{2}}{\|y\|_{2}}\leq\frac{1}{1-\epsilon}\leq 2,

where the last inequality is due to $\epsilon\in(0,\frac{1}{2}]$ . Plugging (29) into (28), we obtain

\frac{\|P_{y^{\perp}}(x)\|_{2}}{\|x\|_{2}}\leq\frac{\epsilon\sqrt{10}}{2}.

∎

3. Archived results

Theorem 3.1.

Let $\Phi$ be an $L$ -layer neural network as in (LABEL:eq:mlp) where the activation function is $\varphi^{(i)}(x)=\rho(x):=\max\{0,x\}$ for $1\leq i\leq L$ . Suppose that each $W^{(i)}\in\mathbb{R}^{N_{i-1}\times N_{i}}$ has i.i.d. $\mathcal{N}(0,\frac{1}{N_{i-1}})$ entries and $\{W^{(i)}\}_{i=1}^{L}$ are independent. Sample data $X\in\mathbb{R}^{m\times N_{0}}$ and quantize $W^{(i)}$ using (LABEL:eq:quantization-algorithm) with alphabet $\mathcal{A}=\mathcal{A}_{\infty}^{\delta^{(i)}}$ where $\delta^{(i)}\in(0,1]$ and $m\leq N_{i}$ for all $1\leq i\leq L$ . Fix $p\in\mathbb{N}$ with $p\geq 3$ . Then

		$\displaystyle\max_{1\leq j\leq N_{L}}\\|\Phi(X)_{j}-\widetilde{\Phi}(X)_{j}\\|_{2}=\max_{1\leq j\leq N_{L}}\\|X^{(L)}_{j}-\widetilde{X}^{(L)}_{j}\\|_{2}$
(30)			$\displaystyle\leq(2\pi pm)^{\frac{L}{2}}\Bigl{(}\prod_{i=1}^{L}\log N_{i-1}\Bigr{)}^{\frac{1}{2}}\prod_{i=1}^{L}\Bigl{(}4+\sum_{k=1}^{d_{i}-1}\\|A^{(i-1)}_{d_{i}-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\\|_{2}\Bigr{)}\max_{1\leq j\leq N_{0}}\\|X_{j}\\|_{2}$

holds with probability at least $1-\sqrt{2}\sum_{i=1}^{L}\frac{mN_{i}}{N_{i-1}^{p}}-\sqrt{2}\sum_{i=2}^{L}\frac{N_{i}}{N_{i-1}^{p-1}}-\sum_{i=1}^{L-1}\frac{N_{i}}{N_{i-1}^{p}}$ . Here, $d_{i}$ and $A^{(i-1)}_{j}$ are defined as in Lemma 1.1.

Proof.

To prove (3.1), we will proceed inductively over layer indices $i$ . The case $i=1$ is trivial, since the error bound of quantizing $W^{(1)}$ is given by part (a) of LABEL:thm:error-bound-single-layer-infinite. Additionally, by a union bound, the quantization of $W^{(1)}$ yields

	$\displaystyle\max_{1\leq j\leq N_{1}}\\|X^{(1)}_{j}-\widetilde{X}^{(1)}_{j}\\|_{2}$	$\displaystyle=\max_{1\leq j\leq N_{1}}\\|\rho(XW_{j}^{(1)})-\rho(XQ_{j}^{(1)})\\|_{2}$
		$\displaystyle\leq\max_{1\leq j\leq N_{1}}\\|XW_{j}^{(1)}-XQ_{j}^{(1)}\\|_{2}$
(31)			$\displaystyle\leq\delta^{(1)}\sqrt{2\pi pm\log N_{0}}\max_{1\leq j\leq N_{0}}\\|X_{j}\\|_{2}$

with probability at least $1-\frac{\sqrt{2}mN_{1}}{N_{0}^{p}}$ . Note that function $f(z):=\|\rho(Xz)\|_{2}$ is Lipschitz with Lipschitz constant $L_{f}=\|X\|_{2}$ and that $\sqrt{N_{0}}\|X^{(1)}_{j}\|_{2}=\sqrt{N_{0}}\|\rho(XW^{(1)}_{j})\|_{2}=f(\sqrt{N_{0}}W_{j}^{(1)})$ with $\sqrt{N_{0}}W_{j}^{(1)}\sim\mathcal{N}(0,I)$ . Applying LABEL:lemma:Lip-concentration to $f$ with $X=\sqrt{N_{0}}W_{j}^{(1)}$ and $\alpha=\sqrt{2p\log N_{0}}\|X\|_{2}$ , we obtain

(32)

\mathrm{P}\Bigl{(}\|X_{j}^{(1)}\|_{2}-\mathbb{E}\|X_{j}^{(1)}\|_{2}\leq\sqrt{\frac{2p\log N_{0}}{N_{0}}}\|X\|_{2}\Bigr{)}\geq 1-\frac{1}{N_{0}^{p}}.

Using Jensen’s inequality and the identity $\mathbb{E}\|\rho(Xg)\|_{2}^{2}=\frac{1}{2}\|X\|_{F}^{2}$ where $g\sim\mathcal{N}(0,I)$ , we have

\mathbb{E}\|X_{j}^{(1)}\|_{2}\leq\sqrt{\mathbb{E}\|X_{j}^{(1)}\|_{2}^{2}}=\sqrt{\mathbb{E}\|\rho(XW^{(1)}_{j})\|_{2}^{2}}=\frac{1}{\sqrt{2N_{0}}}\|X\|_{F}.

Applying the inequalities above to (32) and taking the union bound over all $N_{1}$ neurons in $W^{(1)}$ , we obtain that

(33)

\max_{1\leq j\leq N_{1}}\|X_{j}^{(1)}\|_{2}\leq\frac{1}{\sqrt{2N_{0}}}\bigl{(}\|X\|_{F}+2\sqrt{p\log N_{0}}\|X\|_{2}\bigr{)}\leq\sqrt{\frac{2\pi p\log N_{0}}{N_{0}}}\|X\|_{F}

holds with probability exceeding $1-\frac{N_{1}}{N_{0}^{p}}$ .

Now, we consider $i=2$ . Let $\mathcal{E}$ be the event that both (3) and (33) hold. Conditioning on $\mathcal{E}$ , we quantize $W^{(2)}\in\mathbb{R}^{N_{1}\times N_{2}}$ . Since $W^{(2)}_{j}\sim\mathcal{N}(0,\frac{1}{N_{1}}I)$ and $m\leq N_{1}$ , LABEL:lemma:cx-gaussian-tail yields

(34)

\mathrm{P}\Bigl{(}\|W^{(2)}_{j}\|_{\infty}\leq\sqrt{\frac{2\pi p\log N_{1}}{m}}\Bigr{)}\geq\mathrm{P}\Bigl{(}\|W^{(2)}_{j}\|_{\infty}\leq 2\sqrt{\frac{p\log N_{1}}{N_{1}}}\Bigr{)}\geq 1-\frac{\sqrt{2}}{N_{1}^{p-1}},\quad 1\leq j\leq N_{2}.

Using LABEL:thm:error-bound-single-layer-infinite with $i=2$ , $\delta=\delta^{(2)}$ , and applying (34), we have that, conditioning on $\mathcal{E}$ ,

	$\displaystyle\\|X^{(1)}W^{(2)}_{j}-\widetilde{X}^{(1)}Q^{(2)}_{j}\\|_{2}$
	$\displaystyle\leq\biggl{(}m\\|W^{(2)}_{j}\\|_{\infty}\Bigl{(}2+\sum_{k=1}^{d_{2}-1}\\|A^{(1)}_{d_{2}-1}\ldots A^{(1)}_{k+1}A^{(1)}_{k}\\|_{2}\Bigr{)}+\delta^{(2)}\sqrt{2\pi pm\log N_{1}}\biggr{)}$
	$\displaystyle\times\max_{1\leq j\leq N_{1}}\\|X^{(1)}_{j}-\widetilde{X}_{j}^{(1)}\\|_{2}+\delta^{(2)}\sqrt{2\pi pm\log N_{1}}\max_{1\leq j\leq N_{1}}\\|X^{(1)}_{j}\\|_{2}$
	$\displaystyle\leq\Bigl{(}2+\sum_{k=1}^{d_{2}-1}\\|A^{(1)}_{d_{2}-1}\ldots A^{(1)}_{k+1}A^{(1)}_{k}\\|_{2}+\delta^{(2)}\Bigr{)}\sqrt{2\pi pm\log N_{1}}\max_{1\leq j\leq N_{1}}\\|X^{(1)}_{j}-\widetilde{X}_{j}^{(1)}\\|_{2}$
(35)		$\displaystyle+\delta^{(2)}\sqrt{2\pi pm\log N_{1}}\max_{1\leq j\leq N_{1}}\\|X^{(1)}_{j}\\|_{2}$

holds with probability exceeding $1-\sqrt{2}mN_{1}^{-p}-\sqrt{2}N_{1}^{-p+1}$ . Moreover, taking a union bound over (3), (33), and (3), we obtain

	$\displaystyle\max_{1\leq j\leq N_{2}}\\|X^{(2)}_{j}-\widetilde{X}^{(2)}_{j}\\|_{2}$
	$\displaystyle\leq\max_{1\leq j\leq N_{2}}\\|X^{(1)}W^{(2)}_{j}-\widetilde{X}^{(1)}Q^{(2)}_{j}\\|_{2}$
	$\displaystyle\leq 2\pi\delta^{(1)}pm\sqrt{\log N_{0}\log N_{1}}\Bigl{(}2+\sum_{k=1}^{d_{2}-1}\\|A^{(1)}_{d_{2}-1}\ldots A^{(1)}_{k+1}A^{(1)}_{k}\\|_{2}+\delta^{(2)}\Bigr{)}\max_{1\leq j\leq N_{0}}\\|X_{j}\\|_{2}$
	$\displaystyle+2\pi\delta^{(2)}p\sqrt{\frac{m\log N_{0}\log N_{1}}{N_{0}}}\\|X\\|_{F}$
	$\displaystyle\leq 2\pi mp\sqrt{\log N_{0}\log N_{1}}\Bigl{(}4+\sum_{k=1}^{d_{2}-1}\\|A^{(1)}_{d_{2}-1}\ldots A^{(1)}_{k+1}A^{(1)}_{k}\\|_{2}\Bigr{)}\max_{1\leq j\leq N_{0}}\\|X_{j}\\|_{2}$

with probability at least $1-\frac{\sqrt{2}mN_{2}}{N_{1}^{p}}-\frac{\sqrt{2}mN_{1}}{N_{0}^{p}}-\frac{\sqrt{2}N_{2}}{N_{1}^{p-1}}-\frac{N_{1}}{N_{0}^{p}}.$ In the last inequality, we used $\|X\|_{F}\leq\sqrt{N_{0}}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}$ and the assumption that $\delta^{(i)}\leq 1$ .

Finally, (3.1) follows inductively by using the same proof technique we showed above. ∎

Now we pass to the case of finite alphabets using a similar proof technique except that weights are assumed to be Gaussian.

Lemma 3.2 (Finite alphabets).

Suppose the weights $W^{(1)}\in\mathbb{R}^{N_{0}\times N_{1}}$ has i.i.d. $\mathcal{N}(0,\frac{1}{N_{1}})$ entries. If we quantize $W^{(1)}$ using (LABEL:eq:quantization-algorithm) with alphabet $\mathcal{A}=\mathcal{A}^{\delta}_{K}$ defined by (LABEL:eq:alphabet-midtread) and input data $X\in\mathbb{R}^{m\times N_{0}}$ , then for every column (neuron) $w\in\mathbb{R}^{N_{0}}$ , of $W^{(1)}$ ,

(36)

\|Xw-Xq\|_{2}\leq\delta\eta_{N_{0}}\sqrt{2\pi mp\log N_{1}}

holds with probability at least $1-\frac{\sqrt{2}m}{N_{1}^{p}}-\sqrt{2}\exp\Bigl{(}-\frac{K^{2}\delta^{2}N_{1}}{4}\Bigr{)}-\sqrt{2}\sum_{t=2}^{N_{0}}\exp\Bigl{(}-\frac{K^{2}\delta^{2}}{4\beta_{t}^{2}}\Bigr{)}$ . Here, $\eta_{t}:=\max_{1\leq j\leq t}\|X_{j}\|_{2}$ and $\beta_{t}^{2}:=\frac{1}{N_{1}}+\frac{\pi\delta^{2}\eta_{t-1}^{2}}{2\|X_{t}\|_{2}^{2}}$ .

Proof.

Fix a neuron $w=W^{(1)}_{j}\in\mathbb{R}^{N_{0}}$ for some $1\leq j\leq N_{1}$ . Additionally, we have $X^{(0)}=\widetilde{X}^{(0)}=X$ in (LABEL:eq:quantization-algorithm) when $i=1$ . At the $t$ -th iteration of quantizing $w$ , similar to (LABEL:eq:error-bound-step2-infinite-eq1), (LABEL:eq:error-bound-step2-infinite-eq3), and (LABEL:eq:error-bound-step2-infinite-eq5), we have

u_{t}=P_{X_{t}^{\perp}}(h_{t})+(v_{t}-q_{t})X_{t}

where

(37)

h_{t}=u_{t-1}+w_{t}X_{t},\quad v_{t}=\frac{\langle X_{t},h_{t}\rangle}{\|X_{t}\|_{2}^{2}},\quad\text{and}\quad q_{t}=\mathcal{Q}_{\mathrm{StocQ}}(v_{t}).

To prove (36), we proceed by induction on $t$ . If $t=1$ , then $h_{1}=w_{1}X_{1}$ , $v_{1}=w_{1}$ , and $q_{1}=\mathcal{Q}_{\mathrm{StocQ}}(v_{1})$ . Since $w_{1}\sim\mathcal{N}(0,\frac{1}{N_{1}})$ , LABEL:lemma:cx-gaussian-tail indicates

\mathrm{P}\Bigl{(}|w_{1}|\leq K\delta\Bigr{)}\geq 1-\sqrt{2}\exp\Bigl{(}-\frac{K^{2}\delta^{2}N_{1}}{4}\Bigr{)}.

Conditioning on the event $\{|w_{1}|\leq K\delta\}$ , we get $v_{1}=w_{1}\in[-K\delta,K\delta]$ . Hence, $|v_{1}-q_{1}|\leq\delta$ and the proof technique used for the case $t=1$ in LABEL:thm:ut-bound-infinite can be applied here. It implies that $u_{1}\leq_{\mathrm{cx}}\mathcal{N}(0,\Sigma_{1})$ with $\Sigma_{1}=\frac{\pi\delta^{2}}{2}X_{1}X_{1}^{\top}$ . By LABEL:corollary:ut-bound-infinite, we obtain $u_{1}\leq_{\mathrm{cx}}\mathcal{N}(0,\sigma_{1}^{2}I)$ with $\sigma_{1}^{2}=\frac{\pi}{2}\delta^{2}\eta_{1}^{2}$ .

Next, for $t\geq 2$ , assume that $u_{t-1}\leq_{\mathrm{cx}}\mathcal{N}(0,\sigma_{t-1}^{2}I)$ holds where $\sigma_{t-1}^{2}=\frac{\pi\delta^{2}}{2}\eta_{t-1}^{2}$ . Since $w_{t}\sim\mathcal{N}(0,\frac{1}{N_{1}})$ and $w_{t}$ is independent of $u_{t-1}$ , it follows from (37), LABEL:lemma:cx-afine, and LABEL:lemma:cx-independent-sum that

\displaystyle v_{t}=\frac{\langle X_{t},u_{t-1}\rangle}{\|X_{t}\|_{2}^{2}}+w_{t}\leq_{\mathrm{cx}}\mathcal{N}(0,\beta^{2}_{t})

where $\beta^{2}_{t}:=\frac{1}{N_{1}}+\frac{\pi\delta^{2}\eta_{t-1}^{2}}{2\|X_{t}\|_{2}^{2}}$ . It follows from LABEL:lemma:cx-gaussian-tail that

\mathrm{P}(|v_{t}|\leq K\delta)\geq 1-\sqrt{2}\exp\Bigl{(}-\frac{K^{2}\delta^{2}}{4\beta_{t}^{2}}\Bigr{)}

On event $\{|v_{t}|\leq K\delta\}$ , we can quantize $v_{t}$ as if the quantizer $\mathcal{Q}_{\mathrm{StocQ}}$ were over infinite alphabets $\mathcal{A}^{\delta}_{\infty}$ . So $u_{t}\leq_{\mathrm{cx}}\mathcal{N}(0,\sigma^{2}_{t}I)$ with $\sigma_{t}^{2}=\frac{\pi}{2}\delta^{2}\eta_{t}^{2}$ .

Therefore the induction steps above indicate that

(38)

\mathrm{P}\Bigl{(}u_{N_{0}}\leq_{\mathrm{cx}}\mathcal{N}(0,\sigma_{N_{0}}^{2}I)\Bigr{)}\geq 1-\sqrt{2}\exp\Bigl{(}-\frac{K^{2}\delta^{2}N_{1}}{4}\Bigr{)}-\sqrt{2}\sum_{t=2}^{N_{0}}\exp\Bigl{(}-\frac{K^{2}\delta^{2}}{4\beta_{t}^{2}}\Bigr{)}.

where $\sigma_{N_{0}}^{2}=\frac{\pi}{2}\delta^{2}\eta_{N_{0}}^{2}$ . Conditioning on $u_{N_{0}}\leq_{\mathrm{cx}}\mathcal{N}(0,\sigma_{N_{0}}^{2}I)$ , LABEL:corollary:ut-bound-infinite leads to

\mathrm{P}\Bigl{(}\|u_{N_{0}}\|_{\infty}\leq 2\sigma_{N_{0}}\sqrt{\log(\sqrt{2}m/\gamma)}\Bigr{)}\geq 1-\gamma.

So $\|u_{N_{0}}\|_{\infty}\leq 2\sigma_{N_{0}}\sqrt{\log(\sqrt{2}m/\gamma)}$ holds with probability exceeding

1-\gamma-\sqrt{2}\exp\Bigl{(}-\frac{K^{2}\delta^{2}N_{1}}{4}\Bigr{)}-\sqrt{2}\sum_{t=2}^{N_{0}}\exp\Bigl{(}-\frac{K^{2}\delta^{2}}{4\beta_{t}^{2}}\Bigr{)}.

Setting $\gamma=\sqrt{2}mN_{1}^{-p}$ , we obtain

\|Xw-Xq\|_{2}=\|u_{N_{0}}\|_{2}\leq\sqrt{m}\|u_{N_{0}}\|_{\infty}\leq 2\sigma_{N_{0}}\sqrt{m\log N_{1}^{p}}=2\sigma_{N_{0}}\sqrt{mp\log N_{1}}

Theorem 3.3.

(39)

\delta^{(1)}=\frac{1}{\sqrt{mN_{1}}},\quad\delta^{(2)}=\frac{1}{\sqrt{N_{2}}},\quad K\gtrsim\sqrt{m\log(N_{0}N_{1}N_{2})}.

Suppose the input data $X\in\mathbb{R}^{m\times N_{0}}$ satisfies

(40)

\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}\lesssim\frac{1}{\sqrt{d_{1}}}\|X\|_{F},\quad\text{and}\quad\|X\|_{2}\lesssim\frac{1}{\sqrt{d_{2}}}\|X\|_{F}

for $pN_{1}\log N_{1}\lesssim d_{1}\leq N_{0}$ and $p\log N_{1}\lesssim d_{2}\leq m$ . If we quantize $W^{(i)}$ using (LABEL:eq:quantization-algorithm) with alphabet $\mathcal{A}=\mathcal{A}_{K}^{\delta^{(i)}}$ and data $X\in\mathbb{R}^{m\times N_{0}}$ , then

(41)

\max_{1\leq j\leq N_{2}}\|X^{(1)}W^{(2)}_{j}-\widetilde{X}^{(1)}Q^{(2)}_{j}\|_{2}\leq 2\|X\|_{F}\sqrt{\frac{pm\log N_{2}}{N_{1}N_{2}}}

holds with high probability.

Proof.

The proof is organized into four steps. In step 1, we will use the randomness of the weights in the first layer, as well as LABEL:thm:error-bound-finite-first-layer to control the norm of the difference between $X^{(1)}_{j}$ and $\widetilde{X}^{(1)}_{j}$ in (3), as well the deviation of the norm of $X_{j}^{(1)}$ from its expectation, in (44). By subsequently controlling the expectation, we obtain upper and lower bounds on $\|X_{j}^{(1)}\|_{2}$ that hold with high probability in (45). In step 2, we condition on the event that the above derived bounds hold, and control the magnitude of the quantizer’s argument for quantizing the first weight of the second layer. This results in (48) which in turn leads to (49) showing that $u_{1}$ is dominated in the convex order by an appropriate gaussian. This forms the base-case for an induction argument to control the norm of the error in the second layer. In step 3, we complete the induction argument by dealing with indices $t\geq 2$ , resulting in (57) showing that $u_{t}$ is also convexly dominated by an appropriate gaussian. Finally, in step 4, we use these results to obtain an error bound (41) that holds with high probability.

Step 1: Following , we define $\eta_{0}:=0$ and

(42)

\eta_{t}:=\max_{1\leq j\leq t}\|X_{j}\|_{2},\quad\beta_{t}^{2}:=\frac{1}{N_{1}}+\frac{\pi(\delta^{(1)}\eta_{t-1})^{2}}{2\|X_{t}\|_{2}^{2}},\quad 1\leq t\leq N_{0}.

Given step size $\delta^{(1)}$ , by LABEL:thm:error-bound-finite-first-layer and a union bound, the quantization of $W^{(1)}$ yields

	$\displaystyle\max_{1\leq j\leq N_{1}}\\|X^{(1)}_{j}-\widetilde{X}^{(1)}_{j}\\|_{2}$	$\displaystyle=\max_{1\leq j\leq N_{1}}\\|\rho(XW_{j}^{(1)})-\rho(XQ_{j}^{(1)})\\|_{2}$
		$\displaystyle\leq\max_{1\leq j\leq N_{1}}\\|XW_{j}^{(1)}-XQ_{j}^{(1)}\\|_{2}$
(43)			$\displaystyle\leq\eta_{N_{0}}\delta^{(1)}\sqrt{2\pi mp\log N_{1}}$

with probability at least

1-\frac{\sqrt{2}m}{N_{1}^{p-1}}-\sqrt{2}N_{1}\exp\Bigl{(}-\frac{(K\delta^{(1)})^{2}N_{1}}{4}\Bigr{)}-\sqrt{2}N_{1}\sum_{t=2}^{N_{0}}\exp\Bigl{(}-\frac{(K\delta^{(1)})^{2}}{4\beta_{t}^{2}}\Bigr{)}.

Note that function $f(z):=\|\rho(Xz)\|_{2}$ is Lipschitz with Lipschitz constant $L_{f}=\|X\|_{2}$ and that $\sqrt{N_{1}}\|X^{(1)}_{j}\|_{2}=\sqrt{N_{1}}\|\rho(XW^{(1)}_{j})\|_{2}=f(\sqrt{N_{1}}W_{j}^{(1)})$ with $\sqrt{N_{1}}W_{j}^{(1)}\sim\mathcal{N}(0,I)$ . Applying LABEL:lemma:Lip-concentration to $f$ with $X=\sqrt{N_{1}}W_{j}^{(1)}$ and $\alpha=\sqrt{2p\log N_{1}}\|X\|_{2}$ , we obtain

(44)

\mathrm{P}\Bigl{(}\bigl{|}\|X_{j}^{(1)}\|_{2}-\mathbb{E}\|X_{j}^{(1)}\|_{2}\bigr{|}\leq\sqrt{\frac{2p\log N_{1}}{N_{1}}}\|X\|_{2}\Bigr{)}\geq 1-\frac{2}{N_{1}^{p}}.

Using Jensen’s inequality and the identity $\mathbb{E}\|\rho(Xg)\|_{2}^{2}=\frac{1}{2}\|X\|_{F}^{2}$ where $g\sim\mathcal{N}(0,I)$ , we have

\mathbb{E}\|X_{j}^{(1)}\|_{2}\leq\sqrt{\mathbb{E}\|X_{j}^{(1)}\|_{2}^{2}}=\sqrt{\mathbb{E}\|\rho(XW^{(1)}_{j})\|_{2}^{2}}=\frac{1}{\sqrt{2N_{1}}}\|X\|_{F}.

Additionally, LABEL:prop:expect-relu-gaussian implies that

\mathbb{E}\|X_{j}^{(1)}\|_{2}=\mathbb{E}\|\rho(XW^{(1)}_{j})\|_{2}\geq\frac{1}{\sqrt{2\pi N_{1}}}\|X\|_{F}.

Applying the inequalities above to (44) and taking the union bound over all $N_{1}$ neurons in $W^{(1)}$ , we obtain that the inequality

(45)

\frac{1}{\sqrt{2\pi N_{1}}}\bigl{(}\|X\|_{F}-2\sqrt{\pi p\log N_{1}}\|X\|_{2}\bigr{)}\leq\|X_{j}^{(1)}\|_{2}\leq\frac{1}{\sqrt{2N_{1}}}\bigl{(}\|X\|_{F}+2\sqrt{p\log N_{1}}\|X\|_{2}\bigr{)}

holds uniformly for all $1\leq j\leq N_{1}$ with probability exceeding $1-\frac{2}{N_{1}^{p-1}}$ .

Step 2: Let $\mathcal{E}$ be the event that both (3) and (45) hold. Conditioning on $\mathcal{E}$ , we quantize the second layer $W^{(2)}\in\mathbb{R}^{N_{1}\times N_{2}}$ . Fix a neuron $w=W^{(2)}_{j}\in\mathbb{R}^{N_{1}}$ for some $1\leq j\leq N_{2}$ . At the $t$ -th iteration of quantizing $w$ , similar to (LABEL:eq:error-bound-step2-infinite-eq1), (LABEL:eq:error-bound-step2-infinite-eq2), and (LABEL:eq:error-bound-step2-infinite-eq3), we have

(46)

u_{t}=u_{t-1}+w_{t}X^{(1)}_{t}-q_{t}\widetilde{X}^{(1)}_{t},\quad v_{t}=\frac{\langle\widetilde{X}_{t}^{(1)},u_{t-1}+w_{t}X^{(1)}_{t}\rangle}{\|\widetilde{X}_{t}^{(1)}\|_{2}^{2}},\quad\text{and}\quad q_{t}=\mathcal{Q}_{\mathrm{StocQ}}(v_{t}).

To prove (41), we proceed by induction on $t$ . Let $t=1$ . In this case, due to $w_{1}\sim\mathcal{N}(0,\frac{1}{N_{2}})$ , we have

v_{1}=\frac{\langle\widetilde{X}_{1}^{(1)},X^{(1)}_{1}\rangle}{\|\widetilde{X}_{1}^{(1)}\|_{2}^{2}}\cdot w_{1}\sim\mathcal{N}\Biggl{(}0,\frac{\langle\widetilde{X}_{1}^{(1)},X^{(1)}_{1}\rangle^{2}}{N_{2}\|\widetilde{X}_{1}^{(1)}\|_{2}^{4}}\Biggr{)}

and $q_{1}=\mathcal{Q}_{\mathrm{StocQ}}(v_{1})$ . Additionally, (3), (45), and (40) imply that

	$\displaystyle\frac{\\|X^{(1)}_{1}\\|_{2}}{\sqrt{N_{2}}\\|\widetilde{X}_{1}^{(1)}\\|_{2}}$	$\displaystyle\leq\frac{1}{\sqrt{N_{2}}}\cdot\frac{\\|X^{(1)}_{1}\\|_{2}}{\\|X_{1}^{(1)}\\|_{2}-\\|X^{(1)}_{1}-\widetilde{X}_{1}^{(1)}\\|_{2}}$
		$\displaystyle\leq\sqrt{\frac{\pi}{N_{2}}}\cdot\frac{\\|X\\|_{F}+2\sqrt{p\log N_{1}}\\|X\\|_{2}}{\\|X\\|_{F}-2\sqrt{\pi p\log N_{1}}\\|X\\|_{2}-2\pi\delta^{(1)}\eta_{N_{0}}\sqrt{mpN_{1}\log N_{1}}}$
		$\displaystyle\leq{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\sqrt{\frac{\pi}{N_{2}}}\cdot\frac{1+2\sqrt{\frac{p\log N_{1}}{d_{2}}}}{1-2\sqrt{\frac{\pi p\log N_{1}}{d_{2}}}-2\pi\delta^{(1)}\sqrt{\frac{mpN_{1}\log N_{1}}{d_{1}}}}}$
(47)			$\displaystyle=:c_{1}$

Using Cauchy-Schwarz inequality and (3), we obtain $\frac{|\langle\widetilde{X}_{1}^{(1)},X^{(1)}_{1}\rangle|}{\sqrt{N_{2}}\|\widetilde{X}_{1}^{(1)}\|_{2}^{2}}\leq\frac{\|X^{(1)}_{1}\|_{2}}{\sqrt{N_{2}}\|\widetilde{X}_{1}^{(1)}\|_{2}}\leq c_{1}$ . Thus we can apply LABEL:lemma:cx-normal and deduce $v_{1}\leq_{\mathrm{cx}}\mathcal{N}(0,c_{1}^{2})$ . By LABEL:lemma:cx-gaussian-tail, we get

(48)

\mathrm{P}(|v_{1}|\leq K\delta^{(2)})\geq 1-\sqrt{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4c_{1}^{2}}\Bigr{)}.

Conditioning on the event $\{|v_{1}|\leq K\delta^{(2)}\}$ , we have $|v_{1}-q_{1}|\leq\delta^{(2)}$ and the proof technique used for the case $t=1$ in LABEL:thm:ut-bound-infinite still works here. LABEL:corollary:ut-bound-infinite implies that, conditioning on $w_{1}$ , we have

(49)

u_{1}\,|\,w_{1}\leq_{\mathrm{cx}}\mathcal{N}(\mu_{1},\sigma_{1}^{2}I)\text{\quad with \quad}\mu_{1}=w_{1}P_{\widetilde{X}^{(1)\perp}_{1}}(X_{1}^{(1)})\text{\quad and \quad}\sigma_{1}^{2}=\frac{\pi}{2}(\delta^{(2)})^{2}\|\widetilde{X}_{1}^{(1)}\|_{2}^{2}.

Step 3: Next, for $t\geq 2$ , we assume

(50)

u_{t-1}\,|\,\{w_{i}\}_{i=1}^{t-1}\leq_{\mathrm{cx}}\mathcal{N}(\mu_{t-1},\sigma_{t-1}^{2}I)

where

\mu_{t-1}=\sum_{j=1}^{t-1}w_{j}P_{\widetilde{X}^{(1)\perp}_{t-1}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j},\quad\sigma_{t-1}^{2}=\frac{\pi}{2}(\delta^{(2)})^{2}\max_{1\leq j\leq t-1}\|\widetilde{X}^{(1)}_{j}\|_{2}^{2}.

Note that the randomness in (50) comes from the stochastic quantizer $\mathcal{Q}_{\mathrm{StocQ}}$ . Due to the independence of the weights $w_{j}\sim\mathcal{N}(0,\frac{1}{N_{2}})$ , we have

(51)

\mu_{t-1}\sim\mathcal{N}(0,S_{t-1})

with

S_{t-1}:=\frac{1}{N_{2}}\sum_{j=1}^{t-1}P_{\widetilde{X}^{(1)\perp}_{t-1}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t-1}}.

Applying LABEL:lemma:cx-sum to (50) and (51), we obtain

u_{t-1}\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,S_{t-1}+\sigma_{t-1}^{2}I\Bigr{)}.

Additionally, it follows from LABEL:lemma:cx-afine and LABEL:lemma:cx-independent-sum that

u_{t-1}+w_{t}X_{t}^{(1)}\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,S_{t-1}+\sigma_{t-1}^{2}I+\frac{1}{N_{2}}X^{(1)}_{t}X^{(1)\top}_{t}\Bigr{)}.

Since $\|P\|_{2}\leq 1$ for all orthogonal projections $P$ and $\|aa^{\top}\|_{2}=\|a\|_{2}^{2}$ for all vectors $a\in\mathbb{R}^{m}$ , we have

	$\displaystyle S_{t-1}$	$\displaystyle\preceq\\|S_{t-1}\\|_{2}I$
		$\displaystyle\preceq\frac{1}{N_{2}}\sum_{j=1}^{t-1}\\|P_{\widetilde{X}^{(1)\perp}_{t-1}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t-1}}\\|_{2}I$
		$\displaystyle\preceq\frac{1}{N_{2}}\sum_{j=1}^{t-1}\\|P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}\\|_{2}I$
		$\displaystyle=\frac{1}{N_{2}}\sum_{j=1}^{t-1}\\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j})\\|_{2}^{2}I$
		$\displaystyle=\frac{1}{N_{2}}\sum_{j=1}^{t-1}\\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j}-\widetilde{X}^{(1)}_{j})\\|_{2}^{2}I$
		$\displaystyle\preceq\frac{t-1}{N_{2}}\max_{1\leq j\leq N_{1}}\\|X_{j}^{(1)}-\widetilde{X}^{(1)}_{j}\\|_{2}^{2}I$
		$\displaystyle\preceq{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{2\pi(t-1)mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}N_{2}}}\\|X\\|_{F}^{2}I$
(52)			$\displaystyle:=c_{2}\\|X\\|_{F}^{2}I.$

In the last inequality, we applied (3) and (40). Moreover, using (3), (45), and (40), we get

	$\displaystyle\sigma_{t-1}^{2}$	$\displaystyle=\frac{\pi}{2}(\delta^{(2)})^{2}\max_{1\leq j\leq t-1}\\|\widetilde{X}^{(1)}_{j}\\|_{2}^{2}$
		$\displaystyle\leq\frac{\pi}{2}(\delta^{(2)})^{2}\Bigl{(}\max_{1\leq j\leq N_{1}}\\|X^{(1)}_{j}-\widetilde{X}^{(1)}_{j}\\|_{2}+\max_{1\leq j\leq N_{1}}\\|X^{(1)}_{j}\\|_{2}\Bigr{)}^{2}$
		$\displaystyle\leq\frac{\pi}{2}(\delta^{(2)})^{2}\Bigl{(}\delta^{(1)}\eta_{N_{0}}\sqrt{2\pi mp\log N_{1}}+\frac{1}{\sqrt{2N_{1}}}\\|X\\|_{F}+\sqrt{\frac{2p\log N_{1}}{N_{1}}}\\|X\\|_{2}\Bigr{)}^{2}$
		$\displaystyle\leq\frac{3\pi}{2}(\delta^{(2)})^{2}\Bigl{(}2\pi(\delta^{(1)}\eta_{N_{0}})^{2}mp\log N_{1}+\frac{1}{2N_{1}}\\|X\\|_{F}^{2}+\frac{2p\log N_{1}}{N_{1}}\\|X\\|_{2}^{2}\Bigr{)}$
		$\displaystyle\leq{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3\pi(\delta^{(2)})^{2}\Bigl{(}\frac{\pi mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}}+\frac{1}{4N_{1}}+\frac{p\log N_{1}}{N_{1}d_{2}}\Bigr{)}}\\|X\\|_{F}^{2}$
(53)			$\displaystyle:=c_{3}\\|X\\|_{F}^{2}$

Combining (3) with (3), we obtain

(54)

u_{t-1}\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,\bigl{(}c_{2}+c_{3}\bigr{)}\|X\|_{F}^{2}I\Bigr{)}.

Further, using (45) and (40), we have

	$\displaystyle\frac{1}{N_{2}}X^{(1)}_{t}X^{(1)\top}_{t}$	$\displaystyle\preceq\frac{1}{N_{2}}\\|X_{t}^{(1)}\\|_{2}^{2}I$
		$\displaystyle\preceq\frac{1}{2N_{1}N_{2}}\Bigl{(}\\|X\\|_{F}+2\sqrt{p\log N_{1}}\\|X\\|_{2}\Bigr{)}^{2}I$
		$\displaystyle\preceq\frac{1}{N_{1}N_{2}}\Bigl{(}\\|X\\|_{F}^{2}+4p\log N_{1}\\|X\\|_{2}^{2}\Bigr{)}I$
		$\displaystyle\preceq{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{1}{N_{1}N_{2}}\Bigl{(}1+\frac{4p\log N_{1}}{d_{2}}\Bigr{)}}\\|X\\|_{F}^{2}I$
(55)			$\displaystyle:=c_{4}\\|X\\|_{F}^{2}I$

Combining (3), (3), (3), we obtain

u_{t-1}+w_{t}X_{t}^{(1)}\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,\bigl{(}c_{2}+c_{3}+c_{4}\bigr{)}\|X\|_{F}^{2}I\Bigr{)}.

Then it follows from (46) that

(56)

v_{t}=\frac{\langle\widetilde{X}_{t}^{(1)},u_{t-1}+w_{t}X^{(1)}_{t}\rangle}{\|\widetilde{X}_{t}^{(1)}\|_{2}^{2}}\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,\bigl{(}c_{2}+c_{3}+c_{4}\bigr{)}\frac{\|X\|_{F}^{2}}{\|\widetilde{X}_{t}^{(1)}\|_{2}^{2}}\Bigr{)}.

By (3), (45), and (40), we have

	$\displaystyle\frac{\\|X\\|_{F}}{\\|\widetilde{X}_{t}^{(1)}\\|_{2}}$	$\displaystyle\leq\frac{\\|X\\|_{F}}{\\|X_{t}^{(1)}\\|_{2}-\\|X^{(1)}_{t}-\widetilde{X}_{t}^{(1)}\\|_{2}}$
		$\displaystyle\leq{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\frac{\sqrt{2\pi N_{1}}}{1-2\sqrt{\frac{\pi p\log N_{1}}{d_{2}}}-2\pi\delta^{(1)}\sqrt{\frac{mpN_{1}\log N_{1}}{d_{1}}}}}$
		$\displaystyle:=c_{5}$

Plugging the result above into (56), we get

\displaystyle v_{t}

\displaystyle\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,(c_{2}+c_{3}+c_{4})c_{5}^{2}\Bigr{)}.

One can deduce from LABEL:lemma:cx-gaussian-tail that

\mathrm{P}(|v_{t}|\leq K\delta^{(2)})\geq 1-\sqrt{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4(c_{2}+c_{3}+c_{4})c_{5}^{2}}\Bigr{)}.

On the event $\{|v_{t}|\leq K\delta^{(2)}\}$ , we can quantize $v_{t}$ as if the quantizer $\mathcal{Q}_{\mathrm{StocQ}}$ were over infinite alphabets $\mathcal{A}^{\delta}_{\infty}$ . So, conditioning on this event, $u_{t}\leq_{\mathrm{cx}}\mathcal{N}(\mu_{t},\sigma^{2}_{t}I)$ . Hence conditioning on $\mathcal{E}$ , namely the event that both (3) and (45) hold, the induction steps above indicate that

(57)

X^{(1)}w-\widetilde{X}^{(1)}q=u_{N_{1}}\leq_{\mathrm{cx}}\mathcal{N}\Bigl{(}0,(c_{2}+c_{3})\|X\|_{F}^{2}I\Bigr{)}.

holds with probability at least

1-\sqrt{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4c_{1}^{2}}\Bigr{)}-\sqrt{2}(N_{1}-1)\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4(c_{2}+c_{3}+c_{4})c_{5}^{2}}\Bigr{)}.

Step 4: Conditioning on $\mathcal{E}$ , applying LABEL:lemma:cx-gaussian-tail with $\gamma=\sqrt{2}mN_{2}^{-p}$ , and taking union bound over all neurons in $W^{(2)}$ ,

(58)

\max_{1\leq j\leq N_{2}}\|X^{(1)}W^{(2)}_{j}-\widetilde{X}^{(1)}Q^{(2)}_{j}\|_{2}\leq 2\sqrt{(c_{2}+c_{3})pm\log N_{2}}\|X\|_{F}

holds with probability at least

1-\frac{\sqrt{2}m}{N_{2}^{p-1}}-\sqrt{2}N_{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4c_{1}^{2}}\Bigr{)}-\sqrt{2}(N_{1}-1)N_{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4(c_{2}+c_{3}+c_{4})c_{5}^{2}}\Bigr{)}.

Considering the conditioned event $\mathcal{E}$ , (58) holds with probability exceeding

	$\displaystyle 1$	$\displaystyle-\frac{\sqrt{2}m}{N_{1}^{p-1}}-\sqrt{2}N_{1}\exp\Bigl{(}-\frac{(K\delta^{(1)})^{2}N_{1}}{4}\Bigr{)}-\sqrt{2}N_{1}\sum_{t=2}^{N_{0}}\exp\Bigl{(}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}-\frac{(K\delta^{(1)})^{2}}{4\beta_{t}^{2}}}\Bigr{)}-\frac{2}{N_{1}^{p-1}}$
		$\displaystyle-\frac{\sqrt{2}m}{N_{2}^{p-1}}-\sqrt{2}N_{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4c_{1}^{2}}\Bigr{)}-\sqrt{2}(N_{1}-1)N_{2}\exp\Bigl{(}-\frac{(K\delta^{(2)})^{2}}{4(c_{2}+c_{3}+c_{4})c_{5}^{2}}\Bigr{)}$

where $\beta_{t}^{2}=\frac{1}{N_{1}}+\frac{\pi(\delta^{(1)}\eta_{t-1})^{2}}{2\|X_{t}\|_{2}^{2}}$ . Note that

c_{2}+c_{3}={\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\frac{2\pi N_{1}mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}N_{2}}}+3\pi(\delta^{(2)})^{2}\Bigl{(}\frac{\pi mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}}+\frac{1}{4N_{1}}+\frac{p\log N_{1}}{N_{1}d_{2}}\Bigr{)}

and

	$\displaystyle c_{2}+c_{3}+c_{4}$	$\displaystyle={\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\frac{2\pi N_{1}mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}N_{2}}}+3\pi(\delta^{(2)})^{2}\Bigl{(}\frac{\pi mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}}+\frac{1}{4N_{1}}+\frac{p\log N_{1}}{N_{1}d_{2}}\Bigr{)}$
		$\displaystyle+\frac{1}{N_{1}N_{2}}\Bigl{(}1+\frac{4p\log N_{1}}{d_{2}}\Bigr{)}.$

[**JZ: Bottleneck terms are highlighted in red.] We can set $\delta^{(1)}=\frac{1}{\sqrt{mN_{1}}}$ . Another bottleneck term $\frac{2\pi N_{1}mp(\delta^{(1)})^{2}\log N_{1}}{d_{1}N_{2}}$ implies that $N_{0}\geq d_{1}\gtrsim pN_{1}\log N_{1}$ . Additionally, we choose $\delta^{(2)}=\frac{1}{\sqrt{N_{2}}}$ , $d_{2}\gtrsim p\log N_{1}$ , and $K\gtrsim\sqrt{m\log(N_{0}N_{1}N_{2})}$ . Then we deduce that $c_{2}+c_{3}\lesssim\frac{1}{N_{1}N_{2}}$ and $(c_{2}+c_{3}+c_{4})c_{5}^{2}\lesssim\frac{1}{N_{2}}$ . Therefore, we have

\max_{1\leq j\leq N_{2}}\|X^{(1)}W^{(2)}_{j}-\widetilde{X}^{(1)}Q^{(2)}_{j}\|_{2}\leq 2\sqrt{\frac{pm\log N_{2}}{N_{1}N_{2}}}\|X\|_{F}

with high probability. ∎

Since $\|X^{(1)}_{j}\|_{2}\approx\frac{1}{\sqrt{N_{1}}}\|X\|_{F}$ for all $1\leq j\leq N_{1}$ , we have $\|X^{(1)}\|_{F}^{2}\approx\|X\|_{F}^{2}$ . If follows that

\mathbb{E}\|X^{(1)}W^{(2)}_{j}\|_{2}^{2}=\frac{1}{N_{2}}\|X^{(1)}\|_{F}^{2}\approx\frac{1}{N_{2}}\|X\|_{F}^{2}.

Thus, the relative error of quantizing the second layer is given by

\frac{\|X^{(1)}W^{(2)}_{j}-\widetilde{X}^{(1)}Q^{(2)}_{j}\|_{2}}{\|X^{(1)}W^{(2)}_{j}\|_{2}}\approx\sqrt{\frac{pm\log N_{2}}{N_{1}}}.

[**JZ: All results in this section serve as references which will be modified.] Given the result (LABEL:eq:inf-alphabet-tails), one can bound the quantization error for the first layer without approximating $\mu_{t}$ because $\mu_{t}=0$ in this case. See the following example for details.

Theorem 3.4 (Quantization error bound for the first layer).

Suppose we quantize the weights $W^{(1)}\in\mathbb{R}^{N_{0}\times N_{1}}$ in the first layer using (LABEL:eq:quantization-algorithm) with alphabet $\mathcal{A}=\mathcal{A}^{\delta}_{\infty}$ , step size $\delta>0$ , and input data $X\in\mathbb{R}^{m\times N_{0}}$ .

(1)

For every neuron $w\in\mathbb{R}^{N_{0}}$ that is a column of $W^{(1)}$ , the quantization error satisfies

(59) $\|Xw-Xq\|_{2}\lesssim\sigma_{N_{0}}\sqrt{m\log N_{0}}$

with probability exceeding $1-N_{0}^{-2}$ .
(2)

Let $Q^{(1)}\in\mathcal{A}^{N_{0}\times N_{1}}$ denote the quantized output for all neurons. Then

(60) $\|XW^{(1)}-XQ^{(1)}\|_{F}\lesssim\sigma_{N_{0}}\sqrt{mN_{1}\log N_{0}}.$

holds with probability at least $1-\frac{N_{1}}{N_{0}^{2}}$ .

Here, $\sigma_{N_{0}}=\delta\sqrt{\frac{\pi}{2}}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}$ is defined by (LABEL:eq:def-sigma).

Proof.

(1) Since we perform parallel and independent quantization for all neurons in the same layer, it suffices to consider $w=W^{(1)}_{j}$ for some $1\leq j\leq N_{1}$ . Additionally, by (LABEL:eq:layer-input), we have $X^{(0)}=\widetilde{X}^{(0)}=X\in\mathbb{R}^{m\times N_{0}}$ . According to LABEL:corollary:ut-bound-infinite, quantization of $w$ guarantees that

\|u_{t}-\mu_{t}\|_{\infty}\leq 2\sigma_{t}\sqrt{\log m+\log\sqrt{2}-\log\gamma}.

holds with probability exceeding $1-\gamma$ . Here, $\mu_{t}=P_{\widetilde{X}_{t}^{(0)\perp}}(\mu_{t-1}+w_{t}X_{t}^{(0)})$ with $\mu_{0}=0$ , and $\sigma_{t}^{2}=\frac{\pi\delta^{2}}{2}\max_{1\leq j\leq t}\|X_{j}\|_{2}^{2}$ . Since $X^{(0)}=\widetilde{X}^{(0)}$ , by induction on $t$ , it is easy to verify that $\mu_{t}=0$ for all $t$ . It follows that

\|u_{t}\|_{\infty}\leq 2\sigma_{t}\sqrt{\log m+\log\sqrt{2}-\log\gamma}

holds with probability at least $1-\gamma$ . In particular, let $t=N_{0}\gg m$ and $\gamma=\frac{1}{N_{0}^{2}}$ . Then

\|Xw-Xq\|_{2}=\|u_{N_{0}}\|_{2}\leq\sqrt{m}\|u_{N_{0}}\|_{\infty}\lesssim\sigma_{N_{0}}\sqrt{m\log N_{0}}

holds with probability at least $1-N_{0}^{-2}$ .

(2) Note that (59) is valid for every neuron $w$ . By taking union bound over all neurons,

\|XW^{(1)}-XQ^{(1)}\|_{F}^{2}=\sum_{j=1}^{N_{1}}\|XW^{(1)}_{j}-XQ^{(1)}_{j}\|_{2}^{2}\lesssim\sigma_{N_{0}}^{2}mN_{1}\log N_{0}.

holds with probability at least $1-\frac{N_{1}}{N_{0}^{2}}$ . ∎

Moreover, we can obtain the relative error of quantization by estimating $\|Xw\|_{2}$ . For example, assume that neuron $w\in\mathbb{R}^{N_{0}}$ is an isotropic random vector and data $X\in\mathbb{R}^{m\times N_{0}}$ is generic in the sense that $\|X\|_{F}\gtrsim\sigma_{N_{0}}\sqrt{N_{0}}$ . Then combining the fact that $\mathbb{E}\|Xw\|_{2}^{2}=\|X\|_{F}^{2}$ with (59), we deduce that the following inequality holds with high probability.

\frac{\|Xw-Xq\|_{2}}{\|Xw\|_{2}}\lesssim\frac{\sigma_{N_{0}}\sqrt{m\log N_{0}}}{\|X\|_{F}}\lesssim\sqrt{\frac{m\log N_{0}}{N_{0}}}.

Further, if all neurons are isotropic, then $\mathbb{E}\|XW^{(1)}\|_{F}^{2}=\sum_{j=1}^{N_{1}}\mathbb{E}\|XW^{(1)}_{j}\|_{2}^{2}=N_{1}\|X\|_{F}^{2}.$ It follows from (60) that

\frac{\|XW^{(1)}-XQ^{(1)}\|_{F}}{\|XW^{(1)}\|_{F}}\lesssim\frac{\sigma_{N_{0}}\sqrt{mN_{1}\log N_{0}}}{\sqrt{N_{1}}\|X\|_{F}}\lesssim\sqrt{\frac{m\log N_{0}}{N_{0}}}.

Note that the error bounds above are identical with the error bounds derived in [zhang2022post] where $X$ was assumed random and $w$ was deterministic.

Theorem 3.5.

Let $\Phi$ be a two-layer neural network as in (LABEL:eq:mlp) where $L=2$ and the activation function is $\varphi^{(i)}(x)=\rho(x):=\max\{0,x\}$ for all $1\leq i\leq L$ . Suppose that each $W^{(i)}\in\mathbb{R}^{N_{i-1}\times N_{i}}$ has i.i.d. $\mathcal{N}(0,\frac{1}{N_{i-1}})$ entries and $\{W^{(i)}\}_{i=1}^{L}$ are independent. If we quantize $\Phi$ using (LABEL:eq:quantization-algorithm) with alphabet $\mathcal{A}=\mathcal{A}_{\infty}^{\delta}$ , and input data $X\in\mathbb{R}^{m\times N_{0}}$ , then for $1\leq k\leq N_{2}$ ,

	$\displaystyle\\|X^{(1)}W^{(2)}_{k}-\widetilde{X}^{(1)}Q^{(2)}_{k}\\|_{2}$	$\displaystyle\lesssim(\delta^{2}m\sqrt{\log N_{0}\log N_{1}}+\delta\sqrt{m}\log N_{0})\max_{1\leq j\leq N_{0}}\\|X_{j}\\|_{2}$
		$\displaystyle+\delta\sqrt{\frac{m\log N_{1}}{N_{0}}}\Bigl{(}\sqrt{\log N_{0}}\\|X\\|_{2}+\\|X\\|_{F}\Bigr{)}$

holds with probability at least $1-\frac{N_{1}}{N_{0}^{3}}-N_{0}^{-c}-N_{1}^{-3}$ .

Proof.

By quantizing the weights $W^{(1)}$ of the first layer, one can deduce the following error bound as in Theorem 3.4:

(61)

\mathrm{P}\Bigl{(}\|XW^{(1)}_{j}-XQ^{(1)}_{j}\|_{2}\lesssim\delta\sqrt{m\log N_{0}}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}\Bigr{)}\geq 1-\frac{1}{2N_{0}^{3}},\quad 1\leq j\leq N_{1}.

Since $X^{(1)}_{j}=\rho(XW^{(1)}_{j})$ , $\widetilde{X}^{(1)}_{j}=\rho(XQ^{(1)}_{j})$ , and $\|\rho(x)-\rho(y)\|_{2}\leq\|x-y\|_{2}$ for any $x,y\in\mathbb{R}^{m}$ , it follows from (61) that, for $1\leq j\leq N_{1}$ ,

(62)

\|X^{(1)}_{j}-\widetilde{X}^{(1)}_{j}\|_{2}\lesssim\delta\sqrt{m\log N_{0}}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}

holds with probability at least $1-\frac{1}{2N_{0}^{3}}$ . Additionally, one can show that, for each $j$ ,

(63)

\|X^{(1)}_{j}\|_{2}=\|\rho(XW^{(1)}_{j})\|_{2}\lesssim\frac{1}{\sqrt{N_{0}}}\Bigl{(}\sqrt{\log N_{0}}\|X\|_{2}+\|X\|_{F}\Bigr{)}

holds with probability exceeding $1-\frac{1}{2N_{0}^{3}}$ . [**JZ: Will add a lemma to prove the above inequality later.] Let $\mathcal{E}$ denote the event that (62) and (63) hold uniformly for all $j$ . By taking a union bound, we have $\mathrm{P}(\mathcal{E})\geq 1-\frac{N_{1}}{N_{0}^{3}}$ .

Next, conditioning on $\mathcal{E}$ , we quantize the second layer $W^{(2)}\in\mathbb{R}^{N_{1}\times N_{2}}$ using (LABEL:eq:quantization-algorithm) with data $X^{(1)},\widetilde{X}^{(1)}\in\mathbb{R}^{m\times N_{1}}$ . Applying LABEL:corollary:ut-bound-infinite with $i=2$ and $\gamma=N_{1}^{-2}$ , for each neuron $W^{(2)}_{k}\in\mathbb{R}^{N_{1}}$ and its quantized counterpart $Q^{(2)}_{k}\in\mathcal{A}^{N_{1}}$ , we have

(64)

\|X^{(1)}W^{(2)}_{k}-\widetilde{X}^{(1)}Q^{(2)}_{k}-\mu_{k}\|_{2}\lesssim\delta\sqrt{m\log N_{1}}\max_{1\leq j\leq N_{1}}\|\widetilde{X}^{(1)}_{j}\|_{2}

holds with probability exceeding $1-N_{1}^{-3}$ . Here, according to (LABEL:eq:def-mu), we have

\mu_{k}:=\sum_{j=1}^{N_{1}}W^{(2)}_{k}(j)P_{\widetilde{X}^{(1)\perp}_{N_{1}}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}.

On event $\mathcal{E}$ , the triangle inequality, (62), and (63) yield

	$\displaystyle\\|\widetilde{X}^{(1)}_{j}\\|_{2}$	$\displaystyle\leq\\|\widetilde{X}^{(1)}_{j}-X^{(1)}_{j}\\|_{2}+\\|X^{(1)}_{j}\\|_{2}$
		$\displaystyle\leq\delta\sqrt{m\log N_{0}}\max_{1\leq j\leq N_{0}}\\|X_{j}\\|_{2}+\frac{1}{\sqrt{N_{0}}}\Bigl{(}\sqrt{\log N_{0}}\\|X\\|_{2}+\\|X\\|_{F}\Bigr{)}$

hold uniformly for all $1\leq j\leq N_{1}$ . Hence, (64) becomes

	$\displaystyle\\|X^{(1)}W^{(2)}_{k}-\widetilde{X}^{(1)}Q^{(2)}_{k}-\mu_{k}\\|_{2}$	$\displaystyle\lesssim\delta\sqrt{m\log N_{1}}\max_{1\leq j\leq N_{1}}\\|\widetilde{X}^{(1)}_{j}\\|_{2}$
		$\displaystyle\lesssim\delta^{2}m\sqrt{\log N_{0}\log N_{1}}\max_{1\leq j\leq N_{0}}\\|X_{j}\\|_{2}$
(65)			$\displaystyle+\delta\sqrt{\frac{m\log N_{1}}{N_{0}}}\Bigl{(}\sqrt{\log N_{0}}\\|X\\|_{2}+\\|X\\|_{F}\Bigr{)}$

with probability (conditioning on $\mathcal{E}$ ) at least $1-N_{1}^{-3}$ .

Now, to bound the quantization error $\|X^{(1)}W^{(2)}_{k}-\widetilde{X}^{(1)}Q^{(2)}_{k}\|_{2}$ using triangle inequality, it suffices to control $\|\mu_{k}\|_{2}$ . Since $W^{(2)}_{k}\sim\mathcal{N}(0,\frac{1}{N_{1}}I)$ , we get $\mu_{k}\sim\mathcal{N}(0,S)$ with $S$ defined as follows

(66)

S:=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}P_{\widetilde{X}^{(1)\perp}_{N_{1}}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{N_{1}}}\in\mathbb{R}^{m\times m}.

Then $S^{-\frac{1}{2}}\mu_{k}\sim\mathcal{N}(0,I)$ . Applying the Hanson-Wright inequality, see e.g. [rudelson2013hanson] and [vershynin2018high], we obtain for all $\alpha\geq 0$ that

(67)

\mathrm{P}\biggl{(}\|\mu_{k}\|_{2}-\|S^{1/2}\|_{F}\leq\alpha\biggr{)}\geq 1-\exp\biggl{(}-\frac{c\alpha^{2}}{\|S^{1/2}\|_{2}^{2}}\biggr{)}.

It remains to evaluate $\|S^{1/2}\|_{F}$ and $\|S^{1/2}\|_{2}$ . On event $\mathcal{E}$ ,

(68)

\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j})\|_{2}=\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j}-\widetilde{X}^{(1)}_{j})\|_{2}\leq\|X^{(1)}_{j}-\widetilde{X}^{(1)}_{j}\|_{2}\lesssim\delta\sqrt{m\log N_{0}}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}

holds uniformly for all $1\leq j\leq N_{1}$ . Moreover, since $\|A\|_{F}^{2}=\operatorname{tr}(A^{\top}A)$ and $\operatorname{tr}(AB)=\operatorname{tr}(BA)$ , we have

	$\displaystyle\\|S^{1/2}\\|_{F}^{2}$	$\displaystyle=\operatorname{tr}(S)$
		$\displaystyle=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\operatorname{tr}(P_{\widetilde{X}^{(1)\perp}_{t}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t}})$
		$\displaystyle=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\operatorname{tr}(X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j})$
		$\displaystyle=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}$
		$\displaystyle=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}(P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j})^{\top}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}(P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j})$
		$\displaystyle\leq\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\\|P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}\\|_{2}\\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j})\\|_{2}^{2}$
		$\displaystyle\leq\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j})\\|_{2}^{2}.$

Here, the first inequality holds because $\max_{\|x\|_{2}=1}x^{\top}Ax=\|A\|_{2}$ for any positive semidefinite matrix $A$ and the second inequality is due to $\|P\|_{2}\leq 1$ for any orthogonal projection $P$ . Plugging (68) into the result above, we get

(69)

\|S^{1/2}\|_{F}^{2}\leq\delta^{2}m\log N_{0}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}^{2}.

Further, since $\|P\|_{2}\leq 1$ for all orthogonal projections $P$ and $\|aa^{\top}\|_{2}=\|a\|_{2}^{2}$ for all vectors $a\in\mathbb{R}^{m}$ , we obtain

	$\displaystyle\\|S^{1/2}\\|_{2}^{2}$	$\displaystyle=\\|S\\|_{2}$
		$\displaystyle\leq\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\\|P_{\widetilde{X}^{(1)\perp}_{t}}\ldots P_{\widetilde{X}^{(1)\perp}_{j+1}}P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}P_{\widetilde{X}^{(1)\perp}_{j+1}}\ldots P_{\widetilde{X}^{(1)\perp}_{t}}\\|_{2}$
		$\displaystyle\leq\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\\|P_{\widetilde{X}^{(1)\perp}_{j}}X^{(1)}_{j}X^{(1)\top}_{j}P_{\widetilde{X}^{(1)\perp}_{j}}\\|_{2}$
		$\displaystyle=\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\\|P_{\widetilde{X}^{(1)\perp}_{j}}(X^{(1)}_{j})\\|_{2}^{2}.$

Again, by (68), the inequality above becomes

(70)

\|S^{1/2}\|_{2}^{2}\leq\delta^{2}m\log N_{0}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}^{2}.

Combining (67), (69), (70), and choosing $\alpha=\delta\sqrt{m}\log N_{0}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}$ , one can get

(71)

\|\mu_{k}\|_{2}\leq 2\delta\sqrt{m}\log N_{0}\max_{1\leq j\leq N_{0}}\|X_{j}\|_{2}

with probability (conditioning on $\mathcal{E}$ ) at least $1-N_{0}^{-c}$ . It follows from (3), (71), and $\mathrm{P}(\mathcal{E})\geq 1-\frac{N_{1}}{N_{0}^{3}}$ that

	$\displaystyle\\|X^{(1)}W^{(2)}_{k}-\widetilde{X}^{(1)}Q^{(2)}_{k}\\|_{2}$	$\displaystyle\lesssim(\delta^{2}m\sqrt{\log N_{0}\log N_{1}}+\delta\sqrt{m}\log N_{0})\max_{1\leq j\leq N_{0}}\\|X_{j}\\|_{2}$
		$\displaystyle+\delta\sqrt{\frac{m\log N_{1}}{N_{0}}}\Bigl{(}\sqrt{\log N_{0}}\\|X\\|_{2}+\\|X\\|_{F}\Bigr{)}$

holds with probability at least $1-\frac{N_{1}}{N_{0}^{3}}-N_{0}^{-c}-N_{1}^{-3}$ . ∎

	$\displaystyle\\|\hat{u}_{N_{i-1}}\\|_{2}$	$\displaystyle=\Bigl{\\|}\sum_{j=1}^{N_{i-1}}w_{j}P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\Bigr{\\|}_{2}$
		$\displaystyle\leq\\|w\\|_{\infty}\sum_{j=1}^{N_{i-1}}\\|P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\\|_{2}$
		$\displaystyle=\\|w\\|_{\infty}\sum_{k=1}^{d-1}\sum_{j=(k-1)m+1}^{km}\\|P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\\|_{2}$
		$\displaystyle+\\|w\\|_{\infty}\sum_{j=(d-1)m+1}^{N_{i-1}}\\|P_{\widetilde{X}_{N_{i-1}}^{(i-1)\perp}}\ldots P_{\widetilde{X}_{j+1}^{(i-1)\perp}}P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\\|_{2}.$

	$\displaystyle\\|\hat{u}_{N_{i-1}}\\|_{2}\leq\\|w\\|_{\infty}\sum_{k=1}^{d-1}\sum_{j=(k-1)m+1}^{km}\\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\\|_{2}\\|P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\\|_{2}$
	$\displaystyle+\\|w\\|_{\infty}\sum_{j=(d-1)m+1}^{N_{i-1}}\\|P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\\|_{2}$
	$\displaystyle\leq m\\|w\\|_{\infty}\Bigl{(}\sum_{k=1}^{d-1}\\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\\|_{2}+\frac{N_{i-1}}{m}-(d-1)\Bigr{)}\max_{1\leq j\leq N_{i-1}}\\|P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j})\\|_{2}$
	$\displaystyle\leq m\\|w\\|_{\infty}\Bigl{(}2+\sum_{k=1}^{d-1}\\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\\|_{2}\Bigr{)}\max_{1\leq j\leq N_{i-1}}\\|P_{\widetilde{X}_{j}^{(i-1)\perp}}(X^{(i-1)}_{j}-\widetilde{X}_{j}^{(i-1)})\\|_{2}$
	$\displaystyle\leq m\\|w\\|_{\infty}\Bigl{(}2+\sum_{k=1}^{d-1}\\|A^{(i-1)}_{d-1}\ldots A^{(i-1)}_{k+1}A^{(i-1)}_{k}\\|_{2}\Bigr{)}\max_{1\leq j\leq N_{i-1}}\\|X^{(i-1)}_{j}-\widetilde{X}_{j}^{(i-1)}\\|_{2}.$

	$\displaystyle\Bigl{\|}\\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\\|_{1}-\\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\\|_{1}\Bigr{\|}$	$\displaystyle\leq\\|(\widetilde{A}_{1}-A_{1})^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}-A_{2}^{\top}y^{*}\\|_{1}$
		$\displaystyle=\\|\Delta A_{1}^{\top}b+A_{2}^{\top}(y^{}+\Delta y)+\Delta A_{2}^{\top}\widetilde{y}-A_{2}^{\top}y^{}\\|_{1}$
		$\displaystyle=\\|\Delta A_{1}^{\top}b+A_{2}^{\top}\Delta y+\Delta A_{2}^{\top}\widetilde{y}\\|_{1}$
(9)			$\displaystyle\leq\\|\Delta A_{1}^{\top}b\\|_{1}+\\|A_{2}^{\top}\Delta y\\|_{1}+\\|\Delta A_{2}^{\top}\widetilde{y}\\|_{1}.$

	$\displaystyle\Bigl{\|}\\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\\|_{1}-\\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\\|_{1}\Bigr{\|}$	$\displaystyle\leq(\\|b\\|_{1}+\sqrt{m}\\|\widetilde{y}\\|_{2})\sum_{j=1}^{n}\\|\Delta a_{j}\\|_{\infty}+\sqrt{m}\\|\Delta y\\|_{2}\sum_{j=1}^{n}\\|a_{j}\\|_{\infty}$
(10)			$\displaystyle\leq\Bigl{(}c_{3}\\|b\\|_{1}+c_{3}\sqrt{m}\\|\widetilde{y}\\|_{2}+c_{1}\sqrt{m}\Bigr{)}\sum_{j=1}^{n}\\|a_{j}\\|_{\infty}.$

	$\displaystyle\Bigl{\|}\\|\widetilde{x}\\|_{\infty}-\\|x^{*}\\|_{\infty}\Bigr{\|}$	$\displaystyle=\Biggl{\|}\frac{\\|b\\|_{2}^{2}}{\\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\\|_{1}}-\frac{\\|b\\|_{2}^{2}}{\\|A_{1}^{\top}b+A_{2}^{\top}y^{*}\\|_{1}}\Biggr{\|}$
		$\displaystyle=\frac{\\|b\\|_{2}^{2}}{\\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\\|_{1}\\|A_{1}^{\top}b+A_{2}^{\top}y^{}\\|_{1}}\Bigl{\|}\\|\widetilde{A}_{1}^{\top}b+\widetilde{A}_{2}^{\top}\widetilde{y}\\|_{1}-\\|A_{1}^{\top}b+A_{2}^{\top}y^{}\\|_{1}\Bigr{\|}$
		$\displaystyle\leq\frac{\\|b\\|_{2}^{2}}{c_{2}}\Bigl{(}c_{3}\\|b\\|_{1}+c_{3}\sqrt{m}\\|\widetilde{y}\\|_{2}+c_{1}\sqrt{m}\Bigr{)}\sum_{j=1}^{n}\\|a_{j}\\|_{\infty}$
		$\displaystyle\lesssim\sqrt{m}\sum_{j=1}^{n}\\|a_{j}\\|_{\infty}.$

Archived Results

1. Projections

Lemma 1.1.

Proof.

2. Minimum ℓ∞\ell_{\infty} solutions for a linear system

Lemma 2.1.

Proof.

Corollary 2.2.

Proof.

Proof.

Corollary 2.3.

Proof.

Remark 2.4.

Lemma 2.5.

Proof.

3. Archived results

Theorem 3.1.

Proof.

Lemma 3.2 (Finite alphabets).

Proof.

Theorem 3.3.

Proof.

Theorem 3.4 (Quantization error bound for the first layer).

Proof.

Theorem 3.5.

Proof.

2. Minimum $\ell_{\infty}$ solutions for a linear system