Archived Results
1. Projections
Lemma 1.1.
Let , be as in (LABEL:eq:layer-input) and let be a neuron in the -th layer. Applying the data alignment procedure in (LABEL:eq:quantization-algorithm-step1), for , we have
Moreover, if and for , then
(1) |
Proof.
We consider the proof by induction on . By (LABEL:eq:quantization-algorithm-step1), the case is straightforward, since we have
where we apply the properties of orthogonal projections in (LABEL:eq:orth-proj) and (LABEL:eq:orth-proj-mat). For , assume that
Then, by (LABEL:eq:quantization-algorithm-step1), one gets
Applying the induction hypothesis, we obtain
This completes the proof by induction. In particular, if , then we have
It follows from triangle inequality and the definition of that
Since for any nonzero orthogonal projection , , and for , we deduce that
∎
2. Minimum solutions for a linear system
Let be a matrix with and let be a nonzero vector. Then the Rouché-Capelli theorem implies that the linear system admits infinitely many solutions. An intriguing problem that has important applications is to find the solutions of whose norm is the smallest possible. Specifically, we aim to solve the following primal problem:
(2) | ||||
s.t. |
Apart from the linear programming formulation [abdelmalek1977minimum], two powerful tools, namely, Cadzow’s method [cadzow1973finite, cadzow1974efficient] and the Shim-Yoon method [shim1998stabilized], are widely used to solve (2). To perform the perturbation analysis, we will focus on Cadzow’s method throughout this section, which applies the duality principle to get
(3) |
Moreover, suppose that are column vectors of . By (LABEL:eq:orth-proj), we have with . Then one can uniquely decompose as
(4) |
where
and
According to the transformation technique used in section 2 of [cadzow1974efficient], the dual problem in (3) can be reformulated as
(5) |
It follows immediately from (3) and (5) that
(6) |
where and are solutions of the primal problem and the dual problem respectively.
Now we evaluate the change of optimal value of (2) under a small perturbation of . Suppose that with and . Then . Let be primal and dual solutions for the perturbed problem . Then, similar to (6), we deduce that
(7) |
where
and
Lemma 2.1.
Let . Suppose that there exist positive constants , , and such that
(8) |
Then we have
Proof.
In general, to evaluate the error bounds for the -th layer, we need to approximate by considering the small distance and the effect of consecutive orthogonal projections onto . Note that and where the -th neuron of the -th layer, denoted by , is quantized as . Since all neurons are quantized separately using a stochastic approach with independent random variables, are independent.
Corollary 2.2.
Let , be as in (LABEL:eq:layer-input) such that, for , the input discrepancy defined by satisfies where is a constant. Suppose that are independent. Let be the weights associated with a neuron in the -th layer. Quantizing using (LABEL:eq:quantization-expression) over alphabets with step size ,
holds with probability exceeding .
Proof.
Recall that
Due to , we have
(11) |
Since and , we get , where used LABEL:lemma:cx-afine to get the inequality. Assume that the following inequality holds:
Then applying LABEL:lemma:cx-afine and LABEL:lemma:cx-independent-sum to (11) yields
Hence, by induction, we have proved that, for ,
Since for all , by LABEL:lemma:cx-normal, we have . In particular, we get . It follows from LABEL:lemma:cx-gaussian-tail that
holds with probability at least . Additionally, (LABEL:eq:inf-alphabet-tails) implies that with probability exceeding , we have
where . Thus the union bound yields
with probability at least . ∎
[**JZ: Will consider the special case for orthogonal .] [**JZ: Previous loose bound: Suppose that with . If , then
Assume that . Using the triangle inequality and the induction hypothesis, one can get
Therefore, by induction, we have
(12) |
]
Recall from (LABEL:eq:ht), (LABEL:eq:ut-bound-infinite-eq2), and (LABEL:eq:ut-bound-infinite-eq4) that
where
Let for all . Then we obtain
It follows that
(13) |
where .
[**JZ: The proof for the general -th layer]
Proof.
In general, to evaluate the error bounds for the -th layer (with ) using (LABEL:eq:inf-alphabet-tails), we need to approximate by considering recursive orthogonal projections of onto for . Specifically, according to (LABEL:eq:def-mu), we have
To control using concentration inequalities, we impose randomness on the weights. In particular, by assuming , we get with defined as follows
(14) |
Then . Applying the Hanson-Wright inequality, see e.g. [rudelson2013hanson] and [vershynin2018high], we obtain for all that
(15) |
It remains to evaluate and . Note that the error bounds for the -th layer guarantee [**JZ: We may need to discuss the effect of activation functions. Consider two cases: one for large another for small .]
It follows from the inequality above and Lemma 2.5 that
(16) |
Moreover, since and , we have
Here, the first inequality holds because for any positive semidefinite matrix . The second inequality is due to for any orthogonal projection . Plugging (16) into the inequality above, we get
(17) |
Next, since for all orthogonal projections and for all vectors , we obtain
Again, by (16), the inequality above becomes
(18) |
∎
[**JZ: Old results by solving the optimization problem:]
Our strategy is to align with for each which leads to . Specifically, given a neuron in layer , recall that our quantization algorithm generates such that can track in the sense of distance. If one can find a proper vector satisfying , then quantizing the new weights using data amounts to solving for such that
(19) |
which indeed does not change our initial target. Therefore, it suffices to quantize using (LABEL:eq:quantization-expression) in which we set and . In this case, the modified iterations of quantization are given by
(20) |
Moreover, the corresponding error bound for quantizing the -th layer with can be derived as follows.
Corollary 2.3.
Let , be as in (LABEL:eq:layer-input) and suppose that . Let be the weights associated with a neuron in the -th layer and let be any solution of the linear system . Quantizing using (20) over alphabets with step size , the following inequality holds with probability exceeding ,
where . In particular, we have
holds with probability at least .
Proof.
Because it suffices to use to quantize , we have in (LABEL:eq:quantization-expression). So LABEL:thm:ut-bound-infinite and LABEL:corollary:ut-bound-infinite still hold. It follows from (LABEL:eq:def-mu) that for . Additionally, in this case, (LABEL:eq:inf-alphabet-tails) becomes
holds with probability at least . Further, if , then, by (19), we have
and thus
holds with probability at least . ∎
Further, if we assume , then . As a consequence of Hanson-Wright inequality, see e.g. [rudelson2013hanson], one can obtain
for all . Combining this with Corollary 2.3, we deduce the relative error
To quantize a neuron in the -th layer using finite alphabets in (LABEL:eq:alphabet-midtread), one can align the input with its analogue by solving for in . Then it suffices to quantize merely based on the input . However, unlike the case of using infinite alphabets, the choice of the solution is not arbitrary since we need to bound such that .
Now we pass to a detailed procedure for finding proper and suppose that . We will use to express the submatrix of a matrix with columns indexed by and use to denote the restriction of a vector to the entries indexed by . By permuting columns if necessary, we can assume with and . Additionally, we set
Then the linear system is equivalent to
(21) |
To simplify the problem, we let . Then the original linear system (21) becomes
(22) |
where . Since is invertible, the solution of (22) is unique. Moreover, we have
(23) |
Note that has independent columns. Further, if we assume that row vectors of are isotropic sub-gaussian vectors with . It follows that
(24) |
holds with high probability. By triangle inequality, (23) and (24),
(25) |
holds with high probability.
Remark 2.4.
According to [rudelson2009smallest], if is an random matrix, whose entries are independent copies of a mean zero sub-gaussian random variable with unit variance, then, for every , we have
where depend (polynomially) only on .
Lemma 2.5.
Let and be nonzero vectors such that
(26) |
Then we have
where the orthogonal projection is given by (LABEL:eq:orth-proj-mat).
Proof.
(26) implies that
Then we have
(27) |
Let . It follows from (27) and the definition of that
In the last inequality, we used the numerical inequality for all . Then the result above becomes
(28) |
Moreover, by triangle inequality and (26), we have
It implies that
(29) |
where the last inequality is due to . Plugging (29) into (28), we obtain
∎
3. Archived results
Theorem 3.1.
Let be an -layer neural network as in (LABEL:eq:mlp) where the activation function is for . Suppose that each has i.i.d. entries and are independent. Sample data and quantize using (LABEL:eq:quantization-algorithm) with alphabet where and for all . Fix with . Then
(30) |
holds with probability at least . Here, and are defined as in Lemma 1.1.
Proof.
To prove (3.1), we will proceed inductively over layer indices . The case is trivial, since the error bound of quantizing is given by part (a) of LABEL:thm:error-bound-single-layer-infinite. Additionally, by a union bound, the quantization of yields
(31) |
with probability at least . Note that function is Lipschitz with Lipschitz constant and that with . Applying LABEL:lemma:Lip-concentration to with and , we obtain
(32) |
Using Jensen’s inequality and the identity where , we have
Applying the inequalities above to (32) and taking the union bound over all neurons in , we obtain that
(33) |
holds with probability exceeding .
Now, we consider . Let be the event that both (3) and (33) hold. Conditioning on , we quantize . Since and , LABEL:lemma:cx-gaussian-tail yields
(34) |
Using LABEL:thm:error-bound-single-layer-infinite with , , and applying (34), we have that, conditioning on ,
(35) |
holds with probability exceeding . Moreover, taking a union bound over (3), (33), and (3), we obtain
with probability at least In the last inequality, we used and the assumption that .
Finally, (3.1) follows inductively by using the same proof technique we showed above. ∎
Now we pass to the case of finite alphabets using a similar proof technique except that weights are assumed to be Gaussian.
Lemma 3.2 (Finite alphabets).
Suppose the weights has i.i.d. entries. If we quantize using (LABEL:eq:quantization-algorithm) with alphabet defined by (LABEL:eq:alphabet-midtread) and input data , then for every column (neuron) , of ,
(36) |
holds with probability at least . Here, and .
Proof.
Fix a neuron for some . Additionally, we have in (LABEL:eq:quantization-algorithm) when . At the -th iteration of quantizing , similar to (LABEL:eq:error-bound-step2-infinite-eq1), (LABEL:eq:error-bound-step2-infinite-eq3), and (LABEL:eq:error-bound-step2-infinite-eq5), we have
where
(37) |
To prove (36), we proceed by induction on . If , then , , and . Since , LABEL:lemma:cx-gaussian-tail indicates
Conditioning on the event , we get . Hence, and the proof technique used for the case in LABEL:thm:ut-bound-infinite can be applied here. It implies that with . By LABEL:corollary:ut-bound-infinite, we obtain with .
Next, for , assume that holds where . Since and is independent of , it follows from (37), LABEL:lemma:cx-afine, and LABEL:lemma:cx-independent-sum that
where . It follows from LABEL:lemma:cx-gaussian-tail that
On event , we can quantize as if the quantizer were over infinite alphabets . So with .
Therefore the induction steps above indicate that
(38) |
where . Conditioning on , LABEL:corollary:ut-bound-infinite leads to
So holds with probability exceeding
Setting , we obtain
holds with probability at least . This completes the proof. ∎
Theorem 3.3.
Let be a two-layer neural network as in (LABEL:eq:mlp) where and the activation function is for all . Suppose that each has i.i.d. entries and are independent. Let with ,
(39) |
Suppose the input data satisfies
(40) |
for and . If we quantize using (LABEL:eq:quantization-algorithm) with alphabet and data , then
(41) |
holds with high probability.
Proof.
The proof is organized into four steps. In step 1, we will use the randomness of the weights in the first layer, as well as LABEL:thm:error-bound-finite-first-layer to control the norm of the difference between and in (3), as well the deviation of the norm of from its expectation, in (44). By subsequently controlling the expectation, we obtain upper and lower bounds on that hold with high probability in (45). In step 2, we condition on the event that the above derived bounds hold, and control the magnitude of the quantizer’s argument for quantizing the first weight of the second layer. This results in (48) which in turn leads to (49) showing that is dominated in the convex order by an appropriate gaussian. This forms the base-case for an induction argument to control the norm of the error in the second layer. In step 3, we complete the induction argument by dealing with indices , resulting in (57) showing that is also convexly dominated by an appropriate gaussian. Finally, in step 4, we use these results to obtain an error bound (41) that holds with high probability.
Step 1: Following , we define and
(42) |
Given step size , by LABEL:thm:error-bound-finite-first-layer and a union bound, the quantization of yields
(43) |
with probability at least
Note that function is Lipschitz with Lipschitz constant and that with . Applying LABEL:lemma:Lip-concentration to with and , we obtain
(44) |
Using Jensen’s inequality and the identity where , we have
Additionally, LABEL:prop:expect-relu-gaussian implies that
Applying the inequalities above to (44) and taking the union bound over all neurons in , we obtain that the inequality
(45) |
holds uniformly for all with probability exceeding .
Step 2: Let be the event that both (3) and (45) hold. Conditioning on , we quantize the second layer . Fix a neuron for some . At the -th iteration of quantizing , similar to (LABEL:eq:error-bound-step2-infinite-eq1), (LABEL:eq:error-bound-step2-infinite-eq2), and (LABEL:eq:error-bound-step2-infinite-eq3), we have
(46) |
To prove (41), we proceed by induction on . Let . In this case, due to , we have
and . Additionally, (3), (45), and (40) imply that
(47) |
Using Cauchy-Schwarz inequality and (3), we obtain . Thus we can apply LABEL:lemma:cx-normal and deduce . By LABEL:lemma:cx-gaussian-tail, we get
(48) |
Conditioning on the event , we have and the proof technique used for the case in LABEL:thm:ut-bound-infinite still works here. LABEL:corollary:ut-bound-infinite implies that, conditioning on , we have
(49) |
Step 3: Next, for , we assume
(50) |
where
Note that the randomness in (50) comes from the stochastic quantizer . Due to the independence of the weights , we have
(51) |
with
Applying LABEL:lemma:cx-sum to (50) and (51), we obtain
Additionally, it follows from LABEL:lemma:cx-afine and LABEL:lemma:cx-independent-sum that
Since for all orthogonal projections and for all vectors , we have
(52) |
In the last inequality, we applied (3) and (40). Moreover, using (3), (45), and (40), we get
(53) |
Combining (3) with (3), we obtain
(54) |
Further, using (45) and (40), we have
(55) |
Combining (3), (3), (3), we obtain
Then it follows from (46) that
(56) |
By (3), (45), and (40), we have
Plugging the result above into (56), we get
One can deduce from LABEL:lemma:cx-gaussian-tail that
On the event , we can quantize as if the quantizer were over infinite alphabets . So, conditioning on this event, . Hence conditioning on , namely the event that both (3) and (45) hold, the induction steps above indicate that
(57) |
holds with probability at least
Step 4: Conditioning on , applying LABEL:lemma:cx-gaussian-tail with , and taking union bound over all neurons in ,
(58) |
holds with probability at least
Considering the conditioned event , (58) holds with probability exceeding
where . Note that
and
[**JZ: Bottleneck terms are highlighted in red.] We can set . Another bottleneck term implies that . Additionally, we choose , , and . Then we deduce that and . Therefore, we have
with high probability. ∎
Since for all , we have . If follows that
Thus, the relative error of quantizing the second layer is given by
[**JZ: All results in this section serve as references which will be modified.] Given the result (LABEL:eq:inf-alphabet-tails), one can bound the quantization error for the first layer without approximating because in this case. See the following example for details.
Theorem 3.4 (Quantization error bound for the first layer).
Suppose we quantize the weights in the first layer using (LABEL:eq:quantization-algorithm) with alphabet , step size , and input data .
-
(1)
For every neuron that is a column of , the quantization error satisfies
(59) with probability exceeding .
-
(2)
Let denote the quantized output for all neurons. Then
(60) holds with probability at least .
Here, is defined by (LABEL:eq:def-sigma).
Proof.
(1) Since we perform parallel and independent quantization for all neurons in the same layer, it suffices to consider for some . Additionally, by (LABEL:eq:layer-input), we have . According to LABEL:corollary:ut-bound-infinite, quantization of guarantees that
holds with probability exceeding . Here, with , and . Since , by induction on , it is easy to verify that for all . It follows that
holds with probability at least . In particular, let and . Then
holds with probability at least .
(2) Note that (59) is valid for every neuron . By taking union bound over all neurons,
holds with probability at least . ∎
Moreover, we can obtain the relative error of quantization by estimating . For example, assume that neuron is an isotropic random vector and data is generic in the sense that . Then combining the fact that with (59), we deduce that the following inequality holds with high probability.
Further, if all neurons are isotropic, then It follows from (60) that
Note that the error bounds above are identical with the error bounds derived in [zhang2022post] where was assumed random and was deterministic.
Theorem 3.5.
Let be a two-layer neural network as in (LABEL:eq:mlp) where and the activation function is for all . Suppose that each has i.i.d. entries and are independent. If we quantize using (LABEL:eq:quantization-algorithm) with alphabet , and input data , then for ,
holds with probability at least .
Proof.
By quantizing the weights of the first layer, one can deduce the following error bound as in Theorem 3.4:
(61) |
Since , , and for any , it follows from (61) that, for ,
(62) |
holds with probability at least . Additionally, one can show that, for each ,
(63) |
holds with probability exceeding . [**JZ: Will add a lemma to prove the above inequality later.] Let denote the event that (62) and (63) hold uniformly for all . By taking a union bound, we have .
Next, conditioning on , we quantize the second layer using (LABEL:eq:quantization-algorithm) with data . Applying LABEL:corollary:ut-bound-infinite with and , for each neuron and its quantized counterpart , we have
(64) |
holds with probability exceeding . Here, according to (LABEL:eq:def-mu), we have
On event , the triangle inequality, (62), and (63) yield
hold uniformly for all . Hence, (64) becomes
(65) |
with probability (conditioning on ) at least .
Now, to bound the quantization error using triangle inequality, it suffices to control . Since , we get with defined as follows
(66) |
Then . Applying the Hanson-Wright inequality, see e.g. [rudelson2013hanson] and [vershynin2018high], we obtain for all that
(67) |
It remains to evaluate and . On event ,
(68) |
holds uniformly for all . Moreover, since and , we have
Here, the first inequality holds because for any positive semidefinite matrix and the second inequality is due to for any orthogonal projection . Plugging (68) into the result above, we get
(69) |
Further, since for all orthogonal projections and for all vectors , we obtain
Again, by (68), the inequality above becomes
(70) |
Combining (67), (69), (70), and choosing , one can get
(71) |
with probability (conditioning on ) at least . It follows from (3), (71), and that
holds with probability at least . ∎