This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Supplementary Materials for MST-compression: Compressing and Accelerating Binary Neural Networks

Appendix A Proof of Eq. (2)

According to the convolution definition, the output of the channel ii as the following

Yi=(j=1CinMM𝒜ijb𝒲ijb)αY_{i}=\left(\sum_{j=1}^{C_{in}MM}\mathcal{A}^{b}_{ij}*\mathcal{W}^{b}_{ij}\right)\odot\alpha (8)

There are Cin×M×MC_{in}\times M\times M multiplications, and these multiplications’ output is -1 or 1. Assuming that there are AA multiplications with the output 1-1 and BB multiplication with output 11, we have A+B=Cin×M×MA+B=C_{in}\times M\times M. Thus, YiY_{i} can be derived as

Yi=AB=2ACin×M×M.Y_{i}=A-B=2A-C_{in}\times M\times M. (A1)

In addition, because 𝒜ijb\mathcal{A}^{b}_{ij} and 𝒲ijb\mathcal{W}^{b}_{ij} are binarized, A can be calculated as

A=j=1CinMMXNOR(𝒜ijb,𝒲ijb).A=\sum_{j=1}^{C_{in}MM}\operatorname*{XNOR}(\mathcal{A}^{b}_{ij},\mathcal{W}^{b}_{ij}). (A2)

Finally, we have Eq. (2) by replacing A with Eq. (A2) as

Yi=(2j=1CinMMXNOR(𝒜ijb,𝒲ijb)Cin×M×M)α.Y_{i}=(2\sum_{j=1}^{C_{in}MM}\operatorname*{XNOR}(\mathcal{A}^{b}_{ij},\mathcal{W}^{b}_{ij})-C_{in}\times M\times M)\odot\alpha. (2)

Appendix B Proof of Eq. (3)

Assuming that 𝒮\mathcal{S} includes weight values of the channel jj, which are similar to the weights of the channel ii (compared one-one respectively). 𝒟\mathcal{D} includes weight values of the channels jj, which are different from the weights of the channel ii (compared one-one respectively), |𝒟|=dij|\mathcal{D}|=d_{ij}. 𝒜sb\mathcal{A}^{b}_{s} and 𝒜db\mathcal{A}^{b}_{d} are input activations for 𝒮\mathcal{S} and 𝒟\mathcal{D}, respectively. PjP_{j} can be as

Pj=𝒲s𝒮XNOR(𝒜sb,𝒲s)+𝒲d𝒟XNOR(𝒜db,𝒲d).P_{j}=\sum_{\mathcal{W}_{s}\in\mathcal{S}}\operatorname*{XNOR}(\mathcal{A}^{b}_{s},\mathcal{W}_{s})+\sum_{\mathcal{W}_{d}\in\mathcal{D}}\operatorname*{XNOR}(\mathcal{A}^{b}_{d},\mathcal{W}_{d}). (A3)

Because input activation of the channel ii is the same as that of the channel jj. Suppose 𝒟¯\bar{\mathcal{D}} includes weights of the channel ii, which are different from that of the channel jj, |𝒟|=|𝒟¯|=dij|\mathcal{D}|=|\mathcal{\bar{D}}|=d_{ij}. We can have PiP_{i} as

Pi=𝒲s𝒮XNOR(𝒜sb,𝒲s)+𝒲¯d𝒟¯XNOR(𝒜db,𝒲d¯),P_{i}=\sum_{\mathcal{W}_{s}\in\mathcal{S}}\operatorname*{XNOR}(\mathcal{A}^{b}_{s},\mathcal{W}_{s})+\sum_{\bar{\mathcal{W}}_{d}\in\bar{\mathcal{D}}}\operatorname*{XNOR}(\mathcal{A}^{b}_{d},\bar{\mathcal{W}_{d}}), (A4)

In consequence, 𝒲s𝒮XNOR(𝒜sb,𝒲s)\sum_{\mathcal{W}_{s}\in\mathcal{S}}\operatorname*{XNOR}(\mathcal{A}^{b}_{s},\mathcal{W}_{s}) can be calculated as,

𝒲s𝒮XNOR(𝒜sb,𝒲s)=Pi𝒲¯d𝒟¯XNOR(𝒜db,𝒲d¯),\sum_{\mathcal{W}_{s}\in\mathcal{S}}\operatorname*{XNOR}(\mathcal{A}^{b}_{s},\mathcal{W}_{s})=P_{i}-\sum_{\bar{\mathcal{W}}_{d}\in\bar{\mathcal{D}}}\operatorname*{XNOR}(\mathcal{A}^{b}_{d},\bar{\mathcal{W}_{d}}), (A5)

and \forall input activations, based the characteristics of XNOR operation, we have

𝒲d𝒟XNOR(𝒜db,𝒲d)+𝒲¯d𝒟¯XNOR(𝒜db,𝒲d¯)=dij.\sum_{\mathcal{W}_{d}\in\mathcal{D}}\operatorname*{XNOR}(\mathcal{A}^{b}_{d},\mathcal{W}_{d})+\sum_{\bar{\mathcal{W}}_{d}\in\bar{\mathcal{D}}}\operatorname*{XNOR}(\mathcal{A}^{b}_{d},\bar{\mathcal{W}_{d}})=d_{ij}. (A6)

Use Eq. (A5) and Eq. (A6), we can reformulate the Eq. (A3) as

Pj=Pidij+2𝒲d𝒟XNOR(𝒜db,𝒲d).P_{j}=P_{i}-d_{ij}+2\sum_{\mathcal{W}_{d}\in\mathcal{D}}\operatorname*{XNOR}(\mathcal{A}^{b}_{d},\mathcal{W}_{d}). (A7)

In Sec. 2, we have Pij=𝒲d𝒟XNOR(𝒜db,𝒲d)P_{ij}=\sum_{\mathcal{W}_{d}\in\mathcal{D}}\operatorname*{XNOR}(\mathcal{A}^{b}_{d},\mathcal{W}_{d}). Thus, we finally have the following equation.

Yj=2(Pidij+2Pij)Cin×M×M.Y_{j}=2(P_{i}-d_{ij}+2P_{ij})-C_{in}\times M\times M. (3)

Appendix C Additional results

Effect of the number of centers. In this section, we provide an additional experimental results related to the effect of the number of initial centers for the training. In particular, we do the training on VGG-small model and CIFAR-10 dataset with different number of centers, while λ\lambda is fixed at 4e-6. Besides, each number of centers, we execute the training three times and get the mean value for the report.

Table A1 provides the MST depth, number of parameters, bit-ops and accuracy w.r.t. different number of centers. Accordingly, the MST depth, number of parameters and bit-ops tend to increase as the number of centers increases. Specifically, when the number of centers changes from 11 to 88, the MST depth increases 3.27×3.27\times, the number of parameters and bit-ops increase 1.11×1.11\times and 1.14×1.14\times, respectively. Meanwhile, accuracy barely changes with different number of centers. For each binary convolution layer, as shown in Figure A1, as the number of centers increases, both the MST depth and number of parameters also increase. These findings suggest that opting for a single center is the most effective strategy to minimize MST depth, parameters, and bit-ops while preserving accuracy.

#centers MST-depth
#Params
(Mbit)
#Bit-Ops
(GOps)
Top-1 Acc.
mean ±\pm std (%)
1 22.322.3 0.5450.545 0.1190.119 91.49±0.0491.49\pm 0.04
2 30.330.3 0.5500.550 0.1180.118 91.45±0.0891.45\pm 0.08
4 47.747.7 0.5740.574 0.1250.125 91.42±0.0691.42\pm 0.06
6 60.360.3 0.5810.581 0.1300.130 91.53±0.0791.53\pm 0.07
8 73.073.0 0.6070.607 0.1360.136 91.49±0.0491.49\pm 0.04
Table A1: Accuracy, MST depth, number of parameters and bit-Ops w.r.t. different number of centers on CIFAR-10 VGG-small model.
11223344550.50.5111.51.522105\cdot 10^{5}Params (Mbits)112233445505510101515Binary Convolution LayerMST depth1-center2-center4-center6-center8-center
Figure A1: Number of parameters and MST depth on each convolution layer w.r.t. different number of centers.