Logit Standardization in Knowledge Distillation

First Author
Institution1
Institution1 address
firstauthor@i1.org Second Author
Institution2
First line of institution2 address
secondauthor@i2.org

{\mathbf{v}}_{n}=f_{T}({\mathbf{x}}_{n})

{\mathbf{z}}_{n}=f_{S}({\mathbf{x}}_{n})

a_{T}=\overline{{\mathbf{v}}}_{n}=\frac{1}{K}\sum_{k=1}^{K}{\mathbf{v}}_{n}^{(k)}

a_{S}=\overline{{\mathbf{z}}}_{n}=\frac{1}{K}\sum_{k=1}^{K}{\mathbf{z}}_{n}^{(k)}

b_{T}=\sigma({\mathbf{v}}_{n})=\left[{\frac{1}{K}\sum_{k=1}^{K}\left({\mathbf{v}}_{n}^{(k)}-\overline{{\mathbf{v}}}_{n}\right)^{2}}\right]^{1/2}

b_{S}=\sigma({\mathbf{z}}_{n})=\left[{\frac{1}{K}\sum_{k=1}^{K}\left({\mathbf{z}}_{n}^{(k)}-\overline{{\mathbf{z}}}_{n}\right)^{2}}\right]^{1/2}

q({\mathbf{v}}_{n})=\text{\text{softmax}}\left[({\mathbf{v}}_{n}-a_{T})/b_{T}/\tau\right]

q({\mathbf{z}}_{n})=\text{\text{softmax}}\left[({\mathbf{z}}_{n}-a_{S})/b_{S}/\tau\right]

$q^{\prime}({\mathbf{z}}_{n})=\text{\text{softmax}}\left({\mathbf{z}}_{n}\right)$

\lambda_{KD}\tau^{2}{\mathcal{L}}\left(q({\mathbf{v}}_{n}),q({\mathbf{z}}_{n})\right)

Update $f_{S}$ towards minimizing $\lambda_{CE}{\mathcal{L}}_{CE}\left({{y}}_{n},q^{\prime}({\mathbf{z}}_{n})\right)+\lambda_{KD}\tau^{2}{\mathcal{L}}\left(q({\mathbf{v}}_{n}),q({\mathbf{z}}_{n})\right)$

Input: Transfer set

{\mathcal{D}}

with samples of image-label pair

\{{\mathbf{x}}_{n},{{y}}_{n}\}_{n=1}^{N}

, Number of classes

K

, Base Temperature

\tau

, Teacher

f_{T}

, Student

f_{S}

, Loss

{\mathcal{L}}

(e.g.,

{\rm{KL}}

divergence

{\mathcal{L}}_{\rm{KL}}

)

Output: Trained student model

f_{S}

2foreach $({\mathbf{x}}_{n},{{y}}_{n})$ in ${\mathcal{D}}$ do

{\mathbf{v}}_{n}=f_{T}({\mathbf{x}}_{n})

\overline{{\mathbf{v}}}_{n}=\frac{1}{K}\sum_{k=1}^{K}{\mathbf{v}}_{n}^{(k)}

{\mathbf{z}}_{n}=f_{S}({\mathbf{x}}_{n})

\overline{{\mathbf{z}}}_{n}=\frac{1}{K}\sum_{k=1}^{K}{\mathbf{z}}_{n}^{(k)}

\sigma({\mathbf{v}}_{n})=\left[{\frac{1}{K}\sum_{k=1}^{K}\left({\mathbf{v}}_{n}^{(k)}-\overline{{\mathbf{v}}}_{n}\right)^{2}}\right]^{1/2}

\sigma({\mathbf{z}}_{n})=\left[{\frac{1}{K}\sum_{k=1}^{K}\left({\mathbf{z}}_{n}^{(k)}-\overline{{\mathbf{z}}}_{n}\right)^{2}}\right]^{1/2}

q({\mathbf{v}}_{n})=\text{\text{softmax}}\left[({\mathbf{v}}_{n}-\overline{{\mathbf{v}}}_{n})/\sigma({\mathbf{v}}_{n})/\tau\right]

q({\mathbf{z}}_{n})=\text{\text{softmax}}\left[({\mathbf{z}}_{n}-\overline{{\mathbf{z}}}_{n})/\sigma({\mathbf{z}}_{n})/\tau\right]

q^{\prime}({\mathbf{z}}_{n})=\text{\text{softmax}}\left({\mathbf{z}}_{n}\right)

10 Update

f_{S}

towards minimizing

\lambda_{CE}{\mathcal{L}}_{CE}\left({{y}}_{n},q^{\prime}({\mathbf{z}}_{n})\right)+\lambda_{KD}\tau^{2}{\mathcal{L}}\left(q({\mathbf{v}}_{n}),q({\mathbf{z}}_{n})\right)

11 end foreach

Algorithm 1

\mathcal{Z}

-score logit standardization pre-process in knowledge distillation.