This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Study of Neural Collapse Phenomenon: Grassmannian Frame, Symmetry and Generalization

Peifeng Gao 1, Qianqian Xu 2, , Peisong Wen 1, 2,
Huiyang Shao 1, 2, Zhiyong Yang 1, Qingming Huang 1, 2, 3, 4,
1 School of Computer Science and Tech., University of Chinese Academy of Sciences
2 Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS
3 BDKM, University of Chinese Academy of Sciences
4 Peng Cheng Laboratory
Corresponding author.
Abstract

In this paper, we extend original Neural Collapse Phenomenon by proving Generalized Neural Collapse hypothesis. We obtain Grassmannian Frame structure from the optimization and generalization of classification. This structure maximally separates features of every two classes on a sphere and does not require a larger feature dimension than the number of classes. Out of curiosity about the symmetry of Grassmannian Frame, we conduct experiments to explore if models with different Grassmannian Frames have different performance. As a result, we discover the Symmetric Generalization phenomenon. We provide a theorem to explain Symmetric Generalization of permutation. However, the question of why different directions of features can lead to such different generalization is still open for future investigation.

1 Introduction

Consider classification problems. Researchers with good statistical and linear algebra training may believe that the features learned by deep neural networks, which are flowing tensors in deep models, are very different and vary randomly depending on the classification situation. However, Papyan et al. (2020) discovered a phenomenon called Neural Collapse (NeurCol) that challenges this expectation. NeurCol is based on the standard training paradigm of classification problems using the cross-entropy loss minimization based on the stochastic gradient descent (SGD) algorithm with deep models. During Terminal Phase of Training (TPT), NeurCol shows that the features of the last layer and linear classifier that belong to the same class collapse to one point and form a symmetrical geometric structure called Simplex ETF. This phenomenon is beautiful due to its surprising symmetry. However, existing conclusions Papyan et al. (2020); Ji et al. (2022); Fang et al. (2021); Zhu et al. (2021); Zhou et al. (2022a); Yaras et al. (2022a); Han et al. (2022); Tirer and Bruna (2022); Zhou et al. (2022b); Mixon et al. (2020); Poggio and Liao (2020); Graf et al. (2021); Yang et al. (2022) of NeurCol are not universal enough. Specifically, the existing conclusions require the feature dimension to be larger than the class number, as Simplex ETF only exists in this case. In another case where the feature dimension is smaller than the class number, the corresponding structure learned by deep models is still unclear.

Lu and Steinerberger (2022) and Liu et al. (2023) provide a preliminary answer to this question. Lu and Steinerberger (2022) prove that when class number tends towards infinity, the features of different classes uniformly distribute on the hypersphere. Further, Liu et al. (2023) proposes a Generalized Neural Collapse hypothesis, which states that if the class number is larger than the feature dimension, the inter-class features and classifiers will be maximally distant on a hypersphere, which they refer to as Hyperspherical Uniformity.

First Contribution Our first contribution is the theoretical confirmation of the Generalized Neural Collapse hypothesis. We derive the Grassmannian Frame from three perspectives, namely, optimal codes in coding theory, optimization and generalization in classification problems. As a more general version of ETF, Grassmannian Frame exists for any combination of class numbers and feature dimensions. Additionally, it exhibits the minimized maximal correlation property, which is precisely the Hyperspherical Uniformity property.

The study conducted by Yang et al. (2022) is relevant to our next part of work. They argue that deep models can learn features with any direction, and thus, fixing the last layer as an ETF during training can lead to satisfactory performance. However, we completely disprove this argument by discovering a new phenomenon.

Second Contribution Our second contribution is the discovery of a new phenomenon called Symmetric Generalization. Our motivation for it stems from the two invariances of the Grassmannian Frame, namely, rotation invariance and permutation invariance. We observe that models that have learned the Grassmannian Frame with different rotations and permutations exhibit very different generalization performances, even if they have achieved the best performance on the training set.

2 Preliminary

2.1 Neural Collapse

Papyan et al. (2020) conducted extensive experiments to reveal the NeurCol phenomenon. This phenomenon occurs during the Terminal Phase of Training (TPT), which starts from the epoch that the training accuracy has reached 100%100\%. During TPT, training error is zero, but cross-entropy loss keeps decreasing. To describe this phenomenon more clearly, we introduce several necessary notations first. We denote the class number as CC and feature dimension as dd. Here, we consider classifiers with the form logit=𝑴𝒛=[M1,𝒛,,MC,𝒛]Tlogit=\boldsymbol{M}\boldsymbol{z}=[\langle M_{1},\boldsymbol{z}\rangle,\dots,\langle M_{C},\boldsymbol{z}\rangle]^{T}, where 𝑴d×C\boldsymbol{M}\in\mathbb{R}^{d\times C} is the linear classifier, and 𝒛\boldsymbol{z} is the feature of a smaple obtained from a deep feature extractor. The classification result is given by selecting the maximum score of logitlogit. Given a balanced dataset, we denote the feature of ii-th sample in yy-th category as 𝒛y,i\boldsymbol{z}_{y,i}. Specifically, when the model is trained on a balanced dataset, its last layer would converge to the following manifestations:

  • NC1

    Variability Collapse All samples belonging the same class converge to the class mean: 𝒛y,i𝒛¯y0,y,i\|\boldsymbol{z}_{y,i}-\bar{\boldsymbol{z}}_{y}\|\rightarrow 0,\forall y,\forall i where 𝒛¯y=Avei(𝒛y,i)\bar{\boldsymbol{z}}_{y}=\text{Ave}_{i}\left(\boldsymbol{z}_{y,i}\right) denote the class-center of yy-th class;

  • NC2

    Convergence to Self Duality The samples and classifier belonging the same class converge to the same: 𝒛y,iMy0,y,i\|\boldsymbol{z}_{y,i}-M_{y}\|\rightarrow 0,\forall y,\forall i;

  • NC3

    Convergence to Simplex ETF The classifier weight converges to the vertices of Simplex ETF;

  • NC4

    Nearest Classification The learned classifier behaves like the nearest classifier, i.e.argmaxyMy,zargminyz𝒛¯yi.e.\arg\max_{y}\langle M_{y},z\rangle\rightarrow\arg\min_{y}\|z-\bar{\boldsymbol{z}}_{y}\|.

In NC3, Simplex ETF is an interesting structure. Note that there exists two different objects: Simplex ETF and ETF. ETF is rooted from Frame Theroy (refer to next subsection), while Papyan et al. (2020) introduced the Simplex ETF as a new definition in the context of the NeurCol phenomenon. They made some extensions to the original definition of ETF by introducing an orthogonal projection matrix and a scale factor. Here, we provide the definition of Simplex ETF:

Definition 2.1 (Simplex Equiangular Tight Frame Papyan et al. (2020)).

A Simplex ETF is a collection of points in C\mathbb{R}^{C} specified by the columns of

𝑴=αRCC1(I1C𝕀𝕀T)\displaystyle\boldsymbol{M}^{\star}=\alpha R\sqrt{\frac{C}{C-1}}\left(I-\frac{1}{C}\mathbb{I}\mathbb{I}^{T}\right)

where IC×CI\in\mathbb{R}^{C\times C} is the identity matrix, 𝕀C\mathbb{I}\in\mathbb{R}^{C} is the all-one vector, Rd×C(dC)R\in\mathbb{R}^{d\times C}(d\geq C) is an orthogonal projection matrix, α\alpha\in\mathbb{R} is a scale factor.

Simplex ETF is a structure with high symmetry, as every pair of vectors has equal angles (Equiangular property). However, it has a limitation: it only exists when feature dimension dd is larger than class number CC. Recently, a work Liu et al. (2023) removed this restriction by proposing Generalized Neural Collapse hypothesis. In their hypothesis, Hyperspherical Uniformity is introduced to generalize Equiangular property. Hyperspherical Uniformity means features of every class is distributed uniformly on a hypersphere with maximal distance.

2.2 Frame Theory

We have discovered that Grassmannian Frame, a mathematical object from Frame Theory, is a suitable candidate for meeting Hyperspherical Uniformity. Frame Theory is a fundamental research area in mathematics Casazza and Kutyniok (2012) that provides a framework for the analysis of signal transmission and reconstruction. In communication field, certain frame properties have been shown to be optimal configurations in various transmission problems. For instance, Uniform and Tight frame are optimal codes in erasure problem Casazza and Kovačević (2003) and non-orthogonal communication schemes Ambrus et al. (2021). Grassmannian Frame Holmes and Paulsen (2004); Strohmer and Heath Jr (2003), as a more specialized example, not only satisfies the Uniform and Tight properties but also meets the minimized maximal correlation property. This property makes us confident that Grassmannian Frame satisfies Hyperspherical Uniformity.

Here, we provide a series of definitions in Frame Theory111 Frame Theory has more general definitions in Hilbert space. Definitions we provided here are actually based on Euclidean space. See Strohmer and Heath Jr (2003) for general definitions. :

Definition 2.2 (Frame).

In Euclidean space d\mathbb{R}^{d}, a frame is a sequence of bounded vectors {ζi}i=1C\{\zeta_{i}\}_{i=1}^{C}.

Definition 2.3 (Uniform Property and Unit Norm).

Given a frame {ζi}i=1C\{\zeta_{i}\}_{i=1}^{C} in d\mathbb{R}^{d}, it is a uniform frame if the norm of every vector is equal. Further, it is a unit norm frame if the norm of every vector in it is equal to 1.

Definition 2.4 (Tight Property).

Given a frame {ζi}i=1C\{\zeta_{i}\}_{i=1}^{C} in d\mathbb{R}^{d}, it is a tight frame if the rank of its analysis matrix [ζ1,,ζC]\left[\zeta_{1},\dots,\zeta_{C}\right] is dd.

Definition 2.5 (Maximal Frame Correlation).

Given a unit norm frame {ζi}i=1C\{\zeta_{i}\}_{i=1}^{C}, the maximal correlation is defined as ({ζi}i=1C)=maxi,j,ij{|ζi,ζj|}\mathcal{M}(\{\zeta_{i}\}_{i=1}^{C})=\max_{i,j,i\neq j}\{|\langle\zeta_{i},\zeta_{j}\rangle|\}.

We can now define the Grassmannian frame:

Definition 2.6 (Grassmannian Frame).

A frame {ζi}i=1C\{\zeta_{i}\}_{i=1}^{C} in d\mathbb{R}^{d} is Grassmannian frame if it is the solution of min{({ζi}i=1C)}\min\{\mathcal{M}(\{\zeta_{i}\}_{i=1}^{C})\}, where the minimum is taken over all unit norm frames in d\mathbb{R}^{d}.

We also introduce Equiangular Tight Frame (ETF).

Definition 2.7 (Equiangular Property).

Given a unit norm frame {ζi}i=1C\{\zeta_{i}\}_{i=1}^{C}, it is equiangular frame if |ζi,ζj|=c,i,jwithij|\langle\zeta_{i},\zeta_{j}\rangle|=c,\forall i,j\ \text{with}\ i\neq j for some constant c0c\geq 0.

Definition 2.8 (Equiangular Tight Frame).

ETF is a Equiangular and Tight frame.

The following theorem relates ETF and Grassmannian Frame:

Theorem 2.9 (Welch Bound Welch (1974)).

Given any unit norm frame {ζi}i=1C\{\zeta_{i}\}_{i=1}^{C} in d\mathbb{R}^{d}, then we have

({ζi}i=1C)Cdd(C1)only ifCd(d+1)2,\displaystyle\mathcal{M}(\{\zeta_{i}\}_{i=1}^{C})\geq\sqrt{\frac{C-d}{d(C-1)}}\ \ \text{only if}\ \ C\leq\frac{d(d+1)}{2},

Further, ({ζi}i=1C)\mathcal{M}(\{\zeta_{i}\}_{i=1}^{C}) reaches the right hand side if and only if {ζi}i=1C\{\zeta_{i}\}_{i=1}^{C} is a Equiangular Tight Frame.

This theorem tells ETF is the special case of Grassmannian Frame and how a Grassmannian frame {ζi}i=1C\{\zeta_{i}\}_{i=1}^{C} can be Equiangular: if and only if it can achieve the Welch Bound. Intuitively, if dd is large enough, the correlation between every two vectors in {ζi}i=1C\{\zeta_{i}\}_{i=1}^{C} can be minimized equally to achieve Equiangular property. Otherwise, Equiangular property can not be guaranteed.

Motivation Our motivation for considering Grassmannian Frame as a potential structure for the Generalized Neural Collapse hypothesis is that it satisfies two important properties: it has minimized maximal correlation (i.e.i.e., Hyperspherical Uniformity) and exists for any vector number CC and dimension dd. We will provide theoretical supports for this argument in the next section.

3 Main Results

As discussed in the previous section, minimized maximal correlation is a key characteristic of Grassmannian frame. Therefore, in this section, all of our findings are based on this insight:

minMy=1maxyyMy,My\displaystyle\min_{\|M_{y}\|=1}\max_{y\neq y^{\prime}}\langle M_{y},M_{y^{\prime}}\rangle

All proofs can be found in the Appendix.

3.1 Warmup: Optimal Code Perspective

In communication field, Grassmannian frame is not only the optimal 22-erasure code Holmes and Paulsen (2004), but also the optimal code in Gaussian Channel Papyan et al. (2020); Shannon (1959):

Theorem 3.1.

Consider the communication problem: a number cc (c[C]c\in[C]) is encoded as a vector McM_{c} in d\mathbb{R}^{d} that we call code, and then is transmitted over a noisy channel. A receiver need to revocer cc from the noisy signal 𝐡=Mc+𝐠\boldsymbol{h}=M_{c}+\boldsymbol{g}, where 𝐠\boldsymbol{g} is the additive noisy. Then if 𝐠𝒩(0,σ2I)\boldsymbol{g}\sim\mathcal{N}(0,\sigma^{2}I), Grassmannian Frame is the optimal code enjoying the minimal error probability.

This theorem is essentially adopted from the Corollary.4 of Papyan et al. (2020), which is the first study to identify NeurCol phenomenon. However, they only validated this result with Simplex ETF and did not further investigate this structure.

3.2 Optimization Perspective

Then we consider the classification from the optimization perspective. We start from the cross-entropy minimization problem.

Notations Denote feature dimension of the last layer as dd and class number as CC. The linear classifier is 𝑴=[M1,,MC]d×C\boldsymbol{M}=[M_{1},\cdots,M_{C}]\in\mathbb{R}^{d\times C}. Given a balanced dataset 𝒁={{𝒛y,id}i=1N/C}y=1C\boldsymbol{Z}=\{\{\boldsymbol{z}_{y,i}\in\mathbb{R}^{d}\}_{i=1}^{N/C}\}_{y=1}^{C} where every class has N/CN/C samples, we use 𝒛y,i\boldsymbol{z}_{y,i} to represent the feature of ii-th sample in yy-th class.

Optimization Objective Since modern deep models are highly overparameterized, we assume the models have infinite capacity and can fit any dataset. Therefore, we directly optimize sample features to simplify analysis Yang et al. (2022); Zhu et al. (2021):

min𝒁,𝑴CELoss(𝑴,𝒁):=y=1Ci=1N/Clogexp(My,𝒛y,i)yexp(My,𝒛y,i)\displaystyle\min_{\boldsymbol{Z},\boldsymbol{M}}\text{CELoss}(\boldsymbol{M},\boldsymbol{Z})=-\sum_{y=1}^{C}\sum_{i=1}^{N/C}\log\frac{\exp\big{(}\langle M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}}{\sum_{y^{\prime}}\exp\big{(}\langle M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle\big{)}} (1)

As the Proposition.2 of Fang et al. (2021) highlighted, NeurCol occurs only if features and classifiers are 2\ell_{2} norm bounded. Therefore, following previous work Lu and Steinerberger (2022); Fang et al. (2021); Yaras et al. (2022b), we introduce 2\ell_{2} norm constraint into (1):

s.t.Myρ,y[C]and𝒛y,iρ,y[C],i[N/C],\begin{aligned} s.t.\ \ \|M_{y}\|\leq\rho,\forall y\in[C]\ \ \text{and}\ \ \|\boldsymbol{z}_{y,i}\|\leq\rho,\forall y\in[C],i\in[N/C]\end{aligned}, (2)

where the norms of features and linear classifiers are bounded by ρ\rho. Then to perform optimization directly, we turn to the following unconstrained feature models Zhu et al. (2021); Zhou et al. (2022a):

min𝑴,𝒁(𝑴,𝒁):=y=1Ci=1N/C(logexp(My,𝒛y,i)yexp(My,𝒛y,i)+ω𝒛y,i22)+y=1CλMy22\displaystyle\min_{\boldsymbol{M},\boldsymbol{Z}}\mathcal{L}(\boldsymbol{M},\boldsymbol{Z})=\sum_{y=1}^{C}\sum_{i=1}^{N/C}\left(-\log\frac{\exp\big{(}\langle M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}}{\sum_{y^{\prime}}\exp\big{(}\langle M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle\big{)}}+\frac{\omega\|\boldsymbol{z}_{y,i}\|^{2}}{2}\right)+\sum_{y=1}^{C}\frac{\lambda\|M_{y}\|^{2}}{2} (3)

Compared with (1), My2\|M_{y}\|^{2} and 𝒛y,i2\|\boldsymbol{z}_{y,i}\|^{2} are added in (3). In deep learning, the newly added terms can be seen as the weight decay, while λ\lambda and ω\omega are factors of weight decay. This can limit the norm of the sample features 𝒁\boldsymbol{Z} and linear classifier 𝑴\boldsymbol{M}, that is, given any λ,ω(0,)\lambda,\omega\in(0,\infty), there exists a ρ(λ,ω)\rho(\lambda,\omega) such that ρ(λ,ω)My\rho(\lambda,\omega)\geq\|M_{y}\| and ρ(λ,ω)𝒛y,i\rho(\lambda,\omega)\geq\|\boldsymbol{z}_{y,i}\| when (3) converges.

Gradient Descent Ji et al. (2022) analyses Gradient Flow of (1), while different from them, we consider the Gradient Descent of (3):

𝒁(t+1)𝒁(t)α𝒁(t)(𝑴(t),𝒁(t))and𝑴(t+1)𝑴(t)β𝑴(t)(𝑴(t),𝒁(t))\displaystyle\boldsymbol{Z}^{(t+1)}\leftarrow\boldsymbol{Z}^{(t)}-\alpha\nabla_{\boldsymbol{Z}^{(t)}}\mathcal{L}(\boldsymbol{M}^{(t)},\boldsymbol{Z}^{(t)})\ \ \text{and}\ \ \boldsymbol{M}^{(t+1)}\leftarrow\boldsymbol{M}^{(t)}-\beta\nabla_{\boldsymbol{M}^{(t)}}\mathcal{L}(\boldsymbol{M}^{(t)},\boldsymbol{Z}^{(t)}) (4)

where we use upper script (t)(t) to denote optimized variables in the tt-th iterations, α\alpha and β\beta are learning rates. Note that when λ,ω0\lambda,\omega\rightarrow 0, the optimization problem (3) is equivalent to (1) since norm constraint condition (2) vanishes, i.e.ρi.e.\rho\rightarrow\infty.

Theorem 3.2 (Generalized Neural Collapse).

Consider the convergce of Gradient Descent on the model (3), if parameters are properly selected (refer to Assumption.C.3 in Appendix for more details), we have the following conclusions:

(NC1) 𝐳y,i𝐳y,j0,y[C],i,j[N/C]\|\boldsymbol{z}_{y,i}-\boldsymbol{z}_{y,j}\|\rightarrow 0,\forall y\in[C],\forall i,j\in[N/C];

(NC2) 𝐳y,iMy0,y[C],i[N/C]\|\boldsymbol{z}_{y,i}-M_{y}\|\rightarrow 0,\forall y\in[C],\forall i\in[N/C];

(NC3) if ρ\rho\rightarrow\infty, 𝐌\boldsymbol{M} converges to the solution of miny,My=ρmaxyyMy,My\min_{\forall y,\|M_{y}\|=\rho}\max_{y\neq y^{\prime}}\langle M_{y},M_{y^{\prime}}\rangle;

(NC4) argmaxyMy,𝐳argmaxy𝐳𝐳¯y𝐳𝐙\arg\max_{y}\langle M_{y},\boldsymbol{z}\rangle\rightarrow\arg\max_{y}\|\boldsymbol{z}-\bar{\boldsymbol{z}}_{y}\|\ \forall\boldsymbol{z}\in\boldsymbol{Z}, where 𝐳¯y=CNi=1N/C𝐳y,i\bar{\boldsymbol{z}}_{y}=\frac{C}{N}\sum_{i=1}^{N/C}\boldsymbol{z}_{y,i}.

Remark 3.3.

Our findings confirm Generalized Neural Collapse hypothesis of Liu et al. (2023). (NC1), (NC2), (NC4) are consistent with previous discoveries Papyan et al. (2020). Additionally, we have shown that (NC3) extends NeurCol’s ETF and leads to minimized maximal correlation, or Grassmannian Frame.

Remark 3.4.

Our findings reveal two objectives of NeurCol that Liu et al. (2023) highlighted: minimal intra-class variability and maximal inter-class separability. Our conclusions on (NC1), (NC2), (NC4) support the former objective, while the Grassmannian Frame resulting from (NC3) naturally coincides with the solutions of problems such as the Spherical Code Conway and Sloane (2013), Thomson problem F.R.S. (1904), and Packing Lines in Grassmannian Spaces Conway et al. (1996), which supports the latter objective.

Remark 3.5.

(NC1) and (NC2) implies that classifiers {My}y=1C\{M_{y}\}_{y=1}^{C} forms an alternate dual frame Ambrus et al. (2021) of features {𝒛¯y}y=1C\{\bar{\boldsymbol{z}}_{y}\}_{y=1}^{C}. In Frame Theroy, alternate dual frame has been proved to be an optimal dual with respect to erasures for decoding Leng and Han (2011); Lopez and Han (2010).

We conduct a simulation experiment to visualize the convergence of Generalized Neural Collapse in a 22-dimensional feature space with 44 classes. A GIF animation can be found HERE. Please refer to Appendix.G for a detailed description and visual representation of this experiment.

3.3 Generalization Perspective

Next, we consider the generalization of classification problems. While correlation measures the similarity between two vectors in Frame Theory, in classification problems, margin is a similar concept but with an opposite degree. Therefore, our analysis of generalization focuses on margin.

Notations Consider a CC classes classification problem. Suppose sample space is 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, where 𝒳\mathcal{X} is data space and 𝒴={1,,C}\mathcal{Y}=\{1,\dots,C\} is label space. We assume class distribution is 𝒫𝒴=[p(1),,p(C)]\mathcal{P}_{\mathcal{Y}}=\left[p(1),\dots,p(C)\right], where p(c)p(c) denote the proportion of class cc. Let the training set S={(𝒙i,yi)}i=1NS=\{(\boldsymbol{x}_{i},y_{i})\}_{i=1}^{N} be drawn i.i.d from probability 𝒫𝒳×𝒴\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}. For yy-class samples in SS, we denote Sy={𝒙|(𝒙,y)S}S_{y}=\{\boldsymbol{x}|(\boldsymbol{x},y)\in S\} and |Sy|=Ny|S_{y}|=N_{y}. The form of classifiers is logit=𝑴Tf(𝒙;𝒘)=[M1,f(𝒙;𝒘),,MC,f(𝒙;𝒘)]logit=\boldsymbol{M}^{T}f(\boldsymbol{x};\boldsymbol{w})=[\langle M_{1},f(\boldsymbol{x};\boldsymbol{w})\rangle,\dots,\langle M_{C},f(\boldsymbol{x};\boldsymbol{w})\rangle], where 𝑴d×C\boldsymbol{M}\in\mathbb{R}^{d\times C} is the last linear layer, and f(;𝒘)df(\cdot;\boldsymbol{w})\in\mathbb{R}^{d} is the feature extractor parameterized by 𝒘\boldsymbol{w}. We use a tuple (𝑴,f(;𝒘))(\boldsymbol{M},f(\cdot;\boldsymbol{w})) to denote the classifier.

First, give the definition of margin:

Definition 3.6 (Linear Separability).

Given the dataset SS and a classifier (𝑴,f(;𝒘))(\boldsymbol{M},f(\cdot;\boldsymbol{w})), if the classifier can achieve 100%100\% accuracy on dataset, it must have 𝑴\boldsymbol{M} can linearly separate the feature of SS: for any two classes i,j(ij)i,j(i\neq j), there exists a γi,j>0\gamma_{i,j}>0 such that

(MiMj)Tf(𝒙;𝒘)γi,j,\displaystyle(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})\geq\gamma_{i,j}, (𝒙,i)S,\displaystyle\forall(\boldsymbol{x},i)\in S,
(MiMj)Tf(𝒙;𝒘)γi,j,\displaystyle(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})\leq-\gamma_{i,j}, (𝒙,j)S,\displaystyle\forall(\boldsymbol{x},j)\in S,

In this case, we say the classifier (𝑴,f(;𝒘))(\boldsymbol{M},f(\cdot;\boldsymbol{w})) can linearly separate the dataset SS by margin {γi,j}ij\{\gamma_{i,j}\}_{i\neq j}.

The following lemma establishes the relationship between margin and correlation in NeurCol:

Lemma 3.7.

Given the dataset SS and a classifier (𝐌,f(;𝐰))(\boldsymbol{M},f(\cdot;\boldsymbol{w})), if the classifier can linearly separate the dataset by margin {γi,j}ij\{\gamma_{i,j}\}_{i\neq j} and achieves NeurCol phenomenon on it, we have the following conclusion:

i,j[C](ij),γi,j+Mi,Mj=ρ2\displaystyle\forall i,j\in[C](i\neq j),\gamma_{i,j}+\langle M_{i},M_{j}\rangle=\rho^{2}

By substituting the conclusions of NC1-3 into the definition of margin, we can prove this lemma straightforwardly. It says given the maximal norm ρ\rho, margin γi,j\gamma_{i,j} and correlation Mi,Mj\langle M_{i},M_{j}\rangle is a pair of opposite quantities. Then we propose the Multiclass Margin Bound.

Theorem 3.8 (Multiclass Margin Bound).

Consider a dataset SS with CC classes. For any classifier (𝐌,f(;𝐰))(\boldsymbol{M},f(\cdot;\boldsymbol{w})), we denote its margin between ii and jj classes as (MiMj)Tf(;𝐰)(M_{i}-M_{j})^{T}f(\cdot;\boldsymbol{w}). And suppose the function space of the margin is ={(MiMj)Tf(;𝐰)|ij,𝐌,𝐰}\mathcal{F}=\{(M_{i}-M_{j})^{T}f(\cdot;\boldsymbol{w})|\forall i\neq j,\forall\boldsymbol{M},\boldsymbol{w}\}, whose uppder bound is

supijsup𝑴,𝒘supxi|(MiMj)Tf(𝒙;𝒘)|K.\displaystyle\sup_{i\neq j}\sup_{\boldsymbol{M},\boldsymbol{w}}\sup_{x\in\mathcal{M}_{i}}\left|(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})\right|\leq K.

Then, for any classifier (𝐌,f(;𝐰))(\boldsymbol{M},f(\cdot;\boldsymbol{w})) and margins {γi,j}ij(γi,j>0)\{\gamma_{i,j}\}_{i\neq j}(\gamma_{i,j}>0), the following inequality holds with probability at least 1δ1-\delta

x,y(maxc[Mf(𝒙;𝒘)]cy)\displaystyle\mathbb{P}_{x,y}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}\lesssim i=1Cp(i)jiNi()γi,j+i=1Cp(i)jilog(log24Kγi,j)Ni+L0,1\displaystyle\sum_{i=1}^{C}p(i)\sum_{j\neq i}\frac{\mathfrak{R}_{N_{i}}(\mathcal{F})}{\gamma_{i,j}}+\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sqrt{\frac{\log(\log_{2}\frac{4K}{\gamma_{i,j}})}{N_{i}}}+L_{0,1}

where \lesssim means we omit probability related terms, and L0,1L_{0,1} denotes the empirical risk term:

L0,1=i=1Cp(i)jixSi𝕀((MiMj)Tf(x)γi,j)Ni\displaystyle L_{0,1}=\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sum_{x\in S_{i}}\frac{\mathbb{I}((M_{i}-M_{j})^{T}f(x)\leq\gamma_{i,j})}{N_{i}}

Ni()\mathfrak{R}_{N_{i}}(\mathcal{F}) is the Rademacher complexity Kakade et al. (2008); Bartlett and Mendelson (2002) of function space \mathcal{F}. Refer to Appendix.D for full version of this theorem.

Recall NeurCol occurs when class distribution is uniform. We consider this case.

Corollary 3.9.

In Theorem.3.8, we assume the the class distribution and train set are both uniform, i.e.i.e. p(i)=1Cp(i)=\frac{1}{C} and Ni=NCi[C]N_{i}=\frac{N}{C}\ \forall i\in[C]. In this case, the generalization bound in Theorem.3.8 becomes

x,y(maxc[Mf(𝒙;𝒘)]cy)N/C()Ci=1Cji1γi,j+1NCi=1Cjilog(log24Kγi,j)+L0,1\displaystyle\mathbb{P}_{x,y}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}\lesssim\frac{\mathfrak{R}_{N/C}(\mathcal{F})}{C}\sum_{i=1}^{C}\sum_{j\neq i}\frac{1}{\gamma_{i,j}}+\frac{1}{\sqrt{NC}}\sum_{i=1}^{C}\sum_{j\neq i}\sqrt{\log(\log_{2}\frac{4K}{\gamma_{i,j}})}+L_{0,1}

Observing both above terms are the form ij1γi,j\sum_{i\neq j}\frac{1}{\gamma_{i,j}}, we have

c=1Cji1γi,jC(C1)maxij1γi,j1C(C1)min{γi,j}ijc=1Cji1γi,jmin{γi,j}ijmaxij1γi,j\displaystyle\sum_{c=1}^{C}\sum_{j\neq i}\frac{1}{\gamma_{i,j}}\leq C(C-1)\max_{i\neq j}\frac{1}{\gamma_{i,j}}\ \ \Leftrightarrow\ \ \frac{1}{C(C-1)}\min_{\{\gamma_{i,j}\}_{i\neq j}}\sum_{c=1}^{C}\sum_{j\neq i}\frac{1}{\gamma_{i,j}}\leq\min_{\{\gamma_{i,j}\}_{i\neq j}}\max_{i\neq j}\frac{1}{\gamma_{i,j}}
Remark 3.10.

Once again, we observe the characteristic of minimized maximal correlation. However, this time it is in the form of margin and is obtained by minimizing the margin generalization error.

Remark 3.11.

The Multiclass Margin Bound can provide an explanation for the steady improvement in test accuracy and adversarial robustness during TPT (as shown in Figure 8 and Table 1 of Papyan et al. (2020)). At the beginning of TPT, the accuracy over the training set reaches 100%100\% and L0,1=0L_{0,1}=0, indicating that generalization performance can no longer improve by reducing L0,1L_{0,1}. However, if we continue training at this point, the margin γi,j\gamma_{i,j} would still increase. Therefore, better robustness can be achieved by increasing the margin. Furthermore, two terms in our bound related to margin continue to decrease, leading to better generalization performance.

Minority Collapse Fang et al. (2021) have identified a related phenomenon called Minority Collapse, which is an imbalanced version of NeurCol. Specifically, they observed that when the training set is extremely imbalanced, the features of the minority class tend to converge to be parallel. Our Multiclass Margin Bound can offer explanations for the generalization of this phenomenon.

Corollary 3.12.

Consider imbalanced classification. Given dataset SS with CC classes, the first C1C_{1} classes 𝒞1={1,,C1}\mathcal{C}_{1}=\{1,\dots,C_{1}\} each contain N1N_{1} data, and the remaining CC1C-C_{1} minority classes 𝒞2={C1+1,,C}\mathcal{C}_{2}=\{C_{1}+1,\dots,C\} each contain N2N_{2} data. We denote the imbalanced ratio N1/N2N_{1}/N_{2} as RR. Assume class distribution p(i)p(i) is the same as dataset SS’s.

Then terms related to the margins between minority classes in Multiclass Margin Bound becomes:

i𝒞2j𝒞2\{i}1C1R+CC1(N2()γi,j+log(log24Kγi,j)N2)\displaystyle\sum_{i\in\mathcal{C}_{2}}\sum_{j\in\mathcal{C}_{2}\backslash\{i\}}\frac{1}{C_{1}R+C-C_{1}}\left(\frac{\mathfrak{R}_{N_{2}}(\mathcal{F})}{\gamma_{i,j}}+\sqrt{\frac{\log\left(\log_{2}\frac{4K}{\gamma_{i,j}}\right)}{N_{2}}}\right)
Remark 3.13.

In these terms, RR and γi,j(i,j𝒞2)\gamma_{i,j}(i,j\in\mathcal{C}_{2}) are inversely related in terms of their values. This implies that as RR\rightarrow\infty, if the generalization bound remains constant, then γi,j\gamma_{i,j} must approach 0. Recall (MiMj)Tf(𝒙;𝒘)γi,j𝒙Si\left(M_{i}-M_{j}\right)^{T}f(\boldsymbol{x};\boldsymbol{w})\leq\gamma_{i,j}\ \forall\boldsymbol{x}\in S_{i}, which means that MiMj0\|M_{i}-M_{j}\|\rightarrow 0.

4 Further Exploration

In this section, we uncover a new phenomenon in NeurCol, which we refer to as Symmetric Generalization. Symmetric Generalization is linked to two transformation groups over Grassmannian Frame: namely Permutation and Rotation transformation. Briefly, Grassmannian Frame in these two transformations can result in different generalization performances. Upon observing this intriguing phenomenon in our experiments, we provide a theoretical result to explain this phenomenon partially.

4.1 Motivation

First, we introduce two kinds of equivalence of frame Holmes and Paulsen (2004); Bodmann and Paulsen (2005):

Definition 4.1 (Equivalent Frame).

Given two frames {ζi}i=1C,{χi}i=1C\{\zeta_{i}\}_{i=1}^{C},\{\chi_{i}\}_{i=1}^{C} in d\mathbb{R}^{d}, they are:

  • Type I Equivalent if there exists a orthogonal matrix RdR\in\mathbb{R}^{d} such that [ζi]i=1C=R[χi]i=1C[\zeta_{i}]_{i=1}^{C}=R[\chi_{i}]_{i=1}^{C}.

  • Type II Equivalent if there exists a permutation matrix PCP\in\mathbb{R}^{C} such that [ζi]i=1C=[χi]i=1CP[\zeta_{i}]_{i=1}^{C}=[\chi_{i}]_{i=1}^{C}P.

The Grassmannian Frame is a geometrically symmetrical structure, with its symmetry stemming from the invariance of two transformations: rotation and permutation. Specifically, after a rotation or permutation (or their combination), the frame still satisfies the minimized maximal correlation property, as only the frame’s direction and order change. We are curious about how these equivalences affect the performance of models. In machine learning, we typically consider two aspects of a model: optimization and generalization. Given that the training set and classification model (backbone and capacity) are the same, we argue that models with equivalent Grassmannian Frames will exhibit the same optimization performance. However, there is no reason to believe that these models will also have the same generalization performance. As such, we propose the following question:

Is generalization of models invariable to symmetric transformations of Grassmannian Frame?

To explore this question, we conduct a series of experiments. Our experimental results lead to an interesting conclusion:

Optimization performance of models is not affected by Rotation and Permutation transformation of Grassmannian Frame, but generalization performance is.

This newly discovered phenomenon, which we call Symmetric Generalization, contradicts a recent argument made in Yang et al. (2022). The authors of that work claimed that since modern neural networks are often overparameterized and can learn features with any direction, fixing linear classifiers as a Simplex ETF is sufficient, and there is no need to learn it. Our findings challenge this viewpoint.

4.2 Experiments

To investigate the impact of Rotation and Permutation transformations of Grassmannian Frame on the generalization performance of deep neural networks, we conducted a series of experiments.

How to Reveal the Anwser We generate 1010 Grassmannian Frame with different Rotation and Permutation. Then, we train the same network architecture 10 times. In each time, the linear classifier are loaded from a pre-generated equivalent Grassmannian Frame and fixed during training. To ensure a completely same optimization process (mini-batch, augmentation, and parameter initialization), we use the same random seed for each training. Once the NeurCol phenomenon occurrs, we know that the 1010 models have learned different Grassmannian Frames. Finally, we compare their generalization performances by evaluating the cross-entropy loss and accuracy on a test set.

Generation of Equivalent Grassmannian Frame Our Theorem.3.2 naturally offers a numerical method for generating Grassmannian Frame. Given class number and feature dimension, we use Gradient Descent (4) on the unconstrained feature models (3) to generate Grassmannian Frame. Then given a Grassmannian Frame {My}y=1C\{M_{y}\}_{y=1}^{C} in d\mathbb{R}^{d}, if it is denoted as 𝑴=[M1,,MC]d×C\boldsymbol{M}=[M_{1},\cdots,M_{C}]\in\mathbb{R}^{d\times C}, we can use R𝑴PR\boldsymbol{M}P to denote its equivalent frame. where PPermutation(C)P\in Permutation(C) and RSO(d)R\in SO(d):

Permutation(C)={P|i[C],j=1CPj,i=j=1CPi,j=1,i,j[C],Pi,j{0,1},PC}\displaystyle Permutation(C)=\left\{P\bigg{|}\forall i\in[C],\sum_{j=1}^{C}P_{j,i}=\sum_{j=1}^{C}P_{i,j}=1,\forall i,j\in[C],P_{i,j}\in\{0,1\},P\in\mathbb{R}^{C}\right\}
SO(d)={R|RTR=RRT=Id,|R|=1,Rd}\displaystyle SO(d)=\left\{R\big{|}R^{T}R=RR^{T}=I_{d},|R|=1,R\in\mathbb{R}^{d}\right\}

Note that Permutation(C)Permutation(C) and SO(d)SO(d) act on vector orders and directions of {My}y=1C\{M_{y}\}_{y=1}^{C} respectively. Refer to Appendix.F.2 for code implementation.

Models with Different Features Yang et al. (2022) point out in their Theorem.1: if the linear classifier is fixed as Simplex ETF, then the final features learned by the model would converge to be Simplex ETF with the same direction to classifier. Following their work, to let models to learn equivalent Grassmannian Frame, we initialize the linear classifier as equivalent Grassmannian Frame and do not perfrom optimization on it during training. In this way, when NeurCol occurs, models have learned equivalent Grassmannian Frame.

Network Architecture and Dataset Our experiments involve two image classification datasets: CIFAR10/100 Krizhevsky (2009). And for every dataset, we use three different convolutional neural networks to verify our finding, including ResNet He et al. (2016), Vgg Simonyan and Zisserman (2014), DenseNet Huang et al. (2017). Both datasets are balanced with 10 and 100 classes respectively, each having 500500 and 5,0005,000 training images per class. To meet the larger number of classes than feature dimensions, we use 66 and 6464 as the feature dimensions, respectively. Then, to obtain different dimensional feature for every backbone, we attach a linear layer after the end of backbone, which can transform feature dimensions.

Training To make NeurCol appear, we follow Papyan et al. (2020)’s practice. More details refer to Appendix.F.1.

Experiment Index 0 1 2 3 4 5 6 7 8 9
Train CE 0.0019 0.0018 0.0019 0.0019 0.0019 0.0018 0.0019 0.0019 0.0019 0.0019
Val CE 0.586 0.6965 0.6871 0.61 0.6585 0.576 0.5565 0.5972 0.5892 0.6243
Train ACC 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
Val ACC 87.96 87.96 88.25 87.94 87.32 87.58 88.39 88.32 87.98 87.73
Table 1: Performance comparison of different Permutation features on CIFAR10 and Vgg11.
Equivalence Dataset Model Std Train CE Std Val CE Range Val CE Std Train ACC Std Val ACC Range Val ACC
Permutation CIFAR10 VGG11 0.0 0.01 0.034 0.0 0.209 0.71
Resnet34 0.0 0.018 0.058 0.0 0.197 0.71
DenseNet121 0.0 0.011 0.04 0.0 0.157 0.49
CIFAR100 VGG11 0.0 0.027 0.084 0.004 0.353 1.38
Resnet34 0.0 0.042 0.134 0.005 0.468 1.37
DenseNet121 0.0 0.012 0.037 0.003 0.232 0.84
Rotation CIFAR10 VGG11 0.0 0.045 0.14 0.0 0.317 1.07
Resnet34 0.0 0.078 0.23 0.0 0.538 1.73
DenseNet121 0.0 0.059 0.187 0.0 0.18 0.63
CIFAR100 VGG11 0.0 0.034 0.099 0.0 0.455 1.52
Resnet34 0.0 0.036 0.136 0.005 0.403 1.16
DenseNet121 0.0 0.024 0.081 0.004 0.516 1.59
Table 2: More comprehensive results on different datasets and models. Std indicates the standard deviation of ten metrics and Range indicates the maximal metric minus minimal.

Results Table.1 presents results on CIFAR10 and Vgg11 with different Permutation. All metrics are given when the model converges to NeurCol (100%100\% accuracy and zero loss on training set). We observe that, even though all experiments achieved 0 cross-entropy loss and 100%100\% accuracy, they still exhibit significant differences in test loss and accuracy. This implies although permutation hardly affects optimization, it has a significant impact on generalization. Table.2 provides results of different dataset, backbone and two transformations to reveal the same phenomenon. These experimental results answer the question we posed in Section.4.1, demonstrating that different feature order and direction can influence generalization performance of models.

4.3 Analysis for Permutation

We aim to explain why Grassmannian Frame with different Permutation can lead to different generalization performance theoretically. Most of symbol definitions are adopted from Section.3.3.

Theorem 4.2.

Given the dataset SS and a classifier (𝐌,f(;𝐰))(\boldsymbol{M},f(\cdot;\boldsymbol{w})), assume the classifier has already achieved NeurCol with maximal norm ρ\rho. Suppose (𝐌,f(;𝐰))(\boldsymbol{M},f(\cdot;\boldsymbol{w})) can linearly separate SS by margin {γi,j}ij\{\gamma_{i,j}\}_{i\neq j}. Besides, we make the following assumptions:

  • f(,𝒘)f(\cdot,\boldsymbol{w}) is LL-Lipschitz for any 𝒘\boldsymbol{w}, i.e.i.e. 𝒙1,𝒙2\forall\boldsymbol{x}_{1},\boldsymbol{x}_{2}, f(𝒙1,𝒘)f(𝒙2,𝒘)L𝒙1𝒙2\|f(\boldsymbol{x}_{1},\boldsymbol{w})-f(\boldsymbol{x}_{2},\boldsymbol{w})\|\leq L\|\boldsymbol{x}_{1}-\boldsymbol{x}_{2}\|.

  • SS is large enough such that Nimaxji𝒩(γi,jLMiMj,i)N_{i}\geq\max_{j\neq i}\mathcal{N}(\frac{\gamma_{i,j}}{L\|M_{i}-M_{j}\|},\mathcal{M}_{i}) for every class ii.

  • label distribution and labels of SS are balanced, i.e.i.e. p(i)=1Cp(i)=\frac{1}{C} and Ni=NC,i[C]N_{i}=\frac{N}{C},\forall i\in[C]

  • SiS_{i} is drawn from probability 𝒫𝒙|y=i\mathcal{P}_{\boldsymbol{x}|y=i}, where probability’s tight support is denoted as i\mathcal{M}_{i}.

where 𝒩(,i)\mathcal{N}(\cdot,\mathcal{M}_{i}) is the covering number of i\mathcal{M}_{i}, refer to Appendix.E for its definition. Then expected accuracy of (𝐌,f(;𝐰))(\boldsymbol{M},f(\cdot;\boldsymbol{w})) over the entire distribution is given by

𝒙,y(maxc[Mf(𝒙;𝒘)]c=y)>112Ni=1Cmaxji𝒩(1Lρ2MiTMj2,i)\displaystyle\mathbb{P}_{\boldsymbol{x},y}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}=y\Big{)}>1-\frac{1}{2N}\sum_{i=1}^{C}\max_{j\neq i}\mathcal{N}(\frac{1}{L}\sqrt{\frac{\rho^{2}-M_{i}^{T}M_{j}}{2}},\mathcal{M}_{i})
Remark 4.3.

Permutation transformation auctually changes the order of features of every class. Given the Grassmannian Frame {Mi}i=1C\{M_{i}\}_{i=1}^{C}, we denote the Type II Equivalent frame of permutation π\pi as {Mπ(i)}i=1C\{M_{\pi(i)}\}_{i=1}^{C}. Therefore, the covering number in our theorem becomes 𝒩(1Lρ2Mπ(i)TMπ(j)2,i)\mathcal{N}(\frac{1}{L}\sqrt{\frac{\rho^{2}-M_{\pi(i)}^{T}M_{\pi(j)}}{2}},\mathcal{M}_{i}), which leads to different accuracy bound.

4.4 Insight and Discussion

Refer to caption
(a) Dog is close to Cat and Fire Truck is close to Car.
Refer to caption
(b) Swap Dog and Fire Truck from left.
Figure 1: The Grassmannian Frame feature alignment of four classes in 2\mathbb{R}^{2}, including Cat, Car, Dog and Fire Truck. All images are from ImageNet Deng et al. (2009).

Permutation We provide an example to give the insight of Symmetric Generalization of Permutation. Consider a Grassmannian Frame {Mi}i=14\{M_{i}\}_{i=1}^{4} living in 2\mathbb{R}^{2}, it resembles a cross (Theorem III.1 of Benedetto and Kolesar (2006)). As shown in Figure.1(a), there are four classes that correspond to different feature MiM_{i}. Obviously, since Dog and Cat look similar to each other, they deserve to have a smaller margin in the feature space (near to each other). The same goes for the other two categories. However, if we swap the features of Truck and Dog to increase the distance between dogs and cats, as shown in Figure.1(b), the semantic relationship in the feature space would be disrupted. We argue that this can harm the model’s training and lead to worse generalization.

Rotation Symmetric Generalization of Rotation has been completely beyond our initial expectation. We believe that the margin or correlation between features is the most effective tool to understand NeurCol phenomenon. However, it fails to explain why different directions of features have different generalization. In Deep Learning community, Implicit Bias phenomenon is a possible way to approach this finding. Soudry et al. (2018) has proved that the Gradient Descent optimization leads to weight direction that maximizes the margin when using logistic loss and linear models. As a further progress, Lyu and Li (2020) extends this result to homogeneous neural networks. We speculate that the explanations for Symmetric Generalization of Rotation may be hidden within the layers of neural networks. Therefore, studying the Implicit Bias of deep models layer by layer could be a promising direction for future research. This is beyond the scope of our current work, and we leave it as a topic for our future work.

5 Conclusion

In this paper, we justify Generalized Neural Collapse hypothesis by leading to Grassmannian Frame into classification problems, which does not assume specific numbers of classes feature dimensions, and every two vectors in it can achieve maximal distances on a sphere. In addition, awaring that Grassmannian Frame is symmetric geometrically, we propose a question: is generalization of the model invariable to symmetric transformations of Grassmannian Frame? To explore this question, we conduct a series of empirical and theoretical analysis, and finally find Symmetric Generalization phenomenon. This phenomenon suggests that the generalization of a model is influenced by geometric invariant transformations of the Grassmannian Frame, including Permutation(C)Permutation(C) and SO(d)SO(d).

References

  • Papyan et al. [2020] Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
  • Ji et al. [2022] Wenlong Ji, Yiping Lu, Yiliang Zhang, Zhun Deng, and Weijie J Su. An unconstrained layer-peeled perspective on neural collapse. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=WZ3yjh8coDg.
  • Fang et al. [2021] Cong Fang, Hangfeng He, Qi Long, and Weijie J Su. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences, 118(43):e2103091118, 2021.
  • Zhu et al. [2021] Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820–29834, 2021.
  • Zhou et al. [2022a] Jinxin Zhou, Chong You, Xiao Li, Kangning Liu, Sheng Liu, Qing Qu, and Zhihui Zhu. Are all losses created equal: A neural collapse perspective. In Advances in Neural Information Processing Systems, volume 35, pages 31697–31710. Curran Associates, Inc., 2022a.
  • Yaras et al. [2022a] Can Yaras, Peng Wang, Zhihui Zhu, Laura Balzano, and Qing Qu. Neural collapse with normalized features: A geometric analysis over the riemannian manifold. In Advances in Neural Information Processing Systems, volume 35, pages 11547–11560. Curran Associates, Inc., 2022a.
  • Han et al. [2022] X.Y. Han, Vardan Papyan, and David L. Donoho. Neural collapse under MSE loss: Proximity to and dynamics on the central path. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=w1UbdvWH_R3.
  • Tirer and Bruna [2022] Tom Tirer and Joan Bruna. Extended unconstrained features model for exploring deep neural collapse. In International Conference on Machine Learning, pages 21478–21505. PMLR, 2022.
  • Zhou et al. [2022b] Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pages 27179–27202. PMLR, 2022b.
  • Mixon et al. [2020] Dustin G Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619, 2020.
  • Poggio and Liao [2020] Tomaso Poggio and Qianli Liao. Explicit regularization and implicit bias in deep network classifiers trained with the square loss. arXiv preprint arXiv:2101.00072, 2020.
  • Graf et al. [2021] Florian Graf, Christoph Hofer, Marc Niethammer, and Roland Kwitt. Dissecting supervised contrastive learning. In International Conference on Machine Learning, pages 3821–3830. PMLR, 2021.
  • Lu and Steinerberger [2022] Jianfeng Lu and Stefan Steinerberger. Neural collapse under cross-entropy loss. Applied and Computational Harmonic Analysis, 59:224–241, 2022.
  • Liu et al. [2023] Weiyang Liu, Longhui Yu, Adrian Weller, and Bernhard Schölkopf. Generalizing and decoupling neural collapse via hyperspherical uniformity gap. In The Eleventh International Conference on Learning Representations, 2023.
  • Yang et al. [2022] Yibo Yang, Shixiang Chen, Xiangtai Li, Liang Xie, Zhouchen Lin, and Dacheng Tao. Inducing neural collapse in imbalanced learning: Do we really need a learnable classifier at the end of deep neural network? In Advances in Neural Information Processing Systems, volume 35, pages 37991–38002. Curran Associates, Inc., 2022.
  • Casazza and Kutyniok [2012] Peter G Casazza and Gitta Kutyniok. Finite frames: Theory and applications. Springer, 2012.
  • Casazza and Kovačević [2003] Peter G Casazza and Jelena Kovačević. Equal-norm tight frames with erasures. Advances in Computational Mathematics, 18:387–430, 2003.
  • Ambrus et al. [2021] Gergely Ambrus, Bo Bai, and Jianfeng Hou. Uniform tight frames as optimal signals. Advances in Applied Mathematics, 129:102219, 2021.
  • Holmes and Paulsen [2004] Roderick B Holmes and Vern I Paulsen. Optimal frames for erasures. Linear Algebra and its Applications, 377:31–51, 2004.
  • Strohmer and Heath Jr [2003] Thomas Strohmer and Robert W Heath Jr. Grassmannian frames with applications to coding and communication. Applied and computational harmonic analysis, 14(3):257–275, 2003.
  • Welch [1974] Lloyd Welch. Lower bounds on the maximum cross correlation of signals (corresp.). IEEE Transactions on Information theory, 20(3):397–399, 1974.
  • Shannon [1959] Claude E Shannon. Probability of error for optimal codes in a gaussian channel. Bell System Technical Journal, 38(3):611–656, 1959.
  • Yaras et al. [2022b] Can Yaras, Peng Wang, Zhihui Zhu, Laura Balzano, and Qing Qu. Neural collapse with normalized features: A geometric analysis over the riemannian manifold. In Advances in Neural Information Processing Systems, volume 35, pages 11547–11560. Curran Associates, Inc., 2022b.
  • Conway and Sloane [2013] John Horton Conway and Neil James Alexander Sloane. Sphere packings, lattices and groups, volume 290. Springer Science & Business Media, 2013.
  • F.R.S. [1904] J.J. Thomson F.R.S. Xxiv. on the structure of the atom: an investigation of the stability and periods of oscillation of a number of corpuscles arranged at equal intervals around the circumference of a circle; with application of the results to the theory of atomic structure. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 7(39):237–265, 1904. doi: 10.1080/14786440409463107.
  • Conway et al. [1996] John H Conway, Ronald H Hardin, and Neil JA Sloane. Packing lines, planes, etc.: Packings in grassmannian spaces. Experimental mathematics, 5(2):139–159, 1996.
  • Leng and Han [2011] Jinsong Leng and Deguang Han. Optimal dual frames for erasures ii. Linear Algebra and its Applications, 435(6):1464–1472, 2011. ISSN 0024-3795. doi: https://doi.org/10.1016/j.laa.2011.03.043.
  • Lopez and Han [2010] Jerry Lopez and Deguang Han. Optimal dual frames for erasures. Linear Algebra and its Applications, 432(1):471–482, 2010. ISSN 0024-3795. doi: https://doi.org/10.1016/j.laa.2009.08.031.
  • Kakade et al. [2008] Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008.
  • Bartlett and Mendelson [2002] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
  • Bodmann and Paulsen [2005] Bernhard G Bodmann and Vern I Paulsen. Frames, graphs and erasures. Linear algebra and its applications, 404:118–146, 2005.
  • Krizhevsky [2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Benedetto and Kolesar [2006] John Benedetto and Joseph Kolesar. Geometric properties of grassmannian frames for r2 and r3. EURASIP Journal on Advances in Signal Processing, 2006, 12 2006. doi: 10.1155/ASP/2006/49850.
  • Soudry et al. [2018] Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1q7n9gAb.
  • Lyu and Li [2020] Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJeLIgBKPS.
  • Xie et al. [2022] Liang Xie, Yibo Yang, Deng Cai, and Xiaofei He. Neural collapse inspired attraction-repulsion-balanced loss for imbalanced learning. Neurocomputing, 527:60–70, 2022.
  • Yang et al. [2023] Yibo Yang, Haobo Yuan, Xiangtai Li, Zhouchen Lin, Philip Torr, and Dacheng Tao. Neural collapse inspired feature-classifier alignment for few-shot class-incremental learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=y5W8tpojhtJ.
  • Yang et al. [2021] Zhiyong Yang, Qianqian Xu, Shilong Bao, Xiaochun Cao, and Qingming Huang. Learning with multiclass auc: theory and algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7747–7763, 2021.
  • Kulkarni and Posner [1995] Sanjeev R Kulkarni and Steven E Posner. Rates of convergence of nearest neighbor estimation under arbitrary sampling. IEEE Transactions on Information Theory, 41(4):1028–1039, 1995.
  • Vural and Guillemot [2017] Elif Vural and Christine Guillemot. A study of the classification of low-dimensional data with supervised manifold learning. The Journal of Machine Learning Research, 18(1):5741–5795, 2017.
  • Lezcano Casado [2019] Mario Lezcano Casado. Trivializations for gradient-based optimization on manifolds. Advances in Neural Information Processing Systems, 32, 2019.
  • Lezcano-Casado and Martınez-Rubio [2019] Mario Lezcano-Casado and David Martınez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. In International Conference on Machine Learning, pages 3794–3803. PMLR, 2019.

Appendix A Related Work

Neural Collapse (NeurCol) was first observed by Papyan et al. [2020] in 20202020., and it has since sparked numerous studies investigating deep classification models.

Many of these studies Ji et al. [2022], Fang et al. [2021], Zhu et al. [2021], Zhou et al. [2022a], Yaras et al. [2022a] have focused on the optimization aspect of NeurCol, proposing various optimization models and analyzing them. For example, Zhu et al. [2021] proposed Unconstrained Feature Models and provided the first global optimization analysis of NeurCol, while Fang et al. [2021] proposed Layer-Peeled Model, a nonconvex yet analytically tractable optimization program, to prove NeurCol and predict Minority Collapse, an imbalanced version of NeurCol. Other studies have explored NeurCol under Mean Square Error Loss (MSE) Han et al. [2022], Tirer and Bruna [2022], Zhou et al. [2022b], Mixon et al. [2020], Poggio and Liao [2020]. For instance, Han et al. [2022], Tirer and Bruna [2022] justified NeurCol under the MSE loss. In addition to MSE loss, Zhou et al. [2022a] extended such results and analyses a broad family of loss functions including commonly used label smoothing and focal loss. Besides, NeurCol phenomenon also inspires design of loss function in imbalanced learning Xie et al. [2022] and Few-Shot Learning Yang et al. [2023].

Previous studies on NeurCol have generally assumed that the number of classes is less than the feature dimension, and this assumption has not been questioned until recently. In the 11th International Conference on Learning Representations (ICLR 2023), Liu et al. [2023] proposed the Generalized Neural Collapse hypothesis, which removes this restriction by extending the ETF structure to Hyperspherical Uniformity. In this paper, we contribute to this area by proving the Generalized Neural Collapse hypothesis and introducing the Grassmannian Frame to better understand the NeurCol phenomenon.

Appendix B Proof of Theorem.3.1

We introduce this lemma:

Lemma B.1.

Suppose that 𝟎𝒦\boldsymbol{0}\notin\mathcal{K}, and that 𝒦\mathcal{K} is a closed set. Suppose 𝐠𝒩(𝟎,σ2𝐈)\boldsymbol{g}\sim\mathcal{N}(\boldsymbol{0},\sigma^{2}\boldsymbol{I}), we have:

σ2log𝒈(𝒦)min𝒈𝒦{12𝒈2}asσ0-\sigma^{2}\log\mathbb{P}_{\boldsymbol{g}}(\mathcal{K})\rightarrow\min_{\boldsymbol{g}\in\mathcal{K}}\{\frac{1}{2}\|\boldsymbol{g}\|^{2}\}\ \ \text{as}\ \ \sigma\rightarrow 0

This lemma can establish the relationship between geometry and probability. Then we start our proof.

See 3.1

Proof.

First, let us think about what kind of decoding is optimal. According to Shannon [1959], since the gaussian density is monotone decreasing with distance, an optimal decoding system for gaussian channel is the minimum distance decoding, i.e.i.e.

c^=argminc{1,,C}Mch\displaystyle\hat{c}=\underset{c\in\{1,\dots,C\}}{\arg\min}\|M_{c}-h\|

where c^\hat{c} is the prediction result. we consider the two-classes communication problem: there is only two number cc and cc^{\prime} transmitted. We denote the event that a cc signal is recover as cc^{\prime} as εc|c\varepsilon_{c^{\prime}|c}, then

εc|c={𝒈d,𝒈Mc>𝒈Mc}\displaystyle\varepsilon_{c^{\prime}|c}=\{\boldsymbol{g}\in\mathbb{R}^{d},\|\boldsymbol{g}-M_{c}\|>\|\boldsymbol{g}-M_{c^{\prime}}\|\}

According to Lemma.B.1, we have

σ2log𝒈(εc|c)min𝒈εc|c{12𝒈2}=18McMc2,σ0.\displaystyle-\sigma^{2}\log\mathbb{P}_{\boldsymbol{g}}(\varepsilon_{c^{\prime}|c})\rightarrow\min_{\boldsymbol{g}\in\varepsilon_{c^{\prime}|c}}\{\frac{1}{2}\|\boldsymbol{g}\|^{2}\}=\frac{1}{8}\|M_{c}-M_{c^{\prime}}\|^{2},\sigma\rightarrow 0.

For all numbers transmitted, the error event ε\varepsilon could be devided into the error event between every two numbers, i.e.i.e.

ε=ccεc|c\displaystyle\varepsilon=\bigcup_{c\neq c^{\prime}}\varepsilon_{c|c^{\prime}}

So

σ2log𝒈(ε)18minccMcMc2,σ0.\displaystyle-\sigma^{2}\log\mathbb{P}_{\boldsymbol{g}}(\varepsilon)\rightarrow\frac{1}{8}\min_{c\neq c^{\prime}}\|M_{c}-M_{c^{\prime}}\|^{2},\sigma\rightarrow 0.

To obtain the code with minimal error probability, we can maximize minccMcMc2\min_{c\neq c^{\prime}}\|M_{c}-M_{c^{\prime}}\|^{2}. With a norm constraint for every code, we have

maxc,Mc=1minccMcMc2minc,Mc=1maxccMc,Mc\displaystyle\max_{\forall c,\|M_{c}\|=1}\min_{c\neq c^{\prime}}\|M_{c}-M_{c^{\prime}}\|^{2}\Leftrightarrow\min_{\forall c,\|M_{c}\|=1}\max_{c\neq c^{\prime}}\langle M_{c},M_{c^{\prime}}\rangle

Appendix C Proof of Theorem.3.2

In this section, we prove Theorem.3.2. Here are two lemmas that we are going to use.

C.1 Lemmas

Lemma C.1 (Lemma.7 of Yang et al. [2021]: Lipschitz Properties of Softmax).

Given xCx\in\mathbb{R}^{C}, the function Softmax(x)Softmax(x) is defined as

Softmax(x)=[ex1i=1Cexi,,exCi=1Cexi]T,\displaystyle Softmax(x)=\left[\frac{e^{x_{1}}}{\sum_{i=1}^{C}e^{x_{i}}},\dots,\frac{e^{x_{C}}}{\sum_{i=1}^{C}e^{x_{i}}}\right]^{T},

then the function Softmax()Softmax(\cdot) is C2\sqrt{\frac{C}{2}}-Lipschitz continuous.

Lemma C.2.

Given any matrix 𝐀n×n\boldsymbol{A}\in\mathbb{R}^{n\times n} with constant aa on the diagonals and constant cc on off diagonals, i.e.i.e.

𝑨=[acccaccca],\boldsymbol{A}=\left[\begin{array}[]{ccccccc}a&c&\ldots&c\\ c&a&\cdots&c\\ \vdots&\vdots&\ddots&\vdots\\ c&c&\ldots&a\\ \end{array}\right],

then we have |𝐀|=(ac)n1(a+(n1)c)|\boldsymbol{A}|=\left(a-c\right)^{n-1}\left(a+(n-1)c\right).

Proof.
|𝑨|\xlongequalstep1|ac0ca0accacca|\xlongequalstep2|ac000ac0cca+(n1)c||\boldsymbol{A}|\xlongequal{\textbf{step1}}\left|\begin{array}[]{ccccccc}a-c&0&\ldots&c-a\\ 0&a-c&\cdots&c-a\\ \vdots&\vdots&\ddots&\vdots\\ c&c&\ldots&a\\ \end{array}\right|\xlongequal{\textbf{step2}}\left|\begin{array}[]{ccccccc}a-c&0&\ldots&0\\ 0&a-c&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ c&c&\ldots&a+(n-1)c\\ \end{array}\right|
  • step1: Subtract the first n1n-1 rows by the last row.

  • step2: Add the first n1n-1 columns on the last column.

C.2 Background

In the proof of Theorem.3.2, the techniques that we used is mainly limit analysis, and in addition a little linear algebra.

C.2.1 Assumption

We make the following assumption:

Assumption C.3.

In update of (4), parameters is properly selected: λ/ω=α/β=N/C\lambda/\omega=\alpha/\beta=N/C. In addition, α\alpha is small enough to assure the system would converge and the norms of every vector in 𝑴,𝒁\boldsymbol{M},\boldsymbol{Z} is always bounded by ρ\rho.

This small enough α\alpha is necessary and reasonable for the convergce of Gradient Descent. With this assumption, we can assure that every variables in system (4) would not blow up and finally converge to a stable state. And the condition λ/ω=α/β=N/C\lambda/\omega=\alpha/\beta=N/C is to make sure both classifiers and features can be bounded by the same maximal 2\ell_{2} norm.

C.2.2 Symbol Regulations

We regulate our symbols for clearer representation. Recall our setting, we use 𝒛y,i\boldsymbol{z}_{y,i} to denote the ii-th smaple in class yy and every class has N/CN/C samples. Here, we put ii-th sample in all class together, denote it as 𝒁i\boldsymbol{Z}_{i}, i.e.i.e.

𝒁i=[𝒛1,i,,𝒛C,i]d×Cand𝒁=[𝒁1,,𝒁N/C]d×N\displaystyle\boldsymbol{Z}_{i}=\left[\boldsymbol{z}_{1,i},\cdots,\boldsymbol{z}_{C,i}\right]\in\mathbb{R}^{d\times C}\ \ \text{and}\ \ \boldsymbol{Z}=[\boldsymbol{Z}_{1},\cdots,\boldsymbol{Z}_{N/C}]\in\mathbb{R}^{d\times N}

Then we denote confidence probability of 𝒁i\boldsymbol{Z}_{i} given by classifier 𝑴\boldsymbol{M} as

𝑷i=[Softmax(𝒛1,iTM),,Softmax(𝒛C,iTM)]C×C\displaystyle\boldsymbol{P}_{i}=\left[Softmax(\boldsymbol{z}_{1,i}^{T}M),\cdots,Softmax(\boldsymbol{z}_{C,i}^{T}M)\right]\in\mathbb{R}^{C\times C}

where Softmax()Softmax(\cdot) transform the logitlogit into a probability vector. Refer to Lemma.C.1 for definition of Softmax()Softmax(\cdot).

C.2.3 Proof Sketch

We prove Theorem.3.2 by proving three Lemmas:

Lemma C.4 (Variability within Classes).

Consider the update of Gradient Descent (3), under Assumption.C.3, features of any sample that belong to the same class would converge to the same, i.e.i.e. i,j[N/C],𝐙i(t)𝐙j(t)0ast\forall i,j\in[N/C],\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\|\rightarrow 0\ \text{as}\ t\rightarrow\infty.

Lemma C.5 (Convergence to Self Duality).

Consider the update of Gradient Descent (3), under Assumption.C.3, features of any sample would converge to the classifier corresponding to it’s category, i.e.i.e. i[N/C],𝐙i(t)𝐌(t)0ast\forall i\in[N/C],\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}\|\rightarrow 0\ \text{as}\ t\rightarrow\infty.

Lemma C.6 (Convergence to Grassmannian Frame).

Consider the function (1), given any sequence {(M(t),Z(t))}\{(M^{(t)},Z^{(t)})\}, if CELoss(M(t),Z(t),Y)0\text{CELoss}(M^{(t)},Z^{(t)},Y)\rightarrow 0 as tt\rightarrow\infty, then 𝐌(t)\boldsymbol{M}^{(t)} and 𝐙(t)\boldsymbol{Z}^{(t)} would converge to the solution of

max𝑴,𝒁minyymini[N/C]MyMy,𝒛y,i\underset{\boldsymbol{M},\boldsymbol{Z}}{\max}\ \underset{y\neq y^{\prime}}{\min}\underset{i\in[N/C]}{\min}\langle M_{y}-M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle

Obviously, Lemma.C.4 and Lemma.C.5 can directly indicate NC1 and NC2. We first prove them, which show the aggregation effect of cross entropy loss: features / linear classifier with the same class converge to the same point. Then NC4 is obvious under the conclusions of Lemma.C.4 and Lemma.C.5. Lemma.C.6 is the key step to prove that (1) converges to the min-max optimization as (𝑴,𝒁)\mathcal{L}(\boldsymbol{M},\boldsymbol{Z}) converge to zero. However, we still need to combine all Lemmas to obtain the minimized maximal correlation characteristic. Here is the proof:

Proof of (NC3).

First, we know

min𝒁,𝑴CELoss(𝑴,𝒁)(1)\displaystyle\underbrace{\min_{\boldsymbol{Z},\boldsymbol{M}}\text{CELoss}(\boldsymbol{M},\boldsymbol{Z})}_{(\ref{opt0})}\Leftrightarrow minρ>0min𝒛y,i,MiρCELoss(𝑴,𝒁)(1)s.t.(2)minλ,ω>0min𝒁,𝑴(𝑴,𝒁)(3),\displaystyle\min_{\rho>0}\underbrace{\min_{\|\boldsymbol{z}_{y,i}\|,\|M_{i}\|\leq\rho}\text{CELoss}(\boldsymbol{M},\boldsymbol{Z})}_{(\ref{opt0})\ s.t.(\ref{opt1})}\Leftrightarrow\min_{\lambda,\omega>0}\underbrace{\min_{\boldsymbol{Z},\boldsymbol{M}}\mathcal{L}(\boldsymbol{M},\boldsymbol{Z})}_{(\ref{opt2})},

while Lemma.C.6 establishes a bridge between (1) and the following max-min problem:

min𝒁,𝑴CELoss(𝑴,𝒁)(1)maxρ>0max𝒛y,i,Miρminyymini[N/C]MyMy,𝒛y,imax-min\displaystyle\underbrace{\min_{\boldsymbol{Z},\boldsymbol{M}}\text{CELoss}(\boldsymbol{M},\boldsymbol{Z})}_{(\ref{opt0})}\Leftrightarrow\max_{\rho>0}\underbrace{\max_{\|\boldsymbol{z}_{y,i}\|,\|M_{i}\|\leq\rho}\underset{y\neq y^{\prime}}{\min}\underset{i\in[N/C]}{\min}\langle M_{y}-M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle}_{\textbf{max-min}}

According to Lemma.C.4 and C.5, we know the solution of (3) converge to

y[C],i[N/C],𝒛y,i=My,\begin{aligned} \forall y\in[C],\forall i\in[N/C],\boldsymbol{z}_{y,i}=M_{y}\end{aligned}, (5)

and there must exist a ρ(λ,ω)\rho(\lambda,\omega) such that

y[C],i[N/C],𝒛y,i=My=ρ(λ,ω).\begin{aligned} \forall y\in[C],\forall i\in[N/C],\|\boldsymbol{z}_{y,i}\|=\|M_{y}\|=\rho(\lambda,\omega)\end{aligned}. (6)

Therefore, we substitute (5) and (6) into max-min:

max𝒛y,i,Miρminyymini[N/C]MyMy,𝒛y,i\displaystyle\max_{\|\boldsymbol{z}_{y,i}\|,\|M_{i}\|\leq\rho}\underset{y\neq y^{\prime}}{\min}\underset{i\in[N/C]}{\min}\langle M_{y}-M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle
\displaystyle\Leftrightarrow maxMi=ρminyyMyMy,My\displaystyle\max_{\|M_{i}\|=\rho}\underset{y\neq y^{\prime}}{\min}\langle M_{y}-M_{y^{\prime}},M_{y}\rangle
\displaystyle\Leftrightarrow maxMi=ρminyy(ρ2My,My)\displaystyle\max_{\|M_{i}\|=\rho}\underset{y\neq y^{\prime}}{\min}\left(\rho^{2}-\langle M_{y^{\prime}},M_{y}\rangle\right)
\displaystyle\Leftrightarrow minMi=ρmaxyyMy,Mymin-max\displaystyle\underbrace{\min_{\|M_{i}\|=\rho}\underset{y\neq y^{\prime}}{\max}\langle M_{y^{\prime}},M_{y}\rangle}_{\textbf{min-max}}

where \Leftrightarrow symbol in above equation means the solution of these optimization problems would converge to the same. min-max is exactly our expectant minimized maximal correlation characteristic. To ensure the establishment of above equivalence, ρ\rho\rightarrow\infty is required to meet CELoss(𝒁,𝑴)0\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})\rightarrow 0. ∎

C.2.4 Gradient Calculation

Before starting our proof, we calculate the gradient of (3) in terms of features 𝒁\boldsymbol{Z} and classifiers 𝑴\boldsymbol{M}:

y[C],i[N/C],𝒛y,i(𝒁,𝑴)=My+y=1C[Softmax(𝒛y,iT𝑴)]yMy+ω𝒛y,i\displaystyle\forall y\in[C],\forall i\in[N/C],\nabla_{\boldsymbol{z}_{y,i}}\mathcal{L}(\boldsymbol{Z},\boldsymbol{M})=-M_{y}+\sum_{y^{\prime}=1}^{C}\left[Softmax(\boldsymbol{z}_{y,i}^{T}\boldsymbol{M})\right]_{y^{\prime}}M_{y^{\prime}}+\omega\boldsymbol{z}_{y,i}
y[C],My(𝒁,𝑴)=i=1N/C𝒛y,i+y=1Ci=1N/C[Softmax(𝒛y,iT𝑴)]y𝒛y,i+λMy\displaystyle\forall y\in[C],\nabla_{M_{y}}\mathcal{L}(\boldsymbol{Z},\boldsymbol{M})=-\sum_{i=1}^{N/C}\boldsymbol{z}_{y,i}+\sum_{y^{\prime}=1}^{C}\sum_{i=1}^{N/C}\left[Softmax(\boldsymbol{z}_{y^{\prime},i}^{T}\boldsymbol{M})\right]_{y}\boldsymbol{z}_{y^{\prime},i}+\lambda M_{y}

Then we turn it into the matrix form by arranging y=1,,Cy=1,\cdots,C in every column.

i[N/C],\displaystyle\forall i\in[N/C], 𝒁i(𝒁,𝑴)=𝑴+𝑴𝑷i+ω𝒁i\displaystyle\nabla_{\boldsymbol{Z}_{i}}\mathcal{L}(\boldsymbol{Z},\boldsymbol{M})=-\boldsymbol{M}+\boldsymbol{M}\boldsymbol{P}_{i}+\omega\boldsymbol{Z}_{i} (7)
𝑴(𝒁,𝑴)=i=1N/C𝒁i+i=1N/C𝒁i𝑷iT+λ𝑴\displaystyle\nabla_{\boldsymbol{M}}\mathcal{L}(\boldsymbol{Z},\boldsymbol{M})=-\sum_{i=1}^{N/C}\boldsymbol{Z}_{i}+\sum_{i=1}^{N/C}\boldsymbol{Z}_{i}\boldsymbol{P}_{i}^{T}+\lambda\boldsymbol{M}

C.3 Proof of Lemma.C.4

See C.4

Proof.

With the conclusion of Lemma.C.5, we can easily prove Lemma.C.4. For any i,j[N/C]i,j\in[N/C], if tt\rightarrow\infty, we have

𝒁i(t)𝒁j(t)=𝒁i(t)𝑴(t)+𝑴(t)𝒁j(t)𝒁i(t)𝑴(t)+𝒁j(t)𝑴(t)0\displaystyle\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\|=\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}+\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\|\leq\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}\|+\|\boldsymbol{Z}_{j}^{(t)}-\boldsymbol{M}^{(t)}\|\rightarrow 0

C.4 Proof of Lemma.C.5

See C.5

Proof.

According to the update rule (4) and gradient (7), for any i[N/C]i\in[N/C], we have

𝒁i(t+1)\displaystyle\boldsymbol{Z}_{i}^{(t+1)} 𝒁i(t)α[𝑴(t)+𝑴(t)𝑷i(t)+ω𝒁i(t)]\displaystyle\leftarrow\boldsymbol{Z}_{i}^{(t)}-\alpha\left[-\boldsymbol{M}^{(t)}+\boldsymbol{M}^{(t)}\boldsymbol{P}_{i}^{(t)}+\omega\boldsymbol{Z}_{i}^{(t)}\right]
𝑴(t+1)\displaystyle\boldsymbol{M}^{(t+1)} 𝑴(t)β[j[N/C]𝒁i(t)+j[N/C]𝒁j(t)(𝑷j(t))T+λ𝑴(t)]\displaystyle\leftarrow\boldsymbol{M}^{(t)}-\beta\left[-\sum_{j\in[N/C]}\boldsymbol{Z}_{i}^{(t)}+\sum_{j\in[N/C]}\boldsymbol{Z}_{j}^{(t)}\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}+\lambda\boldsymbol{M}^{(t)}\right]

Then, we bound Δ(t+1,i)=𝒁i(t+1)𝑴(t+1)\Delta(t+1,i)=\Big{\|}\boldsymbol{Z}_{i}^{(t+1)}-\boldsymbol{M}^{(t+1)}\Big{\|}:

Δ(t+1,i)=𝒁i(t)𝑴(t)+(α𝑴(t)βj[N/C]𝒁i(t))(a)\displaystyle\Delta(t+1,i)=\Bigg{\|}\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}+\underbrace{\left(\alpha\boldsymbol{M}^{(t)}-\beta\sum_{j\in[N/C]}\boldsymbol{Z}_{i}^{(t)}\right)}_{\textbf{(a)}}- (8)
(α𝑴(t)𝑷i(t)βj[N/C]𝒁j(t)(𝑷j(t))T)(b)(αω𝒁i(t)βλ𝑴(t))(c)\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \underbrace{\left(\alpha\boldsymbol{M}^{(t)}\boldsymbol{P}_{i}^{(t)}-\beta\sum_{j\in[N/C]}\boldsymbol{Z}_{j}^{(t)}\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)}_{\textbf{(b)}}-\underbrace{\left(\alpha\omega\boldsymbol{Z}_{i}^{(t)}-\beta\lambda\boldsymbol{M}^{(t)}\right)}_{\textbf{(c)}}\Bigg{\|}

Then we use the assumption that α/β=λ/ω=N/C\alpha/\beta=\lambda/\omega=N/C, and consider (a),(b)\textbf{(a)},\textbf{(b)} and (c) separately.

(a) =αCNj[N/C](𝑴(t)𝒁j(t))\displaystyle=\frac{\alpha C}{N}\sum_{j\in[N/C]}\left(\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\right) (9)
(b) =αCNj[N/C](𝑴(t)𝑷i(t)𝒁j(t)(𝑷j(t))T)\displaystyle=\frac{\alpha C}{N}\sum_{j\in[N/C]}\left(\boldsymbol{M}^{(t)}\boldsymbol{P}_{i}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)
(c) =αω(𝒁i(t)𝑴(t))\displaystyle=\alpha\omega\left(\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}\right)

Then we combine (a) and (b):

(a)(b)\displaystyle\textbf{(a)}-\textbf{(b)} (10)
=\displaystyle= αCNj[N/C][(𝑴(t)𝑴(t)𝑷i(t))(𝒁j(t)𝒁j(t)(𝑷j(t))T)]\displaystyle\frac{\alpha C}{N}\sum_{j\in[N/C]}\left[\left(\boldsymbol{M}^{(t)}-\boldsymbol{M}^{(t)}\boldsymbol{P}_{i}^{(t)}\right)-\left(\boldsymbol{Z}_{j}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)\right]
=\displaystyle= αCNj[N/C][𝑴(t)(I𝑷i(t))𝒁j(t)(I(𝑷j(t))T)]\displaystyle\frac{\alpha C}{N}\sum_{j\in[N/C]}\left[\boldsymbol{M}^{(t)}\left(I-\boldsymbol{P}_{i}^{(t)}\right)-\boldsymbol{Z}_{j}^{(t)}\left(I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)\right]
=\displaystyle= αCNj[N/C][𝑴(t)(I𝑷i(t))𝑴(t)(I(𝑷j(t))T)+\displaystyle\frac{\alpha C}{N}\sum_{j\in[N/C]}\Bigg{[}\boldsymbol{M}^{(t)}\left(I-\boldsymbol{P}_{i}^{(t)}\right)-\boldsymbol{M}^{(t)}\left(I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)+
𝑴(t)(I(𝑷j(t))T)𝒁j(t)(I(𝑷j(t))T)]\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \boldsymbol{M}^{(t)}\left(I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)-\boldsymbol{Z}_{j}^{(t)}\left(I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)\Bigg{]}
=\displaystyle= αCNj[N/C][𝑴(t)((𝑷j(t))T𝑷i(t))+(𝑴(t)𝒁j(t))(I(𝑷j(t))T)]\displaystyle\frac{\alpha C}{N}\sum_{j\in[N/C]}\left[\boldsymbol{M}^{(t)}\left(\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{i}^{(t)}\right)+\left(\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\right)\left(I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)\right]
=\displaystyle= αCNj[N/C]𝑴(t)((𝑷j(t))T𝑷i(t))(A)+αCNj[N/C](𝑴(t)𝒁j(t))(I(𝑷j(t))T)(B)\displaystyle\underbrace{\frac{\alpha C}{N}\sum_{j\in[N/C]}\boldsymbol{M}^{(t)}\left(\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{i}^{(t)}\right)}_{\textbf{(A)}}+\underbrace{\frac{\alpha C}{N}\sum_{j\in[N/C]}\left(\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\right)\left(I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)}_{\textbf{(B)}}
=\displaystyle= (A)+(B)\displaystyle\textbf{(A)}+\textbf{(B)}

Next, we plug (9) and (10) into (8) to derive

Δ(t+1,i)\displaystyle\Delta(t+1,i) =𝒁i(t)𝑴(t)+(a)(b)(c)=𝒁i(t)𝑴(t)(c)+(A)+(B)\displaystyle=\left\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}+\textbf{(a)}-\textbf{(b)}-\textbf{(c)}\right\|=\left\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}-\textbf{(c)}+\textbf{(A)}+\textbf{(B)}\right\|
=(1αω)(𝒁i(t)𝑴(t))+(A)+(B)(1αω)Δ(t,i)+(A)+(B)\displaystyle=\left\|(1-\alpha\omega)\left(\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}\right)+\textbf{(A)}+\textbf{(B)}\right\|\leq(1-\alpha\omega)\Delta(t,i)+\left\|\textbf{(A)}\right\|+\left\|\textbf{(B)}\right\|

Then we bound (A)\|\textbf{(A)}\| and (B)\|\textbf{(B)}\|.

(A)\displaystyle\|\textbf{(A)}\|\leq αCNj[N/C]𝑴(t)(𝑷j(t))T𝑷i(t)\displaystyle\frac{\alpha C}{N}\sum_{j\in[N/C]}\left\|\boldsymbol{M}^{(t)}\right\|\left\|\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{i}^{(t)}\right\|
\displaystyle\leq αC3/2ρNj[N/C](𝑷j(t))T𝑷i(t)\displaystyle\frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\left\|\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{i}^{(t)}\right\|
=\displaystyle= αC3/2ρNj[N/C](𝑷j(t))T𝑷j(t)+𝑷j(t)𝑷i(t)\displaystyle\frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\left\|\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{j}^{(t)}+\boldsymbol{P}_{j}^{(t)}-\boldsymbol{P}_{i}^{(t)}\right\|
\displaystyle\leq αC3/2ρNj[N/C]((𝑷j(t))T𝑷j(t)(A.1)+𝑷j(t)𝑷i(t)A.2)\displaystyle\frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\left(\left\|\underbrace{\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{j}^{(t)}}_{\textbf{(A.1)}}\right\|+\left\|\underbrace{\boldsymbol{P}_{j}^{(t)}-\boldsymbol{P}_{i}^{(t)}}_{\textbf{A.2}}\right\|\right)

For (A.1), we have

(A.1)\displaystyle\|\textbf{(A.1)}\| =(𝑷j(t))T𝑷j(t)\displaystyle=\left\|\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{j}^{(t)}\right\|
C2(𝑴(t))T𝒁j(t)(𝒁j(t))T𝑴(t)\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\|\left(\boldsymbol{M}^{(t)}\right)^{T}\boldsymbol{Z}_{j}^{(t)}-\left(\boldsymbol{Z}_{j}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}\right\|
=C2(𝑴(t))T𝒁j(t)(𝑴(t))T𝑴(t)+(𝑴(t))T𝑴(t)(𝒁j(t))T𝑴(t)\displaystyle=\frac{\sqrt{C}}{\sqrt{2}}\left\|\left(\boldsymbol{M}^{(t)}\right)^{T}\boldsymbol{Z}_{j}^{(t)}-\left(\boldsymbol{M}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}+\left(\boldsymbol{M}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}-\left(\boldsymbol{Z}_{j}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}\right\|
C2𝑴(t)𝒁j(t)𝑴(t)+C2𝑴(t)𝒁j(t)𝑴(t)\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\|\boldsymbol{M}^{(t)}\right\|\left\|\boldsymbol{Z}_{j}^{(t)}-\boldsymbol{M}^{(t)}\right\|+\frac{\sqrt{C}}{\sqrt{2}}\left\|\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\right\|\left\|\boldsymbol{M}^{(t)}\right\|
=2C𝑴(t)𝒁j(t)𝑴(t)\displaystyle=\sqrt{2C}\left\|\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\right\|\left\|\boldsymbol{M}^{(t)}\right\|
2CρΔ(t,j)\displaystyle\leq\sqrt{2}C\rho\Delta(t,j)

where the first inequality is because the Softmax()Softmax(\cdot) function is C2\sqrt{\frac{C}{2}}-Lipschitz-continuous, and the last inequality follows from the bounded norm of 𝑴\boldsymbol{M}: t\forall t, we have 𝑴(t)Cρ\|\boldsymbol{M}^{(t)}\|\leq\sqrt{C}\rho. For (A.2), if i=ji=j, (A.2)=0\|\textbf{(A.2)}\|=0. If iji\neq j, we have

(A.2)\displaystyle\|\textbf{(A.2)}\| =𝑷j(t)𝑷i(t)\displaystyle=\left\|\boldsymbol{P}_{j}^{(t)}-\boldsymbol{P}_{i}^{(t)}\right\|
C2(𝒁j(t))T𝑴(t)(𝒁i(t))T𝑴(t)\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\|\left(\boldsymbol{Z}_{j}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}-\left(\boldsymbol{Z}_{i}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}\right\|
C2𝒁j(t)𝒁i(t)𝑴(t)\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\|\boldsymbol{Z}_{j}^{(t)}-\boldsymbol{Z}_{i}^{(t)}\right\|\left\|\boldsymbol{M}^{(t)}\right\|
C2𝒁j(t)𝑴(t)+𝑴(t)𝒁i(t)𝑴(t)\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\|\boldsymbol{Z}_{j}^{(t)}-\boldsymbol{M}^{(t)}+\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{i}^{(t)}\right\|\left\|\boldsymbol{M}^{(t)}\right\|
Cρ2(Δ(t,i)+Δ(t,j))\displaystyle\leq\frac{C\rho}{\sqrt{2}}\left(\Delta(t,i)+\Delta(t,j)\right)

For (B), we have

(B)αCNj[N/C]Δ(t,j)I(𝑷j(t))TαC2Nj[N/C]Δ(t,j)\displaystyle\|\textbf{(B)}\|\leq\frac{\alpha C}{N}\sum_{j\in[N/C]}\Delta(t,j)\left\|I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right\|\leq\frac{\alpha C^{2}}{N}\sum_{j\in[N/C]}\Delta(t,j)

The above final inequality is because both II and 𝑷j(t)\boldsymbol{P}_{j}^{(t)} can be seen as probability simplex, the norm of I𝑷j(t)I-\boldsymbol{P}_{j}^{(t)} is always less than the norm of all-one matrix minus all-zero matrix, where the latter’s norm is CC. Finally, we can bound Δ(t+1,i)\Delta(t+1,i):

Δ(t+1,i)\displaystyle\Delta(t+1,i) (1αω)Δ(t,i)+(A)+(B)\displaystyle\leq(1-\alpha\omega)\Delta(t,i)+\|\textbf{(A)}\|+\|\textbf{(B)}\|
(1αω)Δ(t,i)+αC3/2ρNj[N/C]((A.1)+A.2)+αC2Nj[N/C]Δ(t,j)\displaystyle\leq(1-\alpha\omega)\Delta(t,i)+\frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\left(\left\|\textbf{(A.1)}\right\|+\left\|\textbf{A.2}\right\|\right)+\frac{\alpha C^{2}}{N}\sum_{j\in[N/C]}\Delta(t,j)
(1α(ωC2N))Δ(t,i)+αC2NjiΔ(t,j)+\displaystyle\leq\left(1-\alpha\left(\omega-\frac{C^{2}}{N}\right)\right)\Delta(t,i)+\frac{\alpha C^{2}}{N}\sum_{j\neq i}\Delta(t,j)+
αC3/2ρNj[N/C]2CρΔ(t,j)+αC3/2ρNj[N/C]Cρ2(Δ(t,i)+Δ(t,j))\displaystyle\ \ \ \ \ \ \ \ \ \ \ \frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\sqrt{2}C\rho\Delta(t,j)+\frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\frac{C\rho}{\sqrt{2}}\left(\Delta(t,i)+\Delta(t,j)\right)
(1α(ωC2N))Δ(t,i)+αC2NjiΔ(t,j)+\displaystyle\leq\left(1-\alpha\left(\omega-\frac{C^{2}}{N}\right)\right)\Delta(t,i)+\frac{\alpha C^{2}}{N}\sum_{j\neq i}\Delta(t,j)+
2αC5/2ρ2Nj[N/C]Δ(t,j)+αC5/2ρ2N2j[N/C](Δ(t,i)+Δ(t,j))\displaystyle\ \ \ \ \ \ \ \ \ \ \ \frac{\sqrt{2}\alpha C^{5/2}\rho^{2}}{N}\sum_{j\in[N/C]}\Delta(t,j)+\frac{\alpha C^{5/2}\rho^{2}}{N\sqrt{2}}\sum_{j\in[N/C]}\left(\Delta(t,i)+\Delta(t,j)\right)
(1α(ωC2N2C5/2ρ2NC3/2ρ22C5/2ρ2N2)F1)Δ(t,i)+\displaystyle\leq\left(1-\alpha\underbrace{\left(\omega-\frac{C^{2}}{N}-\frac{\sqrt{2}C^{5/2}\rho^{2}}{N}-\frac{C^{3/2}\rho^{2}}{\sqrt{2}}-\frac{C^{5/2}\rho^{2}}{N\sqrt{2}}\right)}_{F1}\right)\Delta(t,i)+
α(C2N+2C5/2ρ2N+C5/2ρ2N2)F2jiΔ(t,j)\displaystyle\ \ \ \ \ \ \ \ \ \ \ \alpha\underbrace{\left(\frac{C^{2}}{N}+\frac{\sqrt{2}C^{5/2}\rho^{2}}{N}+\frac{C^{5/2}\rho^{2}}{N\sqrt{2}}\right)}_{F2}\sum_{j\neq i}\Delta(t,j) (11)

We put all Δ(t+1,)\Delta(t+1,\star) and Δ(t,)\Delta(t,\star) together to derive the difference inequality:

𝚫(t+1)𝑨𝚫(t)\displaystyle\boldsymbol{\Delta}(t+1)\preceq\boldsymbol{A}\boldsymbol{\Delta}(t)

where

𝑨=[1αF1αF2αF2αF21αF1αF2αF2αF21αF1]and𝚫(t)=[Δ(t,1)Δ(t,2)Δ(t,N/C)]\boldsymbol{A}=\left[\begin{array}[]{ccccccc}1-\alpha F_{1}&\alpha F_{2}&\ldots&\alpha F_{2}\\ \alpha F_{2}&1-\alpha F_{1}&\cdots&\alpha F_{2}\\ \vdots&\vdots&\ddots&\vdots\\ \alpha F_{2}&\alpha F_{2}&\ldots&1-\alpha F_{1}\\ \end{array}\right]\ \text{and}\ \boldsymbol{\Delta}(t)=\left[\begin{array}[]{ccccccc}&\Delta(t,1)\\ &\Delta(t,2)\\ &\vdots\\ &\Delta(t,N/C)\\ \end{array}\right]

In above notation, the values of F1F_{1} and F2F_{2} refer to (11). We investigate if 𝚫(t)\boldsymbol{\Delta}(t) can converge to zero vector by adjusting α\alpha. According to the Lemma.C.2, we know eigen values of 𝑨\boldsymbol{A} are

λ1=λ2==λN/C1=1α(F1+F2)\displaystyle\lambda_{1}=\lambda_{2}=\dots=\lambda_{N/C-1}=1-\alpha\left(F_{1}+F_{2}\right)
λN/C=1α(F1(N/C1)F2)\displaystyle\lambda_{N/C}=1-\alpha\left(F_{1}-\left(N/C-1\right)F_{2}\right)

Here, with properly selected parameters, we can make all eigen values of 𝑨\boldsymbol{A} in (1,1)(-1,1). Therefore, as tt\rightarrow\infty, 𝑨t𝟎\boldsymbol{A}^{t}\rightarrow\boldsymbol{0} and 𝚫(t)𝟎\boldsymbol{\Delta}(t)\rightarrow\boldsymbol{0}. ∎

C.5 Proof of Lemma.C.6

See C.6

Proof.

For simplicity, we leave out the upper script (t)(t). First, we have t\forall t

CELoss(𝒁,𝑴)=\displaystyle\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})= y=1Ci=1N/Clogexp(My,𝒛y,i)yexp(My,𝒛y,i)\displaystyle\sum_{y=1}^{C}\sum_{i=1}^{N/C}-\log\frac{\exp\big{(}\langle M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}}{\sum_{y^{\prime}}\exp\big{(}\langle M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle\big{)}} (12)
=\displaystyle= y=1Ci=1N/Clog(1+yyexp(MyMy,𝒛y,i))\displaystyle\sum_{y=1}^{C}\sum_{i=1}^{N/C}\log\bigg{(}1+\sum_{y^{\prime}\neq y}\exp\big{(}\langle M_{y^{\prime}}-M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}\bigg{)}
\displaystyle\leq y=1Ci=1N/Clog(1+(C1)exp(maxyy{MyMy,𝒛y,i)})\displaystyle\sum_{y=1}^{C}\sum_{i=1}^{N/C}\log\bigg{(}1+(C-1)\exp\big{(}\max_{y^{\prime}\neq y}\{\langle M_{y^{\prime}}-M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}\}\bigg{)}
\displaystyle\leq NCy=1Clog(1+(C1)exp(maxyymaxi[N/C]{MyMy,𝒛y,i)})\displaystyle\frac{N}{C}\sum_{y=1}^{C}\log\bigg{(}1+(C-1)\exp\big{(}\max_{y^{\prime}\neq y}\max_{i\in[N/C]}\{\langle M_{y^{\prime}}-M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}\}\bigg{)}
\displaystyle\leq Nmaxy[C]log(1+(C1)exp(maxyymaxi[N/C]{MyMy,𝒛y,i)})\displaystyle N\max_{y\in[C]}\log\bigg{(}1+(C-1)\exp\big{(}\max_{y^{\prime}\neq y}\max_{i\in[N/C]}\{\langle M_{y^{\prime}}-M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}\}\bigg{)}
=\displaystyle= Nlog(1+(C1)exp(maxy[C]maxyymaxi[N/C]{MyMy,𝒛y,i)})\displaystyle N\log\bigg{(}1+(C-1)\exp\big{(}\max_{y\in[C]}\max_{y^{\prime}\neq y}\max_{i\in[N/C]}\{\langle M_{y^{\prime}}-M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}\}\bigg{)}

In addition, we have

log(1+exp(maxy[C]maxyymaxi[N/C]{MyMy,𝒛y,i)})CELoss(𝒁,𝑴)\displaystyle\log\bigg{(}1+\exp\big{(}\max_{y\in[C]}\max_{y^{\prime}\neq y}\max_{i\in[N/C]}\{\langle M_{y^{\prime}}-M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}\}\bigg{)}\leq\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}) (13)

We denote maxy[C]maxyy\max_{y\in[C]}\max_{y^{\prime}\neq y} as maxyy\max_{y^{\prime}\neq y} and define the margin of entire dataset (refer to Section.3.1 of Ji et al. [2022]) as follow:

pmin:=minyymaxi[N/C]MyMy,zy,ip_{min}:=\min_{y\neq y^{\prime}}\max_{i\in[N/C]}\langle M_{y}-M_{y^{\prime}},z_{y,i}\rangle

Therefore, we have

log(1+exp(pmin))1(pmin)CELoss(𝒁,𝑴)Nlog(1+(C1)exp(pmin))C1(pmin)\displaystyle\underbrace{\log\bigg{(}1+\exp\left(-p_{min}\right)\bigg{)}}_{\ell_{1}(p_{min})}\leq\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})\leq N\underbrace{\log\bigg{(}1+(C-1)\exp\left(-p_{min}\right)\bigg{)}}_{\ell_{C-1}(p_{min})} (14)

where a(p)=log(1+aep)\ell_{a}(p)=\log(1+ae^{-p}). Then we represent a()\ell_{a}(\cdot) as the form of exponential function, i.e.i.e.

a(p)=eϕa(p)andϕa(p)=loglog(1+aep).\begin{aligned} \ell_{a}(p)=e^{-\phi_{a}(p)}\ \ \text{and}\ \ \phi_{a}(p)=-\log\log(1+ae^{-p})\end{aligned}.

Denote the inverse function of ϕa()\phi_{a}(\cdot) as Φa()\Phi_{a}(\cdot), where Φa(p)=log(eep1a)\Phi_{a}(p)=-\log(\frac{e^{e^{-p}}-1}{a}). Then continue from (14), we have

1(pmin)CELoss(𝒁,𝑴)NC1(pmin)\displaystyle\ell_{1}(p_{min})\leq\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})\leq N\ell_{C-1}(p_{min})
\displaystyle\Leftrightarrow eϕ1(pmin)CELoss(𝒁,𝑴)NeϕC1(pmin)\displaystyle e^{-\phi_{1}(p_{min})}\leq\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})\leq Ne^{-\phi_{C-1}(p_{min})}
\displaystyle\Leftrightarrow ϕC1(pmin)log(N)log(CELoss(𝒁,𝑴))ϕ1(pmin)\displaystyle\phi_{C-1}(p_{min})-\log(N)\leq-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\leq\phi_{1}(p_{min})

According to the monotonicity of Φ1()\Phi_{1}(\cdot), we have

Φ1(ϕC1(pmin)log(N))Φ1(log(CELoss(𝒁,𝑴)))pmin\displaystyle\Phi_{1}\left(\phi_{C-1}(p_{min})-\log(N)\right)\leq\Phi_{1}\left(-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\right)\leq p_{min}

Use the mean value theorem, there exists a ξ(ϕC1(pmin)log(N),ϕ1(pmin))\xi\in(\phi_{C-1}(p_{min})-\log(N),\phi_{1}(p_{min})) such that

Φ1(ϕC1(pmin)log(N))=pminΦ1(ξ)(ϕ1(pmin)ϕC1(pmin)+log(N)),\displaystyle\Phi_{1}(\phi_{C-1}(p_{min})-\log(N))=p_{min}-\Phi_{1}^{{}^{\prime}}(\xi)(\phi_{1}(p_{min})-\phi_{C-1}(p_{min})+\log(N)),

then

pminΦ1(ξ)(ϕ1(pmin)ϕC1(pmin)+log(N))Δ(t)Φ1(log(CELoss(𝒁,𝑴)))pmin\displaystyle p_{min}-\underbrace{\Phi_{1}^{{}^{\prime}}(\xi)(\phi_{1}(p_{min})-\phi_{C-1}(p_{min})+\log(N))}_{\Delta(t)}\leq\Phi_{1}\left(-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\right)\leq p_{min} (15)

Then we will show Δ(t)=𝒪(1)(t)\Delta(t)=\mathcal{O}(1)(t\rightarrow\infty). Since

ξ>ϕC1(pmin)log(N)log(CELoss(𝒁,𝑴))log(N),\begin{aligned} \xi>\phi_{C-1}(p_{min})-\log(N)\geq-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))-\log(N)\end{aligned},

we know ξ\xi\rightarrow\infty and pminp_{min}\rightarrow\infty as CELoss(𝒁,𝑴)0\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})\rightarrow 0. By simple calculation, we have ϕ1(pmin)ϕC1(pmin)log(C1)\phi_{1}(p_{min})-\phi_{C-1}(p_{min})\rightarrow\log(C-1) and Φ1(ξ)=eeξξeeξ11\Phi_{1}^{{}^{\prime}}(\xi)=\frac{e^{e^{-\xi}-\xi}}{e^{e^{-\xi}}-1}\rightarrow 1. Next, we denote the maximal norm at iteration tt as

ρt=max𝒗𝒁(t)𝑴(t)𝒗.\rho_{t}=\max_{\boldsymbol{v}\in\boldsymbol{Z}^{(t)}\cup\boldsymbol{M}^{(t)}}\|\boldsymbol{v}\|.

Due to pminp_{min}\rightarrow\infty, ρt\rho_{t}\rightarrow\infty. Then we devide ρt\rho_{t} on the both sides of (15) to obtain

|Φ1(log(CELoss(𝒁,𝑴)))ρtpminρt|0,t,\displaystyle\Bigg{|}\frac{\Phi_{1}\left(-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\right)}{\rho_{t}}-\frac{p_{min}}{\rho_{t}}\Bigg{|}\rightarrow 0,t\rightarrow\infty,

therefore

Φ1(log(CELoss(𝒁,𝑴)))ρtpminρt,t.\begin{aligned} \frac{\Phi_{1}\left(-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\right)}{\rho_{t}}\rightarrow\frac{p_{min}}{\rho_{t}},t\rightarrow\infty\end{aligned}.

So

min𝑴,𝒁CELoss(𝒁,𝑴)\displaystyle\min_{\boldsymbol{M},\boldsymbol{Z}}\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})\Leftrightarrow minρ>0minMi,𝒛y,iρCELoss(𝒁,𝑴)\displaystyle\min_{\rho>0}\min_{\|M_{i}\|,\|\boldsymbol{z}_{y,i}\|\leq\rho}\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})
\displaystyle\Leftrightarrow maxρ>0maxMi,𝒛y,iρΦ1(log(CELoss(𝒁,𝑴)))\displaystyle\max_{\rho>0}\max_{\|M_{i}\|,\|\boldsymbol{z}_{y,i}\|\leq\rho}\Phi_{1}\left(-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\right)
\displaystyle\Leftrightarrow maxρ>0maxMi,𝒛y,iρρΦ1(log(CELoss(𝒁,𝑴)))ρ\displaystyle\max_{\rho>0}\max_{\|M_{i}\|,\|\boldsymbol{z}_{y,i}\|\leq\rho}\rho\frac{\Phi_{1}\left(-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\right)}{\rho}
\displaystyle\Leftrightarrow maxρ>0maxMi,𝒛y,iρρpminρ\displaystyle\max_{\rho>0}\max_{\|M_{i}\|,\|\boldsymbol{z}_{y,i}\|\leq\rho}\rho\frac{p_{min}}{\rho}
\displaystyle\Leftrightarrow maxρ>0maxMi,𝒛y,iρminyymini[N/C]MyMy,𝒛y,i\displaystyle\max_{\rho>0}\max_{\|M_{i}\|,\|\boldsymbol{z}_{y,i}\|\leq\rho}\min_{y\neq y^{\prime}}\min_{i\in[N/C]}\langle M_{y}-M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle

The \Leftrightarrow symbol in above equation means the solution of these optimization problems would converge to the same. ∎

Appendix D Proof of Theorem.3.8

Theorem D.1 (Theorem.5 of Kakade et al. [2008]: Margin Bound).

Consider a data space 𝒳\mathcal{X} and a probability measure 𝒫\mathcal{P} on it. There is a dataset {xi}i=1n\{x_{i}\}_{i=1}^{n} that contains nn samples, which are drawn i.i.di.i.d from 𝒫\mathcal{P}. Consider an arbitrary function class \mathcal{F} such that f\forall f\in\mathcal{F} we have supx𝒳|f(x)|K\sup_{x\in\mathcal{X}}|f(x)|\leq K, then with probability at least 1δ1-\delta over the sample, for all margins γ>0\gamma>0 and all ff\in\mathcal{F} we have,

x(f(x)0)i=1n𝕀(f(xi)γ)n+n()γ+log(log24Kγ)n+log(1/δ)2n\displaystyle\mathbb{P}_{x}(f(x)\leq 0)\leq\sum_{i=1}^{n}\frac{\mathbb{I}(f(x_{i})\leq\gamma)}{n}+\frac{\mathfrak{R}_{n}(\mathcal{F})}{\gamma}+\sqrt{\frac{\log(\log_{2}\frac{4K}{\gamma})}{n}}+\sqrt{\frac{\log(1/\delta)}{2n}}
Theorem.3.8 (Multiclass Margin Bound).

Consider a dataset SS with CC classes. For any classifier (𝐌,f(;𝐰))(\boldsymbol{M},f(\cdot;\boldsymbol{w})), we denote its margin between ii and jj classes as (MiMj)Tf(;𝐰)(M_{i}-M_{j})^{T}f(\cdot;\boldsymbol{w}). And suppose the function space of the margin is ={(MiMj)Tf(;𝐰)|ij,𝐌,𝐰}\mathcal{F}=\{(M_{i}-M_{j})^{T}f(\cdot;\boldsymbol{w})|\forall i\neq j,\forall\boldsymbol{M},\boldsymbol{w}\}, whose uppder bound is

supijsup𝑴,𝒘supxi|(MiMj)Tf(𝒙;𝒘)|K.\displaystyle\sup_{i\neq j}\sup_{\boldsymbol{M},\boldsymbol{w}}\sup_{x\in\mathcal{M}_{i}}\left|(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})\right|\leq K.

Then, for any classifier (𝐌,f(;𝐰))(\boldsymbol{M},f(\cdot;\boldsymbol{w})) and margins {γi,j}ij(γi,j>0)\{\gamma_{i,j}\}_{i\neq j}(\gamma_{i,j}>0), the following inequality holds with probability at least 1δ1-\delta

x,y(maxc[Mf(𝒙;𝒘)]cy)\displaystyle\mathbb{P}_{x,y}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}\leq i=1Cp(i)jiNi()γi,j+i=1Cp(i)jilog(log24Kγi,j)Ni\displaystyle\sum_{i=1}^{C}p(i)\sum_{j\neq i}\frac{\mathfrak{R}_{N_{i}}(\mathcal{F})}{\gamma_{i,j}}+\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sqrt{\frac{\log(\log_{2}\frac{4K}{\gamma_{i,j}})}{N_{i}}}
+empirical risk term+probability term\displaystyle+\ \text{empirical risk term}\ +\ \text{probability term}

where

empirical risk term =i=1Cp(i)jixSi𝕀((MiMj)Tf(x)γi,j)Ni,\displaystyle=\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sum_{x\in S_{i}}\frac{\mathbb{I}((M_{i}-M_{j})^{T}f(x)\leq\gamma_{i,j})}{N_{i}},
probability term =i=1Cp(i)jilog(C(C1)/δ)2Ni.\displaystyle=\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sqrt{\frac{\log(C(C-1)/\delta)}{2N_{i}}}.

Ni()\mathfrak{R}_{N_{i}}(\mathcal{F}) is the Rademacher complexity Kakade et al. [2008], Bartlett and Mendelson [2002] of function space \mathcal{F}.

Proof.

First, we decompose the accuracy as accuracies within every class by Bayes Theory:

𝒙,y(argmax𝑐[𝑴f(𝒙;𝒘)]cy)=i=1Cp(i)𝒙|y=i(argmax𝑐[𝑴f(𝒙;𝒘)]cy)\displaystyle\mathbb{P}_{\boldsymbol{x},y}\Big{(}\underset{c}{\arg\max}[\boldsymbol{M}f(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}=\sum_{i=1}^{C}p(i)\mathbb{P}_{\boldsymbol{x}|y=i}\Big{(}\underset{c}{\arg\max}[\boldsymbol{M}f(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)} (16)

where p(i)p(i) is the probability density of ii-th class. Then, we focus on the accuracy within every class ii.

𝒙|y=i(argmax𝑐[𝑴f(𝒙;𝒘)]cy)=𝒙|y=i(ji{(MiMj)Tf(𝒙;𝒘)<0})\displaystyle\mathbb{P}_{\boldsymbol{x}|y=i}\Big{(}\underset{c}{\arg\max}[\boldsymbol{M}f(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}=\mathbb{P}_{\boldsymbol{x}|y=i}\Bigg{(}\bigcup_{j\neq i}\{(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})<0\}\Bigg{)}

According to union bound, we have

𝒙|y=i(argmax𝑐[𝑴f(𝒙;𝒘)]cy)ji𝒙|y=i((MiMj)Tf(𝒙;𝒘)<0)\displaystyle\mathbb{P}_{\boldsymbol{x}|y=i}\Big{(}\underset{c}{\arg\max}[\boldsymbol{M}f(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}\leq\sum_{j\neq i}\mathbb{P}_{\boldsymbol{x}|y=i}\Big{(}(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})<0\Big{)}

Recall our assumption of function class:

supijsup𝑴,𝒘sup𝒙i|(MiMj)Tf(𝒙;𝒘)|K.\sup_{i\neq j}\sup_{\boldsymbol{M},\boldsymbol{w}}\sup_{\boldsymbol{x}\in\mathcal{M}_{i}}|(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})|\leq K.

Then follow from the Margin Bound (Theorem.D.1), we have

𝒙,y(argmax𝑐[Mf(𝒙;𝒘)]cy)i=1Cp(i)jix|y=i((MiMj)Tf(𝒙;𝒘)<0)\displaystyle\mathbb{P}_{\boldsymbol{x},y}\Big{(}\underset{c}{\arg\max}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}\leq\sum_{i=1}^{C}p(i)\sum_{j\neq i}\mathbb{P}_{x|y=i}\Big{(}(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})<0\Big{)}
\displaystyle\leq i=1Cp(i)jiNi()γi,j+i=1Cp(i)jilog(log24Kγi,j)Nc+\displaystyle\sum_{i=1}^{C}p(i)\sum_{j\neq i}\frac{\mathfrak{R}_{N_{i}}(\mathcal{F})}{\gamma_{i,j}}+\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sqrt{\frac{\log(\log_{2}\frac{4K}{\gamma_{i,j}})}{N_{c}}}+
i=1Cp(i)jilog(1/δ)2Niprobability term+i=1Cp(i)ji𝒙Si𝕀((MiMj)Tf(𝒙)γi,j)Niempirical risk term\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \underbrace{\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sqrt{\frac{\log(1/\delta)}{2N_{i}}}}_{\text{probability term}}+\underbrace{\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sum_{\boldsymbol{x}\in S_{i}}\frac{\mathbb{I}((M_{i}-M_{j})^{T}f(\boldsymbol{x})\leq\gamma_{i,j})}{N_{i}}}_{\text{empirical risk term}}

with probability at least 1C(C1)δ1-C(C-1)\delta. Then, we perform the following replace to drive the final result:

δδC(C1)\displaystyle\delta\leftarrow\frac{\delta}{C(C-1)}

Appendix E Proof of Theorem.4.2

We first introduce the definition of Covering Number.

Definition E.1 (Covering Number Kulkarni and Posner [1995]).

Given ϵ>0\epsilon>0 and 𝒙D\boldsymbol{x}\in\mathbb{R}^{D}, the open ball of radius ϵ\epsilon around 𝒙\boldsymbol{x} is denoted as

Bϵ(𝒙)={𝒖D,𝒖𝒙<ϵ}.\begin{aligned} B_{\epsilon}(\boldsymbol{x})=\{\boldsymbol{u}\in\mathbb{R}^{D},\|\boldsymbol{u}-\boldsymbol{x}\|<\epsilon\}\end{aligned}.

Then the covering number 𝒩(ϵ,A)\mathcal{N}(\epsilon,A) of a set ADA\subset\mathbb{R}^{D} is defined as the smallest number of open balls whose union contains AA:

𝒩(ϵ,A)=inf{k:𝒖1,,𝒖kD,s.t.Ai=1kBϵ(𝒖i)}\displaystyle\mathcal{N}(\epsilon,A)=\inf\left\{k:\exists\boldsymbol{u}_{1},\dots,\boldsymbol{u}_{k}\in\mathbb{R}^{D},s.t.A\in\bigcup_{i=1}^{k}B_{\epsilon}(\boldsymbol{u}_{i})\right\}

The following conclusion is demonstrated in the proof of Theorem.1 of Kulkarni and Posner [1995]. We use it to prove our theorem.

Theorem E.2 (Vural and Guillemot [2017], Kulkarni and Posner [1995]).

There are NN samples {x1,,xN}\{x_{1},\dots,x_{N}\} drawn i.i.d from the probability measure 𝒫\mathcal{P}. Suppose the bounded support of 𝒫\mathcal{P} is \mathcal{M}, then if NN is larger then Covering Number 𝒩(ϵ,)\mathcal{N}(\epsilon,\mathcal{M}), we have

x(xx^>ϵ)𝒩(ϵ,)2N,ϵ>0\displaystyle\mathbb{P}_{x}\Big{(}\|x-\hat{x}\|>\epsilon\Big{)}\leq\frac{\mathcal{N}(\epsilon,\mathcal{M})}{2N},\forall\epsilon>0

where x^\hat{x} is the sample that is closest to xx in {x1,,xN}\{x_{1},\dots,x_{N}\}:

x^argminx{x1,,xN}xx\displaystyle\hat{x}\in\underset{x^{\prime}\in\{x_{1},\dots,x_{N}\}}{\arg\min}\|x^{\prime}-x\|

Then we provide the proof of Theorem.4.2.

See 4.2

Proof.

We decompose the accuracy:

𝒙,y(maxc[Mf(𝒙;𝒘)]c=y)=i=1Cp(i)𝒙|y=i(maxc[Mf(𝒙;𝒘)]c=y)\displaystyle\mathbb{P}_{\boldsymbol{x},y}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}=y\Big{)}=\sum_{i=1}^{C}p(i)\mathbb{P}_{\boldsymbol{x}|y=i}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}=y\Big{)} (17)

where p(i)p(i) is the class distribution. Then, we focus on the accuracy within every class ii.

𝒙|y=i(maxc[Mf(𝒙;𝒘)]c=y)=x|y=i({(MiMj)Tf(𝒙;𝒘)>0for anyji})\displaystyle\mathbb{P}_{\boldsymbol{x}|y=i}(\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}=y)=\mathbb{P}_{x|y=i}(\{(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})>0\ \ \text{for any}\ \ j\neq i\})

We select the data that is closest to 𝒙\boldsymbol{x} in ii class samples SiS_{i}, and denote it as

𝒙^(Si)=argminx1Six1x\hat{\boldsymbol{x}}(S_{i})=\underset{x_{1}\in S_{i}}{\arg\min}\|x_{1}-x\|

According to the linear separability,

(MiMj)Tf(𝒙^(Si);𝒘)γi,j,ji\displaystyle(M_{i}-M_{j})^{T}f(\hat{\boldsymbol{x}}(S_{i});\boldsymbol{w})\geq\gamma_{i,j},\forall j\neq i

For any jij\neq i, we have

(MiMj)Tf(𝒙;𝒘)\displaystyle(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w}) =(MiMj)T(f(𝒙;𝒘)+f(𝒙^(Si);w)f(𝒙^(Si);w))\displaystyle=(M_{i}-M_{j})^{T}(f(\boldsymbol{x};\boldsymbol{w})+f(\hat{\boldsymbol{x}}(S_{i});w)-f(\hat{\boldsymbol{x}}(S_{i});w)) (18)
=(MiMj)Tf(𝒙^(Si);w)+(MiMj)T(f(𝒙;𝒘)f(𝒙^(Si);w))\displaystyle=(M_{i}-M_{j})^{T}f(\hat{\boldsymbol{x}}(S_{i});w)+(M_{i}-M_{j})^{T}(f(\boldsymbol{x};\boldsymbol{w})-f(\hat{\boldsymbol{x}}(S_{i});w))
γi,jMiMjf(𝒙;𝒘)f(𝒙^(Si);w)\displaystyle\geq\gamma_{i,j}-\|M_{i}-M_{j}\|\|f(\boldsymbol{x};\boldsymbol{w})-f(\hat{\boldsymbol{x}}(S_{i});w)\|
γi,jLMiMj𝒙𝒙^(Si)\displaystyle\geq\gamma_{i,j}-L\|M_{i}-M_{j}\|\|\boldsymbol{x}-\hat{\boldsymbol{x}}(S_{i})\|

The prediction result is related to the distance between 𝒙\boldsymbol{x} and 𝒙^(Si)\hat{\boldsymbol{x}}(S_{i}). According to Theorem.E.2, we know

𝒙|y=i(𝒙𝒙^(Si)>ϵ)𝒩(ϵ,i)2Ni\displaystyle\mathbb{P}_{\boldsymbol{x}|y=i}\Big{(}\|\boldsymbol{x}-\hat{\boldsymbol{x}}(S_{i})\|>\epsilon\Big{)}\leq\frac{\mathcal{N}(\epsilon,\mathcal{M}_{i})}{2N_{i}}

To obtain the correct prediction result, i.e.i.e., assure (18)>0(\ref{a})>0 for all jij\neq i, we choose ϵ<minjiγi,jLMiMj\epsilon<\min_{j\neq i}\frac{\gamma_{i,j}}{L\|M_{i}-M_{j}\|}. Therefore, we have

𝒙|y=i({(MiMj)Tf(𝒙;𝒘)>0,ji})\displaystyle\mathbb{P}_{\boldsymbol{x}|y=i}\Bigg{(}\left\{(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})>0,\forall j\neq i\right\}\Bigg{)} 𝒙|y=i(𝒙𝒙^(Si)<minjiγi,jLMiMj)\displaystyle\geq\mathbb{P}_{\boldsymbol{x}|y=i}\Bigg{(}\|\boldsymbol{x}-\hat{\boldsymbol{x}}(S_{i})\|<\min_{j\neq i}\frac{\gamma_{i,j}}{L\|M_{i}-M_{j}\|}\Bigg{)} (19)
>1𝒩(minjiγi,jLMiMj,i)2Ni\displaystyle>1-\frac{\mathcal{N}(\min_{j\neq i}\frac{\gamma_{i,j}}{L\|M_{i}-M_{j}\|},\mathcal{M}_{i})}{2N_{i}}

Plug (19) into (17) to derive

𝒙,y(maxc[Mf(𝒙;𝒘)]c=y)\displaystyle\mathbb{P}_{\boldsymbol{x},y}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}=y\Big{)} >1i=1Cp(i)𝒩(minjiγi,jLMiMj,i)2Ni\displaystyle>1-\sum_{i=1}^{C}p(i)\frac{\mathcal{N}(\min_{j\neq i}\frac{\gamma_{i,j}}{L\|M_{i}-M_{j}\|},\mathcal{M}_{i})}{2N_{i}}
=1i=1Cp(i)maxji𝒩(γi,jLMiMj,i)2Ni\displaystyle=1-\sum_{i=1}^{C}p(i)\frac{\max_{j\neq i}\mathcal{N}(\frac{\gamma_{i,j}}{L\|M_{i}-M_{j}\|},\mathcal{M}_{i})}{2N_{i}}

Then using the conlcusions of NeurCol, if maximal norm is denoted as ρ\rho, we have

γi,jLMiMj=ρ2MiTMjL2ρ22MiTMj=1Lρ2MiTMj2\displaystyle\frac{\gamma_{i,j}}{L\|M_{i}-M_{j}\|}=\frac{\rho^{2}-M_{i}^{T}M_{j}}{L\sqrt{2\rho^{2}-2M_{i}^{T}M_{j}}}=\frac{1}{L}\sqrt{\frac{\rho^{2}-M_{i}^{T}M_{j}}{2}}

Appendix F Some Details of Experiments

F.1 Training Detail

In the experiments of Section.4.2, we follow Papyan et al. [2020]’s practice. For all experiments, we minimize cross entropy loss using stochastic gradient descent with epoch 200200, momentum 0.90.9, batch size 256256 and weight decay 5×1045\times 10^{-4}. Besides, the learning rate is set as 5×1025\times 10^{-2} and annealed by ten-fold at 120120 and 160160 epoch for every dataset. As for data preprocess, we only perform standard pixel-wise mean subtracting and deviation dividing on images. To achieve 100%100\% accuracy on training set, we only use RandomFilp augmentation.

F.2 Generation of Equivalent Grassmannian Frame

We use Code.1 to generate Equivalent Grassmannian Frame with different directions and orders. Note that the generation of rotation matrix uses the method of Lezcano Casado [2019], Lezcano-Casado and Martınez-Rubio [2019].

import torch
SEED = 1000
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
\pardef generate_permutation_matrix(dimension):
return torch.eye(dimension)[torch.randperm(dimension), :]
\pardef generate_rotation_matrix(dimension):
A = torch.randn(dimension, dimension)
return torch.linalg.matrix_exp(A - A.T)
\pardef generate_grass_frame(class_num, feature_num):
feature = torch.nn.Linear(class_num, feature_num)
classifier = torch.nn.Linear(feature_num, class_num)
labels = torch.range(0, class_num-1).long()
softmax = torch.nn.Softmax(dim=0)
optimizer = torch.optim.SGD([
{"params": feature.parameters(), "lr":0.1},
{"params": classifier.parameters(), "lr":0.1}
])
for i in range(1000):
pred = torch.mm(classifier.weight , feature.weight)
loss_ce = torch.nn.functional.cross_entropy(input=pred.T, target=labels)
loss_l2 = 1e-1 * (torch.norm(classifier.weight) + torch.norm(feature.weight))
loss = loss_l2 + loss_ce
feature.zero_grad()
classifier.zero_grad()
loss.backward()
optimizer.step()
print("index: {} loss: {}".format(i, loss.item()))
\parprint(feature.weight.shape) # [feature_num, class_num]
return feature.weight
\parif __name__ == "__main__":
class_num = 4
feature_num = 2
grass_frame = generate_grass_frame(class_num, feature_num)
permutation = generate_permutation_matrix(class_num)
rotation = generate_rotation_matrix(feature_num)
grass_frame = rotation @ grass_frame @ permutation
torch.save(grass_frame, "save_path")
List of codeblocks 1 Generation of Equivalent Grassmannian Frame

Appendix G Numerical Simulation and Visualization of Generalized Neural Collapse

Figure.2 shows the results of a numerical simulation experiment conducted to illustrate the convergence of Generalized Neural Collapse. A GIF version of Figure.2 can be found HERE. During the simulation, we discovered that the condition ρ\rho\rightarrow\infty, which is believed to be necessary for the occurrence of Grassmannian Frame (in the NC3 of Theorem.3.2), may not be required. This suggests that there may be other ways to prove Generalized Neural Collapse with fewer assumptions.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Visualization of Generalized Neural Collapse. There are 44 classes, and the feature space is 22-dimensional. Every class has 2020 samples. In the figures, the points with different colors represent the features of samples from different classes, and the lines indicate the linear classifier.