A Study of Neural Collapse Phenomenon: Grassmannian Frame, Symmetry and Generalization

Peifeng Gao ¹, Qianqian Xu ^2,, Peisong Wen ^{1, 2},
Huiyang Shao ^{1, 2}, Zhiyong Yang ¹, Qingming Huang ^{1, 2, 3, 4,}^∗
¹ School of Computer Science and Tech., University of Chinese Academy of Sciences
² Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS
³ BDKM, University of Chinese Academy of Sciences
⁴ Peng Cheng Laboratory
Corresponding author.

Abstract

In this paper, we extend original Neural Collapse Phenomenon by proving Generalized Neural Collapse hypothesis. We obtain Grassmannian Frame structure from the optimization and generalization of classification. This structure maximally separates features of every two classes on a sphere and does not require a larger feature dimension than the number of classes. Out of curiosity about the symmetry of Grassmannian Frame, we conduct experiments to explore if models with different Grassmannian Frames have different performance. As a result, we discover the Symmetric Generalization phenomenon. We provide a theorem to explain Symmetric Generalization of permutation. However, the question of why different directions of features can lead to such different generalization is still open for future investigation.

1 Introduction

Consider classification problems. Researchers with good statistical and linear algebra training may believe that the features learned by deep neural networks, which are flowing tensors in deep models, are very different and vary randomly depending on the classification situation. However, Papyan et al. (2020) discovered a phenomenon called Neural Collapse (NeurCol) that challenges this expectation. NeurCol is based on the standard training paradigm of classification problems using the cross-entropy loss minimization based on the stochastic gradient descent (SGD) algorithm with deep models. During Terminal Phase of Training (TPT), NeurCol shows that the features of the last layer and linear classifier that belong to the same class collapse to one point and form a symmetrical geometric structure called Simplex ETF. This phenomenon is beautiful due to its surprising symmetry. However, existing conclusions Papyan et al. (2020); Ji et al. (2022); Fang et al. (2021); Zhu et al. (2021); Zhou et al. (2022a); Yaras et al. (2022a); Han et al. (2022); Tirer and Bruna (2022); Zhou et al. (2022b); Mixon et al. (2020); Poggio and Liao (2020); Graf et al. (2021); Yang et al. (2022) of NeurCol are not universal enough. Specifically, the existing conclusions require the feature dimension to be larger than the class number, as Simplex ETF only exists in this case. In another case where the feature dimension is smaller than the class number, the corresponding structure learned by deep models is still unclear.

Lu and Steinerberger (2022) and Liu et al. (2023) provide a preliminary answer to this question. Lu and Steinerberger (2022) prove that when class number tends towards infinity, the features of different classes uniformly distribute on the hypersphere. Further, Liu et al. (2023) proposes a Generalized Neural Collapse hypothesis, which states that if the class number is larger than the feature dimension, the inter-class features and classifiers will be maximally distant on a hypersphere, which they refer to as Hyperspherical Uniformity.

First Contribution Our first contribution is the theoretical confirmation of the Generalized Neural Collapse hypothesis. We derive the Grassmannian Frame from three perspectives, namely, optimal codes in coding theory, optimization and generalization in classification problems. As a more general version of ETF, Grassmannian Frame exists for any combination of class numbers and feature dimensions. Additionally, it exhibits the minimized maximal correlation property, which is precisely the Hyperspherical Uniformity property.

The study conducted by Yang et al. (2022) is relevant to our next part of work. They argue that deep models can learn features with any direction, and thus, fixing the last layer as an ETF during training can lead to satisfactory performance. However, we completely disprove this argument by discovering a new phenomenon.

Second Contribution Our second contribution is the discovery of a new phenomenon called Symmetric Generalization. Our motivation for it stems from the two invariances of the Grassmannian Frame, namely, rotation invariance and permutation invariance. We observe that models that have learned the Grassmannian Frame with different rotations and permutations exhibit very different generalization performances, even if they have achieved the best performance on the training set.

2 Preliminary

2.1 Neural Collapse

Papyan et al. (2020) conducted extensive experiments to reveal the NeurCol phenomenon. This phenomenon occurs during the Terminal Phase of Training (TPT), which starts from the epoch that the training accuracy has reached $100\%$ . During TPT, training error is zero, but cross-entropy loss keeps decreasing. To describe this phenomenon more clearly, we introduce several necessary notations first. We denote the class number as $C$ and feature dimension as $d$ . Here, we consider classifiers with the form $logit=\boldsymbol{M}\boldsymbol{z}=[\langle M_{1},\boldsymbol{z}\rangle,\dots,\langle M_{C},\boldsymbol{z}\rangle]^{T}$ , where $\boldsymbol{M}\in\mathbb{R}^{d\times C}$ is the linear classifier, and $\boldsymbol{z}$ is the feature of a smaple obtained from a deep feature extractor. The classification result is given by selecting the maximum score of $logit$ . Given a balanced dataset, we denote the feature of $i$ -th sample in $y$ -th category as $\boldsymbol{z}_{y,i}$ . Specifically, when the model is trained on a balanced dataset, its last layer would converge to the following manifestations:

NC1

Variability Collapse All samples belonging the same class converge to the class mean: $\|\boldsymbol{z}_{y,i}-\bar{\boldsymbol{z}}_{y}\|\rightarrow 0,\forall y,\forall i$ where $\bar{\boldsymbol{z}}_{y}=\text{Ave}_{i}\left(\boldsymbol{z}_{y,i}\right)$ denote the class-center of $y$ -th class;
NC2

Convergence to Self Duality The samples and classifier belonging the same class converge to the same: $\|\boldsymbol{z}_{y,i}-M_{y}\|\rightarrow 0,\forall y,\forall i$ ;
NC3

Convergence to Simplex ETF The classifier weight converges to the vertices of Simplex ETF;
NC4

Nearest Classification The learned classifier behaves like the nearest classifier, $i.e.\arg\max_{y}\langle M_{y},z\rangle\rightarrow\arg\min_{y}\|z-\bar{\boldsymbol{z}}_{y}\|$ .

In NC3, Simplex ETF is an interesting structure. Note that there exists two different objects: Simplex ETF and ETF. ETF is rooted from Frame Theroy (refer to next subsection), while Papyan et al. (2020) introduced the Simplex ETF as a new definition in the context of the NeurCol phenomenon. They made some extensions to the original definition of ETF by introducing an orthogonal projection matrix and a scale factor. Here, we provide the definition of Simplex ETF:

Definition 2.1 (Simplex Equiangular Tight Frame Papyan et al. (2020)).

A Simplex ETF is a collection of points in $\mathbb{R}^{C}$ specified by the columns of

\displaystyle\boldsymbol{M}^{\star}=\alpha R\sqrt{\frac{C}{C-1}}\left(I-\frac{1}{C}\mathbb{I}\mathbb{I}^{T}\right)

where $I\in\mathbb{R}^{C\times C}$ is the identity matrix, $\mathbb{I}\in\mathbb{R}^{C}$ is the all-one vector, $R\in\mathbb{R}^{d\times C}(d\geq C)$ is an orthogonal projection matrix, $\alpha\in\mathbb{R}$ is a scale factor.

Simplex ETF is a structure with high symmetry, as every pair of vectors has equal angles (Equiangular property). However, it has a limitation: it only exists when feature dimension $d$ is larger than class number $C$ . Recently, a work Liu et al. (2023) removed this restriction by proposing Generalized Neural Collapse hypothesis. In their hypothesis, Hyperspherical Uniformity is introduced to generalize Equiangular property. Hyperspherical Uniformity means features of every class is distributed uniformly on a hypersphere with maximal distance.

2.2 Frame Theory

We have discovered that Grassmannian Frame, a mathematical object from Frame Theory, is a suitable candidate for meeting Hyperspherical Uniformity. Frame Theory is a fundamental research area in mathematics Casazza and Kutyniok (2012) that provides a framework for the analysis of signal transmission and reconstruction. In communication field, certain frame properties have been shown to be optimal configurations in various transmission problems. For instance, Uniform and Tight frame are optimal codes in erasure problem Casazza and Kovačević (2003) and non-orthogonal communication schemes Ambrus et al. (2021). Grassmannian Frame Holmes and Paulsen (2004); Strohmer and Heath Jr (2003), as a more specialized example, not only satisfies the Uniform and Tight properties but also meets the minimized maximal correlation property. This property makes us confident that Grassmannian Frame satisfies Hyperspherical Uniformity.

Here, we provide a series of definitions in Frame Theory¹¹1 Frame Theory has more general definitions in Hilbert space. Definitions we provided here are actually based on Euclidean space. See Strohmer and Heath Jr (2003) for general definitions. :

Definition 2.2 (Frame).

In Euclidean space $\mathbb{R}^{d}$ , a frame is a sequence of bounded vectors $\{\zeta_{i}\}_{i=1}^{C}$ .

Definition 2.3 (Uniform Property and Unit Norm).

Given a frame $\{\zeta_{i}\}_{i=1}^{C}$ in $\mathbb{R}^{d}$ , it is a uniform frame if the norm of every vector is equal. Further, it is a unit norm frame if the norm of every vector in it is equal to 1.

Definition 2.4 (Tight Property).

Given a frame $\{\zeta_{i}\}_{i=1}^{C}$ in $\mathbb{R}^{d}$ , it is a tight frame if the rank of its analysis matrix $\left[\zeta_{1},\dots,\zeta_{C}\right]$ is $d$ .

Definition 2.5 (Maximal Frame Correlation).

Given a unit norm frame $\{\zeta_{i}\}_{i=1}^{C}$ , the maximal correlation is defined as $\mathcal{M}(\{\zeta_{i}\}_{i=1}^{C})=\max_{i,j,i\neq j}\{|\langle\zeta_{i},\zeta_{j}\rangle|\}$ .

We can now define the Grassmannian frame:

Definition 2.6 (Grassmannian Frame).

A frame $\{\zeta_{i}\}_{i=1}^{C}$ in $\mathbb{R}^{d}$ is Grassmannian frame if it is the solution of $\min\{\mathcal{M}(\{\zeta_{i}\}_{i=1}^{C})\}$ , where the minimum is taken over all unit norm frames in $\mathbb{R}^{d}$ .

We also introduce Equiangular Tight Frame (ETF).

Definition 2.7 (Equiangular Property).

Given a unit norm frame $\{\zeta_{i}\}_{i=1}^{C}$ , it is equiangular frame if $|\langle\zeta_{i},\zeta_{j}\rangle|=c,\forall i,j\ \text{with}\ i\neq j$ for some constant $c\geq 0$ .

Definition 2.8 (Equiangular Tight Frame).

ETF is a Equiangular and Tight frame.

The following theorem relates ETF and Grassmannian Frame:

Theorem 2.9 (Welch Bound Welch (1974)).

Given any unit norm frame $\{\zeta_{i}\}_{i=1}^{C}$ in $\mathbb{R}^{d}$ , then we have

\displaystyle\mathcal{M}(\{\zeta_{i}\}_{i=1}^{C})\geq\sqrt{\frac{C-d}{d(C-1)}}\ \ \text{only if}\ \ C\leq\frac{d(d+1)}{2},

Further, $\mathcal{M}(\{\zeta_{i}\}_{i=1}^{C})$ reaches the right hand side if and only if $\{\zeta_{i}\}_{i=1}^{C}$ is a Equiangular Tight Frame.

This theorem tells ETF is the special case of Grassmannian Frame and how a Grassmannian frame $\{\zeta_{i}\}_{i=1}^{C}$ can be Equiangular: if and only if it can achieve the Welch Bound. Intuitively, if $d$ is large enough, the correlation between every two vectors in $\{\zeta_{i}\}_{i=1}^{C}$ can be minimized equally to achieve Equiangular property. Otherwise, Equiangular property can not be guaranteed.

Motivation Our motivation for considering Grassmannian Frame as a potential structure for the Generalized Neural Collapse hypothesis is that it satisfies two important properties: it has minimized maximal correlation ( $i.e.$ , Hyperspherical Uniformity) and exists for any vector number $C$ and dimension $d$ . We will provide theoretical supports for this argument in the next section.

3 Main Results

As discussed in the previous section, minimized maximal correlation is a key characteristic of Grassmannian frame. Therefore, in this section, all of our findings are based on this insight:

\displaystyle\min_{\|M_{y}\|=1}\max_{y\neq y^{\prime}}\langle M_{y},M_{y^{\prime}}\rangle

All proofs can be found in the Appendix.

3.1 Warmup: Optimal Code Perspective

In communication field, Grassmannian frame is not only the optimal $2$ -erasure code Holmes and Paulsen (2004), but also the optimal code in Gaussian Channel Papyan et al. (2020); Shannon (1959):

Theorem 3.1.

Consider the communication problem: a number $c$ ( $c\in[C]$ ) is encoded as a vector $M_{c}$ in $\mathbb{R}^{d}$ that we call code, and then is transmitted over a noisy channel. A receiver need to revocer $c$ from the noisy signal $\boldsymbol{h}=M_{c}+\boldsymbol{g}$ , where $\boldsymbol{g}$ is the additive noisy. Then if $\boldsymbol{g}\sim\mathcal{N}(0,\sigma^{2}I)$ , Grassmannian Frame is the optimal code enjoying the minimal error probability.

This theorem is essentially adopted from the Corollary.4 of Papyan et al. (2020), which is the first study to identify NeurCol phenomenon. However, they only validated this result with Simplex ETF and did not further investigate this structure.

3.2 Optimization Perspective

Then we consider the classification from the optimization perspective. We start from the cross-entropy minimization problem.

Notations Denote feature dimension of the last layer as $d$ and class number as $C$ . The linear classifier is $\boldsymbol{M}=[M_{1},\cdots,M_{C}]\in\mathbb{R}^{d\times C}$ . Given a balanced dataset $\boldsymbol{Z}=\{\{\boldsymbol{z}_{y,i}\in\mathbb{R}^{d}\}_{i=1}^{N/C}\}_{y=1}^{C}$ where every class has $N/C$ samples, we use $\boldsymbol{z}_{y,i}$ to represent the feature of $i$ -th sample in $y$ -th class.

Optimization Objective Since modern deep models are highly overparameterized, we assume the models have infinite capacity and can fit any dataset. Therefore, we directly optimize sample features to simplify analysis Yang et al. (2022); Zhu et al. (2021):

\displaystyle\min_{\boldsymbol{Z},\boldsymbol{M}}\text{CELoss}(\boldsymbol{M},\boldsymbol{Z})=-\sum_{y=1}^{C}\sum_{i=1}^{N/C}\log\frac{\exp\big{(}\langle M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}}{\sum_{y^{\prime}}\exp\big{(}\langle M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle\big{)}}

(1)

As the Proposition.2 of Fang et al. (2021) highlighted, NeurCol occurs only if features and classifiers are $\ell_{2}$ norm bounded. Therefore, following previous work Lu and Steinerberger (2022); Fang et al. (2021); Yaras et al. (2022b), we introduce $\ell_{2}$ norm constraint into (1):

\begin{aligned} s.t.\ \ \|M_{y}\|\leq\rho,\forall y\in[C]\ \ \text{and}\ \ \|\boldsymbol{z}_{y,i}\|\leq\rho,\forall y\in[C],i\in[N/C]\end{aligned},

(2)

where the norms of features and linear classifiers are bounded by $\rho$ . Then to perform optimization directly, we turn to the following unconstrained feature models Zhu et al. (2021); Zhou et al. (2022a):

\displaystyle\min_{\boldsymbol{M},\boldsymbol{Z}}\mathcal{L}(\boldsymbol{M},\boldsymbol{Z})=\sum_{y=1}^{C}\sum_{i=1}^{N/C}\left(-\log\frac{\exp\big{(}\langle M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}}{\sum_{y^{\prime}}\exp\big{(}\langle M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle\big{)}}+\frac{\omega\|\boldsymbol{z}_{y,i}\|^{2}}{2}\right)+\sum_{y=1}^{C}\frac{\lambda\|M_{y}\|^{2}}{2}

(3)

Compared with (1), $\|M_{y}\|^{2}$ and $\|\boldsymbol{z}_{y,i}\|^{2}$ are added in (3). In deep learning, the newly added terms can be seen as the weight decay, while $\lambda$ and $\omega$ are factors of weight decay. This can limit the norm of the sample features $\boldsymbol{Z}$ and linear classifier $\boldsymbol{M}$ , that is, given any $\lambda,\omega\in(0,\infty)$ , there exists a $\rho(\lambda,\omega)$ such that $\rho(\lambda,\omega)\geq\|M_{y}\|$ and $\rho(\lambda,\omega)\geq\|\boldsymbol{z}_{y,i}\|$ when (3) converges.

Gradient Descent Ji et al. (2022) analyses Gradient Flow of (1), while different from them, we consider the Gradient Descent of (3):

\displaystyle\boldsymbol{Z}^{(t+1)}\leftarrow\boldsymbol{Z}^{(t)}-\alpha\nabla_{\boldsymbol{Z}^{(t)}}\mathcal{L}(\boldsymbol{M}^{(t)},\boldsymbol{Z}^{(t)})\ \ \text{and}\ \ \boldsymbol{M}^{(t+1)}\leftarrow\boldsymbol{M}^{(t)}-\beta\nabla_{\boldsymbol{M}^{(t)}}\mathcal{L}(\boldsymbol{M}^{(t)},\boldsymbol{Z}^{(t)})

(4)

where we use upper script $(t)$ to denote optimized variables in the $t$ -th iterations, $\alpha$ and $\beta$ are learning rates. Note that when $\lambda,\omega\rightarrow 0$ , the optimization problem (3) is equivalent to (1) since norm constraint condition (2) vanishes, $i.e.\rho\rightarrow\infty$ .

Theorem 3.2 (Generalized Neural Collapse).

Consider the convergce of Gradient Descent on the model (3), if parameters are properly selected (refer to Assumption.C.3 in Appendix for more details), we have the following conclusions:

(NC1) $\|\boldsymbol{z}_{y,i}-\boldsymbol{z}_{y,j}\|\rightarrow 0,\forall y\in[C],\forall i,j\in[N/C]$ ;

(NC2) $\|\boldsymbol{z}_{y,i}-M_{y}\|\rightarrow 0,\forall y\in[C],\forall i\in[N/C]$ ;

(NC3) if $\rho\rightarrow\infty$ , $\boldsymbol{M}$ converges to the solution of $\min_{\forall y,\|M_{y}\|=\rho}\max_{y\neq y^{\prime}}\langle M_{y},M_{y^{\prime}}\rangle$ ;

(NC4) $\arg\max_{y}\langle M_{y},\boldsymbol{z}\rangle\rightarrow\arg\max_{y}\|\boldsymbol{z}-\bar{\boldsymbol{z}}_{y}\|\ \forall\boldsymbol{z}\in\boldsymbol{Z}$ , where $\bar{\boldsymbol{z}}_{y}=\frac{C}{N}\sum_{i=1}^{N/C}\boldsymbol{z}_{y,i}$ .

Remark 3.3.

Our findings confirm Generalized Neural Collapse hypothesis of Liu et al. (2023). (NC1), (NC2), (NC4) are consistent with previous discoveries Papyan et al. (2020). Additionally, we have shown that (NC3) extends NeurCol’s ETF and leads to minimized maximal correlation, or Grassmannian Frame.

Remark 3.4.

Our findings reveal two objectives of NeurCol that Liu et al. (2023) highlighted: minimal intra-class variability and maximal inter-class separability. Our conclusions on (NC1), (NC2), (NC4) support the former objective, while the Grassmannian Frame resulting from (NC3) naturally coincides with the solutions of problems such as the Spherical Code Conway and Sloane (2013), Thomson problem F.R.S. (1904), and Packing Lines in Grassmannian Spaces Conway et al. (1996), which supports the latter objective.

Remark 3.5.

(NC1) and (NC2) implies that classifiers $\{M_{y}\}_{y=1}^{C}$ forms an alternate dual frame Ambrus et al. (2021) of features $\{\bar{\boldsymbol{z}}_{y}\}_{y=1}^{C}$ . In Frame Theroy, alternate dual frame has been proved to be an optimal dual with respect to erasures for decoding Leng and Han (2011); Lopez and Han (2010).

We conduct a simulation experiment to visualize the convergence of Generalized Neural Collapse in a $2$ -dimensional feature space with $4$ classes. A GIF animation can be found HERE. Please refer to Appendix.G for a detailed description and visual representation of this experiment.

3.3 Generalization Perspective

Next, we consider the generalization of classification problems. While correlation measures the similarity between two vectors in Frame Theory, in classification problems, margin is a similar concept but with an opposite degree. Therefore, our analysis of generalization focuses on margin.

Notations Consider a $C$ classes classification problem. Suppose sample space is $\mathcal{X}\times\mathcal{Y}$ , where $\mathcal{X}$ is data space and $\mathcal{Y}=\{1,\dots,C\}$ is label space. We assume class distribution is $\mathcal{P}_{\mathcal{Y}}=\left[p(1),\dots,p(C)\right]$ , where $p(c)$ denote the proportion of class $c$ . Let the training set $S=\{(\boldsymbol{x}_{i},y_{i})\}_{i=1}^{N}$ be drawn i.i.d from probability $\mathcal{P}_{\mathcal{X}\times\mathcal{Y}}$ . For $y$ -class samples in $S$ , we denote $S_{y}=\{\boldsymbol{x}|(\boldsymbol{x},y)\in S\}$ and $|S_{y}|=N_{y}$ . The form of classifiers is $logit=\boldsymbol{M}^{T}f(\boldsymbol{x};\boldsymbol{w})=[\langle M_{1},f(\boldsymbol{x};\boldsymbol{w})\rangle,\dots,\langle M_{C},f(\boldsymbol{x};\boldsymbol{w})\rangle]$ , where $\boldsymbol{M}\in\mathbb{R}^{d\times C}$ is the last linear layer, and $f(\cdot;\boldsymbol{w})\in\mathbb{R}^{d}$ is the feature extractor parameterized by $\boldsymbol{w}$ . We use a tuple $(\boldsymbol{M},f(\cdot;\boldsymbol{w}))$ to denote the classifier.

First, give the definition of margin:

Definition 3.6 (Linear Separability).

Given the dataset $S$ and a classifier $(\boldsymbol{M},f(\cdot;\boldsymbol{w}))$ , if the classifier can achieve $100\%$ accuracy on dataset, it must have $\boldsymbol{M}$ can linearly separate the feature of $S$ : for any two classes $i,j(i\neq j)$ , there exists a $\gamma_{i,j}>0$ such that

		$\displaystyle(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})\geq\gamma_{i,j},$	$\displaystyle\forall(\boldsymbol{x},i)\in S,$
		$\displaystyle(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})\leq-\gamma_{i,j},$	$\displaystyle\forall(\boldsymbol{x},j)\in S,$

In this case, we say the classifier $(\boldsymbol{M},f(\cdot;\boldsymbol{w}))$ can linearly separate the dataset $S$ by margin $\{\gamma_{i,j}\}_{i\neq j}$ .

The following lemma establishes the relationship between margin and correlation in NeurCol:

Lemma 3.7.

Given the dataset $S$ and a classifier $(\boldsymbol{M},f(\cdot;\boldsymbol{w}))$ , if the classifier can linearly separate the dataset by margin $\{\gamma_{i,j}\}_{i\neq j}$ and achieves NeurCol phenomenon on it, we have the following conclusion:

\displaystyle\forall i,j\in[C](i\neq j),\gamma_{i,j}+\langle M_{i},M_{j}\rangle=\rho^{2}

By substituting the conclusions of NC1-3 into the definition of margin, we can prove this lemma straightforwardly. It says given the maximal norm $\rho$ , margin $\gamma_{i,j}$ and correlation $\langle M_{i},M_{j}\rangle$ is a pair of opposite quantities. Then we propose the Multiclass Margin Bound.

Theorem 3.8 (Multiclass Margin Bound).

Consider a dataset $S$ with $C$ classes. For any classifier $(\boldsymbol{M},f(\cdot;\boldsymbol{w}))$ , we denote its margin between $i$ and $j$ classes as $(M_{i}-M_{j})^{T}f(\cdot;\boldsymbol{w})$ . And suppose the function space of the margin is $\mathcal{F}=\{(M_{i}-M_{j})^{T}f(\cdot;\boldsymbol{w})|\forall i\neq j,\forall\boldsymbol{M},\boldsymbol{w}\}$ , whose uppder bound is

\displaystyle\sup_{i\neq j}\sup_{\boldsymbol{M},\boldsymbol{w}}\sup_{x\in\mathcal{M}_{i}}\left|(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})\right|\leq K.

Then, for any classifier $(\boldsymbol{M},f(\cdot;\boldsymbol{w}))$ and margins $\{\gamma_{i,j}\}_{i\neq j}(\gamma_{i,j}>0)$ , the following inequality holds with probability at least $1-\delta$

\displaystyle\mathbb{P}_{x,y}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}\lesssim

\displaystyle\sum_{i=1}^{C}p(i)\sum_{j\neq i}\frac{\mathfrak{R}_{N_{i}}(\mathcal{F})}{\gamma_{i,j}}+\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sqrt{\frac{\log(\log_{2}\frac{4K}{\gamma_{i,j}})}{N_{i}}}+L_{0,1}

where $\lesssim$ means we omit probability related terms, and $L_{0,1}$ denotes the empirical risk term:

\displaystyle L_{0,1}=\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sum_{x\in S_{i}}\frac{\mathbb{I}((M_{i}-M_{j})^{T}f(x)\leq\gamma_{i,j})}{N_{i}}

$\mathfrak{R}_{N_{i}}(\mathcal{F})$ is the Rademacher complexity Kakade et al. (2008); Bartlett and Mendelson (2002) of function space $\mathcal{F}$ . Refer to Appendix.D for full version of this theorem.

Recall NeurCol occurs when class distribution is uniform. We consider this case.

Corollary 3.9.

In Theorem.3.8, we assume the the class distribution and train set are both uniform, $i.e.$ $p(i)=\frac{1}{C}$ and $N_{i}=\frac{N}{C}\ \forall i\in[C]$ . In this case, the generalization bound in Theorem.3.8 becomes

\displaystyle\mathbb{P}_{x,y}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}\lesssim\frac{\mathfrak{R}_{N/C}(\mathcal{F})}{C}\sum_{i=1}^{C}\sum_{j\neq i}\frac{1}{\gamma_{i,j}}+\frac{1}{\sqrt{NC}}\sum_{i=1}^{C}\sum_{j\neq i}\sqrt{\log(\log_{2}\frac{4K}{\gamma_{i,j}})}+L_{0,1}

Observing both above terms are the form $\sum_{i\neq j}\frac{1}{\gamma_{i,j}}$ , we have

\displaystyle\sum_{c=1}^{C}\sum_{j\neq i}\frac{1}{\gamma_{i,j}}\leq C(C-1)\max_{i\neq j}\frac{1}{\gamma_{i,j}}\ \ \Leftrightarrow\ \ \frac{1}{C(C-1)}\min_{\{\gamma_{i,j}\}_{i\neq j}}\sum_{c=1}^{C}\sum_{j\neq i}\frac{1}{\gamma_{i,j}}\leq\min_{\{\gamma_{i,j}\}_{i\neq j}}\max_{i\neq j}\frac{1}{\gamma_{i,j}}

Remark 3.10.

Once again, we observe the characteristic of minimized maximal correlation. However, this time it is in the form of margin and is obtained by minimizing the margin generalization error.

Remark 3.11.

The Multiclass Margin Bound can provide an explanation for the steady improvement in test accuracy and adversarial robustness during TPT (as shown in Figure 8 and Table 1 of Papyan et al. (2020)). At the beginning of TPT, the accuracy over the training set reaches $100\%$ and $L_{0,1}=0$ , indicating that generalization performance can no longer improve by reducing $L_{0,1}$ . However, if we continue training at this point, the margin $\gamma_{i,j}$ would still increase. Therefore, better robustness can be achieved by increasing the margin. Furthermore, two terms in our bound related to margin continue to decrease, leading to better generalization performance.

Minority Collapse Fang et al. (2021) have identified a related phenomenon called Minority Collapse, which is an imbalanced version of NeurCol. Specifically, they observed that when the training set is extremely imbalanced, the features of the minority class tend to converge to be parallel. Our Multiclass Margin Bound can offer explanations for the generalization of this phenomenon.

Corollary 3.12.

Consider imbalanced classification. Given dataset $S$ with $C$ classes, the first $C_{1}$ classes $\mathcal{C}_{1}=\{1,\dots,C_{1}\}$ each contain $N_{1}$ data, and the remaining $C-C_{1}$ minority classes $\mathcal{C}_{2}=\{C_{1}+1,\dots,C\}$ each contain $N_{2}$ data. We denote the imbalanced ratio $N_{1}/N_{2}$ as $R$ . Assume class distribution $p(i)$ is the same as dataset $S$ ’s.

Then terms related to the margins between minority classes in Multiclass Margin Bound becomes:

\displaystyle\sum_{i\in\mathcal{C}_{2}}\sum_{j\in\mathcal{C}_{2}\backslash\{i\}}\frac{1}{C_{1}R+C-C_{1}}\left(\frac{\mathfrak{R}_{N_{2}}(\mathcal{F})}{\gamma_{i,j}}+\sqrt{\frac{\log\left(\log_{2}\frac{4K}{\gamma_{i,j}}\right)}{N_{2}}}\right)

Remark 3.13.

In these terms, $R$ and $\gamma_{i,j}(i,j\in\mathcal{C}_{2})$ are inversely related in terms of their values. This implies that as $R\rightarrow\infty$ , if the generalization bound remains constant, then $\gamma_{i,j}$ must approach $0$ . Recall $\left(M_{i}-M_{j}\right)^{T}f(\boldsymbol{x};\boldsymbol{w})\leq\gamma_{i,j}\ \forall\boldsymbol{x}\in S_{i}$ , which means that $\|M_{i}-M_{j}\|\rightarrow 0$ .

4 Further Exploration

In this section, we uncover a new phenomenon in NeurCol, which we refer to as Symmetric Generalization. Symmetric Generalization is linked to two transformation groups over Grassmannian Frame: namely Permutation and Rotation transformation. Briefly, Grassmannian Frame in these two transformations can result in different generalization performances. Upon observing this intriguing phenomenon in our experiments, we provide a theoretical result to explain this phenomenon partially.

4.1 Motivation

First, we introduce two kinds of equivalence of frame Holmes and Paulsen (2004); Bodmann and Paulsen (2005):

Definition 4.1 (Equivalent Frame).

Given two frames $\{\zeta_{i}\}_{i=1}^{C},\{\chi_{i}\}_{i=1}^{C}$ in $\mathbb{R}^{d}$ , they are:

•

Type I Equivalent if there exists a orthogonal matrix $R\in\mathbb{R}^{d}$ such that $[\zeta_{i}]_{i=1}^{C}=R[\chi_{i}]_{i=1}^{C}$ .
•

Type II Equivalent if there exists a permutation matrix $P\in\mathbb{R}^{C}$ such that $[\zeta_{i}]_{i=1}^{C}=[\chi_{i}]_{i=1}^{C}P$ .

The Grassmannian Frame is a geometrically symmetrical structure, with its symmetry stemming from the invariance of two transformations: rotation and permutation. Specifically, after a rotation or permutation (or their combination), the frame still satisfies the minimized maximal correlation property, as only the frame’s direction and order change. We are curious about how these equivalences affect the performance of models. In machine learning, we typically consider two aspects of a model: optimization and generalization. Given that the training set and classification model (backbone and capacity) are the same, we argue that models with equivalent Grassmannian Frames will exhibit the same optimization performance. However, there is no reason to believe that these models will also have the same generalization performance. As such, we propose the following question:

Is generalization of models invariable to symmetric transformations of Grassmannian Frame?

To explore this question, we conduct a series of experiments. Our experimental results lead to an interesting conclusion:

Optimization performance of models is not affected by Rotation and Permutation transformation of Grassmannian Frame, but generalization performance is.

This newly discovered phenomenon, which we call Symmetric Generalization, contradicts a recent argument made in Yang et al. (2022). The authors of that work claimed that since modern neural networks are often overparameterized and can learn features with any direction, fixing linear classifiers as a Simplex ETF is sufficient, and there is no need to learn it. Our findings challenge this viewpoint.

4.2 Experiments

To investigate the impact of Rotation and Permutation transformations of Grassmannian Frame on the generalization performance of deep neural networks, we conducted a series of experiments.

How to Reveal the Anwser We generate $10$ Grassmannian Frame with different Rotation and Permutation. Then, we train the same network architecture 10 times. In each time, the linear classifier are loaded from a pre-generated equivalent Grassmannian Frame and fixed during training. To ensure a completely same optimization process (mini-batch, augmentation, and parameter initialization), we use the same random seed for each training. Once the NeurCol phenomenon occurrs, we know that the $10$ models have learned different Grassmannian Frames. Finally, we compare their generalization performances by evaluating the cross-entropy loss and accuracy on a test set.

Generation of Equivalent Grassmannian Frame Our Theorem.3.2 naturally offers a numerical method for generating Grassmannian Frame. Given class number and feature dimension, we use Gradient Descent (4) on the unconstrained feature models (3) to generate Grassmannian Frame. Then given a Grassmannian Frame $\{M_{y}\}_{y=1}^{C}$ in $\mathbb{R}^{d}$ , if it is denoted as $\boldsymbol{M}=[M_{1},\cdots,M_{C}]\in\mathbb{R}^{d\times C}$ , we can use $R\boldsymbol{M}P$ to denote its equivalent frame. where $P\in Permutation(C)$ and $R\in SO(d)$ :

		$\displaystyle Permutation(C)=\left\{P\bigg{\|}\forall i\in[C],\sum_{j=1}^{C}P_{j,i}=\sum_{j=1}^{C}P_{i,j}=1,\forall i,j\in[C],P_{i,j}\in\{0,1\},P\in\mathbb{R}^{C}\right\}$
		$\displaystyle SO(d)=\left\{R\big{\|}R^{T}R=RR^{T}=I_{d},\|R\|=1,R\in\mathbb{R}^{d}\right\}$

Note that $Permutation(C)$ and $SO(d)$ act on vector orders and directions of $\{M_{y}\}_{y=1}^{C}$ respectively. Refer to Appendix.F.2 for code implementation.

Models with Different Features Yang et al. (2022) point out in their Theorem.1: if the linear classifier is fixed as Simplex ETF, then the final features learned by the model would converge to be Simplex ETF with the same direction to classifier. Following their work, to let models to learn equivalent Grassmannian Frame, we initialize the linear classifier as equivalent Grassmannian Frame and do not perfrom optimization on it during training. In this way, when NeurCol occurs, models have learned equivalent Grassmannian Frame.

Network Architecture and Dataset Our experiments involve two image classification datasets: CIFAR10/100 Krizhevsky (2009). And for every dataset, we use three different convolutional neural networks to verify our finding, including ResNet He et al. (2016), Vgg Simonyan and Zisserman (2014), DenseNet Huang et al. (2017). Both datasets are balanced with 10 and 100 classes respectively, each having $500$ and $5,000$ training images per class. To meet the larger number of classes than feature dimensions, we use $6$ and $64$ as the feature dimensions, respectively. Then, to obtain different dimensional feature for every backbone, we attach a linear layer after the end of backbone, which can transform feature dimensions.

Training To make NeurCol appear, we follow Papyan et al. (2020)’s practice. More details refer to Appendix.F.1.

Experiment Index	0	1	2	3	4	5	6	7	8	9
Train CE	0.0019	0.0018	0.0019	0.0019	0.0019	0.0018	0.0019	0.0019	0.0019	0.0019
Val CE	0.586	0.6965	0.6871	0.61	0.6585	0.576	0.5565	0.5972	0.5892	0.6243
Train ACC	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0
Val ACC	87.96	87.96	88.25	87.94	87.32	87.58	88.39	88.32	87.98	87.73

Table 1: Performance comparison of different Permutation features on CIFAR10 and Vgg11.

Equivalence	Dataset	Model	Std Train CE	Std Val CE	Range Val CE	Std Train ACC	Std Val ACC	Range Val ACC
Permutation	CIFAR10	VGG11	0.0	0.01	0.034	0.0	0.209	0.71
		Resnet34	0.0	0.018	0.058	0.0	0.197	0.71
		DenseNet121	0.0	0.011	0.04	0.0	0.157	0.49
	CIFAR100	VGG11	0.0	0.027	0.084	0.004	0.353	1.38
		Resnet34	0.0	0.042	0.134	0.005	0.468	1.37
		DenseNet121	0.0	0.012	0.037	0.003	0.232	0.84
Rotation	CIFAR10	VGG11	0.0	0.045	0.14	0.0	0.317	1.07
		Resnet34	0.0	0.078	0.23	0.0	0.538	1.73
		DenseNet121	0.0	0.059	0.187	0.0	0.18	0.63
	CIFAR100	VGG11	0.0	0.034	0.099	0.0	0.455	1.52
		Resnet34	0.0	0.036	0.136	0.005	0.403	1.16
		DenseNet121	0.0	0.024	0.081	0.004	0.516	1.59

Table 2: More comprehensive results on different datasets and models. Std indicates the standard deviation of ten metrics and Range indicates the maximal metric minus minimal.

Results Table.1 presents results on CIFAR10 and Vgg11 with different Permutation. All metrics are given when the model converges to NeurCol ( $100\%$ accuracy and zero loss on training set). We observe that, even though all experiments achieved $0$ cross-entropy loss and $100\%$ accuracy, they still exhibit significant differences in test loss and accuracy. This implies although permutation hardly affects optimization, it has a significant impact on generalization. Table.2 provides results of different dataset, backbone and two transformations to reveal the same phenomenon. These experimental results answer the question we posed in Section.4.1, demonstrating that different feature order and direction can influence generalization performance of models.

4.3 Analysis for Permutation

We aim to explain why Grassmannian Frame with different Permutation can lead to different generalization performance theoretically. Most of symbol definitions are adopted from Section.3.3.

Theorem 4.2.

Given the dataset $S$ and a classifier $(\boldsymbol{M},f(\cdot;\boldsymbol{w}))$ , assume the classifier has already achieved NeurCol with maximal norm $\rho$ . Suppose $(\boldsymbol{M},f(\cdot;\boldsymbol{w}))$ can linearly separate $S$ by margin $\{\gamma_{i,j}\}_{i\neq j}$ . Besides, we make the following assumptions:

•

$f(\cdot,\boldsymbol{w})$ is $L$ -Lipschitz for any $\boldsymbol{w}$ , $i.e.$ $\forall\boldsymbol{x}_{1},\boldsymbol{x}_{2}$ , $\|f(\boldsymbol{x}_{1},\boldsymbol{w})-f(\boldsymbol{x}_{2},\boldsymbol{w})\|\leq L\|\boldsymbol{x}_{1}-\boldsymbol{x}_{2}\|$ .
•

$S$ is large enough such that $N_{i}\geq\max_{j\neq i}\mathcal{N}(\frac{\gamma_{i,j}}{L\|M_{i}-M_{j}\|},\mathcal{M}_{i})$ for every class $i$ .
•

label distribution and labels of $S$ are balanced, $i.e.$ $p(i)=\frac{1}{C}$ and $N_{i}=\frac{N}{C},\forall i\in[C]$
•

$S_{i}$ is drawn from probability $\mathcal{P}_{\boldsymbol{x}|y=i}$ , where probability’s tight support is denoted as $\mathcal{M}_{i}$ .

where $\mathcal{N}(\cdot,\mathcal{M}_{i})$ is the covering number of $\mathcal{M}_{i}$ , refer to Appendix.E for its definition. Then expected accuracy of $(\boldsymbol{M},f(\cdot;\boldsymbol{w}))$ over the entire distribution is given by

\displaystyle\mathbb{P}_{\boldsymbol{x},y}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}=y\Big{)}>1-\frac{1}{2N}\sum_{i=1}^{C}\max_{j\neq i}\mathcal{N}(\frac{1}{L}\sqrt{\frac{\rho^{2}-M_{i}^{T}M_{j}}{2}},\mathcal{M}_{i})

Remark 4.3.

Permutation transformation auctually changes the order of features of every class. Given the Grassmannian Frame $\{M_{i}\}_{i=1}^{C}$ , we denote the Type II Equivalent frame of permutation $\pi$ as $\{M_{\pi(i)}\}_{i=1}^{C}$ . Therefore, the covering number in our theorem becomes $\mathcal{N}(\frac{1}{L}\sqrt{\frac{\rho^{2}-M_{\pi(i)}^{T}M_{\pi(j)}}{2}},\mathcal{M}_{i})$ , which leads to different accuracy bound.

4.4 Insight and Discussion

Refer to caption — (a) Dog is close to Cat and Fire Truck is close to Car.

Permutation We provide an example to give the insight of Symmetric Generalization of Permutation. Consider a Grassmannian Frame $\{M_{i}\}_{i=1}^{4}$ living in $\mathbb{R}^{2}$ , it resembles a cross (Theorem III.1 of Benedetto and Kolesar (2006)). As shown in Figure.1(a), there are four classes that correspond to different feature $M_{i}$ . Obviously, since Dog and Cat look similar to each other, they deserve to have a smaller margin in the feature space (near to each other). The same goes for the other two categories. However, if we swap the features of Truck and Dog to increase the distance between dogs and cats, as shown in Figure.1(b), the semantic relationship in the feature space would be disrupted. We argue that this can harm the model’s training and lead to worse generalization.

Rotation Symmetric Generalization of Rotation has been completely beyond our initial expectation. We believe that the margin or correlation between features is the most effective tool to understand NeurCol phenomenon. However, it fails to explain why different directions of features have different generalization. In Deep Learning community, Implicit Bias phenomenon is a possible way to approach this finding. Soudry et al. (2018) has proved that the Gradient Descent optimization leads to weight direction that maximizes the margin when using logistic loss and linear models. As a further progress, Lyu and Li (2020) extends this result to homogeneous neural networks. We speculate that the explanations for Symmetric Generalization of Rotation may be hidden within the layers of neural networks. Therefore, studying the Implicit Bias of deep models layer by layer could be a promising direction for future research. This is beyond the scope of our current work, and we leave it as a topic for our future work.

5 Conclusion

In this paper, we justify Generalized Neural Collapse hypothesis by leading to Grassmannian Frame into classification problems, which does not assume specific numbers of classes feature dimensions, and every two vectors in it can achieve maximal distances on a sphere. In addition, awaring that Grassmannian Frame is symmetric geometrically, we propose a question: is generalization of the model invariable to symmetric transformations of Grassmannian Frame? To explore this question, we conduct a series of empirical and theoretical analysis, and finally find Symmetric Generalization phenomenon. This phenomenon suggests that the generalization of a model is influenced by geometric invariant transformations of the Grassmannian Frame, including $Permutation(C)$ and $SO(d)$ .

References

Papyan et al. [2020] Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
Ji et al. [2022] Wenlong Ji, Yiping Lu, Yiliang Zhang, Zhun Deng, and Weijie J Su. An unconstrained layer-peeled perspective on neural collapse. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=WZ3yjh8coDg.
Fang et al. [2021] Cong Fang, Hangfeng He, Qi Long, and Weijie J Su. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences, 118(43):e2103091118, 2021.
Zhu et al. [2021] Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820–29834, 2021.
Zhou et al. [2022a] Jinxin Zhou, Chong You, Xiao Li, Kangning Liu, Sheng Liu, Qing Qu, and Zhihui Zhu. Are all losses created equal: A neural collapse perspective. In Advances in Neural Information Processing Systems, volume 35, pages 31697–31710. Curran Associates, Inc., 2022a.
Yaras et al. [2022a] Can Yaras, Peng Wang, Zhihui Zhu, Laura Balzano, and Qing Qu. Neural collapse with normalized features: A geometric analysis over the riemannian manifold. In Advances in Neural Information Processing Systems, volume 35, pages 11547–11560. Curran Associates, Inc., 2022a.
Han et al. [2022] X.Y. Han, Vardan Papyan, and David L. Donoho. Neural collapse under MSE loss: Proximity to and dynamics on the central path. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=w1UbdvWH_R3.
Tirer and Bruna [2022] Tom Tirer and Joan Bruna. Extended unconstrained features model for exploring deep neural collapse. In International Conference on Machine Learning, pages 21478–21505. PMLR, 2022.
Zhou et al. [2022b] Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pages 27179–27202. PMLR, 2022b.
Mixon et al. [2020] Dustin G Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619, 2020.
Poggio and Liao [2020] Tomaso Poggio and Qianli Liao. Explicit regularization and implicit bias in deep network classifiers trained with the square loss. arXiv preprint arXiv:2101.00072, 2020.
Graf et al. [2021] Florian Graf, Christoph Hofer, Marc Niethammer, and Roland Kwitt. Dissecting supervised contrastive learning. In International Conference on Machine Learning, pages 3821–3830. PMLR, 2021.
Lu and Steinerberger [2022] Jianfeng Lu and Stefan Steinerberger. Neural collapse under cross-entropy loss. Applied and Computational Harmonic Analysis, 59:224–241, 2022.
Liu et al. [2023] Weiyang Liu, Longhui Yu, Adrian Weller, and Bernhard Schölkopf. Generalizing and decoupling neural collapse via hyperspherical uniformity gap. In The Eleventh International Conference on Learning Representations, 2023.
Yang et al. [2022] Yibo Yang, Shixiang Chen, Xiangtai Li, Liang Xie, Zhouchen Lin, and Dacheng Tao. Inducing neural collapse in imbalanced learning: Do we really need a learnable classifier at the end of deep neural network? In Advances in Neural Information Processing Systems, volume 35, pages 37991–38002. Curran Associates, Inc., 2022.
Casazza and Kutyniok [2012] Peter G Casazza and Gitta Kutyniok. Finite frames: Theory and applications. Springer, 2012.
Casazza and Kovačević [2003] Peter G Casazza and Jelena Kovačević. Equal-norm tight frames with erasures. Advances in Computational Mathematics, 18:387–430, 2003.
Ambrus et al. [2021] Gergely Ambrus, Bo Bai, and Jianfeng Hou. Uniform tight frames as optimal signals. Advances in Applied Mathematics, 129:102219, 2021.
Holmes and Paulsen [2004] Roderick B Holmes and Vern I Paulsen. Optimal frames for erasures. Linear Algebra and its Applications, 377:31–51, 2004.
Strohmer and Heath Jr [2003] Thomas Strohmer and Robert W Heath Jr. Grassmannian frames with applications to coding and communication. Applied and computational harmonic analysis, 14(3):257–275, 2003.
Welch [1974] Lloyd Welch. Lower bounds on the maximum cross correlation of signals (corresp.). IEEE Transactions on Information theory, 20(3):397–399, 1974.
Shannon [1959] Claude E Shannon. Probability of error for optimal codes in a gaussian channel. Bell System Technical Journal, 38(3):611–656, 1959.
Yaras et al. [2022b] Can Yaras, Peng Wang, Zhihui Zhu, Laura Balzano, and Qing Qu. Neural collapse with normalized features: A geometric analysis over the riemannian manifold. In Advances in Neural Information Processing Systems, volume 35, pages 11547–11560. Curran Associates, Inc., 2022b.
Conway and Sloane [2013] John Horton Conway and Neil James Alexander Sloane. Sphere packings, lattices and groups, volume 290. Springer Science & Business Media, 2013.
F.R.S. [1904] J.J. Thomson F.R.S. Xxiv. on the structure of the atom: an investigation of the stability and periods of oscillation of a number of corpuscles arranged at equal intervals around the circumference of a circle; with application of the results to the theory of atomic structure. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 7(39):237–265, 1904. doi: 10.1080/14786440409463107.
Conway et al. [1996] John H Conway, Ronald H Hardin, and Neil JA Sloane. Packing lines, planes, etc.: Packings in grassmannian spaces. Experimental mathematics, 5(2):139–159, 1996.
Leng and Han [2011] Jinsong Leng and Deguang Han. Optimal dual frames for erasures ii. Linear Algebra and its Applications, 435(6):1464–1472, 2011. ISSN 0024-3795. doi: https://doi.org/10.1016/j.laa.2011.03.043.
Lopez and Han [2010] Jerry Lopez and Deguang Han. Optimal dual frames for erasures. Linear Algebra and its Applications, 432(1):471–482, 2010. ISSN 0024-3795. doi: https://doi.org/10.1016/j.laa.2009.08.031.
Kakade et al. [2008] Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008.
Bartlett and Mendelson [2002] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
Bodmann and Paulsen [2005] Bernhard G Bodmann and Vern I Paulsen. Frames, graphs and erasures. Linear algebra and its applications, 404:118–146, 2005.
Krizhevsky [2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Benedetto and Kolesar [2006] John Benedetto and Joseph Kolesar. Geometric properties of grassmannian frames for r2 and r3. EURASIP Journal on Advances in Signal Processing, 2006, 12 2006. doi: 10.1155/ASP/2006/49850.
Soudry et al. [2018] Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1q7n9gAb.
Lyu and Li [2020] Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJeLIgBKPS.
Xie et al. [2022] Liang Xie, Yibo Yang, Deng Cai, and Xiaofei He. Neural collapse inspired attraction-repulsion-balanced loss for imbalanced learning. Neurocomputing, 527:60–70, 2022.
Yang et al. [2023] Yibo Yang, Haobo Yuan, Xiangtai Li, Zhouchen Lin, Philip Torr, and Dacheng Tao. Neural collapse inspired feature-classifier alignment for few-shot class-incremental learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=y5W8tpojhtJ.
Yang et al. [2021] Zhiyong Yang, Qianqian Xu, Shilong Bao, Xiaochun Cao, and Qingming Huang. Learning with multiclass auc: theory and algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7747–7763, 2021.
Kulkarni and Posner [1995] Sanjeev R Kulkarni and Steven E Posner. Rates of convergence of nearest neighbor estimation under arbitrary sampling. IEEE Transactions on Information Theory, 41(4):1028–1039, 1995.
Vural and Guillemot [2017] Elif Vural and Christine Guillemot. A study of the classification of low-dimensional data with supervised manifold learning. The Journal of Machine Learning Research, 18(1):5741–5795, 2017.
Lezcano Casado [2019] Mario Lezcano Casado. Trivializations for gradient-based optimization on manifolds. Advances in Neural Information Processing Systems, 32, 2019.
Lezcano-Casado and Martınez-Rubio [2019] Mario Lezcano-Casado and David Martınez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. In International Conference on Machine Learning, pages 3794–3803. PMLR, 2019.

Appendix A Related Work

Neural Collapse (NeurCol) was first observed by Papyan et al. [2020] in $2020$ ., and it has since sparked numerous studies investigating deep classification models.

Many of these studies Ji et al. [2022], Fang et al. [2021], Zhu et al. [2021], Zhou et al. [2022a], Yaras et al. [2022a] have focused on the optimization aspect of NeurCol, proposing various optimization models and analyzing them. For example, Zhu et al. [2021] proposed Unconstrained Feature Models and provided the first global optimization analysis of NeurCol, while Fang et al. [2021] proposed Layer-Peeled Model, a nonconvex yet analytically tractable optimization program, to prove NeurCol and predict Minority Collapse, an imbalanced version of NeurCol. Other studies have explored NeurCol under Mean Square Error Loss (MSE) Han et al. [2022], Tirer and Bruna [2022], Zhou et al. [2022b], Mixon et al. [2020], Poggio and Liao [2020]. For instance, Han et al. [2022], Tirer and Bruna [2022] justified NeurCol under the MSE loss. In addition to MSE loss, Zhou et al. [2022a] extended such results and analyses a broad family of loss functions including commonly used label smoothing and focal loss. Besides, NeurCol phenomenon also inspires design of loss function in imbalanced learning Xie et al. [2022] and Few-Shot Learning Yang et al. [2023].

Previous studies on NeurCol have generally assumed that the number of classes is less than the feature dimension, and this assumption has not been questioned until recently. In the 11th International Conference on Learning Representations (ICLR 2023), Liu et al. [2023] proposed the Generalized Neural Collapse hypothesis, which removes this restriction by extending the ETF structure to Hyperspherical Uniformity. In this paper, we contribute to this area by proving the Generalized Neural Collapse hypothesis and introducing the Grassmannian Frame to better understand the NeurCol phenomenon.

Appendix B Proof of Theorem.3.1

We introduce this lemma:

Lemma B.1.

Suppose that $\boldsymbol{0}\notin\mathcal{K}$ , and that $\mathcal{K}$ is a closed set. Suppose $\boldsymbol{g}\sim\mathcal{N}(\boldsymbol{0},\sigma^{2}\boldsymbol{I})$ , we have:

-\sigma^{2}\log\mathbb{P}_{\boldsymbol{g}}(\mathcal{K})\rightarrow\min_{\boldsymbol{g}\in\mathcal{K}}\{\frac{1}{2}\|\boldsymbol{g}\|^{2}\}\ \ \text{as}\ \ \sigma\rightarrow 0

This lemma can establish the relationship between geometry and probability. Then we start our proof.

See 3.1

Proof.

First, let us think about what kind of decoding is optimal. According to Shannon [1959], since the gaussian density is monotone decreasing with distance, an optimal decoding system for gaussian channel is the minimum distance decoding, $i.e.$

\displaystyle\hat{c}=\underset{c\in\{1,\dots,C\}}{\arg\min}\|M_{c}-h\|

where $\hat{c}$ is the prediction result. we consider the two-classes communication problem: there is only two number $c$ and $c^{\prime}$ transmitted. We denote the event that a $c$ signal is recover as $c^{\prime}$ as $\varepsilon_{c^{\prime}|c}$ , then

\displaystyle\varepsilon_{c^{\prime}|c}=\{\boldsymbol{g}\in\mathbb{R}^{d},\|\boldsymbol{g}-M_{c}\|>\|\boldsymbol{g}-M_{c^{\prime}}\|\}

According to Lemma.B.1, we have

\displaystyle-\sigma^{2}\log\mathbb{P}_{\boldsymbol{g}}(\varepsilon_{c^{\prime}|c})\rightarrow\min_{\boldsymbol{g}\in\varepsilon_{c^{\prime}|c}}\{\frac{1}{2}\|\boldsymbol{g}\|^{2}\}=\frac{1}{8}\|M_{c}-M_{c^{\prime}}\|^{2},\sigma\rightarrow 0.

For all numbers transmitted, the error event $\varepsilon$ could be devided into the error event between every two numbers, $i.e.$

\displaystyle\varepsilon=\bigcup_{c\neq c^{\prime}}\varepsilon_{c|c^{\prime}}

\displaystyle-\sigma^{2}\log\mathbb{P}_{\boldsymbol{g}}(\varepsilon)\rightarrow\frac{1}{8}\min_{c\neq c^{\prime}}\|M_{c}-M_{c^{\prime}}\|^{2},\sigma\rightarrow 0.

To obtain the code with minimal error probability, we can maximize $\min_{c\neq c^{\prime}}\|M_{c}-M_{c^{\prime}}\|^{2}$ . With a norm constraint for every code, we have

\displaystyle\max_{\forall c,\|M_{c}\|=1}\min_{c\neq c^{\prime}}\|M_{c}-M_{c^{\prime}}\|^{2}\Leftrightarrow\min_{\forall c,\|M_{c}\|=1}\max_{c\neq c^{\prime}}\langle M_{c},M_{c^{\prime}}\rangle

∎

Appendix C Proof of Theorem.3.2

In this section, we prove Theorem.3.2. Here are two lemmas that we are going to use.

C.1 Lemmas

Lemma C.1 (Lemma.7 of Yang et al. [2021]: Lipschitz Properties of Softmax).

Given $x\in\mathbb{R}^{C}$ , the function $Softmax(x)$ is defined as

\displaystyle Softmax(x)=\left[\frac{e^{x_{1}}}{\sum_{i=1}^{C}e^{x_{i}}},\dots,\frac{e^{x_{C}}}{\sum_{i=1}^{C}e^{x_{i}}}\right]^{T},

then the function $Softmax(\cdot)$ is $\sqrt{\frac{C}{2}}$ -Lipschitz continuous.

Lemma C.2.

Given any matrix $\boldsymbol{A}\in\mathbb{R}^{n\times n}$ with constant $a$ on the diagonals and constant $c$ on off diagonals, $i.e.$

\boldsymbol{A}=\left[\begin{array}[]{ccccccc}a&c&\ldots&c\\ c&a&\cdots&c\\ \vdots&\vdots&\ddots&\vdots\\ c&c&\ldots&a\\ \end{array}\right],

then we have $|\boldsymbol{A}|=\left(a-c\right)^{n-1}\left(a+(n-1)c\right)$ .

Proof.

|\boldsymbol{A}|\xlongequal{\textbf{step1}}\left|\begin{array}[]{ccccccc}a-c&0&\ldots&c-a\\ 0&a-c&\cdots&c-a\\ \vdots&\vdots&\ddots&\vdots\\ c&c&\ldots&a\\ \end{array}\right|\xlongequal{\textbf{step2}}\left|\begin{array}[]{ccccccc}a-c&0&\ldots&0\\ 0&a-c&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ c&c&\ldots&a+(n-1)c\\ \end{array}\right|

step1: Subtract the first $n-1$ rows by the last row.
step2: Add the first $n-1$ columns on the last column.

∎

C.2 Background

In the proof of Theorem.3.2, the techniques that we used is mainly limit analysis, and in addition a little linear algebra.

C.2.1 Assumption

We make the following assumption:

Assumption C.3.

In update of (4), parameters is properly selected: $\lambda/\omega=\alpha/\beta=N/C$ . In addition, $\alpha$ is small enough to assure the system would converge and the norms of every vector in $\boldsymbol{M},\boldsymbol{Z}$ is always bounded by $\rho$ .

This small enough $\alpha$ is necessary and reasonable for the convergce of Gradient Descent. With this assumption, we can assure that every variables in system (4) would not blow up and finally converge to a stable state. And the condition $\lambda/\omega=\alpha/\beta=N/C$ is to make sure both classifiers and features can be bounded by the same maximal $\ell_{2}$ norm.

C.2.2 Symbol Regulations

We regulate our symbols for clearer representation. Recall our setting, we use $\boldsymbol{z}_{y,i}$ to denote the $i$ -th smaple in class $y$ and every class has $N/C$ samples. Here, we put $i$ -th sample in all class together, denote it as $\boldsymbol{Z}_{i}$ , $i.e.$

\displaystyle\boldsymbol{Z}_{i}=\left[\boldsymbol{z}_{1,i},\cdots,\boldsymbol{z}_{C,i}\right]\in\mathbb{R}^{d\times C}\ \ \text{and}\ \ \boldsymbol{Z}=[\boldsymbol{Z}_{1},\cdots,\boldsymbol{Z}_{N/C}]\in\mathbb{R}^{d\times N}

Then we denote confidence probability of $\boldsymbol{Z}_{i}$ given by classifier $\boldsymbol{M}$ as

\displaystyle\boldsymbol{P}_{i}=\left[Softmax(\boldsymbol{z}_{1,i}^{T}M),\cdots,Softmax(\boldsymbol{z}_{C,i}^{T}M)\right]\in\mathbb{R}^{C\times C}

where $Softmax(\cdot)$ transform the $logit$ into a probability vector. Refer to Lemma.C.1 for definition of $Softmax(\cdot)$ .

C.2.3 Proof Sketch

We prove Theorem.3.2 by proving three Lemmas:

Lemma C.4 (Variability within Classes).

Consider the update of Gradient Descent (3), under Assumption.C.3, features of any sample that belong to the same class would converge to the same, $i.e.$ $\forall i,j\in[N/C],\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\|\rightarrow 0\ \text{as}\ t\rightarrow\infty$ .

Lemma C.5 (Convergence to Self Duality).

Consider the update of Gradient Descent (3), under Assumption.C.3, features of any sample would converge to the classifier corresponding to it’s category, $i.e.$ $\forall i\in[N/C],\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}\|\rightarrow 0\ \text{as}\ t\rightarrow\infty$ .

Lemma C.6 (Convergence to Grassmannian Frame).

Consider the function (1), given any sequence $\{(M^{(t)},Z^{(t)})\}$ , if $\text{CELoss}(M^{(t)},Z^{(t)},Y)\rightarrow 0$ as $t\rightarrow\infty$ , then $\boldsymbol{M}^{(t)}$ and $\boldsymbol{Z}^{(t)}$ would converge to the solution of

\underset{\boldsymbol{M},\boldsymbol{Z}}{\max}\ \underset{y\neq y^{\prime}}{\min}\underset{i\in[N/C]}{\min}\langle M_{y}-M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle

Obviously, Lemma.C.4 and Lemma.C.5 can directly indicate NC1 and NC2. We first prove them, which show the aggregation effect of cross entropy loss: features / linear classifier with the same class converge to the same point. Then NC4 is obvious under the conclusions of Lemma.C.4 and Lemma.C.5. Lemma.C.6 is the key step to prove that (1) converges to the min-max optimization as $\mathcal{L}(\boldsymbol{M},\boldsymbol{Z})$ converge to zero. However, we still need to combine all Lemmas to obtain the minimized maximal correlation characteristic. Here is the proof:

Proof of (NC3).

First, we know

\displaystyle\underbrace{\min_{\boldsymbol{Z},\boldsymbol{M}}\text{CELoss}(\boldsymbol{M},\boldsymbol{Z})}_{(\ref{opt0})}\Leftrightarrow

\displaystyle\min_{\rho>0}\underbrace{\min_{\|\boldsymbol{z}_{y,i}\|,\|M_{i}\|\leq\rho}\text{CELoss}(\boldsymbol{M},\boldsymbol{Z})}_{(\ref{opt0})\ s.t.(\ref{opt1})}\Leftrightarrow\min_{\lambda,\omega>0}\underbrace{\min_{\boldsymbol{Z},\boldsymbol{M}}\mathcal{L}(\boldsymbol{M},\boldsymbol{Z})}_{(\ref{opt2})},

while Lemma.C.6 establishes a bridge between (1) and the following max-min problem:

\displaystyle\underbrace{\min_{\boldsymbol{Z},\boldsymbol{M}}\text{CELoss}(\boldsymbol{M},\boldsymbol{Z})}_{(\ref{opt0})}\Leftrightarrow\max_{\rho>0}\underbrace{\max_{\|\boldsymbol{z}_{y,i}\|,\|M_{i}\|\leq\rho}\underset{y\neq y^{\prime}}{\min}\underset{i\in[N/C]}{\min}\langle M_{y}-M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle}_{\textbf{max-min}}

According to Lemma.C.4 and C.5, we know the solution of (3) converge to

\begin{aligned} \forall y\in[C],\forall i\in[N/C],\boldsymbol{z}_{y,i}=M_{y}\end{aligned},

(5)

and there must exist a $\rho(\lambda,\omega)$ such that

\begin{aligned} \forall y\in[C],\forall i\in[N/C],\|\boldsymbol{z}_{y,i}\|=\|M_{y}\|=\rho(\lambda,\omega)\end{aligned}.

(6)

Therefore, we substitute (5) and (6) into max-min:

		$\displaystyle\max_{\\|\boldsymbol{z}_{y,i}\\|,\\|M_{i}\\|\leq\rho}\underset{y\neq y^{\prime}}{\min}\underset{i\in[N/C]}{\min}\langle M_{y}-M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle$
	$\displaystyle\Leftrightarrow$	$\displaystyle\max_{\\|M_{i}\\|=\rho}\underset{y\neq y^{\prime}}{\min}\langle M_{y}-M_{y^{\prime}},M_{y}\rangle$
	$\displaystyle\Leftrightarrow$	$\displaystyle\max_{\\|M_{i}\\|=\rho}\underset{y\neq y^{\prime}}{\min}\left(\rho^{2}-\langle M_{y^{\prime}},M_{y}\rangle\right)$
	$\displaystyle\Leftrightarrow$	$\displaystyle\underbrace{\min_{\\|M_{i}\\|=\rho}\underset{y\neq y^{\prime}}{\max}\langle M_{y^{\prime}},M_{y}\rangle}_{\textbf{min-max}}$

where $\Leftrightarrow$ symbol in above equation means the solution of these optimization problems would converge to the same. min-max is exactly our expectant minimized maximal correlation characteristic. To ensure the establishment of above equivalence, $\rho\rightarrow\infty$ is required to meet $\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})\rightarrow 0$ . ∎

C.2.4 Gradient Calculation

Before starting our proof, we calculate the gradient of (3) in terms of features $\boldsymbol{Z}$ and classifiers $\boldsymbol{M}$ :

	$\displaystyle\forall y\in[C],\forall i\in[N/C],\nabla_{\boldsymbol{z}_{y,i}}\mathcal{L}(\boldsymbol{Z},\boldsymbol{M})=-M_{y}+\sum_{y^{\prime}=1}^{C}\left[Softmax(\boldsymbol{z}_{y,i}^{T}\boldsymbol{M})\right]_{y^{\prime}}M_{y^{\prime}}+\omega\boldsymbol{z}_{y,i}$
	$\displaystyle\forall y\in[C],\nabla_{M_{y}}\mathcal{L}(\boldsymbol{Z},\boldsymbol{M})=-\sum_{i=1}^{N/C}\boldsymbol{z}_{y,i}+\sum_{y^{\prime}=1}^{C}\sum_{i=1}^{N/C}\left[Softmax(\boldsymbol{z}_{y^{\prime},i}^{T}\boldsymbol{M})\right]_{y}\boldsymbol{z}_{y^{\prime},i}+\lambda M_{y}$

Then we turn it into the matrix form by arranging $y=1,\cdots,C$ in every column.

	$\displaystyle\forall i\in[N/C],$	$\displaystyle\nabla_{\boldsymbol{Z}_{i}}\mathcal{L}(\boldsymbol{Z},\boldsymbol{M})=-\boldsymbol{M}+\boldsymbol{M}\boldsymbol{P}_{i}+\omega\boldsymbol{Z}_{i}$		(7)
		$\displaystyle\nabla_{\boldsymbol{M}}\mathcal{L}(\boldsymbol{Z},\boldsymbol{M})=-\sum_{i=1}^{N/C}\boldsymbol{Z}_{i}+\sum_{i=1}^{N/C}\boldsymbol{Z}_{i}\boldsymbol{P}_{i}^{T}+\lambda\boldsymbol{M}$		(7)

C.3 Proof of Lemma.C.4

See C.4

Proof.

With the conclusion of Lemma.C.5, we can easily prove Lemma.C.4. For any $i,j\in[N/C]$ , if $t\rightarrow\infty$ , we have

\displaystyle\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\|=\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}+\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\|\leq\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}\|+\|\boldsymbol{Z}_{j}^{(t)}-\boldsymbol{M}^{(t)}\|\rightarrow 0

∎

C.4 Proof of Lemma.C.5

See C.5

Proof.

According to the update rule (4) and gradient (7), for any $i\in[N/C]$ , we have

	$\displaystyle\boldsymbol{Z}_{i}^{(t+1)}$	$\displaystyle\leftarrow\boldsymbol{Z}_{i}^{(t)}-\alpha\left[-\boldsymbol{M}^{(t)}+\boldsymbol{M}^{(t)}\boldsymbol{P}_{i}^{(t)}+\omega\boldsymbol{Z}_{i}^{(t)}\right]$
	$\displaystyle\boldsymbol{M}^{(t+1)}$	$\displaystyle\leftarrow\boldsymbol{M}^{(t)}-\beta\left[-\sum_{j\in[N/C]}\boldsymbol{Z}_{i}^{(t)}+\sum_{j\in[N/C]}\boldsymbol{Z}_{j}^{(t)}\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}+\lambda\boldsymbol{M}^{(t)}\right]$

Then, we bound $\Delta(t+1,i)=\Big{\|}\boldsymbol{Z}_{i}^{(t+1)}-\boldsymbol{M}^{(t+1)}\Big{\|}$ :

		$\displaystyle\Delta(t+1,i)=\Bigg{\\|}\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}+\underbrace{\left(\alpha\boldsymbol{M}^{(t)}-\beta\sum_{j\in[N/C]}\boldsymbol{Z}_{i}^{(t)}\right)}_{\textbf{(a)}}-$		(8)
		$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \underbrace{\left(\alpha\boldsymbol{M}^{(t)}\boldsymbol{P}_{i}^{(t)}-\beta\sum_{j\in[N/C]}\boldsymbol{Z}_{j}^{(t)}\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)}_{\textbf{(b)}}-\underbrace{\left(\alpha\omega\boldsymbol{Z}_{i}^{(t)}-\beta\lambda\boldsymbol{M}^{(t)}\right)}_{\textbf{(c)}}\Bigg{\\|}$		(8)

Then we use the assumption that $\alpha/\beta=\lambda/\omega=N/C$ , and consider $\textbf{(a)},\textbf{(b)}$ and (c) separately.

(a)	$\displaystyle=\frac{\alpha C}{N}\sum_{j\in[N/C]}\left(\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\right)$	(9)
(b)	$\displaystyle=\frac{\alpha C}{N}\sum_{j\in[N/C]}\left(\boldsymbol{M}^{(t)}\boldsymbol{P}_{i}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)$
(c)	$\displaystyle=\alpha\omega\left(\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}\right)$

Then we combine (a) and (b):

	$\displaystyle\textbf{(a)}-\textbf{(b)}$	(10)
$\displaystyle=$	$\displaystyle\frac{\alpha C}{N}\sum_{j\in[N/C]}\left[\left(\boldsymbol{M}^{(t)}-\boldsymbol{M}^{(t)}\boldsymbol{P}_{i}^{(t)}\right)-\left(\boldsymbol{Z}_{j}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)\right]$
$\displaystyle=$	$\displaystyle\frac{\alpha C}{N}\sum_{j\in[N/C]}\left[\boldsymbol{M}^{(t)}\left(I-\boldsymbol{P}_{i}^{(t)}\right)-\boldsymbol{Z}_{j}^{(t)}\left(I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)\right]$
$\displaystyle=$	$\displaystyle\frac{\alpha C}{N}\sum_{j\in[N/C]}\Bigg{[}\boldsymbol{M}^{(t)}\left(I-\boldsymbol{P}_{i}^{(t)}\right)-\boldsymbol{M}^{(t)}\left(I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)+$
	$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \boldsymbol{M}^{(t)}\left(I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)-\boldsymbol{Z}_{j}^{(t)}\left(I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)\Bigg{]}$
$\displaystyle=$	$\displaystyle\frac{\alpha C}{N}\sum_{j\in[N/C]}\left[\boldsymbol{M}^{(t)}\left(\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{i}^{(t)}\right)+\left(\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\right)\left(I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)\right]$
$\displaystyle=$	$\displaystyle\underbrace{\frac{\alpha C}{N}\sum_{j\in[N/C]}\boldsymbol{M}^{(t)}\left(\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{i}^{(t)}\right)}_{\textbf{(A)}}+\underbrace{\frac{\alpha C}{N}\sum_{j\in[N/C]}\left(\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\right)\left(I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right)}_{\textbf{(B)}}$
$\displaystyle=$	$\displaystyle\textbf{(A)}+\textbf{(B)}$

Next, we plug (9) and (10) into (8) to derive

	$\displaystyle\Delta(t+1,i)$	$\displaystyle=\left\\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}+\textbf{(a)}-\textbf{(b)}-\textbf{(c)}\right\\|=\left\\|\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}-\textbf{(c)}+\textbf{(A)}+\textbf{(B)}\right\\|$
		$\displaystyle=\left\\|(1-\alpha\omega)\left(\boldsymbol{Z}_{i}^{(t)}-\boldsymbol{M}^{(t)}\right)+\textbf{(A)}+\textbf{(B)}\right\\|\leq(1-\alpha\omega)\Delta(t,i)+\left\\|\textbf{(A)}\right\\|+\left\\|\textbf{(B)}\right\\|$

Then we bound $\|\textbf{(A)}\|$ and $\|\textbf{(B)}\|$ .

	$\displaystyle\\|\textbf{(A)}\\|\leq$	$\displaystyle\frac{\alpha C}{N}\sum_{j\in[N/C]}\left\\|\boldsymbol{M}^{(t)}\right\\|\left\\|\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{i}^{(t)}\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\left\\|\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{i}^{(t)}\right\\|$
	$\displaystyle=$	$\displaystyle\frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\left\\|\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{j}^{(t)}+\boldsymbol{P}_{j}^{(t)}-\boldsymbol{P}_{i}^{(t)}\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\left(\left\\|\underbrace{\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{j}^{(t)}}_{\textbf{(A.1)}}\right\\|+\left\\|\underbrace{\boldsymbol{P}_{j}^{(t)}-\boldsymbol{P}_{i}^{(t)}}_{\textbf{A.2}}\right\\|\right)$

For (A.1), we have

	$\displaystyle\\|\textbf{(A.1)}\\|$	$\displaystyle=\left\\|\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{j}^{(t)}\right\\|$
		$\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\\|\left(\boldsymbol{M}^{(t)}\right)^{T}\boldsymbol{Z}_{j}^{(t)}-\left(\boldsymbol{Z}_{j}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle=\frac{\sqrt{C}}{\sqrt{2}}\left\\|\left(\boldsymbol{M}^{(t)}\right)^{T}\boldsymbol{Z}_{j}^{(t)}-\left(\boldsymbol{M}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}+\left(\boldsymbol{M}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}-\left(\boldsymbol{Z}_{j}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\\|\boldsymbol{M}^{(t)}\right\\|\left\\|\boldsymbol{Z}_{j}^{(t)}-\boldsymbol{M}^{(t)}\right\\|+\frac{\sqrt{C}}{\sqrt{2}}\left\\|\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\right\\|\left\\|\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle=\sqrt{2C}\left\\|\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\right\\|\left\\|\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle\leq\sqrt{2}C\rho\Delta(t,j)$

where the first inequality is because the $Softmax(\cdot)$ function is $\sqrt{\frac{C}{2}}$ -Lipschitz-continuous, and the last inequality follows from the bounded norm of $\boldsymbol{M}$ : $\forall t$ , we have $\|\boldsymbol{M}^{(t)}\|\leq\sqrt{C}\rho$ . For (A.2), if $i=j$ , $\|\textbf{(A.2)}\|=0$ . If $i\neq j$ , we have

	$\displaystyle\\|\textbf{(A.2)}\\|$	$\displaystyle=\left\\|\boldsymbol{P}_{j}^{(t)}-\boldsymbol{P}_{i}^{(t)}\right\\|$
		$\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\\|\left(\boldsymbol{Z}_{j}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}-\left(\boldsymbol{Z}_{i}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\\|\boldsymbol{Z}_{j}^{(t)}-\boldsymbol{Z}_{i}^{(t)}\right\\|\left\\|\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\\|\boldsymbol{Z}_{j}^{(t)}-\boldsymbol{M}^{(t)}+\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{i}^{(t)}\right\\|\left\\|\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle\leq\frac{C\rho}{\sqrt{2}}\left(\Delta(t,i)+\Delta(t,j)\right)$

For (B), we have

\displaystyle\|\textbf{(B)}\|\leq\frac{\alpha C}{N}\sum_{j\in[N/C]}\Delta(t,j)\left\|I-\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}\right\|\leq\frac{\alpha C^{2}}{N}\sum_{j\in[N/C]}\Delta(t,j)

The above final inequality is because both $I$ and $\boldsymbol{P}_{j}^{(t)}$ can be seen as probability simplex, the norm of $I-\boldsymbol{P}_{j}^{(t)}$ is always less than the norm of all-one matrix minus all-zero matrix, where the latter’s norm is $C$ . Finally, we can bound $\Delta(t+1,i)$ :

$\displaystyle\Delta(t+1,i)$	$\displaystyle\leq(1-\alpha\omega)\Delta(t,i)+\\|\textbf{(A)}\\|+\\|\textbf{(B)}\\|$
	$\displaystyle\leq(1-\alpha\omega)\Delta(t,i)+\frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\left(\left\\|\textbf{(A.1)}\right\\|+\left\\|\textbf{A.2}\right\\|\right)+\frac{\alpha C^{2}}{N}\sum_{j\in[N/C]}\Delta(t,j)$
	$\displaystyle\leq\left(1-\alpha\left(\omega-\frac{C^{2}}{N}\right)\right)\Delta(t,i)+\frac{\alpha C^{2}}{N}\sum_{j\neq i}\Delta(t,j)+$
	$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\sqrt{2}C\rho\Delta(t,j)+\frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\frac{C\rho}{\sqrt{2}}\left(\Delta(t,i)+\Delta(t,j)\right)$
	$\displaystyle\leq\left(1-\alpha\left(\omega-\frac{C^{2}}{N}\right)\right)\Delta(t,i)+\frac{\alpha C^{2}}{N}\sum_{j\neq i}\Delta(t,j)+$
	$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \frac{\sqrt{2}\alpha C^{5/2}\rho^{2}}{N}\sum_{j\in[N/C]}\Delta(t,j)+\frac{\alpha C^{5/2}\rho^{2}}{N\sqrt{2}}\sum_{j\in[N/C]}\left(\Delta(t,i)+\Delta(t,j)\right)$
	$\displaystyle\leq\left(1-\alpha\underbrace{\left(\omega-\frac{C^{2}}{N}-\frac{\sqrt{2}C^{5/2}\rho^{2}}{N}-\frac{C^{3/2}\rho^{2}}{\sqrt{2}}-\frac{C^{5/2}\rho^{2}}{N\sqrt{2}}\right)}_{F1}\right)\Delta(t,i)+$
	$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \alpha\underbrace{\left(\frac{C^{2}}{N}+\frac{\sqrt{2}C^{5/2}\rho^{2}}{N}+\frac{C^{5/2}\rho^{2}}{N\sqrt{2}}\right)}_{F2}\sum_{j\neq i}\Delta(t,j)$	(11)

We put all $\Delta(t+1,\star)$ and $\Delta(t,\star)$ together to derive the difference inequality:

\displaystyle\boldsymbol{\Delta}(t+1)\preceq\boldsymbol{A}\boldsymbol{\Delta}(t)

where

\boldsymbol{A}=\left[\begin{array}[]{ccccccc}1-\alpha F_{1}&\alpha F_{2}&\ldots&\alpha F_{2}\\ \alpha F_{2}&1-\alpha F_{1}&\cdots&\alpha F_{2}\\ \vdots&\vdots&\ddots&\vdots\\ \alpha F_{2}&\alpha F_{2}&\ldots&1-\alpha F_{1}\\ \end{array}\right]\ \text{and}\ \boldsymbol{\Delta}(t)=\left[\begin{array}[]{ccccccc}&\Delta(t,1)\\ &\Delta(t,2)\\ &\vdots\\ &\Delta(t,N/C)\\ \end{array}\right]

In above notation, the values of $F_{1}$ and $F_{2}$ refer to (11). We investigate if $\boldsymbol{\Delta}(t)$ can converge to zero vector by adjusting $\alpha$ . According to the Lemma.C.2, we know eigen values of $\boldsymbol{A}$ are

		$\displaystyle\lambda_{1}=\lambda_{2}=\dots=\lambda_{N/C-1}=1-\alpha\left(F_{1}+F_{2}\right)$
		$\displaystyle\lambda_{N/C}=1-\alpha\left(F_{1}-\left(N/C-1\right)F_{2}\right)$

Here, with properly selected parameters, we can make all eigen values of $\boldsymbol{A}$ in $(-1,1)$ . Therefore, as $t\rightarrow\infty$ , $\boldsymbol{A}^{t}\rightarrow\boldsymbol{0}$ and $\boldsymbol{\Delta}(t)\rightarrow\boldsymbol{0}$ . ∎

C.5 Proof of Lemma.C.6

See C.6

Proof.

For simplicity, we leave out the upper script $(t)$ . First, we have $\forall t$

$\displaystyle\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})=$	$\displaystyle\sum_{y=1}^{C}\sum_{i=1}^{N/C}-\log\frac{\exp\big{(}\langle M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}}{\sum_{y^{\prime}}\exp\big{(}\langle M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle\big{)}}$	(12)
$\displaystyle=$	$\displaystyle\sum_{y=1}^{C}\sum_{i=1}^{N/C}\log\bigg{(}1+\sum_{y^{\prime}\neq y}\exp\big{(}\langle M_{y^{\prime}}-M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}\bigg{)}$
$\displaystyle\leq$	$\displaystyle\sum_{y=1}^{C}\sum_{i=1}^{N/C}\log\bigg{(}1+(C-1)\exp\big{(}\max_{y^{\prime}\neq y}\{\langle M_{y^{\prime}}-M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}\}\bigg{)}$
$\displaystyle\leq$	$\displaystyle\frac{N}{C}\sum_{y=1}^{C}\log\bigg{(}1+(C-1)\exp\big{(}\max_{y^{\prime}\neq y}\max_{i\in[N/C]}\{\langle M_{y^{\prime}}-M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}\}\bigg{)}$
$\displaystyle\leq$	$\displaystyle N\max_{y\in[C]}\log\bigg{(}1+(C-1)\exp\big{(}\max_{y^{\prime}\neq y}\max_{i\in[N/C]}\{\langle M_{y^{\prime}}-M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}\}\bigg{)}$
$\displaystyle=$	$\displaystyle N\log\bigg{(}1+(C-1)\exp\big{(}\max_{y\in[C]}\max_{y^{\prime}\neq y}\max_{i\in[N/C]}\{\langle M_{y^{\prime}}-M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}\}\bigg{)}$

In addition, we have

\displaystyle\log\bigg{(}1+\exp\big{(}\max_{y\in[C]}\max_{y^{\prime}\neq y}\max_{i\in[N/C]}\{\langle M_{y^{\prime}}-M_{y},\boldsymbol{z}_{y,i}\rangle\big{)}\}\bigg{)}\leq\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})

(13)

We denote $\max_{y\in[C]}\max_{y^{\prime}\neq y}$ as $\max_{y^{\prime}\neq y}$ and define the margin of entire dataset (refer to Section.3.1 of Ji et al. [2022]) as follow:

p_{min}:=\min_{y\neq y^{\prime}}\max_{i\in[N/C]}\langle M_{y}-M_{y^{\prime}},z_{y,i}\rangle

Therefore, we have

\displaystyle\underbrace{\log\bigg{(}1+\exp\left(-p_{min}\right)\bigg{)}}_{\ell_{1}(p_{min})}\leq\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})\leq N\underbrace{\log\bigg{(}1+(C-1)\exp\left(-p_{min}\right)\bigg{)}}_{\ell_{C-1}(p_{min})}

(14)

where $\ell_{a}(p)=\log(1+ae^{-p})$ . Then we represent $\ell_{a}(\cdot)$ as the form of exponential function, $i.e.$

\begin{aligned} \ell_{a}(p)=e^{-\phi_{a}(p)}\ \ \text{and}\ \ \phi_{a}(p)=-\log\log(1+ae^{-p})\end{aligned}.

Denote the inverse function of $\phi_{a}(\cdot)$ as $\Phi_{a}(\cdot)$ , where $\Phi_{a}(p)=-\log(\frac{e^{e^{-p}}-1}{a})$ . Then continue from (14), we have

		$\displaystyle\ell_{1}(p_{min})\leq\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})\leq N\ell_{C-1}(p_{min})$
	$\displaystyle\Leftrightarrow$	$\displaystyle e^{-\phi_{1}(p_{min})}\leq\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})\leq Ne^{-\phi_{C-1}(p_{min})}$
	$\displaystyle\Leftrightarrow$	$\displaystyle\phi_{C-1}(p_{min})-\log(N)\leq-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\leq\phi_{1}(p_{min})$

According to the monotonicity of $\Phi_{1}(\cdot)$ , we have

\displaystyle\Phi_{1}\left(\phi_{C-1}(p_{min})-\log(N)\right)\leq\Phi_{1}\left(-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\right)\leq p_{min}

Use the mean value theorem, there exists a $\xi\in(\phi_{C-1}(p_{min})-\log(N),\phi_{1}(p_{min}))$ such that

\displaystyle\Phi_{1}(\phi_{C-1}(p_{min})-\log(N))=p_{min}-\Phi_{1}^{{}^{\prime}}(\xi)(\phi_{1}(p_{min})-\phi_{C-1}(p_{min})+\log(N)),

then

\displaystyle p_{min}-\underbrace{\Phi_{1}^{{}^{\prime}}(\xi)(\phi_{1}(p_{min})-\phi_{C-1}(p_{min})+\log(N))}_{\Delta(t)}\leq\Phi_{1}\left(-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\right)\leq p_{min}

(15)

Then we will show $\Delta(t)=\mathcal{O}(1)(t\rightarrow\infty)$ . Since

\begin{aligned} \xi>\phi_{C-1}(p_{min})-\log(N)\geq-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))-\log(N)\end{aligned},

we know $\xi\rightarrow\infty$ and $p_{min}\rightarrow\infty$ as $\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})\rightarrow 0$ . By simple calculation, we have $\phi_{1}(p_{min})-\phi_{C-1}(p_{min})\rightarrow\log(C-1)$ and $\Phi_{1}^{{}^{\prime}}(\xi)=\frac{e^{e^{-\xi}-\xi}}{e^{e^{-\xi}}-1}\rightarrow 1$ . Next, we denote the maximal norm at iteration $t$ as

\rho_{t}=\max_{\boldsymbol{v}\in\boldsymbol{Z}^{(t)}\cup\boldsymbol{M}^{(t)}}\|\boldsymbol{v}\|.

Due to $p_{min}\rightarrow\infty$ , $\rho_{t}\rightarrow\infty$ . Then we devide $\rho_{t}$ on the both sides of (15) to obtain

\displaystyle\Bigg{|}\frac{\Phi_{1}\left(-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\right)}{\rho_{t}}-\frac{p_{min}}{\rho_{t}}\Bigg{|}\rightarrow 0,t\rightarrow\infty,

therefore

\begin{aligned} \frac{\Phi_{1}\left(-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\right)}{\rho_{t}}\rightarrow\frac{p_{min}}{\rho_{t}},t\rightarrow\infty\end{aligned}.

	$\displaystyle\min_{\boldsymbol{M},\boldsymbol{Z}}\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})\Leftrightarrow$	$\displaystyle\min_{\rho>0}\min_{\\|M_{i}\\|,\\|\boldsymbol{z}_{y,i}\\|\leq\rho}\text{CELoss}(\boldsymbol{Z},\boldsymbol{M})$
	$\displaystyle\Leftrightarrow$	$\displaystyle\max_{\rho>0}\max_{\\|M_{i}\\|,\\|\boldsymbol{z}_{y,i}\\|\leq\rho}\Phi_{1}\left(-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\right)$
	$\displaystyle\Leftrightarrow$	$\displaystyle\max_{\rho>0}\max_{\\|M_{i}\\|,\\|\boldsymbol{z}_{y,i}\\|\leq\rho}\rho\frac{\Phi_{1}\left(-\log(\text{CELoss}(\boldsymbol{Z},\boldsymbol{M}))\right)}{\rho}$
	$\displaystyle\Leftrightarrow$	$\displaystyle\max_{\rho>0}\max_{\\|M_{i}\\|,\\|\boldsymbol{z}_{y,i}\\|\leq\rho}\rho\frac{p_{min}}{\rho}$
	$\displaystyle\Leftrightarrow$	$\displaystyle\max_{\rho>0}\max_{\\|M_{i}\\|,\\|\boldsymbol{z}_{y,i}\\|\leq\rho}\min_{y\neq y^{\prime}}\min_{i\in[N/C]}\langle M_{y}-M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle$

The $\Leftrightarrow$ symbol in above equation means the solution of these optimization problems would converge to the same. ∎

Appendix D Proof of Theorem.3.8

Theorem D.1 (Theorem.5 of Kakade et al. [2008]: Margin Bound).

Consider a data space $\mathcal{X}$ and a probability measure $\mathcal{P}$ on it. There is a dataset $\{x_{i}\}_{i=1}^{n}$ that contains $n$ samples, which are drawn $i.i.d$ from $\mathcal{P}$ . Consider an arbitrary function class $\mathcal{F}$ such that $\forall f\in\mathcal{F}$ we have $\sup_{x\in\mathcal{X}}|f(x)|\leq K$ , then with probability at least $1-\delta$ over the sample, for all margins $\gamma>0$ and all $f\in\mathcal{F}$ we have,

\displaystyle\mathbb{P}_{x}(f(x)\leq 0)\leq\sum_{i=1}^{n}\frac{\mathbb{I}(f(x_{i})\leq\gamma)}{n}+\frac{\mathfrak{R}_{n}(\mathcal{F})}{\gamma}+\sqrt{\frac{\log(\log_{2}\frac{4K}{\gamma})}{n}}+\sqrt{\frac{\log(1/\delta)}{2n}}

Theorem.3.8 (Multiclass Margin Bound).

\displaystyle\sup_{i\neq j}\sup_{\boldsymbol{M},\boldsymbol{w}}\sup_{x\in\mathcal{M}_{i}}\left|(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})\right|\leq K.

Then, for any classifier $(\boldsymbol{M},f(\cdot;\boldsymbol{w}))$ and margins $\{\gamma_{i,j}\}_{i\neq j}(\gamma_{i,j}>0)$ , the following inequality holds with probability at least $1-\delta$

	$\displaystyle\mathbb{P}_{x,y}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}\leq$	$\displaystyle\sum_{i=1}^{C}p(i)\sum_{j\neq i}\frac{\mathfrak{R}_{N_{i}}(\mathcal{F})}{\gamma_{i,j}}+\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sqrt{\frac{\log(\log_{2}\frac{4K}{\gamma_{i,j}})}{N_{i}}}$
		$\displaystyle+\ \text{empirical risk term}\ +\ \text{probability term}$

where

	empirical risk term	$\displaystyle=\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sum_{x\in S_{i}}\frac{\mathbb{I}((M_{i}-M_{j})^{T}f(x)\leq\gamma_{i,j})}{N_{i}},$
	probability term	$\displaystyle=\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sqrt{\frac{\log(C(C-1)/\delta)}{2N_{i}}}.$

$\mathfrak{R}_{N_{i}}(\mathcal{F})$ is the Rademacher complexity Kakade et al. [2008], Bartlett and Mendelson [2002] of function space $\mathcal{F}$ .

Proof.

First, we decompose the accuracy as accuracies within every class by Bayes Theory:

\displaystyle\mathbb{P}_{\boldsymbol{x},y}\Big{(}\underset{c}{\arg\max}[\boldsymbol{M}f(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}=\sum_{i=1}^{C}p(i)\mathbb{P}_{\boldsymbol{x}|y=i}\Big{(}\underset{c}{\arg\max}[\boldsymbol{M}f(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}

(16)

where $p(i)$ is the probability density of $i$ -th class. Then, we focus on the accuracy within every class $i$ .

\displaystyle\mathbb{P}_{\boldsymbol{x}|y=i}\Big{(}\underset{c}{\arg\max}[\boldsymbol{M}f(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}=\mathbb{P}_{\boldsymbol{x}|y=i}\Bigg{(}\bigcup_{j\neq i}\{(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})<0\}\Bigg{)}

According to union bound, we have

\displaystyle\mathbb{P}_{\boldsymbol{x}|y=i}\Big{(}\underset{c}{\arg\max}[\boldsymbol{M}f(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}\leq\sum_{j\neq i}\mathbb{P}_{\boldsymbol{x}|y=i}\Big{(}(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})<0\Big{)}

Recall our assumption of function class:

\sup_{i\neq j}\sup_{\boldsymbol{M},\boldsymbol{w}}\sup_{\boldsymbol{x}\in\mathcal{M}_{i}}|(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})|\leq K.

Then follow from the Margin Bound (Theorem.D.1), we have

		$\displaystyle\mathbb{P}_{\boldsymbol{x},y}\Big{(}\underset{c}{\arg\max}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}\neq y\Big{)}\leq\sum_{i=1}^{C}p(i)\sum_{j\neq i}\mathbb{P}_{x\|y=i}\Big{(}(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})<0\Big{)}$
	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{C}p(i)\sum_{j\neq i}\frac{\mathfrak{R}_{N_{i}}(\mathcal{F})}{\gamma_{i,j}}+\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sqrt{\frac{\log(\log_{2}\frac{4K}{\gamma_{i,j}})}{N_{c}}}+$
		$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \underbrace{\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sqrt{\frac{\log(1/\delta)}{2N_{i}}}}_{\text{probability term}}+\underbrace{\sum_{i=1}^{C}p(i)\sum_{j\neq i}\sum_{\boldsymbol{x}\in S_{i}}\frac{\mathbb{I}((M_{i}-M_{j})^{T}f(\boldsymbol{x})\leq\gamma_{i,j})}{N_{i}}}_{\text{empirical risk term}}$

with probability at least $1-C(C-1)\delta$ . Then, we perform the following replace to drive the final result:

\displaystyle\delta\leftarrow\frac{\delta}{C(C-1)}

∎

Appendix E Proof of Theorem.4.2

We first introduce the definition of Covering Number.

Definition E.1 (Covering Number Kulkarni and Posner [1995]).

Given $\epsilon>0$ and $\boldsymbol{x}\in\mathbb{R}^{D}$ , the open ball of radius $\epsilon$ around $\boldsymbol{x}$ is denoted as

\begin{aligned} B_{\epsilon}(\boldsymbol{x})=\{\boldsymbol{u}\in\mathbb{R}^{D},\|\boldsymbol{u}-\boldsymbol{x}\|<\epsilon\}\end{aligned}.

Then the covering number $\mathcal{N}(\epsilon,A)$ of a set $A\subset\mathbb{R}^{D}$ is defined as the smallest number of open balls whose union contains $A$ :

\displaystyle\mathcal{N}(\epsilon,A)=\inf\left\{k:\exists\boldsymbol{u}_{1},\dots,\boldsymbol{u}_{k}\in\mathbb{R}^{D},s.t.A\in\bigcup_{i=1}^{k}B_{\epsilon}(\boldsymbol{u}_{i})\right\}

The following conclusion is demonstrated in the proof of Theorem.1 of Kulkarni and Posner [1995]. We use it to prove our theorem.

Theorem E.2 (Vural and Guillemot [2017], Kulkarni and Posner [1995]).

There are $N$ samples $\{x_{1},\dots,x_{N}\}$ drawn i.i.d from the probability measure $\mathcal{P}$ . Suppose the bounded support of $\mathcal{P}$ is $\mathcal{M}$ , then if $N$ is larger then Covering Number $\mathcal{N}(\epsilon,\mathcal{M})$ , we have

\displaystyle\mathbb{P}_{x}\Big{(}\|x-\hat{x}\|>\epsilon\Big{)}\leq\frac{\mathcal{N}(\epsilon,\mathcal{M})}{2N},\forall\epsilon>0

where $\hat{x}$ is the sample that is closest to $x$ in $\{x_{1},\dots,x_{N}\}$ :

\displaystyle\hat{x}\in\underset{x^{\prime}\in\{x_{1},\dots,x_{N}\}}{\arg\min}\|x^{\prime}-x\|

Then we provide the proof of Theorem.4.2.

See 4.2

Proof.

We decompose the accuracy:

\displaystyle\mathbb{P}_{\boldsymbol{x},y}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}=y\Big{)}=\sum_{i=1}^{C}p(i)\mathbb{P}_{\boldsymbol{x}|y=i}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}=y\Big{)}

(17)

where $p(i)$ is the class distribution. Then, we focus on the accuracy within every class $i$ .

\displaystyle\mathbb{P}_{\boldsymbol{x}|y=i}(\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}=y)=\mathbb{P}_{x|y=i}(\{(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})>0\ \ \text{for any}\ \ j\neq i\})

We select the data that is closest to $\boldsymbol{x}$ in $i$ class samples $S_{i}$ , and denote it as

\hat{\boldsymbol{x}}(S_{i})=\underset{x_{1}\in S_{i}}{\arg\min}\|x_{1}-x\|

According to the linear separability,

\displaystyle(M_{i}-M_{j})^{T}f(\hat{\boldsymbol{x}}(S_{i});\boldsymbol{w})\geq\gamma_{i,j},\forall j\neq i

For any $j\neq i$ , we have

$\displaystyle(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})$	$\displaystyle=(M_{i}-M_{j})^{T}(f(\boldsymbol{x};\boldsymbol{w})+f(\hat{\boldsymbol{x}}(S_{i});w)-f(\hat{\boldsymbol{x}}(S_{i});w))$	(18)
	$\displaystyle=(M_{i}-M_{j})^{T}f(\hat{\boldsymbol{x}}(S_{i});w)+(M_{i}-M_{j})^{T}(f(\boldsymbol{x};\boldsymbol{w})-f(\hat{\boldsymbol{x}}(S_{i});w))$
	$\displaystyle\geq\gamma_{i,j}-\\|M_{i}-M_{j}\\|\\|f(\boldsymbol{x};\boldsymbol{w})-f(\hat{\boldsymbol{x}}(S_{i});w)\\|$
	$\displaystyle\geq\gamma_{i,j}-L\\|M_{i}-M_{j}\\|\\|\boldsymbol{x}-\hat{\boldsymbol{x}}(S_{i})\\|$

The prediction result is related to the distance between $\boldsymbol{x}$ and $\hat{\boldsymbol{x}}(S_{i})$ . According to Theorem.E.2, we know

\displaystyle\mathbb{P}_{\boldsymbol{x}|y=i}\Big{(}\|\boldsymbol{x}-\hat{\boldsymbol{x}}(S_{i})\|>\epsilon\Big{)}\leq\frac{\mathcal{N}(\epsilon,\mathcal{M}_{i})}{2N_{i}}

To obtain the correct prediction result, $i.e.$ , assure $(\ref{a})>0$ for all $j\neq i$ , we choose $\epsilon<\min_{j\neq i}\frac{\gamma_{i,j}}{L\|M_{i}-M_{j}\|}$ . Therefore, we have

	$\displaystyle\mathbb{P}_{\boldsymbol{x}\|y=i}\Bigg{(}\left\{(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})>0,\forall j\neq i\right\}\Bigg{)}$	$\displaystyle\geq\mathbb{P}_{\boldsymbol{x}\|y=i}\Bigg{(}\\|\boldsymbol{x}-\hat{\boldsymbol{x}}(S_{i})\\|<\min_{j\neq i}\frac{\gamma_{i,j}}{L\\|M_{i}-M_{j}\\|}\Bigg{)}$		(19)
		$\displaystyle>1-\frac{\mathcal{N}(\min_{j\neq i}\frac{\gamma_{i,j}}{L\\|M_{i}-M_{j}\\|},\mathcal{M}_{i})}{2N_{i}}$		(19)

Plug (19) into (17) to derive

	$\displaystyle\mathbb{P}_{\boldsymbol{x},y}\Big{(}\max_{c}[Mf(\boldsymbol{x};\boldsymbol{w})]_{c}=y\Big{)}$	$\displaystyle>1-\sum_{i=1}^{C}p(i)\frac{\mathcal{N}(\min_{j\neq i}\frac{\gamma_{i,j}}{L\\|M_{i}-M_{j}\\|},\mathcal{M}_{i})}{2N_{i}}$
		$\displaystyle=1-\sum_{i=1}^{C}p(i)\frac{\max_{j\neq i}\mathcal{N}(\frac{\gamma_{i,j}}{L\\|M_{i}-M_{j}\\|},\mathcal{M}_{i})}{2N_{i}}$

Then using the conlcusions of NeurCol, if maximal norm is denoted as $\rho$ , we have

\displaystyle\frac{\gamma_{i,j}}{L\|M_{i}-M_{j}\|}=\frac{\rho^{2}-M_{i}^{T}M_{j}}{L\sqrt{2\rho^{2}-2M_{i}^{T}M_{j}}}=\frac{1}{L}\sqrt{\frac{\rho^{2}-M_{i}^{T}M_{j}}{2}}

∎

Appendix F Some Details of Experiments

F.1 Training Detail

In the experiments of Section.4.2, we follow Papyan et al. [2020]’s practice. For all experiments, we minimize cross entropy loss using stochastic gradient descent with epoch $200$ , momentum $0.9$ , batch size $256$ and weight decay $5\times 10^{-4}$ . Besides, the learning rate is set as $5\times 10^{-2}$ and annealed by ten-fold at $120$ and $160$ epoch for every dataset. As for data preprocess, we only perform standard pixel-wise mean subtracting and deviation dividing on images. To achieve $100\%$ accuracy on training set, we only use RandomFilp augmentation.

F.2 Generation of Equivalent Grassmannian Frame

We use Code.1 to generate Equivalent Grassmannian Frame with different directions and orders. Note that the generation of rotation matrix uses the method of Lezcano Casado [2019], Lezcano-Casado and Martınez-Rubio [2019].

⬇

import torch

SEED = 1000

torch.manual_seed(SEED)

torch.backends.cudnn.deterministic = True

torch.backends.cudnn.benchmark = False

\pardef generate_permutation_matrix(dimension):

return torch.eye(dimension)[torch.randperm(dimension), :]

\pardef generate_rotation_matrix(dimension):

A = torch.randn(dimension, dimension)

return torch.linalg.matrix_exp(A - A.T)

\pardef generate_grass_frame(class_num, feature_num):

feature = torch.nn.Linear(class_num, feature_num)

classifier = torch.nn.Linear(feature_num, class_num)

labels = torch.range(0, class_num-1).long()

softmax = torch.nn.Softmax(dim=0)

optimizer = torch.optim.SGD([

{"params": feature.parameters(), "lr":0.1},

{"params": classifier.parameters(), "lr":0.1}

])

for i in range(1000):

pred = torch.mm(classifier.weight , feature.weight)

loss_ce = torch.nn.functional.cross_entropy(input=pred.T, target=labels)

loss_l2 = 1e-1 * (torch.norm(classifier.weight) + torch.norm(feature.weight))

loss = loss_l2 + loss_ce

feature.zero_grad()

classifier.zero_grad()

loss.backward()

optimizer.step()

print("index: {} loss: {}".format(i, loss.item()))

\parprint(feature.weight.shape) # [feature_num, class_num]

return feature.weight

\parif __name__ == "__main__":

class_num = 4

feature_num = 2

grass_frame = generate_grass_frame(class_num, feature_num)

permutation = generate_permutation_matrix(class_num)

rotation = generate_rotation_matrix(feature_num)

grass_frame = rotation @ grass_frame @ permutation

torch.save(grass_frame, "save_path")

List of codeblocks 1 Generation of Equivalent Grassmannian Frame

Appendix G Numerical Simulation and Visualization of Generalized Neural Collapse

Figure.2 shows the results of a numerical simulation experiment conducted to illustrate the convergence of Generalized Neural Collapse. A GIF version of Figure.2 can be found HERE. During the simulation, we discovered that the condition $\rho\rightarrow\infty$ , which is believed to be necessary for the occurrence of Grassmannian Frame (in the NC3 of Theorem.3.2), may not be required. This suggests that there may be other ways to prove Generalized Neural Collapse with fewer assumptions.

		$\displaystyle\max_{\\|\boldsymbol{z}_{y,i}\\|,\\|M_{i}\\|\leq\rho}\underset{y\neq y^{\prime}}{\min}\underset{i\in[N/C]}{\min}\langle M_{y}-M_{y^{\prime}},\boldsymbol{z}_{y,i}\rangle$
	$\displaystyle\Leftrightarrow$	$\displaystyle\max_{\\|M_{i}\\|=\rho}\underset{y\neq y^{\prime}}{\min}\langle M_{y}-M_{y^{\prime}},M_{y}\rangle$
	$\displaystyle\Leftrightarrow$	$\displaystyle\max_{\\|M_{i}\\|=\rho}\underset{y\neq y^{\prime}}{\min}\left(\rho^{2}-\langle M_{y^{\prime}},M_{y}\rangle\right)$
	$\displaystyle\Leftrightarrow$	$\displaystyle\underbrace{\min_{\\|M_{i}\\|=\rho}\underset{y\neq y^{\prime}}{\max}\langle M_{y^{\prime}},M_{y}\rangle}_{\textbf{min-max}}$

	$\displaystyle\\|\textbf{(A)}\\|\leq$	$\displaystyle\frac{\alpha C}{N}\sum_{j\in[N/C]}\left\\|\boldsymbol{M}^{(t)}\right\\|\left\\|\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{i}^{(t)}\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\left\\|\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{i}^{(t)}\right\\|$
	$\displaystyle=$	$\displaystyle\frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\left\\|\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{j}^{(t)}+\boldsymbol{P}_{j}^{(t)}-\boldsymbol{P}_{i}^{(t)}\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{\alpha C^{3/2}\rho}{N}\sum_{j\in[N/C]}\left(\left\\|\underbrace{\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{j}^{(t)}}_{\textbf{(A.1)}}\right\\|+\left\\|\underbrace{\boldsymbol{P}_{j}^{(t)}-\boldsymbol{P}_{i}^{(t)}}_{\textbf{A.2}}\right\\|\right)$

	$\displaystyle\\|\textbf{(A.1)}\\|$	$\displaystyle=\left\\|\left(\boldsymbol{P}_{j}^{(t)}\right)^{T}-\boldsymbol{P}_{j}^{(t)}\right\\|$
		$\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\\|\left(\boldsymbol{M}^{(t)}\right)^{T}\boldsymbol{Z}_{j}^{(t)}-\left(\boldsymbol{Z}_{j}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle=\frac{\sqrt{C}}{\sqrt{2}}\left\\|\left(\boldsymbol{M}^{(t)}\right)^{T}\boldsymbol{Z}_{j}^{(t)}-\left(\boldsymbol{M}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}+\left(\boldsymbol{M}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}-\left(\boldsymbol{Z}_{j}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\\|\boldsymbol{M}^{(t)}\right\\|\left\\|\boldsymbol{Z}_{j}^{(t)}-\boldsymbol{M}^{(t)}\right\\|+\frac{\sqrt{C}}{\sqrt{2}}\left\\|\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\right\\|\left\\|\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle=\sqrt{2C}\left\\|\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{j}^{(t)}\right\\|\left\\|\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle\leq\sqrt{2}C\rho\Delta(t,j)$

	$\displaystyle\\|\textbf{(A.2)}\\|$	$\displaystyle=\left\\|\boldsymbol{P}_{j}^{(t)}-\boldsymbol{P}_{i}^{(t)}\right\\|$
		$\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\\|\left(\boldsymbol{Z}_{j}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}-\left(\boldsymbol{Z}_{i}^{(t)}\right)^{T}\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\\|\boldsymbol{Z}_{j}^{(t)}-\boldsymbol{Z}_{i}^{(t)}\right\\|\left\\|\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle\leq\frac{\sqrt{C}}{\sqrt{2}}\left\\|\boldsymbol{Z}_{j}^{(t)}-\boldsymbol{M}^{(t)}+\boldsymbol{M}^{(t)}-\boldsymbol{Z}_{i}^{(t)}\right\\|\left\\|\boldsymbol{M}^{(t)}\right\\|$
		$\displaystyle\leq\frac{C\rho}{\sqrt{2}}\left(\Delta(t,i)+\Delta(t,j)\right)$

	$\displaystyle\mathbb{P}_{\boldsymbol{x}\|y=i}\Bigg{(}\left\{(M_{i}-M_{j})^{T}f(\boldsymbol{x};\boldsymbol{w})>0,\forall j\neq i\right\}\Bigg{)}$	$\displaystyle\geq\mathbb{P}_{\boldsymbol{x}\|y=i}\Bigg{(}\\|\boldsymbol{x}-\hat{\boldsymbol{x}}(S_{i})\\|<\min_{j\neq i}\frac{\gamma_{i,j}}{L\\|M_{i}-M_{j}\\|}\Bigg{)}$		(19)
		$\displaystyle>1-\frac{\mathcal{N}(\min_{j\neq i}\frac{\gamma_{i,j}}{L\\|M_{i}-M_{j}\\|},\mathcal{M}_{i})}{2N_{i}}$		(19)