This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Engineering the Neural Collapse Geometry of Supervised-Contrastive Loss

Jaidev Gill   Vala Vakilian   Christos Thrampoulidis 111JG and CT gratefully acknowledge the support of an NSERC USRA. The authors also acknowledge use of the Sockeye cluster by UBC Advanced Research Computing. This work is supported by an an NSERC Discovery Grant, NSF Grant CCF-2009030, and by a CRG8-KAUST award.
University of British Columbia
Abstract

Supervised-contrastive loss (SCL) is an alternative to cross-entropy (CE) for classification tasks that makes use of similarities in the embedding space to allow for richer representations. In this work, we propose methods to engineer the geometry of these learnt feature embeddings by modifying the contrastive loss. In pursuit of adjusting the geometry we explore the impact of prototypes, fixed embeddings included during training to alter the final feature geometry. Specifically, through empirical findings, we demonstrate that the inclusion of prototypes in every batch induces the geometry of the learnt embeddings to align with that of the prototypes. We gain further insights by considering a limiting scenario where the number of prototypes far outnumber the original batch size. Through this, we establish a connection to cross-entropy (CE) loss with a fixed classifier and normalized embeddings. We validate our findings by conducting a series of experiments with deep neural networks on benchmark vision datasets.

Introduction

Understanding the structure of the learned features of deep neural networks has gained significant attention through a recent line of research surrounding a phenomenon known as Neural Collapse (NC) formalized by [1]. The authors of [1] have found that, when training a deep-net on balanced datasets with cross-entropy (CE) loss beyond zero training error, the feature embeddings collapse to their corresponding class mean and align with the learned classifier, overall forming a simplex equiangular tight frame (ETF) geometry. In other words, at the terminal phase of training, the class-mean embeddings form an implicit geometry described by vectors of equal norms and angles that are maximally separated. Following [2], we call this geometry “implicit,” since it is not enforced by an explicit regularization, but rather induced by common optimizers, such as SGD. A number of followup studies have provided further analysis on the NC phenomenon [3, 4, 5] and extended the implicit-geometry characterization of CE to imbalanced data [6, 2]. At the same time, [7, 8, 9] have

[Uncaptioned image]
Figure 1: Comparison of Gram matrices 𝐆𝐌\mathbf{G}_{\mathbf{M}} at last epoch (350) trained on STEP imbalanced CIFAR-10 and ResNet-18 with (Top) vanilla SCL (nw=0n_{w}=0) (Middle) Class averaging (BCL) [8] satisfying class representation requirements through batch binding [9] (Bottom) SCL with (nw=100n_{w}=100) prototypes.

shown, both empirically and theoretically, that such characterizations can be extended to other loss functions, specifically the supervised-contrastive loss (SCL).

Drawing inspiration from unsupervised contrastive learning [10], SCL was proposed by [11] as a substitute to CE for classification. Specifically, SCL makes use of semantic information by directly contrasting learned features. [7] was the first to theoretically analyze the implicit geometry of SCL, demonstrating that it forms an ETF when data is balanced. However, when the label distribution is imbalanced, the geometry changes. To combat this, [8] proposed a training framework, which they called balanced contrastive learning (BCL), improving the generalization test accuracy under imbalances. Their framework uses a class averaging modification to SCL alongside a set of kk trainable prototypes, representing class centers, trained using a logit adjusted cross-entropy [12, 13].

According to the authors, the BCL framework drives the implicit geometry to an ETF. In another related work, drawing inspiration from unsupervised frameworks such as MoCo [14], [15] introduced PaCo, a supervised contrastive method that also takes advantage of such trainable class centers to improve test accuracy under imbalances.

These works collectively suggest that prototypes can play a crucial role in determining the implicit geometry when training with SCL. However, in their respective frameworks, prototypes are treated as trainable parameters, optimized alongside various other heuristics and modifications. Thus, it is challenging to ascertain their specific impact on the training process. This raises the question what is the direct impact of prototypes on the SCL geometry when isolated from other modifications?

In order to answer this question, this paper investigates the implicit geometry of SCL with fixed prototypes, departing from the conventional approach of using trainable prototypes. We introduce a new method to incorporate fixed prototypes to the SCL training by augmenting each batch with copies of all class prototypes. Our experimental results demonstrate that choosing prototypes that form an ETF leads to a remarkably accurate convergence of the embeddings’ implicit geometry to an ETF, regardless of imbalances. Furthermore, this convergence is achieved with a moderate number of prototype copies per batch. Importantly, we argue that the computational overhead incurred by our SCL approach with fixed prototypes remains independent of the number of copies of prototypes, motivating an investigation into its behavior as the number of copies increases. In this limit, we derive a simplified form of SCL and subsequently prove that the implicit geometry indeed becomes an ETF when ETF prototypes are selected. Intriguingly, this simplified SCL form resembles the CE loss with fixed classifiers, albeit utilizing normalized embeddings. Finally, realizing the flexibility of choosing prototypes that form an arbitrary target geometry, we pose the question: Is it possible to tune the learned features to form an arbritrary and possibly asymmetric geometry? Through experiments on deep-nets and standard datasets, we demonstrate that by selecting prototypes accordingly we can achieve implicit geometries that deviate from symmetric geometries, such as ETFs.

Tuning Geometry with Prototypes

Setup. We consider a kk-class classification task with training dataset 𝒟={(𝐱i,yi):i[N]}\mathcal{D}=\{({\mathbf{x}}_{i},y_{i}):i\in[N]\} where 𝐱ip{\mathbf{x}}_{i}\in\mathbb{R}^{p} are the NN training points with labels yi[k]y_{i}\in[k].222We denote [N]:={1,2,,N}[N]:=\{1,2,\dots,N\}. The SCL loss is optimized over batches B[N]B\subset[N] belonging to a batch-set \mathcal{B}. Concretely, :=BB\mathcal{L}:=\sum_{B\in\mathcal{B}}\mathcal{L}_{B}, where the loss for each batch BB is given below as

B:=iB1nB,yi1jBjiyj=yilog(Biexp(𝐡i𝐡𝐡i𝐡j)).\displaystyle\mathcal{L}_{B}:=\sum_{i\in B}\frac{1}{n_{B,y_{i}}-1}\sum_{\begin{subarray}{c}j\in B\\ j\neq i\\ y_{j}=y_{i}\end{subarray}}\log\big{(}\sum_{\begin{subarray}{c}\ell\in B\\ \ell\neq i\end{subarray}}\exp{(\mathbf{h}_{i}^{\top}\mathbf{h}_{\ell}-\mathbf{h}_{i}^{\top}\mathbf{h}_{j})}\big{)}\,. (1)

Here, 𝐡i:=𝐡𝜽(𝐱i)d\mathbf{h}_{i}:=\mathbf{h}_{\boldsymbol{\theta}}({\mathbf{x}}_{i})\in\mathbb{R}^{d} is the last-layer learned feature-embedding corresponding to the original training point 𝐱i{\mathbf{x}}_{i} for a network parameterized by parameters 𝜽\boldsymbol{\theta}. Also, nB,yin_{B,y_{i}} is the number of samples sharing the same label as 𝐡i\mathbf{h}_{i} in the batch BB. Lastly, we let |B|=n|B|=n be the batch size. As per standard practice [10, 11], we assume a normalization layer as part of the last layer, hence 𝐡i=1\|\mathbf{h}_{i}\|=1 i[N]\forall i\in[N]. It is also common to include a scaling of the inner products by a temperature parameter τ\tau [11]; since this can be absorbed in the normalization, we drop it above for simplicity.

Methodology. Inspired by the class-complement method of [8], the learnable class centers of [15], and the batch-binding algorithm of [9], we propose using fixed prototypes. These prototypes collectively form a desired reference geometry for the embeddings to learn.

Definition 1 (Prototype).

A prototype 𝐰cd\mathbf{w}_{c}\in\mathbb{R}^{d} for class c[k]c\in[k] is a fixed vector that represents the desired representation of embeddings {𝐡i}yi=c\{\mathbf{h}_{i}\}_{y_{i}=c} in class cc.

Our method optimizes SCL with a new batch {𝐡i}iB𝒲\{\mathbf{h}_{i}\}_{i\in B}\cup\mathcal{W}, where 𝒲:=i=1nw{𝐰1,𝐰2,,𝐰k}\mathcal{W}:=\bigcup_{i=1}^{n_{w}}\{\mathbf{w}_{1},\mathbf{w}_{2},\dots,\mathbf{w}_{k}\} and nwn_{w} is the number of added prototypes per class. We highlight two key aspects of this strategy. (i) First, as nwn_{w} increases, there is no increase in the computational complexity of the loss computation. This is because the number of required inner product computations between embeddings increases from n2/2\nicefrac{{n^{2}}}{{2}} in vanilla SCL (Eq. (1)) to n2/2+nk\nicefrac{{n^{2}}}{{2}}+nk when prototypes are introduced. This increase is solely due to the presence of kk distinct prototypes and remains constant regardless of the value of nw.n_{w}. As we will see, this aspect becomes critical as increasing the number of prototypes can help the learned embeddings converge faster to the chosen prototype geometry (see Defn. 2) with minimal added computational overhead (at least when k=O(n)k=O(n)). (ii) Second, we guarantee that prototypes are fixed and form a suitable, engineered geometry, defined formally in Definition 2 below. In particular, this is in contrast to [15] where prototypes are learned, and [8] which conjectures that the trained prototypes form an ETF.

Definition 2 (Prototype Geometry).

Given a set of prototypes {𝐰c}c[k]\{\mathbf{w}_{c}\}_{c\in[k]} the prototype geometry is characterized by a symmetric matrix 𝐆=𝐖𝐖\mathbf{G}_{*}=\mathbf{W}^{\top}\mathbf{W} where 𝐖=[𝐰1𝐰k]\mathbf{W}=[\mathbf{w}_{1}\cdots\mathbf{w}_{k}].

[Uncaptioned image]
Figure 2: Convergence metric Δ𝐆ETF\Delta_{\mathbf{G}_{ETF}} tracked throughout training of ResNet-18 on STEP-imbalanced (R=10,100R=10,100) CIFAR-10 while varying the number nwn_{w} of prototypes per class. As nwn_{w} increases, the feature geometry exhibits a stronger convergence to ETF.

Experiments. To display the impact of prototypes on feature geometry, we train a ResNet-18 [16] backbone with a two layer MLP projector head [11] using prototypes. We train the model with batch doubling [9] resulting in a batch size of n=2048n=2048, a constant learning rate of 0.10.1, and temperature parameter τ=0.1\tau=0.1 as in [9]. We modify CIFAR-10 (k=10k=10) such that the first 55 classes contain 5000 examples and the last 55 classes have 5000/R5000/R samples with imbalance ratios R=10,100R=10,100. In Fig. 1, we compare the final epoch geometry of models trained with vanilla SCL, BCL [8] without prototype training and logit-adjusted CE [13], and, SCL with 100 prototypes per class. The figure suggest that the embeddings trained with SCL and prototypes form an ETF geometry irrespective of imbalance ratio. On the other hand, SCL and BCL geometries are noticeably less symmetric with angles between minority centers decreasing with higher imbalance ratios. This highlights the impact of prototypes on achieving an ETF geometry, and further emphasizes their importance within frameworks such as BCL.

To study the impact of the number of prototypes we define the concept of Geometric Convergence (see Defn. 3 below) and compare the convergence to ETF (Δ𝐆ETF\Delta_{\mathbf{G}_{ETF}}) when training with SCL using different number of prototypes nw=0,10,50,100n_{w}=0,10,50,100. As illustrated in Fig. 2, without prototypes (nw=0n_{w}=0) SCL does not converge to ETF. However, simply adding 100 total prototype examples to the batch significantly improves convergence to the ETF geometry (nw=10n_{w}=10). Moreover, once the prototypes make up 20%\sim 20\% of the batch size, convergence is nearly perfect (see nw=50n_{w}=50). This observation motivates the study of SCL when nwn_{w} outnumbers the training datapoints within the batch.

Definition 3 (Geometry Convergence).

We say that the geometry of learned embeddings has successfully converged if 𝐆𝐌𝐆\mathbf{G}_{\mathbf{M}}\rightarrow\mathbf{G}_{*}, where 𝐆\mathbf{G}_{*} is as given in Defn. 2. Here, 𝐆𝐌=𝐌𝐌\mathbf{G}_{\mathbf{M}}=\mathbf{M}^{\top}\mathbf{M}, 𝐌=[𝝁1𝝁k]\mathbf{M}=[\boldsymbol{\mu}_{1}\cdots\boldsymbol{\mu}_{k}] where 𝝁c=Aveyi=c𝐡i\boldsymbol{\mu}_{c}=\text{Ave}_{y_{i}=c}\mathbf{h}_{i}. As a measure of convergence, we track Δ𝐆=𝐆𝐌/𝐆𝐌F𝐆/𝐆FF\Delta_{\mathbf{G}_{*}}=\|\nicefrac{{\mathbf{G}_{\mathbf{M}}}}{{\|\mathbf{G}_{\mathbf{M}}\|_{F}}}-\nicefrac{{\mathbf{G}_{*}}}{{\|\mathbf{G}_{*}\|_{F}}}\|_{F}.

Connection to Cross-Entropy

Having seen the impact of increasing the number of prototypes (nwn_{w}) on the learnt geometry, it is natural to ask how the prototypes impact the loss as these prototypes begin to outnumber the original number of samples in the batch. Further, this sheds light on prototype-based methods that help improve test accuracy such as BCL [8] and PaCO [15] as both losses include multiplicative hyperparameters to tune the impact of prototypes.

Proposition 1.

Let n^:=knw\hat{n}:=k\cdot n_{w} be the total number of prototypes added to the batch, and nn be the original batch size. Then in the limit n^n\hat{n}\gg n the batch-wise SCL loss becomes,

BiB[log(exp(𝐰yi𝐡i)c[k]exp(𝐰c𝐡i))+𝐰yi𝐡i]\mathcal{L}_{B}\rightarrow-\sum_{i\in B}\left[\log\left(\frac{\exp(\mathbf{w}_{y_{i}}^{\top}\mathbf{h}_{i})}{\sum\limits_{c\in[k]}\exp(\mathbf{w}_{c}^{\top}\mathbf{h}_{i})}\right)+\mathbf{w}_{y_{i}}^{\top}\mathbf{h}_{i}\right]

As shown in Prop. 1, in the presence of a large number of prototypes, optimizing SCL is akin to optimizing cross-entropy with a fixed classifier.

Remark 1.

This setting is remarkably similar to [17] that trains CE loss with fixed classifiers forming an ETF geometry. However two key differences emerge: (i) the features and prototypes are normalized, i.e. 𝐡i=1\|\mathbf{h}_{i}\|=1 iB\forall i\in B, 𝐰c=1\|\mathbf{w}_{c}\|=1 c[k]\forall c\in[k], and (ii) here, there is an additional alignment-promoting regularization induced by the inner product between 𝐡i\mathbf{h}_{i} and 𝐰yi\mathbf{w}_{y_{i}}. As we will see below, we also explore choices of prototypes that deviate from the ETF geomerty.

As described in Rem. 1, optimizing CE with a fixed classifier has been previously studied [17, 18]; however, typically embeddings are not normalized and different geometries have yet to be considered. In particular, we have empirically found that normalizing embeddings leads to faster geometry convergence consistent with the results of [18]. Lastly, we arrive at this setting from an entirely different view, one of understanding SCL with prototypes.

Below, we use the simplified loss in Proposition 1, to analytically study the geometry of embeddings in the specific setting (that of the experiments of Section 2) where prototypes form an ETF. To facilitate the analysis, we adopt the unconstrained-features model (UFM) [5], where the embeddings, 𝐡i\mathbf{h}_{i}, are treated as free variables.

Proposition 2.

If {𝐰c}c[k]\{\mathbf{w}_{c}\}_{c\in[k]} form an equiangular tight frame, i.e. 𝐰c𝐰c=1k1\mathbf{w}_{c}^{\top}\mathbf{w}_{c^{\prime}}=\frac{-1}{k-1} for ccc\neq c^{\prime}, then the optimal embeddings align with their corresponding prototype, 𝐡i=𝐰yi\mathbf{h}_{i}^{*}=\mathbf{w}_{y_{i}}.

Prop. 1 and Prop. 2 (the proofs of which are deferred to the appendix) emphasize the impact of prototypes on SCL optimization showing that in the limit nwnn_{w}\gg n, ETF is the optimal geometry. However, as mentioned in [19, 20, 12] allowing for better separability for minority classes can potentially improve generalization. Thus we now consider convergence to non-symmetric geometries, which could potentially favor minority separability. In Fig. 3, we use the limiting form of SCL given in Prop. 1 to illustrate the final learnt geometry 𝐆𝐌\mathbf{G}_{\mathbf{M}} of features trained using three possible prototype geometries: 1) ETF which assigns equal angles between prototypes 2) Improved Minority Angles which assigns a larger angle between prototypes belonging to the minority classes and 3) Majority Collapse, an extreme case which assigns the same prototype for all majority classes, forcing the majority class features to collapse to the same vector. Models are trained with a similar setup as in Fig. 2 and Fig. 1 albeit with learning rate annealing of 0.1 at epochs 200 and 300 as we observed that it expedites convergence. It is clear in Fig. 3 that the learnt features can be significantly altered based on the choice of prototypes allowing for geometries with more clear separability of minority classes. In summary, these experiments demonstrate the flexibility of SCL with prototypes, and create an opportunity to explore a wide variety of prototype geometries. This exploration could lead to identifying geometries that result in improved test performance when trained under label imbalances. We leave this to future work.

[Uncaptioned image]
Figure 3: Comparison of the Gram matrices (𝐆𝐌\mathbf{G}_{\mathbf{M}}) of learned embeddings with different prototypes trained with the limiting form of SCL given in Thm. 1. (Top) ETF prototypes (Middle) Large Minority Angles (Bottom) Majority Collapse.

Concluding Remarks

In this work, we have isolated and explored the effects of prototypes on supervised-contrastive loss. In doing so, we have identified a reliable method in tuning the learnt embedding geometry. In addition, a theoretical link to cross-entropy was established. Overall, our discoveries indicate that employing fixed prototypes offers a promising avenue for streamlining framework modifications that typically treat prototypes as trainable parameters without a clear understanding of their direct contribution. Moreover, this opens up an exciting avenue for future research to explore how choosing prototype geometries favoring larger angles for minority classes can positively impact generalization performance.

Appendix A Proof of Propositions

A.1 Proof of Proposition 1

Let |B|=n|B|=n, and note that 𝐰c=1,c[k]\|\mathbf{w}_{c}\|=1,\forall c\in[k]. Then we have that B\mathcal{L}_{B} with prototypes is of the form,

B=iBnwnB,yi+nw1s+c^[k]nwnB,yi+nw1p\mathcal{L}_{B}=\sum_{i\in B}\frac{n_{w}}{n_{B,y_{i}}+n_{w}-1}\mathcal{L}_{s}+\sum_{\hat{c}\in[k]}\frac{n_{w}}{n_{B,y_{i}}+n_{w}-1}\mathcal{L}_{p}

Here, s\mathcal{L}_{s} is the loss accrued while iterating over each sample in the batch and p\mathcal{L}_{p} is the loss accrued while iterating through each prototype and are given as follows:

s:=jiyj=yi1nwlog(exp(𝐡i𝐡j)iexp(𝐡𝐡i)+nwc[k]exp(𝐰c𝐡i))log(exp(𝐡i𝐰yi)iexp(𝐡𝐡i)+nwc[k]exp(𝐰c𝐡i)),\mathcal{L}_{s}:=\sum_{\begin{subarray}{c}j\neq i\\ y_{j}=y_{i}\end{subarray}}\frac{-1}{n_{w}}\log\left(\frac{\exp(\mathbf{h}_{i}^{\top}\mathbf{h}_{j})}{\sum\limits_{\ell\neq i}\exp(\mathbf{h}_{\ell}^{\top}\mathbf{h}_{i})+n_{w}\sum\limits_{c\in[k]}\exp(\mathbf{w}_{c}^{\top}\mathbf{h}_{i})}\right)\\ -\log\left(\frac{\exp(\mathbf{h}_{i}^{\top}\mathbf{w}_{y_{i}})}{\sum\limits_{\ell\neq i}\exp(\mathbf{h}_{\ell}^{\top}\mathbf{h}_{i})+n_{w}\sum\limits_{c\in[k]}\exp(\mathbf{w}_{c}^{\top}\mathbf{h}_{i})}\right)\,,
p:=j[n]yj=c^log(exp(𝐰c^𝐡j)[n]exp(𝐡𝐰c^)+nwΛ+(nw1)e)(nw1)log(eiexp(𝐡𝐰c^)+nwΛ+(nw1)e).\mathcal{L}_{p}:=\sum_{\begin{subarray}{c}j\in[n]\\ y_{j}=\hat{c}\end{subarray}}-\log\left(\frac{\exp(\mathbf{w}_{\hat{c}}^{\top}\mathbf{h}_{j})}{\sum\limits_{\ell\in[n]}\exp(\mathbf{h}_{\ell}^{\top}\mathbf{w}_{\hat{c}})+n_{w}\Lambda+(n_{w}-1)e}\right)\\ -(n_{w}-1)\log\left(\frac{e}{\sum\limits_{\ell\neq i}\exp(\mathbf{h}_{\ell}^{\top}\mathbf{w}_{\hat{c}})+n_{w}\Lambda+(n_{w}-1)e}\right)\,.

Above, we have used Λ=cc^exp(𝐰c𝐰c^)\Lambda=\sum_{c\neq\hat{c}}\exp(\mathbf{w}_{c}^{\top}\mathbf{w}_{\hat{c}}) to denote a fixed constant that is determined after selecting the desired geometry, and for compact notation, we have denote e=exp(1)e=\exp(1).

For clarity, we analyze each term s\mathcal{L}_{s} and p\mathcal{L}_{p} separately. First, as nwnB,yin_{w}\gg n_{B,y_{i}} the first term of s\mathcal{L}_{s} is proportional to 1/nw\nicefrac{{1}}{{n_{w}}}, we can neglect it. Moreover, as nwn_{w} increases, iexp(𝐡𝐡i)nwc[k]exp(𝐰c𝐡i)\sum\limits_{\ell\neq i}\exp(\mathbf{h}_{\ell}^{\top}\mathbf{h}_{i})\ll n_{w}\sum\limits_{c\in[k]}\exp(\mathbf{w}_{c}^{\top}\mathbf{h}_{i}). Thus, for large nwn_{w} we have that,

slog(exp(𝐡i𝐰yi)nwc[k]exp(𝐰c𝐡i))\mathcal{L}_{s}\approx-\log\left(\frac{\exp(\mathbf{h}_{i}^{\top}\mathbf{w}_{y_{i}})}{n_{w}\sum\limits_{c\in[k]}\exp(\mathbf{w}_{c}^{\top}\mathbf{h}_{i})}\right)

Now considering p\mathcal{L}_{p}, we have that the denominators of the logarithms are approximately given as,

[n]exp(𝐡𝐰c^)+nwΛ+(nw1)enwΛ+(nw1)eiexp(𝐡𝐰c^)+nwΛ+(nw1)enwΛ+(nw1)e\sum\limits_{\ell\in[n]}\exp(\mathbf{h}_{\ell}^{\top}\mathbf{w}_{\hat{c}})+n_{w}\Lambda+(n_{w}-1)e\approx n_{w}\Lambda+(n_{w}-1)e\\ \sum\limits_{\ell\neq i}\exp(\mathbf{h}_{\ell}^{\top}\mathbf{w}_{\hat{c}})+n_{w}\Lambda+(n_{w}-1)e\approx n_{w}\Lambda+(n_{w}-1)e

Thus we get that,

pj[n]yj=c^log(exp(𝐰c^𝐡j)nwΛ+(nw1)e)Φ,\mathcal{L}_{p}\approx\sum_{\begin{subarray}{c}j\in[n]\\ y_{j}=\hat{c}\end{subarray}}-\log\left(\frac{\exp(\mathbf{w}_{\hat{c}}^{\top}\mathbf{h}_{j})}{n_{w}\Lambda+(n_{w}-1)e}\right)-\Phi\,,

where Φ:=(nw1)log(enwΛ+(nw1)e)\Phi:=(n_{w}-1)\log\left(\frac{e}{n_{w}\Lambda+(n_{w}-1)e}\right). Furthermore, in the limit nwnB,yi+nw11\frac{n_{w}}{n_{B,y_{i}}+n_{w}-1}\rightarrow 1.

Combining the above, the per-batch loss (B\mathcal{L}_{B}) can be expressed as,

BiBlog(exp(𝐡i𝐰yi)c[k]exp(𝐰c𝐡i))+log(nw)+c^[k]j[n]yj=c^log(exp(𝐰c^𝐡j))Φ+log(nwΛ+(nw1)e).\mathcal{L}_{B}\approx\sum_{i\in B}-\log\left(\frac{\exp(\mathbf{h}_{i}^{\top}\mathbf{w}_{y_{i}})}{\sum\limits_{c\in[k]}\exp(\mathbf{w}_{c}^{\top}\mathbf{h}_{i})}\right)+\log(n_{w})\\ +\sum_{\hat{c}\in[k]}\sum_{\begin{subarray}{c}j\in[n]\\ y_{j}=\hat{c}\end{subarray}}-\log\left(\exp(\mathbf{w}_{\hat{c}}^{\top}\mathbf{h}_{j})\right)-\Phi+\log(n_{w}\Lambda+(n_{w}-1)e)\,.

Since the optimal embeddings 𝐡i\mathbf{h}_{i}^{*} are independent of any additive constants on the objective it suffices to drop them during optimization. Thus we arrive at the desired:

BiB[log(exp(𝐰yi𝐡i)c[k]exp(𝐰c𝐡i))+𝐰yi𝐡i].\mathcal{L}_{B}\rightarrow-\sum_{i\in B}\left[\log\left(\frac{\exp(\mathbf{w}_{y_{i}}^{\top}\mathbf{h}_{i})}{\sum\limits_{c\in[k]}\exp(\mathbf{w}_{c}^{\top}\mathbf{h}_{i})}\right)+\mathbf{w}_{y_{i}}^{\top}\mathbf{h}_{i}\right]\,.

A.2 Proof Sketch of Proposition 2

We follow a similar proof technique to [17], thus we only mention the delicate aspects necessary to handle the alignment term. Consider the minimization program given below, with the objective as given by Prop. 1, while relaxing the norm constraint on the embeddings.

min𝐡i21iB[log(exp(𝐰yi𝐡i)c[k]exp(𝐰c𝐡i))+𝐰yi𝐡i]\min_{\|\mathbf{h}_{i}\|^{2}\leq 1}-\sum_{i\in B}\left[\log\left(\frac{\exp(\mathbf{w}_{y_{i}}^{\top}\mathbf{h}_{i})}{\sum\limits_{c\in[k]}\exp(\mathbf{w}_{c}^{\top}\mathbf{h}_{i})}\right)+\mathbf{w}_{y_{i}}^{\top}\mathbf{h}_{i}\right]

Now, as a first step, one can define the Lagrangian L({𝐡i},{λi})L(\{\mathbf{h}_{i}\},\{\lambda_{i}\}) for i[n]i\in[n]. Noting that {λi}i[n]\{\lambda_{i}\}_{i\in[n]} are the dual variables associated with the norm constraints, as in [17] we can prove by contradiction that λi0\lambda_{i}\neq 0. This implies that 𝐡i2=1\|\mathbf{h}_{i}\|^{2}=1 and λi>0\lambda_{i}>0 from the KKT conditions.

As a next step, we can define py=exp(𝐰y𝐡i)c[k]exp(𝐰c𝐡i)p_{y}=\frac{\exp(\mathbf{w}_{y}^{\top}\mathbf{h}_{i})}{\sum_{c\in[k]}\exp(\mathbf{w}_{c}^{\top}\mathbf{h}_{i})} (as in [17]). From here, one can establish that for c~c^yi\tilde{c}\neq\hat{c}\neq y_{i} we have that,

pc^pc~=exp(𝐡i𝐰c^)exp(𝐡i𝐰c~)=1k12λi𝐡i𝐰c^1k12λi𝐡i𝐰c~{\frac{p_{\hat{c}}}{p_{\tilde{c}}}}={\frac{\exp(\mathbf{h}_{i}^{\top}\mathbf{w}_{\hat{c}})}{\exp(\mathbf{h}_{i}^{\top}\mathbf{w}_{\tilde{c}})}}=\frac{\frac{1}{k-1}-2\lambda_{i}\mathbf{h}_{i}^{\top}\mathbf{w}_{\hat{c}}}{\frac{1}{k-1}-2\lambda_{i}\mathbf{h}_{i}^{\top}\mathbf{w}_{\tilde{c}}}

Taking x=𝐡i𝐰c^x=\mathbf{h}_{i}^{\top}\mathbf{w}_{\hat{c}} we define the function abxexp(x)\frac{a-bx}{\exp(x)}. For CE with fixed ETF classifier, the authors of [17] use the monotonicity of exp(x)x\frac{\exp(x)}{x} to complete the proof. In our case, the function abxexp(x)\frac{a-bx}{\exp(x)} is strictly decreasing under the constraints 0a1,b>00\leq a\leq 1,b>0 in the interval x[1,1]x\in[-1,1]. Therefore, it holds that 𝐡i𝐰c^=𝐡i𝐰c~\mathbf{h}_{i}^{\top}\mathbf{w}_{\hat{c}}=\mathbf{h}_{i}^{\top}\mathbf{w}_{\tilde{c}} and pc^=pc~=pp_{\hat{c}}=p_{\tilde{c}}=p. With this fact established, one can directly take the gradient of the Lagrangian, and solve for 𝐡i\mathbf{h}_{i}^{*}, i.e. set 𝐡iL=0\nabla_{\mathbf{h}_{i}}L=0. Using the established facts in this proof sketch one will find that 𝐡i=𝐰yi\mathbf{h}_{i}^{*}=\mathbf{w}_{y_{i}}.

References

  • [1] Vardan Papyan, X. Y. Han, and David L. Donoho, “Prevalence of neural collapse during the terminal phase of deep learning training,” Proceedings of the National Academy of Sciences, vol. 117, no. 40, pp. 24652–24663, 2020.
  • [2] Tina Behnia, Ganesh Ramachandra Kini, Vala Vakilian, and Christos Thrampoulidis, “On the implicit geometry of cross-entropy parameterizations for label-imbalanced data,” 2023.
  • [3] Cong Fang, Hangfeng He, Qi Long, and Weijie J. Su, “Layer-peeled model: Toward understanding well-trained deep neural networks,” CoRR, vol. abs/2101.12699, 2021.
  • [4] Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu, “A geometric analysis of neural collapse with unconstrained features,” CoRR, vol. abs/2105.02375, 2021.
  • [5] Dustin G. Mixon, Hans Parshall, and Jianzong Pi, “Neural collapse with unconstrained features,” CoRR, vol. abs/2011.11619, 2020.
  • [6] Christos Thrampoulidis, Ganesh Ramachandra Kini, Vala Vakilian, and Tina Behnia, “Imbalance trouble: Revisiting neural-collapse geometry,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds. 2022, vol. 35, pp. 27225–27238, Curran Associates, Inc.
  • [7] Florian Graf, Christoph Hofer, Marc Niethammer, and Roland Kwitt, “Dissecting supervised contrastive learning,” in Proceedings of the 38th International Conference on Machine Learning, Marina Meila and Tong Zhang, Eds. 18–24 Jul 2021, vol. 139 of Proceedings of Machine Learning Research, pp. 3821–3830, PMLR.
  • [8] Jianggang Zhu, Zheng Wang, Jingjing Chen, Yi-Ping Phoebe Chen, and Yu-Gang Jiang, “Balanced contrastive learning for long-tailed visual recognition,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 6898–6907.
  • [9] Ganesh Ramachandra Kini, Vala Vakilian, Tina Behnia, Jaidev Gill, and Christos Thrampoulidis, “Supervised-contrastive loss learns orthogonal frames and batching matters,” 2023.
  • [10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning, Hal Daumé III and Aarti Singh, Eds. 13–18 Jul 2020, vol. 119 of Proceedings of Machine Learning Research, pp. 1597–1607, PMLR.
  • [11] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan, “Supervised contrastive learning,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 18661–18673, Curran Associates, Inc.
  • [12] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma, “Learning imbalanced datasets with label-distribution-aware margin loss,” 2019.
  • [13] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar, “Long-tail learning via logit adjustment,” 2021.
  • [14] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick, “Momentum contrast for unsupervised visual representation learning,” 2020.
  • [15] Jiequan Cui, Zhisheng Zhong, Shu Liu, Bei Yu, and Jiaya Jia, “Parametric contrastive learning,” 2021.
  • [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.
  • [17] Yibo Yang, Shixiang Chen, Xiangtai Li, Liang Xie, Zhouchen Lin, and Dacheng Tao, “Inducing neural collapse in imbalanced learning: Do we really need a learnable classifier at the end of deep neural network?,” 2022.
  • [18] Elad Hoffer, Itay Hubara, and Daniel Soudry, “Fix your classifier: the marginal value of training the last weight layer,” 2018.
  • [19] Ganesh Ramachandra Kini, Orestis Paraskevas, Samet Oymak, and Christos Thrampoulidis, “Label-imbalanced and group-sensitive classification under overparameterization,” 2021.
  • [20] Han-Jia Ye, Hong-You Chen, De-Chuan Zhan, and Wei-Lun Chao, “Identifying and compensating for feature deviation in imbalanced deep learning,” CoRR, vol. abs/2001.01385, 2020.