This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hard-Negative Sampling for Contrastive Learning: Optimal Representation Geometry and Neural- vs Dimensional-Collapse

Ruijie Jiang∗,†, Thuan Nguyen∗,∘, Shuchin Aeron, Prakash Ishwar - Equal Contribution.
\dagger: Department of Electrical Engineering, Tufts University. Emails: Ruijie.Jiang@tufts.edu, shuchin@ece.tufts.edu.
\circ: Department of Engineering, Engineering Technology, East Tennessee State University. Email: nguyent11@etsu.edu.
\ddagger: Department of Electrical and Computer Engineering, Boston University. Email: pi@bu.edu.
Abstract

For a widely-studied data model and general loss and sample-hardening functions we prove that the losses of Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) are minimized by representations that exhibit Neural-Collapse (NC), i.e., the class means form an Equiangular Tight Frame (ETF) and data from the same class are mapped to the same representation. We also prove that for any representation mapping, the HSCL and Hard-UCL (HUCL) losses are lower bounded by the corresponding SCL and UCL losses. In contrast to existing literature, our theoretical results for SCL do not require class-conditional independence of augmented views and work for a general loss function class that includes the widely used InfoNCE loss function. Moreover, our proofs are simpler, compact, and transparent. Similar to existing literature, our theoretical claims also hold for the practical scenario where batching is used for optimization. We empirically demonstrate, for the first time, that Adam optimization (with batching) of HSCL and HUCL losses with random initialization and suitable hardness levels can indeed converge to the NC-geometry if we incorporate unit-ball or unit-sphere feature normalization. Without incorporating hard-negatives or feature normalization, however, the representations learned via Adam suffer from Dimensional-Collapse (DC) and fail to attain the NC-geometry. These results exemplify the role of hard-negative sampling in contrastive representation learning and we conclude with several open theoretical problems for future work. The code can be found at https://github.com/rjiang03/HCL/tree/main

Keywords: Contrastive Learning, Hard-Negative Sampling, Neural-Collapse.

1 Introduction

Contrastive representation learning (CL) methods learn a mapping that embeds data into a Euclidean space such that similar examples retain close proximity to each other and dissimilar examples are pushed apart. CL, and in particular unsupervised CL, has gained prominence in the last decade with notable success in Natural Language Processing (NLP), Computer Vision (CV), time-series, and other applications. Recent surveys [2, 23] and the references therein provide a comprehensive view of these applications.

The characteristics and utility of the learned representation depend on the joint distribution of similar (positive samples) and dissimilar data points (negative samples) and the downstream learning task. In this paper we are not interested in the downstream analysis, but in characterizing the geometric structure and other properties of global minima of the contrastive learning loss under a general latent data model. In this context, we focus on understanding the impact and utility of hard-negative sampling in the UCL and SCL settings. When carefully designed, hard-negative sampling improves downstream classification performance of representations learned via CL as demonstrated in [24], [11, 10], and [17]. While it is known that hard-negative sampling can be performed implicitly by adjusting what is referred to as the temperature parameter in the CL loss, in this paper we set this parameter to unity and explicitly model hard-negative sampling through a general “hardening” function that can tilt the sampling distribution to generate negative samples that are more similar to (and therefore harder to distinguish from) the positive and anchor samples. We also numerically study the impact of feature normalization on the learned representation geometry.

1.1 Main contributions

Our main theoretical contributions are Theorems 1, 2, and 3, which, under a widely-studied latent data model, hold for any convex, argument-wise non-decreasing contrastive loss function, any non-negative and argument-wise non-decreasing hardening function to generate hard-negative samples, and norm-bounded representations of dimension at least C1C-1.

Theorem 1 establishes that the HSCL loss dominates the SCL loss and similarly the HUCL loss dominates the UCL loss. In this context we note that Theorem 3.1 in [30] is a somewhat similar result for the UCL setting, but for a special loss function and it does not address directly hard-negatives.

Theorem 2 is a novel result which states that the globally optimal representation geometry for both SCL and HSCL corresponds to Neural-Collapse (NC) (see Definition 6) with the same optimal loss value. In contrast to existing works (see related works section 2), we show that achieving NC in SCL and HSCL does not require class-conditional independence of the positive samples.

Similarly, Theorem 3 establishes the optimality of NC-geometry for UCL if the representation dimension is sufficiently large compared to the number of latent classes which, in turn, is implicitly determined by the joint distribution of the positive examples that corresponds to the augmentation mechanism.

A comprehensive set of experimental results on one synthetic and three real datasets are detailed in Sec. 5 and Appendix B. These include experiments that study effects of two initialization methods, three different feature normalization methods, three different batch sizes, two very different families of hardening functions, and two different CL sampling strategies. Empirical results show that when using the Adam optimizer with random initialization, the matrix of class means for SCL is badly conditioned and effectively low-rank, i.e., it exhibits Dimensional-Collapse (DC). In contrast, the use of hard-negatives at appropriate hardness levels mitigates DC and enables convergence to the global optima. A similar phenomenon is observed in the unsupervised settings. We also show that feature normalization is critical for mitigating DC in these settings. Results are qualitatively similar across different datasets, a range of batch sizes, hardening functions, and CL sampling strategies.

2 Related work

2.1 Supervised Contrastive Learning

The theoretical results for our SCL setting, where in contrast to [7] and [13] we make use of label information in positive as well as negative sampling, are novel. The debiased SCL loss in [5] corresponds to our SCL loss, but no analysis pertaining to optimal representation geometry was considered in [5]. A recent ArXiv preprint by [6], which appeared after our own ArXiv preprint, [11], considers using label information for negative sampling in SCL with InfoNCE loss, calling it the SINCERE loss. Our theoretical results prove that NC is the optimal geometry for the SINCERE loss. Furthermore, our results and analysis also apply to hard-negative sampling as well, a scenario not considered thus far for SCL.

We would like to note that our theoretical set-up for UCL under the sampling mechanism of Figure 1, can be seen to be aligned with the SCL analysis that makes use of label information only for positive samples as done in [13] and [7]. Therefore, our theoretical results provide an alternative proof of optimality of NC based on simple, compact, and transparent probabilistic arguments complementing the proof of a similar result in [7]. We note that similarly to [7], our arguments also hold for the case when one approximates the loss using batches.

We want to point out that in all key papers that conduct a theoretical analysis of contrastive learning e.g., [1, 7, 25], the positive samples are assumed to be conditionally i.i.d., conditioned on the label. However, this conditional independence may not hold in practice when using augmentation mechanisms typically considered in CL settings e.g., [19, 20, 31, 4].

Unlike the recent work of [33] which shows that the optimization landscape of supervised learning with least-squares loss is benign, i.e. all critical points other than the global optima are strict saddle points, in Sec. 5 we demonstrate that the optimization landscape of SCL is more complicated. Specifically, not only may the global optima not be reached by SGD-like methods with random initialization, but also the local optima exhibit the Dimensional-Collapse (DC) phenomenon. However, our experiments demonstrate that these issues are remedied via HSCL whose global optimization landscape may be better. Here we note that [32] show that with unit-sphere normalization, Riemannian gradient descent methods can achieve the global optima for SCL, underscoring the importance of optimization methods and constraints for training in CL.

2.2 Unsupervised Contrastive Learning

[27] argue that asymptotically (in number of negative samples) the InfoNCE loss for UCL optimizes for a trade-off between the alignment of positive pairs while ensuring uniformity of features on the hypersphere. However, a non-asymptotic and global analysis of the optimal solution is still lacking. In contrast, for UCL in Theorem 3, we show that as long as the embedding dimension is larger than the number of latent classes, which in turn is determined by the distribution of the similar samples, the optimal representations in UCL exhibit NC-geometry.

Our results also complement several recent papers, e.g., [22], [29], [28], that study the role of augmentations in UCL. Similar to theoretical works analyzing UCL, e.g. [1] and [16], our results also assume conditional independence of positive pairs given the label. This assumption may or may not be satisfied in practice.

We demonstrate that a recent result, viz., Theorem 4 in [12], that attempts to explain DC in UCL is limited in that under a suitable initialization, the UCL loss trained with Adam does not exhibit DC (see Sec. 5). Furthermore, we demonstrate empirically, for the first time, that HUCL mitigates DC in UCL at moderate hardness levels. For CL (without hard-negative sampling), [34] characterize local solutions that correspond to DC but leave open the analysis of training dynamics leading to collapsed solutions.

A geometrical analysis of HUCL is carried out in [24], but the optimal solutions are only characterized asymptotically (in the number of negative samples) and for the case when hardness also goes to infinity, the analysis seems to require knowledge of supports of class conditional distributions. In contrast, we show that the geometry of the optimal solution for HUCL depends on the hardness level and is, in general, different compared to UCL due to the possibility of class collision.

3 Contrastive Learning framework

3.1 Mathematical model

Notation: k,C,k>1,C>1,𝒴:={1,,C},𝒵d𝒵k,C\in{\mathbb{N}},k>1,C>1,{\mathcal{Y}}:=\{1,\ldots,C\},{\mathcal{Z}}\subseteq{\mathbb{R}}^{d_{\mathcal{Z}}}. For i,ji,j\in{\mathbb{Z}}, i<ji<j, i:j:=i,i+1,,ji:j:=i,i+1,\ldots,j, and ai:j:=ai,ai+1,,aja_{i:j}:=a_{i},a_{i+1},\ldots,a_{j}. If i>ji>j, i:ji:j and ai:ja_{i:j} are “null”.

Let f:𝒳𝒵f:{\mathcal{X}}\rightarrow{\mathcal{Z}} denote a (deterministic) representation mapping from data space 𝒳{\mathcal{X}} to representation space 𝒵d𝒵{\mathcal{Z}}\subseteq{\mathbb{R}}^{d_{\mathcal{Z}}}. Let {\mathcal{F}} denote a family of such representation mappings. Contrastive Learning (CL) selects a representation from the family by minimizing an expected loss function that penalizes “misalignment” between the representation of an anchor sample z=f(x)z=f(x) and the representation of a positive sample z+=f(x+)z^{+}=f(x^{+}) and simultaneously penalizes “alignment” between zz and the representations of kk negative samples zi:=f(xi),i=1:kz^{-}_{i}:=f(x^{-}_{i}),i=1:k.

We consider a CL loss function k\ell_{k} of the following general form.

Definition 1 (Generalized Contrastive Loss).
k(z,z+,z1:k):=ψk(z(z1z+),,z(zkz+))\displaystyle\ell_{k}(z,z^{+},z^{-}_{1:k}):=\psi_{k}(z^{\top}(z^{-}_{1}-z^{+}),\ldots,z^{\top}(z^{-}_{k}-z^{+})) (1)

where ψk:k\psi_{k}:{\mathbb{R}}^{k}\rightarrow{\mathbb{R}} is a convex function that is also argument-wise non-decreasing (i.e., non-decreasing with respect to each argument when the other arguments are held fixed) throughout k{\mathbb{R}}^{k}.

This subsumes and generalizes popular CL loss functions such as InfoNCE and triplet-loss with sphere-normalized representations. InfoNCE corresponds to ψk(t1:k)=log(α+i=1keti)\psi_{k}(t_{1:k})=\log(\alpha+\sum_{i=1}^{k}e^{t_{i}}) with α>0\alpha>0111This is the log-sum-exponential function which is convex over k{\mathbb{R}}^{k} for all α0\alpha\geq 0 and strictly convex over k{\mathbb{R}}^{k} if α>0\alpha>0. and ψk(t)=max{t+α,0}\psi_{k}(t)=\max\{t+\alpha,0\}, α>0\alpha>0, is the triplet-loss with sphere-normalized representations. However, some CL losses such as the spectral contrastive loss of [8] are not of this form.

The CL loss is the expected value of the CL loss function:

LCL(k)(f):=𝔼(x,x+,x1:k)pCL[k(f(x),f(x+),f(x1),,f(xk))]\displaystyle L^{(k)}_{CL}(f):={\mathbb{E}}_{(x,x^{+},x^{-}_{1:k})\sim p_{CL}}[\ell_{k}(f(x),f(x^{+}),f(x^{-}_{1}),\ldots,f(x^{-}_{k}))]

where pCL(x,x+,x1:k)p_{CL}(x,x^{+},x^{-}_{1:k}) is the joint probability distribution of the anchor, positive, and kk negative samples and is designed differently within the supervised and unsupervised settings as described below.

Supervised CL (SCL): Here, all samples have class labels: a common class label y𝒴y\in{\mathcal{Y}} for the anchor and positive sample and one class label for each negative sample denoted by yi𝒴y^{-}_{i}\in{\mathcal{Y}} for the ii-th negative sample, i=1,,ki=1,\ldots,k. The joint distribution of all samples and their labels is described in the following equation:

pSCL(y,x,x+,y1:k,x1:k)\displaystyle p_{SCL}(y,x,x^{+},y^{-}_{1:k},x^{-}_{1:k}) :=λyq(x,x+|y)i=1kr(yi|y)s(xi|yi),\displaystyle:=\lambda_{y}\,q(x,x^{+}|y)\prod_{i=1}^{k}r(y^{-}_{i}|y)\,s(x^{-}_{i}|y^{-}_{i}), (2)
r(yi|y)\displaystyle r(y^{-}_{i}|y) :=1(yiy)λyi(1λy)\displaystyle:=\frac{1(y^{-}_{i}\neq y)\lambda_{y^{-}_{i}}}{(1-\lambda_{y})} (3)

where λy(0,1)\lambda_{y}\in(0,1) for all y𝒴y\in{\mathcal{Y}} is the marginal distribution of the anchor’s label and s(x|y)s(x^{-}|y^{-}) is the conditional probability distribution of any negative sample xx^{-} given its class yy^{-}.

This joint distribution may be interpreted from a sample generation perspective as follows: first, a common class label y𝒴y\in{\mathcal{Y}} for the anchor and positive sample is sampled from a class marginal probability distribution λ\lambda. Then, the anchor and positive samples are generated by sampling from the conditional distribution q(x,x+|y)q(x,x^{+}|y). Then, given x,x+x,x^{+} and their common class label yy, the kk negative samples and their labels are generated in a conditionally IID manner. The sampling of yi,xiy^{-}_{i},x^{-}_{i}, for each ii, can be interpreted as first sampling a class label yiy^{-}_{i} different from yy in a manner consistent with the class marginal probability distribution λ\lambda (sampling from distribution r(yi|y)r(y^{-}_{i}|y)) and then sampling xix^{-}_{i} from the conditional probability distribution s(|)s(\cdot|\cdot) of negative samples given class yiy^{-}_{i}. Thus in SCL, the kk negative samples are conditionally IID and independent of the anchor and positive sample given the anchor’s label.

In the typical supervised setting, the anchor, positive, and negative samples all share the same common conditional probability distribution s(|)s(\cdot|\cdot) within each class given their respective labels.

We denote the CL loss in the supervised setting by LSCL(k)(f)L^{(k)}_{SCL}(f).

Unsupervised CL (UCL): Here samples do not have labels or rather they are latent (unobserved) and the kk negative samples are IID and independent of the anchor and positive samples.

Latent labels in UCL can be interpreted as indexing latent clusters. Suppose that there are CC latent clusters from which the anchor, positive, and the kk negative samples can be drawn from. Then the joint distribution of all samples and their latent labels can be described by the following equation:

pUCL(y,x,x+,y1:k,x1:k)\displaystyle p_{UCL}(y,x,x^{+},y^{-}_{1:k},x^{-}_{1:k}) :=λyq(x,x+|y)i=1kr(yi)s(xi|yi),\displaystyle:=\lambda_{y}\,q(x,x^{+}|y)\prod_{i=1}^{k}r(y^{-}_{i})\,s(x^{-}_{i}|y^{-}_{i}), (4)

where λ\lambda is the marginal distribution of the anchor’s latent label, r()r(\cdot) the marginal distribution of the latent labels of negative samples, and s(x|y)s(x^{-}|y^{-}) is the conditional probability distribution of any negative sample xx^{-} given its latent label yy^{-}. We have used the same notation as in SCL (and slightly abused it for r()r(\cdot)) in order to make the similarities and differences between the SCL and UCL distribution structures transparent.

In the typical UCL setting, r=λr=\lambda and the conditional distribution of xx given its label yy and the conditional distribution of x+x^{+} given its label yy are both s(|y)s(\cdot|y) which is the conditional distribution of any negative sample given its label.

We denote the CL loss in the unsupervised setting by LUCL(k)(f)L^{(k)}_{UCL}(f).

xrefx^{ref}yyxxx+x^{+}zzz+z^{+}xrefx^{ref-}yy^{-}xx^{-}zz^{-}
Figure 1: Graphical model for augmentation and negative sampling for SCL and UCL settings used in practical implementations such as Sim-CLR.

Anchor and positive samples: For the SCL scenario, we will consider sampling mechanisms in which the representations of the anchor xx and the positive sample x+x^{+} have the same conditional probability distribution s(|y)s(\cdot|y) given their common label yy (see (2)). We will not, however, assume that xx and x+x^{+} are conditionally independent given yy. This is compatible with settings where xx and x+x^{+} are generated via IID augmentations of a common reference sample xrefx^{ref} as in SimCLR [4] (see Fig. 1 for the model and Appendix A for a proof of compatibility of this model).

For UCL, we assume the same mechanism for sampling the positive samples as that for SCL, but the latent label yy is unobserved. Further, the negative samples are generated independently of the positive pairs using the same mechanism, e.g., IID augmentations of an xrefx^{ref-} chosen independently of xrefx^{ref} (see Fig. 1). Thus, the anchor, positive, and negative samples will all have the same marginal distribution given by i=1Cλis(|i)\sum_{i=1}^{C}\lambda_{i}s(\cdot|i).

3.2 Hard-negative sampling

Hard-negative sampling aims to generate negative samples whose representations are “more aligned” with that of the anchor (making them harder to distinguish from the anchor) compared to a given reference negative sampling distribution (whether unsupervised or supervised). We consider a very general class of “hardening” mechanisms that include several classical approaches as special cases. To this end, we define a hardening function as follows.

Definition 2 (Hardening function).

η:k\eta:{\mathbb{R}}^{k}\rightarrow{\mathbb{R}} is a hardening function if it is non-negative and argument-wise non-decreasing throughout k{\mathbb{R}}^{k}.

As an example, η(t1:k):=i=1keβti\eta(t_{1:k}):=\prod_{i=1}^{k}e^{\beta t_{i}}, β>0\beta>0, is an exponential tilting hardening function employed in [25] and [10].

Hard-negative SCL (HSCL): From (2) it follows that in SCL, p(x|x,x+,y)=p(x|y)=y𝒴r(y|y)s(x|y)=:pSCL(x|y)p(x^{-}|x,x^{+},y)=p(x^{-}|y)=\sum_{y^{-}\in{\mathcal{Y}}}r(y^{-}|y)\,s(x^{-}|y^{-})=:p^{-}_{SCL}(x^{-}|y) is the reference negative sampling distribution for one negative sample and p(x1:k|x,x+,y)=p(x1:k|y)=i=1kpSCL(xi|y)p(x^{-}_{1:k}|x,x^{+},y)=p(x^{-}_{1:k}|y)=\prod_{i=1}^{k}p^{-}_{SCL}(x^{-}_{i}|y) is the reference negative sampling distribution for kk negative samples. Let η\eta be a hardening function such that for all x𝒳x\in{\mathcal{X}} and all y𝒴y\in{\mathcal{Y}},

γ(x,y,f):=𝔼x1:kIIDpSCL(|y)[η(f(x)f(x1),,f(x)f(xk))](0,).\displaystyle\gamma(x,y,f):={\mathbb{E}}_{x^{-}_{1:k}\sim\,\mathrm{IID}\,p^{-}_{SCL}(\cdot|y)}[\eta(f(x)^{\top}f(x^{-}_{1}),\ldots,f(x)^{\top}f(x^{-}_{k}))]\in(0,\infty).

Then we define the η\eta-harder negative sampling distribution for SCL as follows.

Definition 3 (η\eta-harder negatives for SCL).
pHSCL(x1:k|x,x+,y,f):=η(f(x)f(x1),,f(x)f(xk))γ(x,y,f)i=1kpSCL(xi|y).p^{-}_{\text{HSCL}}(x^{-}_{1:k}|x,x^{+},y,f)\displaystyle:=\frac{\eta(f(x)^{\top}f(x^{-}_{1}),\ldots,f(x)^{\top}f(x^{-}_{k}))}{\gamma(x,y,f)}\prod_{i=1}^{k}p^{-}_{SCL}(x^{-}_{i}|y). (5)

Observe that negative samples which are more aligned with the anchor in the representation space, i.e., f(x)f(xi)f(x)^{\top}f(x^{-}_{i}) is large, are sampled relatively more often in pHSCLp^{-}_{HSCL} than in the reference pSCLp^{-}_{SCL} because η\eta is argument-wise non-decreasing throughout k{\mathbb{R}}^{k}.

In HSCL, x1:kx^{-}_{1:k} are conditionally independent of x+x^{+} given xx and yy but they are not conditionally independent of xx given yy (unlike in SCL). Moreover, x1:kx^{-}_{1:k} may not be conditionally IID given (x,y)(x,y) if the hardening function is not (multiplicatively) separable. We also note that unlike in SCL, pHSCLp^{-}_{HSCL} depends on the representation function ff.

We denote the joint probability distribution of all samples and their labels in the hard-negative SCL setting by pHSCLp_{HSCL} and the corresponding CL loss by LHSCL(k)(f)L^{(k)}_{HSCL}(f).

Hard-negative UCL (HUCL): From (4) it follows that in UCL, p(x|x,x+,y)=p(x)=y𝒴r(y)s(x|y)=:pUCL(x)p(x^{-}|x,x^{+},y)=p(x^{-})=\sum_{y^{-}\in{\mathcal{Y}}}r(y^{-})\,s(x^{-}|y^{-})=:p^{-}_{UCL}(x^{-}) is the reference negative sampling distribution for one negative sample and p(x1:k|x,x+,y)=p(x1:k)=i=1kpUCL(xi)p(x^{-}_{1:k}|x,x^{+},y)=p(x^{-}_{1:k})=\prod_{i=1}^{k}p^{-}_{UCL}(x^{-}_{i}) is the reference negative sampling distribution for kk negative samples. Let η\eta be a hardening function such that for all x𝒳x\in{\mathcal{X}},

γ(x,f):=𝔼x1:kIIDpUCL[η(f(x)f(x1),,f(x)f(xk))](0,).\displaystyle\gamma(x,f):={\mathbb{E}}_{x^{-}_{1:k}\sim\,\mathrm{IID}\,p^{-}_{UCL}}[\eta(f(x)^{\top}f(x^{-}_{1}),\ldots,f(x)^{\top}f(x^{-}_{k}))]\in(0,\infty). (6)

Then we define the η\eta-harder negative sampling probability distribution for UCL as follows.

Definition 4 (η\eta-harder negatives for UCL).
pHUCL(x1:k|x,x+,y,f):=η(f(x)f(x1),,f(x)f(xk))γ(x,f)i=1kpUCL(xi).\displaystyle p^{-}_{HUCL}(x^{-}_{1:k}|x,x^{+},y,f):=\frac{\eta(f(x)^{\top}f(x^{-}_{1}),\ldots,f(x)^{\top}f(x^{-}_{k}))}{\gamma(x,f)}\cdot\prod_{i=1}^{k}p^{-}_{UCL}(x^{-}_{i}). (7)

Again, observe that negative samples which are more aligned with the anchor in representation space, i.e., f(x)f(xi)f(x)^{\top}f(x^{-}_{i}) is large, are sampled relatively more often in pHUCLp^{-}_{HUCL} than in the reference pUCLp^{-}_{UCL} because η\eta is argument-wise non-decreasing throughout k{\mathbb{R}}^{k}.

In HUCL, x1:kx^{-}_{1:k} are conditionally independent of x+x^{+} given xx, but they are not independent of xx (unlike in UCL). Moreover, x1:kx^{-}_{1:k} may not be conditionally IID given xx if the hardening function is not (multiplicatively) separable. We also note that unlike in UCL, pHUCLp^{-}_{HUCL} depends on the representation function ff.

We denote the joint probability distribution of all samples and their (latent labels) in the hard-negative UCL setting by pHUCLp_{HUCL} and the corresponding CL loss by LHUCL(k)(f)L^{(k)}_{HUCL}(f).

4 Theoretical results

In this section, we present all our theoretical results using the notation and mathematical framework for CL described in the previous section.

4.1 Hard-negative CL loss is not smaller than CL loss

Theorem 1 (Hard-negative CL versus CL losses).

Let ψk\psi_{k} in (1) be argument-wise non-decreasing over k{\mathbb{R}}^{k} and assume that all expectations associated with LUCL(k)(f)L^{(k)}_{UCL}(f), LHUCL(k)(f)L^{(k)}_{HUCL}(f), LSCL(k)(f)L^{(k)}_{SCL}(f), LHSCL(k)(f)L^{(k)}_{HSCL}(f) exist and are finite. Then, for all ff and all kk, LHUCL(k)(f)LUCL(k)(f)L^{(k)}_{HUCL}(f)\geq L^{(k)}_{UCL}(f) and LHSCL(k)(f)LSCL(k)(f)L^{(k)}_{HSCL}(f)\geq L^{(k)}_{SCL}(f).

We note that convexity of ψk\psi_{k} is not needed in Theorem 1. The proof of Theorem 1 is based on the generalized (multivariate) association inequality due to Harris, Theorem 2.15 [3] and its corollary which are stated below.

Lemma 1 (Harris-inequality, Theorem 2.15 in [3]).

Let g:kg:{\mathbb{R}}^{k}\rightarrow{\mathbb{R}} and h:kh:{\mathbb{R}}^{k}\rightarrow{\mathbb{R}} be argument-wise non-decreasing throughout k{\mathbb{R}}^{k}. If u1:kIIDpu_{1:k}\sim\,\mathrm{IID}\,p then

𝔼u1:kIIDp[g(u1:k)\displaystyle{\mathbb{E}}_{u_{1:k}\sim\,\mathrm{IID}\,p}[g(u_{1:k}) h(u1:k)]𝔼u1:kIIDp[g(u1:k)]𝔼u1:kIIDp[h(u1:k)]\displaystyle h(u_{1:k})]\geq{\mathbb{E}}_{u_{1:k}\sim\,\mathrm{IID}\,p}[g(u_{1:k})]\cdot{\mathbb{E}}_{u_{1:k}\sim\,\mathrm{IID}\,p}[h(u_{1:k})]

whenever the expectations exist and are finite.

Corollary 1.

Let η:k\eta:{\mathbb{R}}^{k}\rightarrow{\mathbb{R}} be non-negative and argument-wise non-decreasing throughout k{\mathbb{R}}^{k} such that γ:=𝔼u1:kIIDp[η(u1:k)](0,)\gamma:={\mathbb{E}}_{u_{1:k}\sim\,\mathrm{IID}\,p}[\eta(u_{1:k})]\in(0,\infty). Let pH(u1:k):=η(u1:k)γi=1kp(ui)p_{H}(u_{1:k}):=\frac{\eta(u_{1:k})}{\gamma}\prod_{i=1}^{k}p(u_{i}). If g:kg:{\mathbb{R}}^{k}\rightarrow{\mathbb{R}} is argument-wise non-decreasing throughout k{\mathbb{R}}^{k} such that 𝔼u1:kIIDp[g(u1:k)]{\mathbb{E}}_{u_{1:k}\sim\,\mathrm{IID}\,p}[g(u_{1:k})] exists and is finite, then

𝔼u1:kpH[g(u1:k)]𝔼u1:kIIDp[g(u1:k)].{\mathbb{E}}_{u_{1:k}\sim p_{H}}[g(u_{1:k})]\geq{\mathbb{E}}_{u_{1:k}\sim\,\mathrm{IID}\,p}[g(u_{1:k})].

Proof

𝔼u1:kpH[g(u1:k)]\displaystyle{\mathbb{E}}_{u_{1:k}\sim p_{H}}[g(u_{1:k})]
=𝔼u1:kIIDp[g(u1:k)η(u1:k)γ]\displaystyle={\mathbb{E}}_{u_{1:k}\sim\,\mathrm{IID}\,p}\left[g(u_{1:k})\frac{\eta(u_{1:k})}{\gamma}\right]
𝔼u1:kIIDp[g(u1:k)]𝔼u1:kIIDp[η(u1:k)]γ\displaystyle\geq{\mathbb{E}}_{u_{1:k}\sim\,\mathrm{IID}\,p}[g(u_{1:k})]\frac{{\mathbb{E}}_{u_{1:k}\sim\,\mathrm{IID}\,p}[\eta(u_{1:k})]}{\gamma}
=𝔼u1:kIIDp[g(u1:k)]\displaystyle={\mathbb{E}}_{u_{1:k}\sim\,\mathrm{IID}\,p}[g(u_{1:k})]

where the inequality in the second step follows from the Harris-inequality (see Lemma 1).  

Proof of Theorem 1. The proof essentially follows from Corollary 1 by defining ui:=f(x)f(xi)u_{i}:=f(x)^{\top}f(x^{-}_{i}) for i=1:ki=1:k, defining gx,x+(u1:k):=ψk(u1f(x)f(x+),,ukf(x)f(x+))g_{x,x^{+}}(u_{1:k}):=\psi_{k}(u_{1}-f(x)^{\top}f(x^{+}),\ldots,u_{k}-f(x)^{\top}f(x^{+})), noting that u1:ku_{1:k} are conditionally IID given (x,x+)(x,x^{+}) in the UCL setting and conditionally IID given (x,x+,y)(x,x^{+},y) in the SCL setting, and verifying that the conditions of Corollary 1 hold.

For clarity, we provide a detailed proof of the inequality LHSCL(k)(f)LSCL(k)(f)L^{(k)}_{HSCL}(f)\geq L^{(k)}_{SCL}(f). The detailed proof of the inequality LHUCL(k)(f)LUCL(k)(f)L^{(k)}_{HUCL}(f)\geq L^{(k)}_{UCL}(f) parallels that for the (more intricate) supervised setting and is omitted.

LHSCL(k)(f)=\displaystyle L^{(k)}_{HSCL}(f)=
=𝔼(x,x+,y)[𝔼x1:kpHSCL(x1:k|x,x+,y,f)[ψk(f(x)(f(x1)f(x+),,f(x)(f(xk)f(x+)))]]\displaystyle={\mathbb{E}}_{(x,x^{+},y)}\Big{[}{\mathbb{E}}_{x^{-}_{1:k}\sim p^{-}_{HSCL}(x^{-}_{1:k}|x,x^{+},y,f)}\Big{[}\psi_{k}(f(x)^{\top}(f(x^{-}_{1})-f(x^{+}),\ldots,f(x)^{\top}(f(x^{-}_{k})-f(x^{+})))\Big{]}\Big{]}
=𝔼(x,x+,y)[𝔼x1:kIIDpSCL(|y)[ψk(f(x)(f(x1)f(x+)),,f(x)(f(xk)f(x+)))\displaystyle={\mathbb{E}}_{(x,x^{+},y)}\bigg{[}{\mathbb{E}}_{x^{-}_{1:k}\sim\,\mathrm{IID}\,p^{-}_{SCL}(\cdot|y)}\bigg{[}\psi_{k}(f(x)^{\top}(f(x^{-}_{1})-f(x^{+})),\ldots,f(x)^{\top}(f(x^{-}_{k})-f(x^{+})))\ \cdot
η(f(x)f(x1),,f(x)f(xk))γ(x,y,f)]]\displaystyle\hskip 213.12416pt\frac{\eta(f(x)^{\top}f(x^{-}_{1}),\ldots,f(x)^{\top}f(x^{-}_{k}))}{\gamma(x,y,f)}\bigg{]}\bigg{]} (8)
𝔼(x,x+,y)[𝔼x1:kIIDpSCL(|y)[ψk(f(x)(f(x1)f(x+)),,f(x)(f(xk)f(x+)))]\displaystyle\geq{\mathbb{E}}_{(x,x^{+},y)}\Bigg{[}{\mathbb{E}}_{x^{-}_{1:k}\sim\,\mathrm{IID}\,p^{-}_{SCL}(\cdot|y)}\Big{[}\psi_{k}(f(x)^{\top}(f(x^{-}_{1})-f(x^{+})),\ldots,f(x)^{\top}(f(x^{-}_{k})-f(x^{+})))\Big{]}\ \cdot
𝔼x1:kIIDpSCL(|y)[η(f(x)f(x1),,f(x)f(xk))]γ(x,y,f)]\displaystyle\hskip 135.62447pt\frac{{\mathbb{E}}_{x^{-}_{1:k}\sim\,\mathrm{IID}\,p^{-}_{SCL}(\cdot|y)}\Big{[}\eta(f(x)^{\top}f(x^{-}_{1}),\ldots,f(x)^{\top}f(x^{-}_{k}))\Big{]}}{\gamma(x,y,f)}\Bigg{]} (9)
=𝔼(x,x+,y)[𝔼x1:kIIDpSCL(|y)[ψk(f(x)(f(x1)f(x+)),,f(x)(f(xk)f(x+)))]]\displaystyle={\mathbb{E}}_{(x,x^{+},y)}\Big{[}{\mathbb{E}}_{x^{-}_{1:k}\sim\,\mathrm{IID}\,p^{-}_{SCL}(\cdot|y)}\Big{[}\psi_{k}(f(x)^{\top}(f(x^{-}_{1})-f(x^{+})),\ldots,f(x)^{\top}(f(x^{-}_{k})-f(x^{+})))\Big{]}\Big{]} (10)
=LSCL(k)(f)\displaystyle=L^{(k)}_{SCL}(f)

where (8) follows from (5) which defines pHSCLp^{-}_{HSCL}, (9) follows from the application of the Harris-inequality (see Lemma 1) to the inner expectation where xx and x+x^{+} are held fixed, and (10) follows from the definition of γ(x,y,f)\gamma(x,y,f) in (6). ∎

4.2 Lower bound for SCL loss and Neural-Collapse

Consider the SCL model with anchor, positive, and kk negative samples generated as described in Sec. 3.1. Within this setting, we have the following lower bound for the SCL loss and conditions for equality.

Theorem 2 (Lower bound for SCL loss and conditions for equality with unit-ball representations and equiprobable classes).

In the SCL model with anchor, positive, and negative samples generated as described in (2) and (3), let (a) λy=1C\lambda_{y}=\tfrac{1}{C} for all y𝒴y\in{\mathcal{Y}} (equiprobable classes), (b) 𝒵={zd𝒵:z1}{\mathcal{Z}}=\{z\in{\mathbb{R}}^{d_{{\mathcal{Z}}}}:||z||\leq 1\} (unit-ball representations), and (c) the anchor, positive and negative samples have a common conditional probability distribution s(|)s(\cdot|\cdot) within each class given their respective labels. If ψk\psi_{k} is a convex function that is also argument-wise non-decreasing throughout k{\mathbb{R}}^{k}, then for all f:𝒳𝒵f:{\mathcal{X}}\rightarrow{\mathcal{Z}},

LSCL(k)(f)ψk(C(C1),,C(C1)).\displaystyle L^{(k)}_{SCL}(f)\geq\psi_{k}\left(\tfrac{-C}{(C-1)},\ldots,\tfrac{-C}{(C-1)}\right). (11)

For a given ff\in{\mathcal{F}} and all y𝒴y\in{\mathcal{Y}}, let

μy:=𝔼xs(x|y)[f(x)].\displaystyle\mu_{y}:={\mathbb{E}}_{x\sim s(x|y)}[f(x)]. (12)

If a given ff\in{\mathcal{F}} satisfies the following additional condition:

Equal inner-product class means: j,𝒴:j,μjμ=1C1,\displaystyle\mbox{Equal inner-product class means: }\forall j,\ell\in{\mathcal{Y}}:j\neq\ell,\quad\mu_{j}^{\top}\mu_{\ell}=\tfrac{-1}{C-1}, (13)

then equality will hold in (11), i.e., additional condition (13) is sufficient for equality in (11). Additional condition (13) also implies the following properties:

  1. (i)

    zero-sum class means: j𝒴μj=0\sum_{j\in{\mathcal{Y}}}\mu_{j}=0,

  2. (ii)

    unit-norm class means: j𝒴,μj=1\forall j\in{\mathcal{Y}},\|\mu_{j}\|=1,

  3. (iii)

    d𝒵C1d_{\mathcal{Z}}\geq C-1.

  4. (iv)

    zero within-class variance: for all j𝒴j\in{\mathcal{Y}} and all i=1:ki=1:k, Pr(f(x)=f(x+)=μj|y=j)=1=Pr(f(xi)=μj|yi=j)\mathrm{Pr}(f(x)=f(x^{+})=\mu_{j}|y=j)=1=\mathrm{Pr}(f(x^{-}_{i})=\mu_{j}|y^{-}_{i}=j), and

  5. (v)

    The support sets of s(|y)s(\cdot|y) for all y𝒴y\in{\mathcal{Y}} must be disjoint and the anchor, positive and negative samples must share a common deterministic labeling function defined by the support sets.

  6. (vi)

    equality of HSCL and SCL losses:

    LHSCL(k)(f)=LSCL(k)(f)=ψk(C(C1),,C(C1)).L^{(k)}_{HSCL}(f)=L^{(k)}_{SCL}(f)=\psi_{k}\left(\tfrac{-C}{(C-1)},\ldots,\tfrac{-C}{(C-1)}\right).

If ψk\psi_{k} is a strictly convex function that is also argument-wise strictly increasing throughout k{\mathbb{R}}^{k}, then additional condition (13) is also necessary for equality to hold in (11).

Proof We have

LSCL(k)(f)=𝔼x,x+,x1:k[ψk(f(x)(f(x)1f(x+)),,f(x)(f(xk)f(x+)))]\displaystyle L^{(k)}_{SCL}(f)={\mathbb{E}}_{x,x^{+},x^{-}_{1:k}}[\psi_{k}(f(x)^{\top}(f(x)_{1}^{-}-f(x^{+})),\ldots,f(x)^{\top}(f(x_{k}^{-})-f(x^{+})))]
𝔼x,x+,x1:k[ψk(f(x)f(x1)1,,f(x)f(xk)1)]\displaystyle\geq{\mathbb{E}}_{x,x^{+},x^{-}_{1:k}}[\psi_{k}(f(x)^{\top}f(x_{1}^{-})-1,\ldots,f(x)^{\top}f(x_{k}^{-})-1)] (14)
ψk(𝔼x,x1[f(x)f(x1)]1,,𝔼x,xk[f(x)f(xk)]1)\displaystyle\geq\psi_{k}({\mathbb{E}}_{x,x^{-}_{1}}[f(x)^{\top}f(x_{1}^{-})]-1,\ldots,{\mathbb{E}}_{x,x^{-}_{k}}[f(x)^{\top}f(x_{k}^{-})]-1) (15)
=ψk(𝔼y,y1[𝔼x,x1[f(x)f(x1)|y,y1]]1,,𝔼y,yk[𝔼x,xk[f(x)f(xk)|y,yk]]1)\displaystyle=\psi_{k}({\mathbb{E}}_{y,y^{-}_{1}}[{\mathbb{E}}_{x,x^{-}_{1}}[f(x)^{\top}f(x_{1}^{-})|y,y^{-}_{1}]]-1,\ldots,{\mathbb{E}}_{y,y^{-}_{k}}[{\mathbb{E}}_{x,x^{-}_{k}}[f(x)^{\top}f(x_{k}^{-})|y,y^{-}_{k}]]-1) (16)
=ψk(j,𝒴,jμjμC(C1)1,,j,𝒴,jμjμC(C1)1)\displaystyle=\psi_{k}\Big{(}\sum_{j,\ell\in{\mathcal{Y}},j\neq\ell}\tfrac{\mu_{j}^{\top}\mu_{\ell}}{C(C-1)}-1,\ldots,\sum_{j,\ell\in{\mathcal{Y}},j\neq\ell}\tfrac{\mu_{j}^{\top}\mu_{\ell}}{C(C-1)}-1\Big{)} (17)
=ψk(j𝒴μj2j𝒴μj2C(C1)1,,j𝒴μj2j𝒴μj2C(C1)1)\displaystyle=\psi_{k}\Big{(}\tfrac{||\sum_{j\in{\mathcal{Y}}}\mu_{j}||^{2}-\sum_{j\in{\mathcal{Y}}}\|\mu_{j}\|^{2}}{C(C-1)}-1,\ldots,\tfrac{||\sum_{j\in{\mathcal{Y}}}\mu_{j}||^{2}-\sum_{j\in{\mathcal{Y}}}\|\mu_{j}\|^{2}}{C(C-1)}-1\Big{)} (18)
ψk(0CC(C1)1,,0CC(C1)1)\displaystyle\geq\psi_{k}\Big{(}\tfrac{0-C}{C(C-1)}-1,\ldots,\tfrac{0-C}{C(C-1)}-1\Big{)} (19)
=ψk(C(C1),,C(C1))\displaystyle=\psi_{k}\left(\tfrac{-C}{(C-1)},\ldots,\tfrac{-C}{(C-1)}\right) (20)

which is the lower bound in (11). Inequality (14) is because ψk\psi_{k} is argument-wise non-decreasing and f(x)f(x+)1f(x)^{\top}f(x^{+})\leq 1 by the Cauchy-Schwartz inequality since f(x),f(x+)1\|f(x)\|,\|f(x^{+})\|\leq 1 (unit-ball representations). Inequality (15) is Jensen’s inequality applied to the convex function ψk\psi_{k}. Equality (16) is due to the law of iterated expectations. Equality (17) follows from (2), (3), and the assumption of equiprobable classes. Equality (18) is because j,𝒴μjμ=j,𝒴,jμjμ+j𝒴μj2\sum_{j,\ell\in{\mathcal{Y}}}\mu^{\top}_{j}\mu_{\ell}=\sum_{j,\ell\in{\mathcal{Y}},j\neq\ell}\mu^{\top}_{j}\mu_{\ell}+\sum_{j\in{\mathcal{Y}}}||\mu_{j}||^{2}. Inequality (19) is because ψk\psi_{k} is argument-wise non-decreasing and the smallest possible value of j𝒴μj2||\sum_{j\in{\mathcal{Y}}}\mu_{j}||^{2} is zero and the largest possible value of μj2||\mu_{j}||^{2} is one (unit-ball representations): Jensen’s inequality for the strictly convex function 2\|\cdot\|^{2} together with f(x)21\|f(x)\|^{2}\leq 1 (unit-ball representations) imply that for all y𝒴y\in{\mathcal{Y}}, we have

μy2=𝔼xs(x|y)[f(x)]2𝔼xs(x|y)[f(x)2]1.\displaystyle\|\mu_{y}\|^{2}=\|{\mathbb{E}}_{x\sim s(x|y)}[f(x)]\|^{2}\leq{\mathbb{E}}_{x\sim s(x|y)}[\|f(x)\|^{2}]\leq 1. (21)

Finally the equality (20) is because 0CC(C1)1=C(C1)\tfrac{0-C}{C(C-1)}-1=\tfrac{-C}{(C-1)}.

(i)(i) and (ii)(ii) Proof that additional condition (13) implies zero-sum and unit-norm class means: Inequality (21) together with condition (13) implies that

0j𝒴μy2=j,𝒴μjμ=j,𝒴,jμjμ=1C1+j𝒴μj21C(C1)C1+j𝒴1=C+C=0.\displaystyle 0\leq\|\sum_{j\in{\mathcal{Y}}}\mu_{y}\|^{2}=\sum_{j,\ell\in{\mathcal{Y}}}\mu_{j}^{\top}\mu_{\ell}=\sum_{j,\ell\in{\mathcal{Y}},j\neq\ell}\underbrace{\mu_{j}^{\top}\mu_{\ell}}_{=\frac{-1}{C-1}}+\sum_{j\in{\mathcal{Y}}}\underbrace{\|\mu_{j}\|^{2}}_{\leq 1}\leq-\frac{C(C-1)}{C-1}+\sum_{j\in{\mathcal{Y}}}1=-C+C=0.

Thus j𝒴μy2=0\|\sum_{j\in{\mathcal{Y}}}\mu_{y}\|^{2}=0 and for all j𝒴j\in{\mathcal{Y}}, μj2=1\|\mu_{j}\|^{2}=1.

(iii)(iii) Let M:=[μ1,,μC]d𝒵×CM:=[\mu_{1},\ldots,\mu_{C}]\in{\mathbb{R}}^{d_{\mathcal{Z}}\times C}. Then from (13) and (ii)(ii), the gram matrix MM=CC1IC1C1𝟙C𝟙CM^{\top}M=\tfrac{C}{C-1}I_{C}-\tfrac{1}{C-1}\mathbbm{1}_{C}\mathbbm{1}_{C}^{\top} where ICI_{C} is the C×CC\times C identity matrix and 𝟙C\mathbbm{1}_{C} is the C×1C\times 1 column vector of all ones. From this it follows that MMM^{\top}M has one eigenvalue of zero corresponding to eigenvector 𝟙C\mathbbm{1}_{C} and C1C-1 eigenvalues all equal to CC1\tfrac{C}{C-1} corresponding to (C1)(C-1) orthogonal eigenvectors spanning the orthogonal complement of 𝟙C\mathbbm{1}_{C}. Thus, MM has C1C-1 nonzero singular values all equal to CC1\sqrt{\frac{C}{C-1}} and a rank equal to C1d𝒵C-1\leq d_{\mathcal{Z}}.

(iv)(iv) Proof that additional condition (13) implies zero within-class variance: We just proved that additional condition (13) together with the unit-ball representation constraint implies unit-norm class means. This, together with (21) implies that for all y𝒴y\in{\mathcal{Y}},

1=μy2=𝔼xs(x|y)[f(x)]2𝔼xs(x|y)[f(x)2]1.\displaystyle 1=\|\mu_{y}\|^{2}=\|{\mathbb{E}}_{x\sim s(x|y)}[f(x)]\|^{2}\leq{\mathbb{E}}_{x\sim s(x|y)}[\|f(x)\|^{2}]\leq 1. (22)

This implies that we have equality in Jensen’s inequality, which can occur iff with probability one given yy, we have f(x)=μyf(x)=\mu_{y} (since 2\|\cdot\|^{2} is strictly convex). Since the anchor, positive and negative samples all have a common conditional probability distribution s(|)s(\cdot|\cdot) within each class given their respective labels, it follows that for all j𝒴j\in{\mathcal{Y}} and all i=1:ki=1:k, Pr(f(x)=μj|y=j)=Pr(f(x+)=μj|y=j)=Pr(f(xi)=μj|yi=j)=1\mathrm{Pr}(f(x)=\mu_{j}|y=j)=\mathrm{Pr}(f(x^{+})=\mu_{j}|y=j)=\mathrm{Pr}(f(x^{-}_{i})=\mu_{j}|y^{-}_{i}=j)=1. Moreover, since the anchor and positive samples have the same label, for all j𝒴j\in{\mathcal{Y}}, with probability one given y=jy=j, we have f(x)=f(x+)=μjf(x)=f(x^{+})=\mu_{j}.

Proof that additional condition (13) is sufficient for equality to hold in (11): From the proofs of (i),(ii)(i),(ii), and (iv)(iv) above, if additional condition (13) holds, then we showed that with probability one given y=jy=j we have f(x)=f(x+)=μjf(x)=f(x^{+})=\mu_{j} (see the para below (22)). This equality of f(x)f(x) and f(x+)f(x^{+}) is a conditional equality given the class. Since this is true for all classes, it implies equality (with probability one) of f(x)f(x) and f(x+)f(x^{+}) without conditioning on the class:

Pr(f(x)=f(x+))=j𝒴Pr(f(x)=f(x+)|y=j)C=1.\displaystyle\mathrm{Pr}(f(x)=f(x^{+}))=\sum_{j\in{\mathcal{Y}}}\tfrac{\mathrm{Pr}(f(x)=f(x^{+})|y=j)}{C}=1. (23)

From (23) we get

Pr(f(x)f(x+)=1)=Pr(f(x)2=1)=j𝒴Pr(f(x)2=1|y=j)C=1,\displaystyle\mathrm{Pr}(f(x)^{\top}f(x^{+})=1)=\mathrm{Pr}(||f(x)||^{2}=1)=\sum_{j\in{\mathcal{Y}}}\tfrac{\mathrm{Pr}(||f(x)||^{2}=1|y=j)}{C}=1, (24)

since f(x)=f(x+)=μjf(x)=f(x^{+})=\mu_{j} with probability one given y=jy=j and μj2=1\|\mu_{j}\|^{2}=1. Equality in (14) then follows from (23) and (24). Moreover, due to zero within-class variance we will have

with probability one, for all i=1:k,f(x)f(xi)=μyμyi=1C1,\displaystyle\mbox{with probability one, for all }i=1:k,f(x)^{\top}f(x^{-}_{i})=\mu_{y}^{\top}\mu_{y^{-}_{i}}=\tfrac{-1}{C-1}, (25)

and then we will have equality in (15) and (19). Therefore additional condition (13) is a sufficient condition for equality to hold in (11).

(v)(v) Proof that support sets of s(|y),y𝒴s(\cdot|y),y\in{\mathcal{Y}} are disjoint: From part (iv)(iv), all samples belonging to the support set of s(|y),y𝒴s(\cdot|y),y\in{\mathcal{Y}}, are mapped to μy\mu_{y} by ff. From part (ii)(ii) and condition (13), distinct labels have distinct representation means: for all y,y𝒴y,y^{\prime}\in{\mathcal{Y}}, if yyy^{\prime}\neq y, then μyμy\mu_{y}\neq\mu_{y^{\prime}}. Therefore the support sets of s(|y)s(\cdot|y) for all y𝒴y\in{\mathcal{Y}} must be disjoint. Since the anchor, positive, and negative samples all share a common conditional probability distribution s(|)s(\cdot|\cdot) and the same marginal label distribution λ\lambda, it follows that they share a common conditional distribution of label given sample (labeling function). Since the support sets of s(|y)s(\cdot|y) for all y𝒴y\in{\mathcal{Y}} are disjoint, the labeling function is deterministic and is defined by the support set to which a sample belongs.

(vi)(vi) Proof of equality of HSCL and SCL losses under additional condition (13): under the equal inner-product class means condition, with probability one f(x)f(xi)=1C1f(x)^{\top}f(x^{-}_{i})=\tfrac{-1}{C-1} simultaneously for all i=1:ki=1:k and η(f(x)f(x1),,f(x)f(xk))=η(1C1,,1C1)\eta(f(x)^{\top}f(x^{-}_{1}),\ldots,f(x)^{\top}f(x^{-}_{k}))=\eta(\tfrac{-1}{C-1},\ldots,\tfrac{-1}{C-1}), a constant. Consequently, for all x,yx,y and the given ff, we must have γ(x,y,f)=η(1C1,,1C1)\gamma(x,y,f)=\eta(\tfrac{-1}{C-1},\ldots,\tfrac{-1}{C-1}) which would imply that (see Equation 5) pHSCL(x1:k|x,x+,y,f)=i=1kpSCL(xi|y)p^{-}_{HSCL}(x^{-}_{1:k}|x,x^{+},y,f)=\prod_{i=1}^{k}p^{-}_{SCL}(x^{-}_{i}|y) and therefore LHSCL(k)(f)=LSCL(k)(f)=ψk(C(C1),,C(C1))L^{(k)}_{HSCL}(f)=L^{(k)}_{SCL}(f)=\psi_{k}\big{(}\tfrac{-C}{(C-1)},\ldots,\tfrac{-C}{(C-1)}\big{)} where the last equality is because additional condition (13) is sufficient for equality to hold in (11).

Proof that additional condition (13) is necessary for equality in (11) if ψk\psi_{k} is strictly convex and argument-wise strictly increasing over k{\mathbb{R}}^{k}: If equality holds in (11), then it must also hold in (14), (15), and (19). If ψk\psi_{k} is argument-wise strictly increasing, then equality in (19) can only occur if all class means have unit norms. Then, from (22) and the reasoning in the paragraph below it, we would have zero within-class variance and equations (23) and (24). This would imply equality in (14). If ψk\psi_{k} is strictly convex then equality in (15), which is Jensen’s inequality, can only occur if for all i=1:ki=1:k, Pr(f(x)f(xi)=βi)=1\mathrm{Pr}(f^{\top}(x)f(x^{-}_{i})=\beta_{i})=1 for some constants β1:k\beta_{1:k}. Since (x,xi)(x,x^{-}_{i}) has the same distribution for all ii, it follows that for all ii, βi=β\beta_{i}=\beta for some constant β\beta. Since we have already proved zero within-class variance and the labels of negative samples are always distinct from that of the anchor, it follows that for all jj\neq\ell, we must have μjμ=β\mu_{j}^{\top}\mu_{\ell}=\beta. Since we have equality in (14) and ψk\psi_{k} is argument-wise strictly increasing, we must have β=1C1\beta=\tfrac{-1}{C-1} which implies that additional condition (13) must hold (it is a necessary condition).  

Remark 1.

We note that Theorem 2 also holds if we have unit-sphere representations (a stronger constraint) as opposed to unit-ball representations, i.e., if 𝒵={zd𝒵:z=1}{\mathcal{Z}}=\{z\in{\mathbb{R}}^{d_{{\mathcal{Z}}}}:\|z\|=1\}: the lower bound (11) holds since the unit sphere is a subset of the unit ball and equality can be attained with unit-sphere representations in Theorem 2.

Remark 2.

Interestingly, we note that inequality (19) and therefore the lower bound of Theorem 2 also holds if we replace the unit-ball constraint on representations f(x)1\|f(x)\|\leq 1 with the weaker requirement 1Cj=1Cμj21\frac{1}{C}\sum_{j=1}^{C}\|\mu_{j}\|^{2}\leq 1.

Definition 5 (ETF).

The equal inner-product, zero-sum, and unit-norm conditions on class means in Theorem 2 define a (normalized) Equiangular Tight Frame (ETF) (see [18]).

Definition 6 (CL Neural-Collapse (NC)).

We will say representation map f()f(\cdot) exhibits CL Neural-Collapse if it has zero within-class variance as in condition (iv)(iv) of Theorem 2 and the class means in representation space form a normalized ETF as in Definition 5.

Remark 3.

The term “Neural-Collapse” was originally used for representation mappings implemented by deep classifier neural networks (see [21]). However, here we use the term more broadly for any family of representation mappings and within the context of CL instead of classifier training.

The following corollary is a partial restatement of Theorem 2 in terms of CL Neural-Collapse:

Corollary 2.

Under the conditions of Theorem 2, equality in (11) is attained by any representation map ff that exhibits CL Neural-Collapse. Moreover, if ψk\psi_{k} is strictly convex and argument-wise strictly increasing over k{\mathbb{R}}^{k}, then equality in (11) is attained by a representation map ff, if, and only if, it exhibits CL Neural-Collapse.

4.3 Empirical and batched empirical SCL losses

Empirical SCL loss: Theorem 2 also holds for empirical SCL loss because simple averages over samples can be expressed as expectations with suitable uniform distributions over the samples. If the family of representation mappings {\mathcal{F}} has sufficiently high capacity (e.g., the family of mappings implemented by a sufficiently deep and wide feed-forward neural network) and y𝒴\forall y\in\mathcal{Y}, s(|y)s(\cdot|y) is a discrete probability mass function (pmf) over a finite set (e.g., uniform pmf over training samples within each class) with support-sets that are disjoint across different classes, then the equal inner-product condition (13) in Theorem 2 can be satisfied for a suitable ff in the family. If either convexity or monotonicity of ψ\psi is not strict, e.g., ψ(t)=max{t+α,0}\psi(t)=\max\{t+\alpha,0\}, then it may be possible for a representation map ff to attain the lower bound without exhibiting CL Neural-Collapse.

Batched empirical SCL loss: We will now show that representations that exhibit CL Neural-Collapse will also minimize batched empirical SCL loss under the conditions of Theorem 2. Here, the full data set has balanced classes (equal number of samples in each class) but is partitioned into BB disjoint nonempty batches of possibly unequal size. Let bb denote the batch index and nbn_{b} the number of samples in batch bb. Let 𝔼x,x+,x1:k|b[]{\mathbb{E}}_{x,x^{+},x^{-}_{1:k}|b}[\cdot] denote the empirical SCL loss in batch bb and

LSCL(k,b)(f):=b=1B1nb𝔼x,x+,x1:k|b[ψk(f(x)(f(x1)f(x+)),,f(x)(f(xk)f(x+)))]L_{SCL}^{{(k,b)}}(f):=\sum_{b=1}^{B}\frac{1}{n_{b}}{\mathbb{E}}_{x,x^{+},x^{-}_{1:k}|b}\Big{[}\psi_{k}(f(x)^{\top}(f(x_{1}^{-})-f(x^{+})),\ldots,f(x)^{\top}(f(x_{k}^{-})-f(x^{+})))\Big{]}

the overall batched empirical SCL loss. Note that in a given batch the data may not be balanced across classes and therefore we cannot simply use Theorem 2, which assumes balanced classes, to deduce the optimality of CL Neural-Collapse representations.

We lower bound the batched empirical SCL loss as follows:

LSCL(k,b)(f)\displaystyle L_{SCL}^{{(k,b)}}(f) =b=1B1nb𝔼x,x+,x1:k|b[ψk(f(x)(f(x1)f(x+)),,f(x)(f(xk)f(x+)))]\displaystyle=\sum_{b=1}^{B}\frac{1}{n_{b}}{\mathbb{E}}_{x,x^{+},x^{-}_{1:k}|b}\Big{[}\psi_{k}(f(x)^{\top}(f(x_{1}^{-})-f(x^{+})),\ldots,f(x)^{\top}(f(x_{k}^{-})-f(x^{+})))\Big{]}
b=1B1nb𝔼x,x+,x1:k|b[ψk(f(x)f(x1)1,,f(x)f(xk)1)]\displaystyle\geq\sum_{b=1}^{B}\frac{1}{n_{b}}{\mathbb{E}}_{x,x^{+},x^{-}_{1:k}|b}\Big{[}\psi_{k}(f(x)^{\top}f(x_{1}^{-})-1,\ldots,f(x)^{\top}f(x_{k}^{-})-1)\Big{]} (26)
ψk(b=1B1nb𝔼x,x1|b[f(x)f(x1)]1,,b=1B1nb𝔼x,xk[f(x)f(xk)]1)\displaystyle\geq\psi_{k}\Bigg{(}\sum_{b=1}^{B}\frac{1}{n_{b}}{\mathbb{E}}_{x,x^{-}_{1}|b}[f(x)^{\top}f(x_{1}^{-})]-1,\ldots,\sum_{b=1}^{B}\frac{1}{n_{b}}{\mathbb{E}}_{x,x^{-}_{k}}[f(x)^{\top}f(x_{k}^{-})]-1\Bigg{)} (27)
=ψk(𝔼x,x1[f(x)f(x1)]1,,𝔼x,xk[f(x)f(xk)]1)\displaystyle=\psi_{k}\Big{(}{\mathbb{E}}_{x,x^{-}_{1}}[f(x)^{\top}f(x_{1}^{-})]-1,\ldots,{\mathbb{E}}_{x,x^{-}_{k}}[f(x)^{\top}f(x_{k}^{-})]-1\Big{)} (28)
ψk(C(C1),,C(C1))\displaystyle\geq\psi_{k}\left(\tfrac{-C}{(C-1)},\ldots,\tfrac{-C}{(C-1)}\right) (29)

where inequality (26) holds for the same reason as in (14), inequality (27) is Jensen’s inequality applied to ψk\psi_{k} which is convex, and equality (28) is due to the law of iterated (empirical) expectation. The right side of (27) is precisely the right side of (15) and therefore (29) follows from (15) – (20). From the above analysis it follows that the arguments used to prove Theorem 2 can be applied again to prove that the conclusions of Theorem 2 and Corollary 2 also hold for the batched empirical SCL loss.

4.4 Lower bound for UCL loss with latent labels and Neural-Collapse

Consider the UCL model with anchor, positive, and kk negative samples generated as described in Sec. 3.1. Within this setting, we have the following lower bound for the UCL loss and conditions for equality.

Theorem 3 (Lower bound for UCL loss with latent labels and conditions for equality with unit-ball representations and equiprobable classes).

In the UCL model with anchor, positive, and negative samples generated as described in (4), let (a) λy=1C\lambda_{y}=\tfrac{1}{C} for all y𝒴y\in{\mathcal{Y}} (equiprobable classes), (b) 𝒵={zd𝒵:z1}{\mathcal{Z}}=\{z\in{\mathbb{R}}^{d_{{\mathcal{Z}}}}:||z||\leq 1\} (unit-ball representations), (c) the anchor, positive and negative samples have a common conditional probability distribution s(|)s(\cdot|\cdot) within each latent class given their respective labels, (d) r=λr=\lambda in (4), and (e) the anchor and positive samples are conditionally independent given their common label, i.e., q(x,x+|y)=s(x|y)s(x+|y)q(x,x^{+}|y)=s(x|y)s(x^{+}|y) in (4).222As discussed in Sec. 2, all existing works that conduct a theoretical analysis of UCL make this assumption. If ψk\psi_{k} is a convex function that is also argument-wise non-decreasing throughout k{\mathbb{R}}^{k}, then for all f:𝒳𝒵f:{\mathcal{X}}\rightarrow{\mathcal{Z}},

LUCL(k)(f)1Ck+1y,y1:k𝒴ψk(C 1(y1y)(C1),,C 1(yky)(C1))\displaystyle L^{(k)}_{UCL}(f)\geq\frac{1}{C^{k+1}}\sum_{y,y^{-}_{1:k}\in{\mathcal{Y}}}\psi_{k}\left(\tfrac{-C\,1(y^{-}_{1}\neq y)}{(C-1)},\ldots,\tfrac{-C\,1(y^{-}_{k}\neq y)}{(C-1)}\right) (30)

where 1()1(\cdot) is the indicator function. For a given ff\in{\mathcal{F}} and all y𝒴y\in{\mathcal{Y}}, let μy\mu_{y} be as defined in (12). If a given ff\in{\mathcal{F}} satisfies additional condition (13), then equality will hold in (30), i.e., additional condition (13) is sufficient for equality in (30). Additional condition (13) also implies the following properties:

  1. (i)

    zero-sum class means: j𝒴μj=0\sum_{j\in{\mathcal{Y}}}\mu_{j}=0,

  2. (ii)

    unit-norm class means: j𝒴,μj=1\forall j\in{\mathcal{Y}},\|\mu_{j}\|=1,

  3. (iii)

    d𝒵C1d_{\mathcal{Z}}\geq C-1.

  4. (iv)

    zero within-class variance: for all j𝒴j\in{\mathcal{Y}} and all i=1:ki=1:k, Pr(f(x)=f(x+)=μj|y=j)=1=Pr(f(xi)=μj|yi=j)\mathrm{Pr}(f(x)=f(x^{+})=\mu_{j}|y=j)=1=\mathrm{Pr}(f(x^{-}_{i})=\mu_{j}|y^{-}_{i}=j), and

  5. (v)

    The support sets of s(|y)s(\cdot|y) for all y𝒴y\in{\mathcal{Y}} must be disjoint and the anchor, positive and negative samples must share a common deterministic (latent) labeling function defined by the support sets.

If ψk\psi_{k} is a strictly convex function that is also argument-wise strictly increasing throughout k{\mathbb{R}}^{k}, then additional condition (13) is also necessary for equality to hold in (30).

Proof  For i=1:ki=1:k, we define the following indicator random variables bi:=1(yiy)b_{i}:=1(y_{i}^{-}\neq y) and note that for all i=1:ki=1:k, bib_{i} is a deterministic function of (y,yi)(y,y^{-}_{i}). Since y{y1:k}y\perp\!\!\!\perp\{y^{-}_{1:k}\} and y1:kIID Uniform(𝒴)y^{-}_{1:k}\sim\mbox{IID Uniform}({\mathcal{Y}}), it follows that b1:kIIDb_{1:k}\sim\mathrm{IID} and independent of yy. We then have the following sequence of inequalities:

LUCL(k)(f)=𝔼x,x+,x1:k[ψk(f(x)(f(x1)f(x+)),,f(x)(f(xk)f(x+)))]\displaystyle L^{(k)}_{UCL}(f)={\mathbb{E}}_{x,x^{+},x^{-}_{1:k}}\big{[}\psi_{k}\big{(}f(x)^{\top}(f(x_{1}^{-})-f(x^{+})),\ldots,f(x)^{\top}(f(x_{k}^{-})-f(x^{+}))\big{)}\big{]}
𝔼y,y1:k[ψk(𝔼x,x1s(x|y)s(x1|y1)[f(x)f(x1)]𝔼x,x+s(x|y)s(x+|y)[f(x)f(x+)],,\displaystyle\geq{\mathbb{E}}_{y,y^{-}_{1:k}}\big{[}\psi_{k}({\mathbb{E}}_{x,x^{-}_{1}\sim s(x|y)s(x^{-}_{1}|y^{-}_{1})}[f(x)^{\top}f(x_{1}^{-})\big{]}-{\mathbb{E}}_{x,x^{+}\sim s(x|y)s(x^{+}|y)}[f(x)^{\top}f(x^{+})\big{]},\ldots,
,𝔼x,xks(x|y)s(xk|yk)[f(x)f(xk)]𝔼x,x+s(x|y)s(x+|y)[f(x)f(x+)])]\displaystyle\hskip 43.05542pt\ldots,{\mathbb{E}}_{x,x^{-}_{k}\sim s(x|y)s(x^{-}_{k}|y^{-}_{k})}\big{[}f(x)^{\top}f(x_{k}^{-})\big{]}-{\mathbb{E}}_{x,x^{+}\sim s(x|y)s(x^{+}|y)}\big{[}f(x)^{\top}f(x^{+})\big{]})\big{]} (31)
=𝔼y,y1:k[ψk(μyμy1μyμy,,μyμykμyμy)]\displaystyle={\mathbb{E}}_{y,y^{-}_{1:k}}\big{[}\psi_{k}\big{(}\mu_{y}^{\top}\mu_{y^{-}_{1}}-\mu_{y}^{\top}\mu_{y},\ldots,\mu_{y}^{\top}\mu_{y^{-}_{k}}-\mu_{y}^{\top}\mu_{y}\big{)}\big{]} (32)
𝔼b1:k[ψk(𝔼[μyμy1μy2|b1:k],,𝔼[μyμykμy2|b1:k])]\displaystyle\geq{\mathbb{E}}_{b_{1:k}}\big{[}\psi_{k}\big{(}{\mathbb{E}}[\mu_{y}^{\top}\mu_{y^{-}_{1}}-||\mu_{y}||^{2}|b_{1:k}],\ldots,{\mathbb{E}}[\mu_{y}^{\top}\mu_{y^{-}_{k}}-||\mu_{y}||^{2}|b_{1:k}]\big{)}\big{]} (33)
=𝔼b1:k[ψk(𝔼[μyμy1μy2|b1],,𝔼[μyμykμy2|bk])]\displaystyle={\mathbb{E}}_{b_{1:k}}\big{[}\psi_{k}\big{(}{\mathbb{E}}[\mu_{y}^{\top}\mu_{y^{-}_{1}}-||\mu_{y}||^{2}|b_{1}],\ldots,{\mathbb{E}}[\mu_{y}^{\top}\mu_{y^{-}_{k}}-||\mu_{y}||^{2}|b_{k}]\big{)}\big{]} (34)
=𝔼b1:k[ψk(b1𝔼[μyμy1μy2|b1=1],,bk𝔼[μyμykμy2|bk=1])]\displaystyle={\mathbb{E}}_{b_{1:k}}\big{[}\psi_{k}\big{(}b_{1}{\mathbb{E}}[\mu_{y}^{\top}\mu_{y^{-}_{1}}-||\mu_{y}||^{2}|b_{1}=1],\ldots,b_{k}{\mathbb{E}}[\mu_{y}^{\top}\mu_{y^{-}_{k}}-||\mu_{y}||^{2}|b_{k}=1]\big{)}\big{]} (35)
𝔼b1:k[ψk(b1𝔼[μyμy11|b1=1],,bk𝔼[μyμyk1|bk=1])]\displaystyle\geq{\mathbb{E}}_{b_{1:k}}\big{[}\psi_{k}\big{(}b_{1}{\mathbb{E}}[\mu_{y}^{\top}\mu_{y^{-}_{1}}-1|b_{1}=1],\ldots,b_{k}{\mathbb{E}}[\mu_{y}^{\top}\mu_{y^{-}_{k}}-1|b_{k}=1]\big{)}\big{]} (36)
=𝔼b1:k[ψk(b1j(μjμ1)C(C1),,bkj(μjμ1)C(C1))]\displaystyle={\mathbb{E}}_{b_{1:k}}\big{[}\psi_{k}\big{(}\tfrac{b_{1}\sum_{\ell\neq j}(\mu_{j}^{\top}\mu_{\ell}-1)}{C(C-1)},\ldots,\tfrac{b_{k}\sum_{\ell\neq j}(\mu_{j}^{\top}\mu_{\ell}-1)}{C(C-1)}\big{)}\big{]} (37)
=𝔼b1:k[ψk(b1(jμj2jμj2C(C1))C(C1),,bk(jμj2jμj2C(C1))C(C1))]\displaystyle={\mathbb{E}}_{b_{1:k}}\big{[}\psi_{k}\big{(}\tfrac{b_{1}(\|\sum_{j}\mu_{j}\|^{2}-\sum_{j}\|\mu_{j}\|^{2}-C(C-1))}{C(C-1)},\ldots,\tfrac{b_{k}(\|\sum_{j}\mu_{j}\|^{2}-\sum_{j}\|\mu_{j}\|^{2}-C(C-1))}{C(C-1)}\big{)}\big{]} (38)
𝔼b1:k[ψk(b1(0CC(C1))C(C1),,bk(0CC(C1))C(C1))]\displaystyle\geq{\mathbb{E}}_{b_{1:k}}\big{[}\psi_{k}\big{(}\tfrac{b_{1}(0-C-C(C-1))}{C(C-1)},\ldots,\tfrac{b_{k}(0-C-C(C-1))}{C(C-1)}\big{)}\big{]} (39)
=𝔼b1:k[ψk(b1(0CC)C(C1),,bk(0CC)C(C1))]\displaystyle={\mathbb{E}}_{b_{1:k}}\big{[}\psi_{k}\big{(}\tfrac{b_{1}(0-C\cdot C)}{C(C-1)},\ldots,\tfrac{b_{k}(0-C\cdot C)}{C(C-1)}\big{)}\big{]}
=𝔼b1:k[ψk(Cb1(C1),,Cbk(C1))]\displaystyle={\mathbb{E}}_{b_{1:k}}\big{[}\psi_{k}\big{(}\tfrac{-Cb_{1}}{(C-1)},\ldots,\tfrac{-Cb_{k}}{(C-1)}\big{)}\big{]} (40)
=1Ck+1y,y1:k𝒴ψk(C1(y1y)(C1),,C1(yky)(C1))\displaystyle=\tfrac{1}{C^{k+1}}\sum_{y,y^{-}_{1:k}\in{\mathcal{Y}}}\psi_{k}\left(\tfrac{-C1(y^{-}_{1}\neq y)}{(C-1)},\ldots,\tfrac{-C1(y^{-}_{k}\neq y)}{(C-1)}\right) (41)

where the validity of each numbered step in the above sequence of inequalities is explained below.

Inequality (31) is Jensen’s inequality conditioned on y,y1:ky,y_{1:k} applied to the convex function ψk\psi_{k}. Equality (32) holds because for every ii, we have xx and xix^{-}_{i} are conditionally independent given yy and yiy^{-}_{i} (per the UCL model (4)), xx and x+x^{+} are conditionally independent given their common label (assumption (d) in the theorem statement), and the class means in representation space are as defined in (12). Inequality (33) is Jensen’s inequality conditioned on b1:kb_{1:k} applied to the convex function ψk\psi_{k}. Equality (34) holds because for all i=1:ki=1:k, (y,yi){b,i}|bi(y,y^{-}_{i})\perp\!\!\!\perp\{b_{\ell},\ell\neq i\}|b_{i}. Equality (35) holds because bib_{i} only takes values 0, 11 and if bi=0b_{i}=0, then μyi=μy\mu_{y^{-}_{i}}=\mu_{y} and μyμyiμy2=0\mu_{y}^{\top}\mu_{y^{-}_{i}}-\|\mu_{y}\|^{2}=0. Therefore the expressions to the right of the equality symbols in (34) and (35) match when bi=0b_{i}=0 and when bi=1b_{i}=1. Inequality (36) is because ψk\psi_{k} is non-decreasing and μy1||\mu_{y}||\leq 1 for all yy because all representations are in the unit closed ball. Equality (37) holds because for each i=1:ki=1:k, y,yiIID Uniform(𝒴)y,y^{-}_{i}\sim\mbox{IID Uniform}({\mathcal{Y}}) and yyiy\neq y^{-}_{i} when bi=1b_{i}=1. Equality (38) follows from elementary linear algebraic operations. Inequality (39) holds because ψk\psi_{k} is argument-wise non-decreasing, the smallest possible value for jμj2\|\sum_{j}\mu_{j}\|^{2} is zero and the largest possible value for μj2\|\mu_{j}\|^{2}, for all jj, is one. Equality (41) follows from the definition of the indicator variables in terms of y,y1:ky,y^{-}_{1:k} and because y,y1:ky,y^{-}_{1:k} are IID Uniform(𝒴)\mathrm{Uniform}({\mathcal{Y}}).

Similarly to the proof of Theorem 2, if additional condition (13) holds for some ff, then properties (i)(i)(v)(v) in Theorem 3 hold. Moreover, then (23), (24), and (25) also hold and then we will have equality in (31), (33), (36) and (39). Thus additional condition (13) is sufficient for equality to hold in (30).

Proof that additional condition (13) is necessary for equality in (30) if ψk\psi_{k} is strictly convex and argument-wise strictly increasing over k{\mathbb{R}}^{k}: If equality holds in (30), then it must also hold in (31), (33), (36), and (39). If ψk\psi_{k} is argument-wise strictly increasing, then equality in (39) can only occur if all class means have unit norms (which would also imply equality in (36)). Then, from (22) and the reasoning in the paragraph below it, we would have zero within-class variance and (24), which would imply that with probability one for all ii, f(x)f(xi)=μyμyif(x)^{\top}f(x^{-}_{i})=\mu_{y}^{\top}\mu_{y^{-}_{i}} and f(x)f(x+)=μy2f(x)^{\top}f(x^{+})=||\mu_{y}||^{2} which would imply equality in (31) as well. Equality in (33) together with strict convexity of ψk\psi_{k} and μy2=1||\mu_{y}||^{2}=1 would imply that with probability one, for all ii, given bib_{i} we must have μyμyi=\mu_{y}^{\top}\mu_{y^{-}_{i}}= some deterministic function of bib_{i} and if ψk\psi_{k} is also argument-wise strictly increasing, then this function must be such that μyμyi1=Cbi(C1)\mu_{y}^{\top}\mu_{y^{-}_{i}}-1=\tfrac{-Cb_{i}}{(C-1)} due to (40). This would imply that for all ii, we must have μyμyi=1Cbi(C1)=1C1(yiy)(C1)\mu_{y}^{\top}\mu_{y^{-}_{i}}=1-\tfrac{Cb_{i}}{(C-1)}=1-\tfrac{C1(y^{-}_{i}\neq y)}{(C-1)}. Thus for all yyiy\neq y^{-}_{i}, i.e., bi=1b_{i}=1 (and this has nonzero probability), μyμyi=1CC1=1C1\mu_{y}^{\top}\mu_{y^{-}_{i}}=1-\tfrac{C}{C-1}=\tfrac{-1}{C-1} which is the additional condition (13). Thus the additional condition (13) must hold (it is a necessary condition).  

Counterparts of Corollary 2 and results for empirical SCL loss and batched empirical SCL loss can be stated and derived for UCL. One important difference between results for SCL and UCL is that Theorem 3 is missing the counterpart of property (vi)(vi) in Theorem 2. Unlike in Theorem 2, we cannot assert that if the lower bound is attained, then we will have LHUCL(k)(f)=LUCL(k)(f)L^{(k)}_{HUCL}(f)=L^{(k)}_{UCL}(f). This is because in the UCL and HUCL settings, the negative sample can come from the same latent class as the anchor (latent class collision) with a positive probability (1C2\frac{1}{C^{2}}). Then for a representation ff that exhibits Neural-Collapse, we cannot conclude that with probability one we must have η(f(x)f(x1),,f(x)f(xk))=η(β,,β)\eta(f(x)^{\top}f(x^{-}_{1}),\ldots,f(x)^{\top}f(x^{-}_{k}))=\eta(\beta,\ldots,\beta), for some constant β\beta. Deriving a tight lower bound for HUCL and determining whether it can be attained iff there is Neural-Collapse in UCL (under suitable conditions), are open problems.

Neural-Collapse in SCL or UCL requires that the representation space dimension d𝒵C1d_{\mathcal{Z}}\geq C-1 (see part (iii)(iii) of Theorems 2 and 3). This can be ensured in practical implementations of SCL since labels are available and the number of classes is known. In UCL, however, not just latent labels, but even the number of latent classes is unknown. Thus even if it was possible to attain the global minimum of the empirical UCL loss with an ff exhibiting Neural-Collapse, we may not observe Neural-Collapse with an argument-wise strictly increasing and strictly convex ψk\psi_{k} unless d𝒵d_{\mathcal{Z}} is chosen to be sufficiently large.

In practice, even without knowledge of latent labels, it is possible to design a sampling distribution having a structure that is compatible with (4) and conditions (c) and (d) of Theorem 3, e.g, via IID augmentations of a reference sample as in SimCLR illustrated in Fig. 1. However, it is impossible to ensure that the equiprobable latent class condition (a) or the anchor-positive conditional independence condition (d) in Theorem 3 hold or that the supports of s(|)s(\cdot|\cdot) determined by the sample augmentation mechanism will be disjoint across all latent classes. Thus a representation minimizing UCL loss may not exhibit Neural-Collapse even if ψk\psi_{k} is strictly convex and argument-wise strictly increasing or it might exhibit zero within-class variance, but the class means may not form an ETF (see [26]).

A second important difference between theoretical results for SCL and UCL is that unlike in Theorem 2, conditional independence of the anchor and positive samples given their common label is assumed in Theorem 3 (condition (d) in the theorem). It is unclear whether the results of Theorem 3 for the UCL setting will continue to hold in entirety without conditional independence and we leave this as an open problem.

5 Practical achievability of global optima

We first verify our theoretical results using synthetic data comprising three classes with 100 data points per class. The points within each class are IID with a 3072-dimensional Gaussian distribution having an identity covariance matrix and a randomly generated mean vector having IID components that are uniformly distributed over [1,1][-1,1]. We explored two strategies for constructing anchor-positive pairs:

  • Using Label Information: For each anchor sample, a positive sample is chosen uniformly from among all samples having the same label as the anchor (including the anchor). all samples that share the same label with the anchor are used to form the positive pairs. Note that the anchor and the positive can be identical.

  • Additive Gaussian Noise Augmentation Mechanism: For each (reference) sample, we generate the anchor sample by adding IID zero-mean Gaussian noise of variance 0.010.01 to all dimensions of the reference sample. The corresponding positive sample is generated in the same way from the reference sample using noise that is independent of that used to generate the anchor.

The data dimension of 3072 allows for reshaping the vector into a 32×32×332\times 32\times 3 tensor which can be processed by ResNet. We then investigate the achievability of global-optima for UCL, SCL, HUCL, and HSCL on the following three real-world image datasets: CIFAR10, CIFAR100 [14], and TinyImageNet [15]. These datasets consist of 32×32×332\times 32\times 3 images across 10 classes (CIFAR10), 100 classes (CIFAR100), and 200 classes (TinyImageNet), respectively. Similar phenomena are observed in all three datasets. We present results for CIFAR100 here and results for CIFAR10 and TinyImageNet in Appendix B.

We utilize the InfoNCE loss with the exponential tilting hardening function described in Sec. 3.1. For simplicity, in all four CL settings (UCL, SCL, HUCL, and HSCL), for a given anchor xx, we randomly sample (without augmentation) the positive sample x+x^{+} from the class conditional distribution corresponding to the class of xx. We also report additional results with the SimCLR framework in Appendix B.5 where instead of using the label information or only a single augmentation, a positive pair is generated using two independent augmentations from one reference sample. For both supervised and unsupervised settings as well as SimCLR, for a given positive pair, we select negative samples independently and uniformly at randomly from all the data in a mini-batch (including anchors and/or positive samples). We call this random negative sampling.

We used ResNet-50, [9], as the family of representation mappings {\mathcal{F}} and set the representation dimension to d=C1d=C-1 to observe Neural-Collapse (Definition 5). We normalized representations to be within a unit ball as detailed in Algorithm 1, lines 5-12, in Appendix B. We only report results for k=256k=256 negative samples, but observed that results change only slightly for all k[32,512]k\in[32,512].

We chose hyper-parameter β\beta of the hardening function from the set {0,10,30,50}\{0,10,30,50\} for synthetic data and the set {0,2,5,10,30}\{0,2,5,10,30\} for real data and trained each model for E=400E=400 epochs with a batch size of B=512B=512 using the Adam optimizer at a learning rate of 10310^{-3}. Computations were performed on an NVIDIA A100 32 GB GPU.

5.1 Results for synthetic data

Figure 2 summarizes the results for synthetic data. For a representation function ff^{*} that achieves Neural-Collapse, the values of LSCL(256)(f)L^{(256)}_{SCL}(f^{*}) and LHSCL(256)(f)L^{(256)}_{HSCL}(f^{*}) across all β\beta values and the value of LUCL(256)(f)L^{(256)}_{UCL}(f^{*}) are 0.20140.2014, 0.20140.2014, and 0.39350.3935, respectively. These values are obtained by numerically evaluating the lower bounds in Theorem 2 and Theorem 3.

The first row in Figure 2 shows the result using label information for positive pair construction. The values of the minimum loss in different settings are displayed at the top of Fig. 2. Our simulation results are consistent with our theoretical results. After 200 training epochs, we observed that LSCL(256)(f)L^{(256)}_{SCL}(f) and LHSCL(256)(f)L^{(256)}_{HSCL}(f) across all β\beta values and LUCL(256)(f)L^{(256)}_{UCL}(f) converged to their minimum values. From the figure, we can visually confirm that the representations exhibit Neural-Collapse in SCLSCL, HSCLHSCL, and UCLUCL. However, Neural-Collapse was not observed in HUCL as the class means deviate significantly from the ETF geometry.

The second row in Figure 2 shows the results using the additive Gaussian noise augmentation mechanism, where label information is not used in constructing positive examples. We note that in accordance with our theoretical results, NC is observed in SCLSCL and HSCLHSCL. However, NC is not observed in UCLUCL and HUCLHUCL, likely due to the lack of conditional independence. Moreover, the degree of deviation from NC increases progressively with increasing hardness levels.

Contrary to a widely held belief that hard-negative sampling is beneficial in both supervised and unsupervised contrastive learning, results on this simple synthetic dataset suggest that not only may SCL not benefit from hard negatives, but UCL may suffer from it. To investigate whether these conclusions hold true more generally or whether there are practical benefits of hard-negatives for SCL and UCL, we turn to real-world datasets next.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Synthetic dataset results using label information (top row figures) or additive Gaussian noise augmentation mechanism (bottom row figures) for generating anchor-positive pairs: Initial two-dimensional representations (left), post-training SCL and HSCL representations and losses at different hardness levels (middle), post-training UCL and HUCL representations and losses at different hardness levels (right).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Results for CIFAR100 under supervised settings (SCL, HSCL, left column) and unsupervised settings (UCL, HUCL, right column) with unit-ball normalization and random initialization. From top to bottom: Downstream Test Accuracy, Zero-sum metric, Unit-norm metric, and Equal inner-product metric, all plotted against the number of epochs.

5.2 Results for real data

We first show that hard-negative sampling improves downstream classification performance. From the first row of Fig. 3, we see that with negative sampling at moderate hardness levels (β=5,10\beta=5,10), the classification accuracies of HSCL and HUCL are more than 40%40\% points higher than those of SCL and UCL respectively.

Achievability of Neural-Collapse: We investigate whether SCL and UCL losses trained using Adam can attain the globally optimal solution exhibiting NC. To test this, in line with properties (i),(ii)(i),(ii) and condition (13) in Theorems 2 and 3, we employ the following metrics which are plotted in Fig. 3 in rows two through four.

  • 1.

    Zero-sum metric: j𝒴μj\left\|\sum_{j\in\mathcal{Y}}\mu_{j}\right\|

  • 2.

    Unit-norm metric: 1Cj𝒴|μj1|\frac{1}{C}\sum_{j\in\mathcal{Y}}\left|\|\mu_{j}\|-1\right|

  • 3.

    Equal inner-product metric: 1C(C1)j,k𝒴,kj|μjμk+1C1|\frac{1}{C(C-1)}\sum\limits_{\begin{subarray}{c}j,k\in\mathcal{Y},\\ k\neq j\end{subarray}}\big{|}\mu_{j}^{\top}\mu_{k}+\frac{1}{C-1}\big{|}

We note that even though the equal inner-product class means condition together with unit-ball normalization implies the zero-sum and unit-norm conditions, we report these three metrics separately to gain more insight.

According to Theorems 2 and 3, the optimal solutions for UCL, SCL, and HSCL are anticipated to manifest NC. However, our experimental findings reveal a gap between the theoretical expectations and the observed outcomes. Specifically, in both supervised and unsupervised settings, when leveraging the random negative sampling method, i.e., when the negative samples are sampled uniformly from the whole data in a mini-batch which may include the anchor and positive samples, NC is not exactly achieved: the zero-sum and equal inner-product metrics do not approach zero for all hardness levels (rows 2 and 4 in Fig. 3). This is also supported by the results in Table 1 which shows the theoretical minimum SCL loss value and the practically observed SCL loss values after 400400 epochs for different hardness levels. While our theoretical results posit that both SCL and HSCL should have the same minimum loss value for all values of β>0\beta>0, the practically observed loss value for SCL deviates noticeably from theory. On the other hand, increased hardness in HSCL, especially at β=5,10\beta=5,10, brings the observed loss value close to the theoretical minimum.

Theory Empirical
0.3105 SCL HSCL
β=0\beta=0 β=2\beta=2 β=5\beta=5 β=10\beta=10 β=30\beta=30
0.3384 0.3603 0.3106 0.3107 0.3222
Table 1: Comparison of minimum theoretical loss and practically observed loss values after 400400 epochs for CIFAR100.

In addition, the manner in which the final values of NC metrics change with increasing hardness levels is qualitatively different in the supervised and unsupervised settings. Specifically, in the supervised settings, increased hardness invariably leads to improved results, and the model tends to approach NC, notably at β=5,10,30\beta=5,10,30. However, in the unsupervised settings, there seems to be just a single optimal hardness level (β=5\beta=5 is best among the choices tested).

5.3 Dimensional-Collapse

To gain further insights, we investigate the phenomenon of Dimensional-Collapse (DC) that is known to occur in contrastive learning (see [12]).

Definition 7.

[Dimension Collapse (DC)] We say that the class means μ1,,μC\mu_{1},\ldots,\mu_{C} suffer from DC if their empirical covariance matrix has one or more singular values that are zero or orders of magnitude smaller than the largest singular value.

If d𝒵=C1d_{\mathcal{Z}}=C-1, then under Neural-Collapse (NC), the class mean vectors would have full rank C1C-1 in representation space since they form an ETF (see Definition 5). Thus when d𝒵=C1d_{\mathcal{Z}}=C-1, NC¬DCNC\Rightarrow\neg DC. However, we note that ¬DC⇏NC\neg DC\not\Rightarrow NC because, for example, the class means could have full rank and satisfy the equal inner-product and unit-norm conditions in Theorem 2 without satisfying the zero-sum condition.

We numerically assess DC by plotting the singular values of the empirical covariance matrix of the class means (at the end of training) normalized by the largest singular value in decreasing order on a log-scale. Results for UCL, SCL, HUCL, and HSCL are shown in Fig. 4. In the supervised settings, (first row and first column of Fig. 4), the results align with our previous observations from Fig. 3. However, in the unsupervised settings (first row and second column of Fig. 4), while HUCL with high hardness values deviates more from NC compared to UCL in Fig. 3, in Fig. 4 we see that HUCL suffers less from DC.

5.4 Role of initialization

To gain further insights into the DC phenomenon, we trained a model using HSCL with β=10\beta=10 for 400400 epochs until it nearly attains NC as measured by the three NC metrics (zero-sum, unit-norm, and equal inner-product) shown in the second row of Table 2. We call this representation mapping (or pre-trained model) the “near-NC” representation mapping (pre-trained model).

Next, with the near-NC representation mapping as initialization, we continue to train the model for an additional 400 epochs under 10 different settings corresponding to hard supervised and hard unsupervised contrastive learning with different hardness levels. Rows 3-12 in Table 2 show the final values of the three NC metrics for the 10 settings. The resulting normalized singular value plots are shown in the second row of Fig. 4.

Setting Zero-sum Unit-norm Equal inner-product
near-NC model 0.012 1.8×1081.8\times 10^{-8} 0.004
SCL 0.006 2.0×1082.0\times 10^{-8} 0.007
HSCL, β=2\beta=2 0.005 2.2×1082.2\times 10^{-8} 0.004
HSCL, β=5\beta=5 0.007 1.9×1081.9\times 10^{-8} 0.002
HSCL, β=10\beta=10 0.007 2.0×1082.0\times 10^{-8} 0.002
HSCL, β=30\beta=30 0.004 1.7×1081.7\times 10^{-8} 0.001
UCL 0.005 3.9×1043.9\times 10^{-4} 0.005
HUCL, β=2\beta=2 0.005 6.1×1046.1\times 10^{-4} 0.003
HUCL, β=5\beta=5 0.004 1.4×1031.4\times 10^{-3} 0.002
HUCL, β=10\beta=10 0.093 1.5×1031.5\times 10^{-3} 0.047
HUCL, β=30\beta=30 0.761 7.9×1047.9\times 10^{-4} 0.586
Table 2: Post-training NC metrics for near-NC initialization in different settings.

From Table 2 we note that in all 55 supervised settings, the final representation mappings have NC metrics that are very similar to those of initial near-NC mapping. However, in the unsupervised settings, especially HUCL for β=10,30\beta=10,30, the unit-norm and equal inner-product metrics of the final representation mappings are significantly larger than those of the initial near-NC mapping. This shows that mini-batch Adam optimization of CL losses exhibit dynamics that are different in the supervised and unsupervised settings and are impacted by the hardness level of the negative samples.

From the second row of Fig. 4 we make the following observations:

  • SCL and HSCL trained with near-NC initialization and Adam do not exhibit DC or DC is negligible (second row and first column of Fig. 4).

  • UCL trained with near-NC initialization and Adam also does not exhibit DC, but the behavior of HUCL depends on the hardness level β\beta. A larger β\beta value appears to make DC more pronounced. This could be explained by the fact that a higher β\beta value increases the odds of latent-class collision.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Normalized singular values of the empirical covariance matrix of class means (in representation space) plotted in log-scale in decreasing order for CIFAR100 under supervised (left column) and unsupervised (right column) settings. The horizontal axis is the sorted index of the singular values. From top to bottom: Unit-ball normalization with random initialization, Unit-ball normalization with near-NC initialization, Unit-sphere normalization with random initialization, and un-normalized representation with random initialization.

5.5 Role of normalization

Feature normalization also plays an important role in alleviating DC. To demonstrate this, we test three normalization conditions during training: (1) unit-ball normalization, (2) unit-sphere normalization, and (3) no normalization. The resulting normalized singular value plots are shown in Fig. 4 (rows 1, 3, and 4). As can be observed, the behavior of unit-sphere normalization is close to that of unit-ball normalization, and with hard-negative sampling, both SCL and UCL can achieve NC (for suitable hardness levels). Without normalization, neither random-negative nor hard-negative training methods attain NC and they suffer from DC. We also observe that for SCL and UCL, absence of normalization leads to less DC (compare blue curves in rows 1 and 4 of Fig. 4). However, feature normalization could potentially reduce DC in hard-negative sampling for a range of hardness levels.

5.6 Impact of batch size

In Appendix B.3, we report results using different batch sizes. We observe that when the batch size is decreased to 64, NC is still evident in HSCL and HUCL for certain values of β\beta. However, when the batch size is further reduced to 32, NC is no longer observed.

5.7 Role of hardening function

We investigated the impact of hardening functions, using a family of polynomial functions detailed in Appendix B.4. Results shown in Figs. 10 and 11 of Appendix B.4 confirm that hard-negative sampling helps prevent DC and achieve NC.

5.8 Experiments using the SimCLR framework

In Appendix B.5 we report results using the state-of-the-art SimCLR framework for contrastive learning which uses augmentations to generate samples. The results of Figs. 12 and 13 in Appendix B.5 are somewhat similar to those in Figs. 3 and 4, respectively, but a key difference is the failure to attain NC in the SimCLR sampling framework for both supervised and unsupervised settings at all hardness levels. This is primarily because the SimCLR sampling framework does not utilize label information and cannot guarantee property (v)(v) in Theorem 2 and Theorem 3 nor (for UCL) the conditional independence of anchor and positive samples given their label.

6 Conclusion and open questions

We proved the theoretical optimality of the NC-geometry for SCL, UCL, and notably (for the first time) HSCL losses for a very general family of CL losses and hardening functions that subsume popular choices. We empirically demonstrated the ability of hard-negative sampling to achieve global optima for CL and mitigate dimension collapse, in both supervised and unsupervised settings. Our theoretical and empirical results motivate a number of open questions. Firstly, a tight lower bound for HUCL remains open due to latent-class collision. It is also unclear whether the HUCL loss is minimized iff there is Neural-Collapse. Our theoretical results for the SCL setting did not require conditional independence of anchor and positive samples given their label, but our results for the UCL setting did. A theoretical characterization of the geometry of optimal solutions for UCL in the absence of conditional independence remains open. A difficulty with empirically observing NC in UCL and HUCL is that the number of latent classes is not known because it is, in general, implicitly tied to the properties of the sampling distribution requiring one to choose a sufficiently large representation dimension. Another open question is to unravel precisely how and why hard-negatives alter the optimization landscape enabling the training dynamics of Adam with random initialization to converge to the global optimum for suitable hardness levels and what are optimum hardness levels.

7 Acknowledgements

This research was supported by NSF 1931978.

Appendix A Compatibility of sampling model with SimCLR-like augmentations

The following generative model captures the manner in which many data augmentation mechanisms, including SimCLR, generate a pair of anchor and positive samples. First, a label yy is sampled. The label may represent an observed class in the supervised setting or a latent (implicit, unobserved) cluster index in the unsupervised setting. Then given yy, a reference sample xrefx^{ref} is sampled with conditional distribution pref(|y)p_{ref}(\cdot|y). Then a pair of samples (x,x+)(x,x^{+}) are generated given (xref,y)(x^{ref},y) via two independent calls to an augmentation mechanism whose behavior can be statistically described by a conditional probability distribution paug(|xref,y)p_{aug}(\cdot|x^{ref},y), i.e., p(x,x+|xref,y)=paug(x|xref,y)paug(x+|xref,y)p(x,x^{+}|x^{ref},y)=p_{aug}(x|x^{ref},y)\cdot p_{aug}(x^{+}|x^{ref},y). Finally, the representations z=f(x)z=f(x) and z+=f(x+)z^{+}=f(x^{+}) are computed via a mapping f()f(\cdot), e.g., a neural network. Under the setting just described, it follows that both z|yz|y and z+|yz^{+}|y have identical conditional distributions given yy which we denote by s(|y)s(\cdot|y). This can be verified by checking that both x|yx|y and x+|yx^{+}|y have the same conditional distribution given yy equal to

p(|y)=xrefpaug(|xref,y)pref(xref|y)dxrefp(\cdot|y)=\int_{x^{ref}}p_{aug}(\cdot|x^{ref},y)p_{ref}(x^{ref}|y)dx^{ref}

where the integrals will be sums in the discrete (probability mass function) setting. Note that although x,x+x,x^{+} are conditionally IID given (xref,y)(x^{ref},y), they need not be conditionally IID given just yy.

Algorithm 1 Contrastive Learning Algorithm
0:  Batch size NN, data 𝒳\mathcal{X}, label 𝒴\mathcal{Y}, neural-net parameters of representation function ff, Algorithm: SCL/ UCL/HSCL/HUCL, normalization type: unit-ball/unit-sphere/no-normalization, the hardening function η(t1:k):=i=1keβti,β>0\eta(t_{1:k}):=\prod_{i=1}^{k}e^{\beta t_{i}},\beta>0.
1:  Define negative distribution p(z1:k|z,z+)p^{-}(z^{-}_{1:k}|z,z^{+}) based on the chosen Algorithm, see Sec. 3 for details.
2:  for each sampled minibatch {xi}i=1N\{x_{i}\}^{N}_{i=1} do
3:     for all i{1,,N}i\in\{1,\dots,N\} do
4:        Compute f(xi)f(x_{i})
5:        if unit-ball normalization then
6:           if f(xi)1\|f(x_{i})\|\leq 1 then
7:              zi=f(xi)z_{i}=f(x_{i})
8:           else
9:              zi=f(xi)f(xi)z_{i}=\frac{f(x_{i})}{\|f(x_{i})\|}
10:           end if
11:        else if unit-sphere normalization then
12:           zi=f(xi)f(xi)z_{i}=\frac{f(x_{i})}{\|f(x_{i})\|}
13:        else if no-normalization then
14:           zi=f(xi)dz_{i}=\frac{f(x_{i})}{\sqrt{d}}
15:        end if
16:     end for
17:     for all i{1,,N}i\in\{1,\dots,N\} do
18:        for all j{1,,N}j\in\{1,\dots,N\} do
19:           if y(xi)=y(xj)y(x_{i})=y(x_{j}) then
20:              Draw {z1:k}\{z^{-}_{1:k}\} from p(z1:k|zi,zj)p^{-}(z^{-}_{1:k}|z_{i},z_{j})
21:              {vi,j,m}m=1k={zizmzizj}m=1k\{v_{i,j,m}\}_{m=1}^{k}=\{z_{i}^{\top}z^{-}_{m}-z_{i}^{\top}z_{j}\}_{m=1}^{k}
22:              i,j=log(1+1km=1kevi,j,m)\ell_{i,j}=\log\left(1+\frac{1}{k}\sum_{m=1}^{k}e^{v_{i,j,m}}\right)
23:           else
24:              i,j=0\ell_{i,j}=0
25:           end if
26:        end for
27:        Compute the average loss of sample xix_{i}: i=1|{ij0}|j=1Ni,j\ell_{i}=\frac{1}{|\{\ell_{ij}\neq 0\}|}\sum_{j=1}^{N}\ell_{i,j}
28:     end for
29:     Compute the average loss of minibatch: L=1Ni=1NiL=\frac{1}{N}\sum_{i=1}^{N}\ell_{i}
30:     Take one stochastic gradient step using Adam
31:  end forreturn Encoder network f()f(\cdot)

Appendix B Additional experiments

We replicated the same experiments conducted in Sec. 5 on CIFAR10. The results are plotted in Fig. 5 and Fig. 6. In contrast to CIFAR10 and CIFAR100 where the numerical results are provided under three different settings, namely, unit-ball normalization with random initialization, unit-ball normalization with Neural-Collapse initialization, and unit-sphere normalization with random initialization, for Tiny-ImageNet, we only conduct experiments under unit-ball normalization with random initialization. This is because the size of the Tiny-ImageNet dataset (120000 images) is much larger than the sizes of both CIFAR10 and CIFAR100 datasets (50000 images per dataset) which results in a significantly longer processing time. The results for Tiny-ImageNet are plotted in Fig. 7.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Results for CIFAR10 under supervised settings (SCL, HSCL, left column) and unsupervised settings (UCL, HUCL, right column) with unit-ball normalization and random initialization. From top to bottom: Downstream Test Accuracy, Zero-sum metric, Unit-norm metric, and Equal inner-product metric, all plotted against the number of epochs.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Sorted normalized singular values of the empirical covariance matrix of class means (in representation space) in the last epoch plotted in log-scale for CIFAR10 under supervised (left column) and unsupervised (right column) settings. From top to bottom: Unit-ball normalization with random initialization, Unit-ball normalization with near-NC initialization, Unit-sphere normalization with random initialization, and un-normalized representation with random initialization.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Results for Tiny-ImageNet under supervised settings (SCL, HSCL, left column) and unsupervised settings (UCL, HUCL, right column) with unit-ball normalization and random initialization. Top row: sorted normalized singular values of the empirical covariance matrix of class means (in representation space) in the last epoch plotted in log-scale. Rows 2–4: Zero-sum metric, Unit-norm metric, and Equal inner-product metric, all plotted against the number of epochs.

B.1 Neural-Collapse and Dimensional-Collapse

For CIFAR10, from Fig. 5 and Fig. 6, we observe similar phenomena as those for CIFAR100. As before, we note that while Theorems 2 and 3 suggest that Neural-Collapse should occur in both the supervised and unsupervised settings when using the random negative sampling method, one may not be able to observe Neural-Collapse in unsupervised settings in practice. For the supervised case in CIFAR10, any degree of hardness propels the representation towards Neural-Collapse. This may be due to the small number of classes in CIFAR10.

For Tiny-ImageNet, from Fig. 7, we observe that when β=5,10,30\beta=5,10,30, the geometry of the learned representation closely aligns with that of Neural-Collapse. However, in HUCL, a high degree of hardness can be harmful. At β=5\beta=5, the geometry most closely approximates Neural-Collapse for both CIFAR10 and Tiny-ImageNet. However, increasing the degree of hardness further, for example at β=30\beta=30, causes the center of class means to deviate from the origin and the equal inner-product condition is heavily violated.

Furthermore, from the the normalized singular values for CIFAR10 in Fig. 6 and for Tiny-ImageNet in Fig. 7, we observe that random negative sampling without any hardening (β=0\beta=0) suffers from DC whereas hard-negative sampling consistently mitigates DC. The supervised case benefits more from a higher degree of hardness, since in the unsupervised cases there are higher chances of (latent) class collisions.

B.2 Effects of initialization and normalization

The normalized singular value plots for CIFAR10 are shown in Fig. 6. Compared to CIFAR 100 (see Fig. 4), the phenomenon of DC in CIFAR10 is far less pronounced. This may be because CIFAR10 has a smaller number of classes compared to CIFAR100 (10 vs. 100). However, the effects of initialization and normalization on the learned representation geometry are similar to that for CIFAR100:

Effects of initialization: 1) SCL and HSCL trained with near-NC initialization and Adam do not exhibit DC and 2) UCL trained with near-NC initialization and Adam also does not exhibit DC, but the behavior of HUCL depends on the hardness level β\beta.

Effects of normalization: The behavior of unit-sphere normalization is close to that of unit-ball normalization, and with hard-negative sampling (and suitable hardness levels), both SCL and UCL can achieve NC. Without normalization, neither regular nor hard-negative training methods attain NC and they suffer from DC. We also observe that with random-negative sampling, un-normalized representations lead to reduced DC in both SCL and UCL. However, hard-negative sampling benefits more from feature normalization and its absence leads to more severe DC.

B.3 Experiments with different batch sizes

To investigate the effect of batch size on the outcomes, we conducted experiments with varying batch sizes. All previous experiments were performed with a batch size of 512. In this section, we present results for batch sizes of 64 and 32 in Fig. 8 and Fig. 9, respectively.

We observe that when the batch size is reduced to 64, Neural-Collapse is still nearly achieved in both HSCL (β=5,10,30\beta=5,10,30) and HUCL (β=5\beta=5). However, with a further reduction of batch size to 32, Neural-Collapse is only achieved in HSCL (β=30\beta=30), and it fails to occur in HUCL for any value of β\beta.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Results for CIFAR100 under supervised settings (SCL, HSCL, left column) and unsupervised settings (UCL, HUCL, right column) with unit-ball normalization and random initialization when batch size is equal to 64. Top row: sorted normalized singular values of the empirical covariance matrix of class means (in representation space) in the last epoch plotted in log-scale. Rows 2–4: Zero-sum metric, Unit-norm metric, and Equal inner-product metric, all plotted against the number of epochs.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Results for CIFAR100 under supervised settings (SCL, HSCL, left column) and unsupervised settings (UCL, HUCL, right column) with unit-ball normalization and random initialization when batch size is equal to 32. Top row: sorted normalized singular values of the empirical covariance matrix of class means (in representation space) in the last epoch plotted in log-scale. Rows 2–4: Zero-sum metric, Unit-norm metric, and Equal inner-product metric, all plotted against the number of epochs.

B.4 Experiments with a different hardening function

To explore whether similar results can be achieved with a different hardening function, we investigated the impact that changing the hardening function has on NC and DC under unit-ball normalization. We conducted experiments using the CIFAR10 and CIFAR100 datasets adopting the setup consistent with previous experiments but used a new family of hardening functions having the following polynomial form: η(t1:k):=i=1k(max{ti+1,0})ϵ,ϵ>0\eta(t_{1:k}):=\prod_{i=1}^{k}{(\max\{t_{i}+1,0\})^{\epsilon}},\epsilon>0, for the following set of hardness values ϵ=3,5,10,20\epsilon=3,5,10,20. We note that this family of hardening functions decays at a significantly slower rate compared to the exponential hardening function we used in all our previous experiments.

Results for CIFAR100 and CIFAR10 are plotted in Figs. 10 and 11, respectively. We observe phenomena similar to those in Figs. 3 and 5. By selecting an appropriate hardening parameter ϵ\epsilon, we can achieve, or nearly achieve, NC in both supervised and unsupervised settings. Consequently, we can draw conclusions that are qualitatively similar to those in Sec. B.1. Specifically, hard-negative sampling in both supervised and unsupervised settings can mitigate DC while achieving NC.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: Results for CIFAR100 with polynomial hardening function, under supervised settings (SCL, HSCL, left column) and unsupervised settings (UCL, HUCL, right column) with unit-ball normalization and random initialization. Top row: sorted normalized singular values of the empirical covariance matrix of class means (in representation space) in the last epoch plotted in log-scale. Rows 2–4: Zero-sum metric, Unit-norm metric, and Equal inner-product metric, all plotted against the number of epochs.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: Results for CIFAR10 with polynomial hardening function, under supervised settings (SCL, HSCL, left column) and unsupervised settings (UCL, HUCL, right column) with unit-ball normalization and random initialization. Top row: sorted normalized singular values of the empirical covariance matrix of class means (in representation space) in the last epoch plotted in log-scale. Rows 2–4: Zero-sum metric, Unit-norm metric, and Equal inner-product metric, all plotted against the number of epochs.

B.5 Experiments with the SimCLR framework

We conducted additional experiments using the state-of-the-art SimCLR framework for Contrastive Learning. Instead of sampling positive samples from the same class directly, we follow the setting in SimCLR which uses two independent augmentations of a reference sample to create the positive pair. No label information is used to generate the anchor-positive pair.

Figure 12 shows results for SimCLR sampling using the loss function proposed in SimCLR which is the large-kk asymptotic form of the InfoNCE loss:

LSimCLR(f)=𝔼p(x,x+)[log(1+Q𝔼p(x|x)[ef(x)f(x)]ef(x)f(x+))]\displaystyle{L}^{SimCLR}(f)=\mathbb{E}_{p(x,x^{+})}\Bigg{[}\log\Bigg{(}1+\frac{Q\,\mathbb{E}_{p(x^{-}|x)}[e^{f(x)^{\top}f(x^{-})}]}{e^{f(x)^{\top}\,f(x^{+})}}\Bigg{)}\Bigg{]} (42)

where QQ is a weighting hyper-parameter that is set to batch size minus two. Compared to our previous experiments, all the NC metrics deviate significantly away from zero in both supervised and unsupervised settings and all hardness levels. Still, the high-level conclusions for DC are qualitatively similar to those from our previous experiments, specifically that hard-negative sampling can mitigate DC in both supervised and unsupervised settings.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: Results for CIFAR100 with SimCLR sampling with SimCLR loss (Eq.44), under supervised settings (SCL, HSCL, left column) and unsupervised settings (UCL, HUCL, right column) with unit-ball normalization and random initialization. Top row: sorted normalized singular values of the empirical covariance matrix of class means (in representation space) in the last epoch plotted in log-scale. Rows 2–4: Zero-sum metric, Unit-norm metric, and Equal inner-product metric, all plotted against the number of epochs.

Figure 13 shows results for SimCLR sampling using the InfoNCE loss with k=256k=256 and a positive distribution that only relies on the augmentation method. The results are very similar to those in Fig. 12. Since the results in Figs. 12 and  13 share the same SimCLR sampling framework but different losses, it follows that the failure to attain NC is not due to the particular loss function used, but the SimCLR sampling framework itself which does not utilize label information to generate samples. From the unit-norm metric plots in both figures it is clear that the final representations are mostly distributed within the unit-ball than on its surface.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13: Results for CIFAR100 with SimCLR sampling using InfoNCE loss, under supervised settings (SCL, HSCL, left column) and unsupervised settings (UCL, HUCL, right column) with unit-ball normalization and random initialization. Top row: sorted normalized singular values of the empirical covariance matrix of class means (in representation space) in the last epoch plotted in log-scale. Rows 2–4: Zero-sum metric, Unit-norm metric, and Equal inner-product metric, all plotted against the number of epochs.

Bibliography

  • [1] Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.
  • [2] Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, et al. A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210, 2023.
  • [3] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
  • [4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  • [5] Ching-Yao Chuang, Joshua Robinson, Yen-Chen Lin, Antonio Torralba, and Stefanie Jegelka. Debiased contrastive learning. Advances in neural information processing systems, 33:8765–8775, 2020.
  • [6] Patrick Feeney and Michael C Hughes. Sincere: Supervised information noise-contrastive estimation revisited. arXiv preprint arXiv:2309.14277, 2023.
  • [7] Florian Graf, Christoph Hofer, Marc Niethammer, and Roland Kwitt. Dissecting supervised contrastive learning. In International Conference on Machine Learning, pages 3821–3830. PMLR, 2021.
  • [8] Jeff Z HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34:5000–5011, 2021.
  • [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [10] Ruijie Jiang, Prakash Ishwar, and Shuchin Aeron. Hard negative sampling via regularized optimal transport for contrastive representation learning. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2023.
  • [11] Ruijie Jiang, Thuan Nguyen, Prakash Ishwar, and Shuchin Aeron. Supervised contrastive learning with hard negative samples. arXiv preprint arXiv:2209.00078, 2022.
  • [12] Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. In International Conference on Learning Representations, 2021.
  • [13] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in Neural Information Processing Systems, 33:18661–18673, 2020.
  • [14] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [15] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
  • [16] Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34:309–323, 2021.
  • [17] Zijun Long, George Killick, Richard McCreadie, Gerardo Aragon Camarasa, and Zaiqiao Meng. When hard negative sampling meets supervised contrastive learning, 2023.
  • [18] Vassili Nikolaevich Malozemov and Aleksandr Borisovich Pevnyi. Equiangular tight frames. Journal of Mathematical Sciences, 157(6):789–815, 2009.
  • [19] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4004–4012, 2016.
  • [20] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • [21] Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
  • [22] Advait Parulekar, Liam Collins, Karthikeyan Shanmugam, Aryan Mokhtari, and Sanjay Shakkottai. Infonce loss provably learns cluster-preserving representations. arXiv preprint arXiv:2302.07920, 2023.
  • [23] Nils Rethmeier and Isabelle Augenstein. A primer on contrastive pretraining in language processing: Methods, lessons learned, and perspectives. ACM Computing Surveys, 55(10):1–17, 2023.
  • [24] Joshua David Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples. In International Conference on Learning Representations, 2020.
  • [25] Joshua David Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples. In International Conference on Learning Representations, 2021.
  • [26] Nguyen Thuan, Jiang Ruijie, Aeron Shuchin, Ishwar Prakash, and Brown Donald. On neural collapse in contrastive learning with imbalanced datasets. In IEEE International Workshop on Machine Learning for Signal Processing, 2024.
  • [27] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
  • [28] Yifei Wang, Qi Zhang, Yisen Wang, Jiansheng Yang, and Zhouchen Lin. Chaos is a ladder: A new theoretical understanding of contrastive learning via augmentation overlap. In International Conference on Learning Representations, 2021.
  • [29] Zixin Wen and Yuanzhi Li. Toward understanding the feature learning process of self-supervised contrastive learning. In International Conference on Machine Learning, pages 11112–11122. PMLR, 2021.
  • [30] Mike Wu, Milan Mosse, Chengxu Zhuang, Daniel Yamins, and Noah Goodman. Conditional negative sampling for contrastive learning of visual representations. arXiv preprint arXiv:2010.02037, 2020.
  • [31] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
  • [32] Can Yaras, Peng Wang, Zhihui Zhu, Laura Balzano, and Qing Qu. Neural collapse with normalized features: A geometric analysis over the riemannian manifold. Advances in neural information processing systems, 35:11547–11560, 2022.
  • [33] Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pages 27179–27202. PMLR, 2022.
  • [34] Liu Ziyin, Ekdeep Singh Lubana, Masahito Ueda, and Hidenori Tanaka. What shapes the loss landscape of self supervised learning? In The Eleventh International Conference on Learning Representations, 2022.