This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Differentiable Top-kk Classification Learning

Felix Petersen    Hilde Kuehne    Christian Borgelt    Oliver Deussen
Abstract

The top-kk classification accuracy is one of the core metrics in machine learning. Here, kk is conventionally a positive integer, such as 11 or 55, leading to top-11 or top-55 training objectives. In this work, we relax this assumption and optimize the model for multiple kk simultaneously instead of using a single kk. Leveraging recent advances in differentiable sorting and ranking, we propose a differentiable top-kk cross-entropy classification loss. This allows training the network while not only considering the top-11 prediction, but also, e.g., the top-22 and top-55 predictions. We evaluate the proposed loss function for fine-tuning on state-of-the-art architectures, as well as for training from scratch. We find that relaxing kk does not only produce better top-55 accuracies, but also leads to top-11 accuracy improvements. When fine-tuning publicly available ImageNet models, we achieve a new state-of-the-art for these models.

Machine Learning, ICML

1 Introduction

Classification is one of the core disciplines in machine learning and computer vision. The advent of classification problems with hundreds or even thousands of classes let the top-kk classification accuracy establish as an important metric, i.e., one of the top-kk classes has to be the correct class. Usually, models are trained to optimize the top-11 accuracy; and top-55 etc. are used for evaluation only. Some works (Lapin et al., 2016; Berrada et al., 2018) have challenged this idea and proposed top-kk losses, such as a smooth top-55 margin loss. These methods have demonstrated superior robustness over the established top-11 softmax cross-entropy in the presence of additional label noise (Berrada et al., 2018). In standard classification settings, however, these methods have so far not shown improvements over the established top-11 softmax cross-entropy.

In this work, instead of selecting a single top-kk metric such as top-11 or top-55 for defining the loss, we propose to specify kk to be drawn from a distribution PKP_{K}, which may or may not depend on the confidence of specific data points or on the class label. Examples for distributions PKP_{K} are [.5,0,0,0,.5][.5,0,0,0,.5] (50%50\% top-11 and 50%50\% top-55), [.1,0,0,0,.9][.1,0,0,0,.9] (10%10\% top-11 and 90%90\% top-55), and [.2,.2,.2,.2,.2][.2,.2,.2,.2,.2] (20%20\% top-kk for each kk from 11 to 55). Note that, when kk is drawn from a distribution, this is done sampling-free as we can compute the expectation value in closed form.

Conventionally, given scores returned by a neural network, softmax produces a probability distribution over the top-11 rank. Recent advances in differentiable sorting and ranking (Grover et al., 2019; Prillo & Eisenschlos, 2020; Cuturi et al., 2019; Petersen et al., 2021a) provide methods for generalizing this to probability distributions over all ranks represented by a matrix 𝑷{\bm{P}}. Based on differentiable ranking, multiple differentiable top-kk operators have recently been proposed. They found applications in differentiable kk-nearest neighbor algorithms, differentiable beam search, attention mechanisms, and differentiable image patch selection (Cordonnier et al., 2021). In these areas, integrating differentiable top-kk improved results considerably by creating a more natural end-to-end learning setting. However, to date, none of the differentiable top-kk operators have been employed as neural network losses for top-kk classification learning with k>1k>1.

Building on differentiable sorting and ranking methods, we propose a new family of differentiable top-kk classification losses where kk is drawn from a probability distribution. We find that our top-kk losses improve not only top-kk accuracies, but also top-11 accuracy on multiple learning tasks.

We empirically evaluate our method using four differentiable sorting and ranking methods on the CIFAR-100 (Krizhevsky et al., 2009), the ImageNet-1K (Deng et al., 2009), and the ImageNet-21K-P (Ridnik et al., 2021) data sets. Using CIFAR-100, we demonstrate the capabilities of our losses to train models from scratch. On ImageNet-1K, we demonstrate that our losses are capable of fine-tuning existing models and achieve a new state-of-the-art for publicly available models on both top-11 and top-55 accuracy. We benchmark our method on multiple recent models and demonstrate that our proposed method consistently outperforms the baselines for the best two differentiable sorting and ranking methods. With ImageNet-21K-P, where many classes overlap (but only one is the ground truth), we demonstrate that our losses are scalable to more than 10 00010\,000 classes and achieve improvements of over 1%1\% with only last layer fine-tuning.

Overall, while the performance improvements on fine-tuning are rather limited (because we retrain only the classification head), they are consistent and can be achieved without the large costs of training from scratch. The absolute 0.2%0.2\% improvement that we achieve on the ResNeXt-101 32x48d WSL top-55 accuracy corresponds to an error reduction by approximately 10%10\%, and can be achieved at much less than the computational cost of (re-)training the full model in the first place.

We summarize our contributions as follows:

  • We derive a novel family of top-kk cross-entropy losses and relax the assumption of a fixed kk.

  • We find that they improve both top-kk and top-11 accuracy.

  • We demonstrate that our losses are scalable to more than 10 00010\,000 classes.

  • We propose splitter selection nets, which require fewer layers than existing selection nets.

  • We achieve new state-of-the-art results (for publicly available models) on ImageNet1K.

Refer to caption
Figure 1: Overview of the proposed architecture: A CNN predicts scores for an image, which are then ranked by a differentiable ranking algorithm returning the probability distribution for each rank in matrix 𝑷{\bm{P}}. The rows of this distribution correspond to ranks, and the columns correspond to the respective classes. In the example, we use a 50%50\% top-11 and 50%50\% top-22 loss, i.e., PK=[.5,.5,0,0,0]P_{K}=[.5,.5,0,0,0]. Here, the kkth value refers to the top-kk component, which is satisfied if the prediction is at any of rank-11 to rank-kk. Thus, the weights for the different ranks can be computed via a cumulative sum and are [1,.5,0,0,0][1,.5,0,0,0]. The correspondingly weighted sum of rows of 𝑷{\bm{P}} yields the probability distribution pp, which can then be used in a cross-entropy loss. Photo by Chris Curry on Unsplash.

2 Background: Differentiable Sorting and Ranking

We briefly review NeuralSort, SoftSort, Optimal Transport Sort, and Differentiable Sorting Networks. We omit the fast differentiable sorting and ranking method (Blondel et al., 2020) and the relaxed Bubble sort algorithm (Petersen et al., 2021b) as they do not provide relaxed permutation matrices / probability scores, but rather only sorted / ranked vectors.

2.1 NeuralSort & SoftSort

To make the sorting operation differentiable, Grover et al. (2019) proposed relaxing permutation matrices to unimodal row-stochastic matrices. For this, they use the softmax of pairwise differences of (cumulative) sums of the top elements. They prove that this, for the temperature parameter approaching 0, is the correct permutation matrix, and propose a variety of deep learning differentiable sorting benchmark tasks. They propose a deterministic softmax-based variant, as well as a Gumbel-Softmax variant of their algorithm. Note that NeuralSort is not based on sorting networks.

Prillo & Eisenschlos (2020) build on this idea but simplify the formulation and provide SoftSort, a faster alternative to NeuralSort. They show that it is sufficient to build on pairwise differences of elements of the vectors to be sorted instead of the cumulative sums. They find that SoftSort performs approximately equivalent in their experiments to NeuralSort.

2.2 Optimal Transport / Sinkhorn Sort

Cuturi et al. (2019) propose an entropy regularized optimal transport formulation of the sorting operation. They solve this by applying the Sinkhorn algorithm (Cuturi, 2013) and produce gradients via automatic differentiation rather than the implicit function theorem, which resolves the need of solving a linear equation system. As the Sinkhorn algorithm produces a relaxed permutation matrix, we can also apply Sinkhorn sort to top-kk classification learning.

2.3 Differentiable Sorting Networks

Petersen et al. (2021a) propose differentiable sorting networks, a continuous relaxation of sorting networks. Sorting networks are a kind of sorting algorithm that consist of wires carrying the values and comparators, which swap the values on two wires if they are not in the desired order. Sorting networks can be made differentiable by perturbing the values on the wires in each layer of the sorting network by a logistic distribution, i.e., instead of min\min and max\max they use softmin\operatorname{softmin} and softmax\operatorname{softmax}. Similar to the methods above, this method produces a relaxed permutation matrix, which allows us to apply it to top-kk classification learning. The method has also been improved by enforcing monotonicity and bounding the approximation error (Petersen et al., 2022). Note that sorting networks are a classic algorithmic concept (Knuth, 1998), are not neural networks nor refer to differentiable sorting. Differentiable sorting networks are one of multiple differentiable sorting and ranking methods.

3 Top-kk Learning

In this section, we start by introducing our objective, elaborate its exact formulation, and then build on differentiable sorting principles to efficiently approximate the objective. A visual overview over the loss architecture is also given in Figure 1.

The goal of top-kk learning is to extend the learning criterion from only accepting exact (top-11) predictions to accepting kk predictions among which the correct class has to be. In its general form, for top-kk learning, kk may differ for each application, class, data point, or a combination thereof. For example, in one case one may want to rank 55 predictions and assign a score that depends on the rank of the true class among these ranked predictions, while, in another case, one may want to obtain 55 predictions but does not care about their order. In yet another case, such as image classification, one may want to enforce a top-11 accuracy on images from the “person” super-class, but resign to a top-33 accuracy for the “animal” super-class, as it may have more ambiguities in class-labels. (For example, as recently shown by Northcutt et al. (2021), there is noise in the labels of ImageNet-1K. As ImageNet21K is a superset of ImageNet1K, it also holds in this case, with the addition that labeling in case of 21K classes would be more challenging and therefore more error-prone.) We model this by a random variable KK, following a distribution PKP_{K} that describes the relative importance of different values kk. The discrete distribution PKP_{K} is either a marginalized distribution for a given setting (such as the uniform distribution), or a conditional distribution for each class, data point, etc. This allows specifying a marginalized or conditional distribution kPKk\sim P_{K}. This generalizes the ideas of conventional top-11 supervision (usually softmax cross-entropy) and top-kk supervision for a kk like k=5k=5 (usually based on surrogate top-kk margin/hinge losses like (Lapin et al., 2016; Berrada et al., 2018)) and unifies them.

The objective of top-kk learning is maximizing the probability of accepted predictions of the model fΘf_{\Theta} on data X,y𝒟X,y\sim\mathcal{D} given marginal distribution PKP_{K} (or conditional PK|X,yP_{K|X,y} if it depends on the class yy and/or data point XX). In the following, 𝑷k,y{\bm{P}}_{k,y} is the predicted probability of yy being the kkth-best prediction for data point XX.

argmaxΘ𝔼X,y𝒟[log(𝔼kPK[m=1k𝑷m,y])]\operatorname*{arg\,max}_{\Theta}\ \ \mathbb{E}_{X,y\sim\mathcal{D}}\left[\log\left(\mathbb{E}_{k\sim P_{K}}\left[{\textstyle\sum_{m=1}^{k}}{\bm{P}}_{m,y}\right]\right)\right] (1)

To evaluate the probability of yy to be the top-11 prediction, we can simply use softmaxy(fΘ(X))\mathrm{softmax}_{y}(f_{\Theta}(X)). However, k>1k>1 requires more consideration. Here, we require probability scores 𝑷k,c{\bm{P}}_{k,c} for the kkth prediction over classes cc\in\mathbb{C}, where c=1n𝑷k,c=1\sum_{c=1}^{n}{\bm{P}}_{k,c}=1 (i.e., 𝑷{\bm{P}} is row stochastic) and ideally additionally k=1n𝑷k,c=1\sum_{k=1}^{n}{\bm{P}}_{k,c}=1 (i.e., 𝑷{\bm{P}} is also column stochastic and thus doubly stochastic.) With this, we can optimize our model by minimizing the following loss

(X,y)=log(k=1nPK(k)(m=1k𝑷m,y(fΘ(X))))\mathcal{L}(X,y)\,{=}\,{-}\log\!\left(\sum_{k=1}^{n}P_{K}(k)\!\left(\sum_{m=1}^{k}{\bm{P}}_{m,y}(f_{\Theta}(X))\right)\!\!\right)\!\! (2)

which is the cross entropy over the probabilities that the true class is among the top-kk class for each possible kk. Note that k=1nPK(k)=1\sum_{k=1}^{n}P_{K}(k)=1.

If 𝐏\bf{P} is column stochastic, the inner sum in Equation 2 is 1\leq 1. As the sum over PKP_{K} is 11, the outer sum is also 1\leq 1. 𝐏\bf{P} being column stochastic is the desirable case. This is given for DiffSortNets and SinkhornSort. However, for SoftSort and NeuralSort, this is only approximately the case. In the non-column stochastic case of SoftSort and NeuralSort, the inner sum could become greater than 11; however, we did not observe this to be a direct problem.

To compute 𝑷k,c{\bm{P}}_{k,c}, we require a function mapping from a vector of real-valued scores to an (ideally) doubly stochastic matrix 𝑷{\bm{P}}. The most suitable for this are the differentiable relaxations of the sorting and ranking functions, which produce differentiable permutation matrices 𝑷{\bm{P}}, which we introduced in Section 2. We build on these approximations to propose instances of top-kk learning losses and extend differentiable sorting networks to differentiable top-kk networks, as just finding the top-kk scores is computationally cheaper than sorting all elements and reduces the approximation error.

3.1 Top-kk Probability Matrices

The discussed differentiable sorting algorithms produce relaxed permutation matrices of size n×nn\times n. However, for top-kk classification learning, we require only the top kk rows for the number kk of top-ranked classes to consider. Here, kk is the largest kk that is considered for the objective, i.e., where PK(k)>0P_{K}(k)>0. As nkn\gg k, producing a k×nk\times n matrix instead of a n×nn\times n matrix is much faster.

For NeuralSort and SoftSort, it is possible to simply compute only the top rows, as the algorithm is defined row-wise.

For the differentiable Sinkhorn sorting algorithm, it is not directly possible to improve the runtime, as in each Sinkhorn iteration the full matrix is required. Xie et al. (2020b) proposed a Sinkhorn-based differentiable top-kk operator, which computes a 2×n2\times n matrix where the first row corresponds to the top-kk elements and the second row correspond to the remaining elements. However, this formulation does not produce 𝐏\bf{P} and does not distinguish between the placements of the top-kk elements among each other, and thus we use the SinkhornSort algorithm by Cuturi et al. (2019).

For differentiable sorting networks, it is (via a bi-directional evaluation) possible to reduce the cost from 𝒪(n2log2(n))\mathcal{O}(n^{2}\log^{2}(n)) to 𝒪(nklog2(n))\mathcal{O}(nk\log^{2}(n)). Here, it is important to note the shape and order of multiplications for obtaining PP. As we only need those elements, which are (after the last layer of the sorting network) at the top kk ranks that we want to consider, we can omit all remaining rows of the permutation matrix of the last layer (layer tt) and thus it is only of size (k×n)(k\times n).

(k×n)P=(k×n)layer t(n×n)layer t1(n×n)layer 1\underbrace{(k\times n)}_{\text{$P$}}\ =\ \underbrace{(k\times n)}_{\text{layer $t$}}\ \underbrace{(n\times n)}_{\text{layer $t{-}1$}}\ \ ...\ \ \underbrace{(n\times n)}_{\text{layer $1$}} (3)

Note that during execution of the sorting network, PP is conventionally computed from layer 11 to layer tt, i.e., from right to left. If we computed it in this order, we would only save a tiny fraction of the computational cost and only during the last layer. Thus, we propose to execute the differentiable sorting network, save the values that populate the (sparse) n×nn\times n layer-wise permutation matrices, and compute PP in a second pass from the back to the front, i.e., from layer tt to layer 11, or from left to right in Equation 3. This allows executing tt dense-sparse matrix multiplications with dense k×nk\times n matrices and sparse n×nn\times n matrices instead of dense n×nn\times n and sparse n×nn\times n matrices. With this, we reduce the asymptotic complexity from 𝒪(n2log2(n))\mathcal{O}(n^{2}\log^{2}(n)) to 𝒪(nklog2(n))\mathcal{O}(nk\log^{2}(n)).

3.2 Differentiable Top-kk Networks

As only the top-kk rows of a relaxed permutation matrix are required for top-kk classification learning, it is possible to improve the efficiency of computing the top-kk probability distribution via differentiable sorting networks by reducing the number of differentiable layers and comparators. Thus, we propose differentiable top-kk networks, which relax selection networks in analogy to how differentiable sorting networks relax sorting networks. Selection networks are networks that select only the top-kk out of nn elements (Knuth, 1998). We propose splitter selection networks (SSN), a novel class of selection networks that requires only 𝒪(logn)\mathcal{O}(\log n) layers (instead of the 𝒪(log2n)\mathcal{O}(\log^{2}n) layers for sorting networks) which makes top-kk supervision with differentiable top-kk networks more efficient and reduces the error (which is introduced in each layer.) SSNs follow the idea that the input is split into locally sorted sublists and then all wires that are not candidates to be among the global top-kk can be eliminated. For example, for n=1024,k=5n=1024,k=5, SSNs require only 2222 layers, while the best previous selection network requires 3434 layers and full sorting (with a bitonic network) requires even 5555 layers. For n=10450,k=5n=10450,k=5 (i.e., for ImageNet-21K-P), SNNs require 2727 layers, the best previous requires 5050 layers, and full sorting requires 105105 layers. In addition, the layers of SSNs are less computationally expensive than those of the bitonic sorting network. Details on SSNs, as well as their full construction, can be found in Supplementary Material B. Concluding, the contribution of differentiable top-kk networks is two-fold: first, we propose a novel kind of selection networks that needs fewer layers, and second, we relax those similarly to differentiable sorting networks.

3.3 Implementation Details

Despite those performance improvements, evaluating the differentiable ranking operators still requires a considerable amount of computational effort for large numbers of classes. Especially if the number nn of elements to be ranked is n=1 000n=1\,000 (ImageNet-1K) or even n>10 000n>10\,000 (ImageNet-21K-P), the differentiable ranking operators can dominate the overall computational costs. In addition, for large numbers nn of elements to be ranked, the performance of differentiable ranking operators decreases as differentially ranking more elements naturally introduces larger errors (Grover et al., 2019; Prillo & Eisenschlos, 2020; Cuturi et al., 2019; Petersen et al., 2021a). Thus, we reduce the number of outputs to be ranked differentially by only considering those classes (for each input) that have a score among the top-mm scores. For this, we make sure that the ground truth class is among those top-mm scores, by replacing the lowest of the top-mm scores by the ground truth class, if necessary. For n=1000n=1000, we choose m=16m=16, and for n>10 000n>10\,000, we choose m=50m=50. We find that this greatly improves training performance.

Because the differentiable ranking operators are (by their nature of being differentiable) only approximations to the hard ranking operator, they each have their characteristics and inconsistencies. Thus, for training models from scratch, we replace the top-11 component of the loss by the regular softmax, which has a better and more consistent behavior. This guides the other loss if the differentiable ranking operator behaves inconsistently. To avoid the top-kk components affecting the guiding softmax component and avoid probabilities greater than 11 in pp, we can separate the cross-entropy into a mixture of the softmax cross-entropy (sm\mathrm{sm}, for the top-11 component) and the top-kk cross-entropy (topk\mathrm{top}-k, for the top-k2k\geq 2 components) as follows:

(X,y)sm+topk=PK(1)SoftmaxCELoss(fΘ(X),y)\displaystyle\underset{\mathrm{sm+top-}k}{\mathcal{L}(X,y)}=P_{K}(1)\cdot\operatorname{SoftmaxCELoss}(f_{\Theta}(X),y) (4)
(1PK(1))log(k=2nPK(k)(m=1k𝑷m,y(fΘ(X))))\displaystyle{-}(1-P_{K}(1))\cdot\log\!\left(\sum_{k=2}^{n}P_{K}(k)\left(\sum_{m=1}^{k}{\bm{P}}_{m,y}(f_{\Theta}(X))\right)\!\!\right)

4 Related Work

We structure the related work into three broad sections: works that derive and apply differentiable top-kk operators, works that use ranking and top-kk training objectives in general, and works that present classic selection networks.

4.1 Differentiable Top-kk Operators

Grover et al. (2019) include an experiment where they use the NeuralSort differentiable top-kk operator for kkNN learning. Cuturi et al. (2019), Blondel et al. (2020), and Petersen et al. (2021a) each apply their differentiable sorting and ranking methods to top-kk supervision with k=1k=1.

Xie et al. (2020b) propose a differentiable top-kk operator based on optimal transport and the Sinkhorn algorithm (Cuturi, 2013). They apply their method to kk-nearest-neighbor learning (kkNN), differential beam search with sorted soft top-kk, and top-kk attention for machine translation. Cordonnier et al. (2021) use perturbed optimizers (Berthet et al., 2020) to derive a differentiable top-kk operator, which they use for differentiable image patch selection. Lee et al. (2021) propose using NeuralSort for a differentiable top-kk operator to produce differentiable ranking metrics for recommender systems. Goyal et al. (2018) propose a continuous top-kk operator for differentiable beam search. Pietruszka et al. (2020) propose the differentiable successive halving top-kk operator to approximate the normalized Chamfer Cosine Similarity (nCCS@knCCS@k).

4.2 Ranking and Top-kk Training Objectives

Fan et al. (2017) propose the “average top-kk” loss, an aggregate loss that averages over the kk largest individual losses of a training data set. They apply this aggregate loss to SVMs for classification tasks. Note that this is not a differentiable top-kk loss in the sense of this work. Instead, the top-kk is not differentiable and used for deciding which data points’ losses are aggregated into the loss.

Lapin et al. (2015, 2016) propose relaxed top-kk surrogate error functions for multiclass SVMs. Inspired by learning-to-rank losses, they propose top-kk calibration, a top-kk hinge loss, a top-kk entropy loss, as well as a truncated top-kk entropy loss. They apply their method to multiclass SVMs and learn via stochastic dual coordinate ascent (SDCA).

Berrada et al. (2018) build on these ideas and propose smooth loss functions for deep top-kk classification. Their surrogate top-kk loss achieves good performance on the CIFAR-100 and ImageNet1K tasks. While their method does not improve performance on the raw data sets in comparison to the strong Softmax Cross-Entropy baseline, in settings of label noise and data set subsets, they improve classification accuracy. Specifically, with label noise of 20%20\% or more on CIFAR-100, they improve top-11 and top-55 accuracy and for subsets of ImageNet1K of up to 50%50\% they improve top-55 accuracy. This work is closest to ours in the sense that our goal is to improve learning of neural networks. However, in contrast to (Berrada et al., 2018), our method improves classification accuracy in unmodified settings. In our experiments, for the special case of kk being a concrete integer and not being drawn from a distribution, we provide comparisons to the smooth top-kk surrogate loss.

Yang & Koyejo (2020) provide a theoretical analysis of top-kk surrogate losses as well as produce a new surrogate top-kk loss, which they evaluate in synthetic data experiments.

A related idea is set-valued classification, where a set of labels is predicted. We refer to Chzhen et al. (2021) for an extensive overview. We note that our goal is not to predict a set of labels, but instead we return a score for each class corresponding to a ranking, where only one class can correspond to the ground truth.

4.3 Selection Networks

Previous selection networks have been proposed by, i.a., (Wah & Chen, 1984; Zazon-Ivry & Codish, 2012; Karpiński & Piotrów, 2015). All of these are based on classic divide-and-conquer sorting networks, which recursively sort subsequences and merge them. In selection networks, during merging, only the top-kk elements are merged instead of the full (sorted) subsequences. In comparison to those earlier works, we propose a new class of selection networks, which achieve tighter bounds (for knk\ll n), and relax them.

5 Experiments111Code will be available at github.com/Felix-Petersen/difftopk

5.1 Setup

We evaluate the proposed top-kk classification loss for four differentiable ranking operators on CIFAR-100, ImageNet-1K, as well as the winter 2021 edition of ImageNet-21K-P. We use CIFAR-100, which can be considered a small-scale data set with only 100100 classes, to train a ResNet18 model (He et al., 2016) from scratch and show the impact of the proposed loss function on the top-11 and top-55 accuracy. In comparison, ImageNet-1K and ImageNet-21K-P provide rather large-scale data sets with 1 0001\,000 and 10 45010\,450 classes, respectively. To avoid the unreasonable carbon-footprint of training many models from scratch, we decided to exclusively use publicly available backbones for all ImageNet experiments. This has the additional benefit of allowing more settings, making our work easily reproducible, and allowing to perform multiple runs on different seeds to improve the statistical significance of the results. For ImageNet-1K, we use two publicly available state-of-the-art architectures as backbones: First, the (four) ResNeXt-101 WSL architectures by Mahajan et al. (2018), which were pretrained in a weakly-supervised fashion on a billion-scale data set from Instagram. Second, the Noisy Student EfficientNet-L2 (Xie et al., 2020a), which was pretrained on the unlabeled JFT-300M data set (Sun et al., 2017). For ResNeXt-101 WSL, we extract 2 0482\,048-dimensional embeddings and for the Noisy Student EfficientNet-L2, we extract 5 5045\,504-dimensional embeddings of ImageNet-1K and fine-tune on them.

We apply the proposed loss in combination with various available differentiable sorting and ranking approaches, namely NeuralSort, SoftSort, SinkhornSort, and DiffSortNets. To determine the optimal temperature for each differentiable sorting method, we perform a grid search at a resolution of factor 22. For training, we use the Adam optimizer (Kingma & Ba, 2015). For training on CIFAR-100 from scratch, we train for up to 200200 epochs with a batch size of 100100 at a learning rate of 10310^{-3}. For ImageNet-1K, we train for up to 100100 epochs at a batch size of 500500 and a learning rate of 104.510^{-4.5}. For ImageNet-21K-P, we train for up to 4040 epochs at a batch size of 500500 and a learning rate of 10410^{-4}. We use early stopping and found that these settings lead to convergence in all settings. As baselines, we use the respective original models, softmax cross-entropy, as well as learning with the smooth surrogate top-kk loss (Berrada et al., 2018).

Method PKP_{K} CIFAR-100
Baselines
Softmax ([1,0,0,0,0])([1,0,0,0,0]) 61.27| 85.3161.27\,|\,85.31
Smooth top-kk loss ()(*) ([0,0,0,0,1])([0,0,0,0,1]) 53.07| 85.2353.07\,|\,85.23
Top-55 NeuralSort [0,0,0,0,1][0,0,0,0,1] 22.58| 84.4122.58\,|\,84.41
Top-55 SoftSort [0,0,0,0,1][0,0,0,0,1] 1.01| 5.091.01\,|\,5.09
Top-55 SinkhornSort [0,0,0,0,1][0,0,0,0,1] 55.62| 87.0455.62\,|\,\boldsymbol{87.04}
Top-55 DiffSortNets [0,0,0,0,1][0,0,0,0,1] 52.81| 84.2152.81\,|\,84.21
Ours
Top-kk NeuralSort [.2,.2,.2,.2,.2][.2,.2,.2,.2,.2] 61.46| 86.0361.46\,|\,86.03
Top-kk SoftSort [.2,.2,.2,.2,.2][.2,.2,.2,.2,.2] 61.53| 82.3961.53\,|\,82.39
Top-kk SinkhornSort [.2,.2,.2,.2,.2][.2,.2,.2,.2,.2] 61.89| 86.9461.89\,|\,\boldsymbol{86.94}
Top-kk DiffSortNets [.2,.2,.2,.2,.2][.2,.2,.2,.2,.2] 62.00| 86.73\boldsymbol{62.00}\,|\,86.73
Table 1: CIFAR-100 results for training a ResNet18 from scratch. The metrics are Top-1|1\,|\,Top-55 accuracy averaged over 2 seeds. ()(*)Berrada et al. (2018).

5.2 Training from Scratch

We start by demonstrating that the proposed loss can be used to train a network from scratch. As a reference baseline, we train a ResNet18 from scratch on CIFAR-100. In Table 1, we compare the baselines (i.e., top-11 softmax, the smooth top-55 loss (Berrada et al., 2018), as well as “pure” top-55 losses using four differentiable sorting and ranking methods) with our top-kk loss with k[.2,.2,.2,.2,.2]k\sim[.2,.2,.2,.2,.2].

We find that training with top-55 alone—in some cases—slightly improves the top-55 but has a substantially worse top-11 accuracy. Here, we note that the smooth top-55 loss (Berrada et al., 2018), top-55 Sinkhorn (Cuturi et al., 2019), and top-55 DiffSort (Petersen et al., 2021a) are able to achieve good performance. Notably, Sinkhorn (Cuturi et al., 2019) outperforms the softmax baseline on the top-55 metric, while NeuralSort and SoftSort are less stable and yield worse results especially on top-11 accuracy.

By using our loss that corresponds to drawing kk from [.2,.2,.2,.2,.2][.2,.2,.2,.2,.2], we can achieve substantially improved results, especially also on the top-11 accuracy metric. Using the DiffSortNets yields the best results on the top-11 accuracy and Sinkhorn yields the best results on the top-55 accuracy. Note that, here, also NeuralSort and SoftSort achieve good results in this setting, which can be attributed to our loss with k[.2,.2,.2,.2,.2]k\sim[.2,.2,.2,.2,.2] being more robust to inconsistencies and outliers in the used differentiable sorting method. Interestingly, top-55 SinkhornSort achieves the best performance on the top-55 metric, which suggests that SinkhornSort is a very robust differentiable sorting method as it does not require additional top-kk components. Nevertheless, it is advisable to include other top-kk components as the model trained purely on top-55 exhibits poor top-11 performance.

Method PKP_{K} ImgNet-1K ImgNet-21K-P
Baselines
Softmax ([1,0,0,0,0])([1,0,0,0,0]) 86.06| 97.79586.06\,|\,97.795 39.29| 69.6339.29\,|\,69.63
Smooth top-kk loss ()(*) ([0,0,0,0,1])([0,0,0,0,1]) 85.15| 97.54085.15\,|\,97.540 34.03| 65.5634.03\,|\,65.56
Top-55 NeuralSort [0,0,0,0,1][0,0,0,0,1] 33.37| 94.74833.37\,|\,94.748 15.87| 33.8115.87\,|\,33.81
Top-55 SoftSort [0,0,0,0,1][0,0,0,0,1] 18.23| 94.96518.23\,|\,94.965 33.61| 69.8233.61\,|\,69.82
Top-55 SinkhornSort [0,0,0,0,1][0,0,0,0,1] 85.65| 97.99185.65\,|\,\boldsymbol{97.991} 36.93| 69.8036.93\,|\,69.80
Top-55 DiffSortNets [0,0,0,0,1][0,0,0,0,1] 69.05| 97.38969.05\,|\,97.389 35.96| 69.7635.96\,|\,69.76
Ours
Top-kk NeuralSort [.5,0,0,0,.5][.5,0,0,0,.5] 86.30| 97.89686.30\,|\,97.896 37.85| 68.0837.85\,|\,68.08
Top-kk SoftSort [.5,0,0,0,.5][.5,0,0,0,.5] 86.26| 97.96386.26\,|\,97.963 39.93| 70.6339.93\,|\,70.63
Top-kk SinkhornSort [.5,0,0,0,.5][.5,0,0,0,.5] 86.29| 97.971\boldsymbol{86.29}\,|\,\boldsymbol{97.971} 39.85| 70.5639.85\,|\,70.56
Top-kk DiffSortNets [.5,0,0,0,.5][.5,0,0,0,.5] 86.24| 97.93786.24\,|\,97.937 40.22| 70.8840.22\,|\,70.88
Table 2: ImageNet-1K and ImageNet-21K-P results for fine-tuning the head of ResNeXt-101 32x48d WSL (Mahajan et al., 2018). The metrics are Top-1|1\,|\,Top-55 accuracy averaged over 10 seeds for ImageNet-1K and 2 seeds for ImageNet-21K-P. ()(*)Berrada et al. (2018).
0.0\displaystyle{0.0}0.2\displaystyle{0.2}0.4\displaystyle{0.4}0.6\displaystyle{0.6}0.8\displaystyle{0.8}1.0\displaystyle{1.0}α\displaystyle\alpha where PK=[1α,0,0,0,α]\displaystyle P_{K}=[1-\alpha,0,0,0,\alpha]0.857\displaystyle{0.857}0.858\displaystyle{0.858}0.859\displaystyle{0.859}0.860\displaystyle{0.860}0.861\displaystyle{0.861}0.862\displaystyle{0.862}0.863\displaystyle{0.863}Top-1\displaystyle 1 acc.Top-1\displaystyle 1 acc.Top-5\displaystyle 5 acc.0.9786\displaystyle{0.9786}0.9788\displaystyle{0.9788}0.9790\displaystyle{0.9790}0.9792\displaystyle{0.9792}0.9794\displaystyle{0.9794}0.9796\displaystyle{0.9796}0.9798\displaystyle{0.9798}0.9800\displaystyle{0.9800}Top-5\displaystyle 5 acc.
10\displaystyle{10}15\displaystyle{15}20\displaystyle{20}25\displaystyle{25}30\displaystyle{30}35\displaystyle{35}40\displaystyle{40}m\displaystyle m0.858\displaystyle{0.858}0.859\displaystyle{0.859}0.860\displaystyle{0.860}0.861\displaystyle{0.861}0.862\displaystyle{0.862}Top-1\displaystyle 1 acc.Top-1\displaystyle 1 acc.Top-5\displaystyle 5 acc.0.9775\displaystyle{0.9775}0.9780\displaystyle{0.9780}0.9785\displaystyle{0.9785}0.9790\displaystyle{0.9790}0.9795\displaystyle{0.9795}0.9800\displaystyle{0.9800}Top-5\displaystyle 5 acc.
Figure 2: Effects of varying the ratio between top-11 and top-55 (left) and varying the size of differentially ranked subset mm. Both experiments are done with the differentiable Sinkhorn ranking algorithm (Cuturi et al., 2019). On the left, m=16m=16, on the right, α=0.75\alpha=0.75. Averaged over 55 runs.
250\displaystyle{250}500\displaystyle{500}750\displaystyle{750}num. of params [M]84.0\displaystyle{84.0}84.5\displaystyle{84.5}85.0\displaystyle{85.0}85.5\displaystyle{85.5}86.0\displaystyle{86.0}Top-1 acc.250\displaystyle{250}500\displaystyle{500}750\displaystyle{750}num. of params [M]97.2\displaystyle{97.2}97.4\displaystyle{97.4}97.6\displaystyle{97.6}97.8\displaystyle{97.8}98.0\displaystyle{98.0}Top-5 acc.
Figure 3: ImageNet-1K accuracy improvements for all ResNeXt-101 WSL model sizes (32x8d, 32x16d, 32x32d, 32x48d). Green (\bullet) is the original model and red (\blacktriangle) is with top-kk fine-tuning.

5.3 Fine-Tuning

In this section, we discuss the results for fine-tuning on ImageNet-1K and ImageNet-21K-P. In Table 2, we find a very similar behavior to training from scratch on CIFAR-100. Specifically, we find that training accuracies improve by drawing kk from a distribution. An exception is (again) SinkhornSort, where focussing only on top-55 yields the best top-55 accuracy on ImageNet-1K, but the respective model exhibits poor top-11 accuracy. Overall, we find that drawing kk from a distribution improves performance in all cases.

To demonstrate that the improvements also translate to different backbones, we show the improvements on all four model sizes of ResNeXt-101 WSL (32x8d, 32x16d, 32x32d, 32x48d) in Figure 3. Also, here, our method improves the model in all settings.

5.4 Impact of the Distribution PKP_{K} and Differentiable Sorting Methods

We start by demonstrating the impact of PKP_{K}, which is the distribution from which we draw kk. Let us first consider the case where kk is 55 with probability α\alpha and 11 with probability 1α1-\alpha, i.e., PK=[1α,0,0,0,α]P_{K}=[1-\alpha,0,0,0,\alpha]. In Figure 2 (left), we demonstrate the impact that changing α\alpha, i.e., transitioning from a pure top-11 loss to a pure top-55 loss, has on fine-tuning ResNeXt-101 WSL with our loss using the SinkhornSort algorithm. Increasing the weight of the top-55 component does not only increase the top-55 accuracy but also improves the top-11 accuracy up to around 60%60\% top-55; when using only k=5k=5, the top-11 accuracy drastically decays as the incentive for the true class to be at the top-11 position vanishes (or is only indirectly given by being among the top-55.) While the top-55 accuracy in this plot is best for a pure top-55 loss, this generally only applies to the Sinkhorn algorithm and overall training is more stable if a pure top-55 is avoided. This can also be seen in Tables 1 and 2.

In Tables 3 and 4, we consider more additional settings with all differentiable ranking methods. Specifically, we compare four notable settings: [.5,0,0,0,.5][.5,0,0,0,.5], i.e., equally weighted top-11 and top-55; [.25,0,0,0,.75][.25,0,0,0,.75] and [.1,0,0,0,.9][.1,0,0,0,.9], i.e., top-55 has larger weights; [.2,.2,.2,.2,.2][.2,.2,.2,.2,.2], i.e., the case of having an equal weight of 0.20.2 for top-11 to top-55. The [.5,0,0,0,.5][.5,0,0,0,.5] setting is a rather canonical setting which usually performs well on both metrics, while the others tend to favor top-55. In the [.5,0,0,0,.5][.5,0,0,0,.5] setting, all sorting methods improve upon the softmax baseline on both top-11 and top-55 accuracy. When increasing the weight of the top-55 component, the top-55 generally improves while top-11 decays.

Here we find a core insight of this paper: the best performance cannot be achieved by optimizing top-kk for only a single kk, but instead, drawing kk from a distribution improves performance on all metrics.

Comparing the differentiable ranking methods, we can find the overall trend that SoftSort outperforms NeuralSort, and that SinkhornSort as well as DiffSortNets perform best. We can see that some sorting algorithms are more sensitive to the overall PKP_{K} than others: Whereas SinkhornSort (Cuturi et al., 2019) and DiffSortNets (Petersen et al., 2021a) continuously outperform the softmax baseline, NeuralSort (Grover et al., 2019) and SoftSort (Prillo & Eisenschlos, 2020) tend to collapse when over-weighting the top-55 components.

Comparing the performance on the medium-scale ImageNet-1K to the larger ImageNet-21K-P in Table 2, we observe a similar pattern. Here, again, using the top-kk component alone is not enough to significantly increase accuracy, but combining top-11 and top-kk components helps to improve accuracy on both reported metrics. While NeuralSort struggles in this large-scale ranking problem and stays below the softmax baseline, DiffSortNets (Petersen et al., 2021a) provide the best top-11 and top-55 accuracy with 40.22%40.22\% and 70.88%70.88\%, respectively.

In Supplementary Material A, an extension to learning with top-1010 and top-2020 components can be found.

We note that we do not claim that all settings (especially all differentiable sorting methods) improve the classification performance on all metrics. Instead, we include all methods and also additional settings to demonstrate the capabilities and limitations of each differentiable sorting method.

Overall, it is notable that SinkhornSort achieves the overall most robust training behavior, while also being by far the slowest sorting method and thus potentially slowing down training drastically, especially when the task is only fine-tuning. SinkhornSort tends to require more Sinkhorn iterations towards the end of training. DiffSortNets are considerably faster, especially, it is possible to only compute the top-kk probability matrices and because of our advances for more efficient selection networks.

Method / PKP_{K} [.5,0,0,0,.5][.5,0,0,0,.5] [.25,0,0,0,.75][.25,0,0,0,.75] [.1,0,0,0,.9][.1,0,0,0,.9] [.2,.2,.2,.2,.2][.2,.2,.2,.2,.2]
ImageNet-1K
NeuralSort 86.30| 97.89686.30\,|\,97.896 34.26| 95.41034.26\,|\,95.410 34.32| 94.88934.32\,|\,94.889 85.75| 97.86585.75\,|\,97.865
SoftSort 86.26| 97.96386.26\,|\,97.963 86.16| 97.95486.16\,|\,97.954 27.30| 95.91527.30\,|\,95.915 86.18| 97.97986.18\,|\,97.979
SinkhornSort 86.29| 97.971\boldsymbol{86.29}\,|\,97.971 86.24| 97.98986.24\,|\,97.989 86.18| 97.98786.18\,|\,97.987 86.22| 97.98986.22\,|\,97.989
DiffSortNets 86.24| 97.93786.24\,|\,97.937 86.15| 97.93686.15\,|\,97.936 86.04| 97.98086.04\,|\,97.980 86.21| 98.00386.21\,|\,\boldsymbol{98.003}
ImageNet-21K-P
NeuralSort 37.85| 68.0837.85\,|\,68.08 36.16| 67.6036.16\,|\,67.60 33.02| 67.2933.02\,|\,67.29 37.09| 67.9037.09\,|\,67.90
SoftSort 39.93| 70.6339.93\,|\,70.63 39.08| 70.2739.08\,|\,70.27 37.78| 70.0737.78\,|\,70.07 39.68| 70.5739.68\,|\,70.57
SinkhornSort 39.85| 70.5639.85\,|\,70.56 39.21| 70.4139.21\,|\,70.41 38.42| 70.1238.42\,|\,70.12 39.22| 70.4939.22\,|\,70.49
DiffSortNets 40.22| 70.8840.22\,|\,70.88 39.56| 70.5839.56\,|\,70.58 38.48| 70.2538.48\,|\,70.25 39.69| 70.6939.69\,|\,70.69
Table 3: ImageNet-1K and ImageNet-21K-P results for different distributions PKP_{K} for fine-tuning the head of ResNeXt-101 32x48d WSL (Mahajan et al., 2018). The metrics are Top-1|1\,|\,Top-55 accuracy averaged over 10 seeds for ImageNet-1K and 2 seeds for ImageNet-21K-P.

5.5 Differentiable Ranking Set Size mm

We consider how accuracy is affected by varying the number of scores mm to be differentially ranked. Generally, the runtime of differentiable top-kk operators depends between linearly and cubic on mm; thus it is important to choose an adequate value for mm. The choice of mm between 1010 and 4040 has only a moderate impact on the accuracy as can be seen in Figure 2 (right). However, when setting mm to large values such as 1 0001\,000 or larger, we observe that the differentiable sorting methods tend to become unstable. We note that we did not specifically tune mm, and that better performance can be achieved by fine-tuning mm, as displayed in the plot.

Method / PKP_{K} [.5,0,0,0,.5][.5,0,0,0,.5] [.25,0,0,0,.75][.25,0,0,0,.75] [.1,0,0,0,.9][.1,0,0,0,.9] [.2,.2,.2,.2,.2][.2,.2,.2,.2,.2]
CIFAR-100
NeuralSort 61.12| 86.4761.12\,|\,86.47 61.07| 87.2361.07\,|\,87.23 52.57| 85.7652.57\,|\,85.76 61.46| 86.0361.46\,|\,86.03
SoftSort 61.17| 83.9561.17\,|\,83.95 61.05| 83.1061.05\,|\,83.10 58.16| 79.2658.16\,|\,79.26 61.53| 82.3961.53\,|\,82.39
SinkhornSort 61.34| 86.3861.34\,|\,86.38 61.50| 86.6861.50\,|\,86.68 57.35| 86.3457.35\,|\,86.34 61.89| 86.9461.89\,|\,86.94
DiffSortNets 60.07| 86.4460.07\,|\,86.44 61.57| 86.5161.57\,|\,86.51 61.74| 87.2261.74\,|\,\boldsymbol{87.22} 62.00| 86.73\boldsymbol{62.00}\,|\,86.73
Table 4: CIFAR-100 results for different distributions PKP_{K} for training a ResNet18 from scratch. The metrics are Top-1|1\,|\,Top-55 accuracy averaged over 2 seeds.
Method Public Top-11 Top-55
ResNet50 ()(*) 79.2679.26 94.7594.75
ResNet152 ()(*) 80.6280.62 95.5195.51
ResNeXt-101 32x48d WSL ()(\dagger) 85.4385.43 97.5797.57
ViT-L/16 ()(\ddagger) 87.7687.76
Noisy Student EfficientNet-L2 (§)(\mathsection) 88.3588.35 98.6598.65
BiT-L ()(\mathparagraph) 87.5487.54 98.4698.46
CLIP (w/ Noisy Student EffNet-L2) ()(\bigstar) 88.4\approx 88.4
ViT-H/14 ()(\divideontimes) 88.5588.55
ALIGN (EfficientNet-L2) ()(\doublebarwedge) 88.6488.64 98.6798.67
Meta Pseudo Labels (EffNet-L2) ()(\doublecap) 90.2090.20 98.8\approx 98.8
ViT-G/14 ()(\doublecup) 90.4590.45
CoAtNet-7 ()(\circ) 90.8890.88
ResNeXt-101 32x48d WSL 86.0686.06 97.8097.80
Top-kk SinkhornSort 86.2286.22 97.9997.99
Top-kk DiffSortNets 86.2186.21 98.0098.00
Noisy Student EfficientNet-L2 88.3388.33 98.6598.65
Top-kk SinkhornSort 88.3288.32 98.6698.66
Top-kk DiffSortNets 88.3788.37 98.6898.68
Table 5: ImageNet-1K result comparison to state-of-the-art. Among the overall best performing differentiable sorting / ranking methods, almost all results in reasonable settings outperform their respective baseline on Top-11 and Top-55 accuracy. For publicly available models / backbones, we achieve a new state-of-the-art for top-11 and top-55 accuracy. Our results are averaged over 1010 runs. ()(*)He et al. (2016), ()(\dagger)Mahajan et al. (2018), ()(\ddagger)Dosovitskiy et al. (2021), (§)(\mathsection)Xie et al. (2020a), ()(\mathparagraph)Kolesnikov et al. (2020), ()(\bigstar)Radford et al. (2021), ()(\divideontimes)Dosovitskiy et al. (2021), ()(\doublebarwedge)Jia et al. (2021), ()(\doublecap)Pham et al. (2021), ()(\doublecup)Zhai et al. (2021), ()(\circ)Dai et al. (2021).

5.6 Comparison to the State-of-the-Art

We compare the proposed results to current state-of-the-art methods in Table 5. We focus on methods that are publicly available and build upon two of the best performing models, namely Noisy Student EfficientNet-L2 (Xie et al., 2020a), and ResNeXt-101 32x48d WSL (Mahajan et al., 2018). Using both backbones, we achieve improvements on both metrics, and when fine-tuning on the Noisy Student EfficientNet-L2, we achieve a new state-of-the-art for publicly available models.

Significance Tests.

To evaluate the significance of the results, we perform a tt-test (with significance level of 0.010.01). We find that our model is significantly better than the original model on both top-11 and top-55 accuracy metrics. Comparing to the observed accuracies of the baseline (88.33| 98.6588.33\,|\,98.65), DiffSortNets are significantly better (p=0.00001| 0.000050.00001\,|\,0.00005). Comparing to the reported accuracies of the baseline (88.35| 98.6588.35\,|\,98.65), DiffSortNets are also significantly better (p=0.00087| 0.000050.00087\,|\,0.00005).

6 Conclusion

We presented a novel loss, which relaxes the assumption of using a fixed kk for top-kk classification learning. For this, we leveraged recent differentiable sorting and ranking operators. We performed an array of experiments to explore different top-kk classification learning settings and achieved a state-of-the-art on ImageNet for publicly available models.

Acknowledgments & Funding Disclosure

This work was supported by the IBM-MIT Watson AI Lab, the DFG in the Cluster of Excellence EXC 2117 “Centre for the Advanced Study of Collective Behaviour” (Project-ID 390829875), and the Land Salzburg within the WISS 2025 project IDA-Lab (20102-F1901166-KZP and 20204-WISS/225/197-2019).

References

  • Batcher (1968) Batcher, K. E. Sorting networks and their applications. In Proc. AFIPS Spring Joint Computing Conference (Atlantic City, NJ), pp.  307–314, Washington, DC, USA, 1968. American Federation of Information Processing Societies.
  • Berrada et al. (2018) Berrada, L., Zisserman, A., and Kumar, M. P. Smooth loss functions for deep top-k classification. In International Conference on Learning Representations (ICLR), 2018.
  • Berthet et al. (2020) Berthet, Q., Blondel, M., Teboul, O., Cuturi, M., Vert, J.-P., and Bach, F. Learning with Differentiable Perturbed Optimizers. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • Blondel et al. (2020) Blondel, M., Teboul, O., Berthet, Q., and Djolonga, J. Fast Differentiable Sorting and Ranking. In International Conference on Machine Learning (ICML), 2020.
  • Chzhen et al. (2021) Chzhen, E., Denis, C., Hebiri, M., and Lorieul, T. Set-valued classification–overview via a unified framework. arXiv preprint arXiv:2102.12318, 2021.
  • Cordonnier et al. (2021) Cordonnier, J.-B., Mahendran, A., Dosovitskiy, A., Weissenborn, D., Uszkoreit, J., and Unterthiner, T. Differentiable patch selection for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2351–2360, 2021.
  • Cuturi (2013) Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems (NeurIPS), 2013.
  • Cuturi et al. (2019) Cuturi, M., Teboul, O., and Vert, J.-P. Differentiable ranking and sorting using optimal transport. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • Dai et al. (2021) Dai, Z., Liu, H., Le, Q. V., and Tan, M. Coatnet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803, 2021.
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  • Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
  • Fan et al. (2017) Fan, Y., Lyu, S., Ying, Y., and Hu, B.-G. Learning with average top-k loss. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  • Goyal et al. (2018) Goyal, K., Neubig, G., Dyer, C., and Berg-Kirkpatrick, T. A continuous relaxation of beam search for end-to-end training of neural sequence models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Grover et al. (2019) Grover, A., Wang, E., Zweig, A., and Ermon, S. Stochastic Optimization of Sorting Networks via Continuous Relaxations. In International Conference on Learning Representations (ICLR), 2019.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Jia et al. (2021) Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q. V., Sung, Y., Li, Z., and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning (ICML), 2021.
  • Karpiński & Piotrów (2015) Karpiński, M. and Piotrów, M. Smaller selection networks for cardinality constraints encoding. In Proc. Principles and Practice of Constraint Programming (CP 2015, Cork, Ireland), pp.  210–225, Heidelberg/Berlin, Germany, 2015. Springer.
  • Kingma & Ba (2015) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • Knuth (1998) Knuth, D. E. The Art of Computer Programming, Volume 3: (2nd Ed.) Sorting and Searching. Addison Wesley Longman Publishing Co., Inc., 1998.
  • Kolesnikov et al. (2020) Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., and Houlsby, N. Big transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp.  491–507. Springer, 2020.
  • Krizhevsky et al. (2009) Krizhevsky, A., Nair, V., and Hinton, G. Cifar-10 (canadian institute for advanced research). 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html.
  • Lapin et al. (2015) Lapin, M., Hein, M., and Schiele, B. Top-k multiclass svm. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
  • Lapin et al. (2016) Lapin, M., Hein, M., and Schiele, B. Loss functions for top-k error: Analysis and insights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  1468–1477, 2016.
  • Lee et al. (2021) Lee, H., Cho, S., Jang, Y., Kim, J., and Woo, H. Differentiable ranking metric using relaxed sorting for top-k recommendation. IEEE Access, 9:114649–114658, 2021.
  • Mahajan et al. (2018) Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., and Van Der Maaten, L. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), 2018.
  • Northcutt et al. (2021) Northcutt, C. G., Athalye, A., and Mueller, J. Pervasive label errors in test sets destabilize machine learning benchmarks. 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks., 2021.
  • Parberry (1992) Parberry, I. The pairwise sorting network. Parallel Processing Letters, 2:205–211, 1992.
  • Petersen et al. (2021a) Petersen, F., Borgelt, C., Kuehne, H., and Deussen, O. Differentiable sorting networks for scalable sorting and ranking supervision. In International Conference on Machine Learning (ICML), 2021a.
  • Petersen et al. (2021b) Petersen, F., Borgelt, C., Kuehne, H., and Deussen, O. Learning with Algorithmic Supervision via Continuous Relaxations. In Advances in Neural Information Processing Systems (NeurIPS), 2021b.
  • Petersen et al. (2022) Petersen, F., Borgelt, C., Kuehne, H., and Deussen, O. Monotonic Differentiable Sorting Networks. In International Conference on Learning Representations (ICLR), 2022.
  • Pham et al. (2021) Pham, H., Dai, Z., Xie, Q., and Le, Q. V. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11557–11568, 2021.
  • Pietruszka et al. (2020) Pietruszka, M., Borchmann, Ł., and Graliński, F. Successive halving top-k operator. arXiv preprint arXiv:2010.15552, 2020.
  • Prillo & Eisenschlos (2020) Prillo, S. and Eisenschlos, J. Softsort: A continuous relaxation for the argsort operator. In International Conference on Machine Learning (ICML), 2020.
  • Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  • Ridnik et al. (2021) Ridnik, T., Ben-Baruch, E., Noy, A., and Zelnik-Manor, L. Imagenet-21k pretraining for the masses. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  • Sun et al. (2017) Sun, C., Shrivastava, A., Singh, S., and Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pp.  843–852, 2017.
  • Wah & Chen (1984) Wah, B. W. and Chen, K.-L. A partitioning approach to the design of selection networks. IEEE Trans. Computers, 33:261–268, 1984.
  • Xie et al. (2020a) Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. V. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10687–10698, 2020a.
  • Xie et al. (2020b) Xie, Y., Dai, H., Chen, M., Dai, B., Zhao, T., Zha, H., Wei, W., and Pfister, T. Differentiable top-k with optimal transport. In Advances in Neural Information Processing Systems (NeurIPS), 2020b.
  • Yang & Koyejo (2020) Yang, F. and Koyejo, S. On the consistency of top-k surrogate losses. In International Conference on Machine Learning (ICML), pp. 10727–10735. PMLR, 2020.
  • Zazon-Ivry & Codish (2012) Zazon-Ivry, M. and Codish, M. Pairwise networks are superior for selection. Unpublished manuscript, 2012.
  • Zhai et al. (2021) Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling vision transformers. arXiv preprint arXiv:2106.04560, 2021.

Appendix A Extension to Top-10 and Top-20

We further extend the training settings, measuring the impact of top-1010 and top-2020 components on the large-scale ImageNet-21K-P dataset. The results are diplayed in Table 6, where we report top-11, top-55, top-1010, and top-2020 accuracy for all configurations. Again, we observe that 50%50\% top-11 and 50%50\% top-kk produces the overall best performance and that training with top-55 yields the best top-11, top-55, and top-1010 accuracy. We observe that the performance decays for top-2020 components because (even among 10 45010\,450 classes) there are virtually no top-2020 ambiguities, and artifacts of differentiable sorting methods can cause adverse effects. Note that top-1010 ambiguities do exist in ImageNet-21K-P, e.g., there are 1111 class hierarchy levels (Ridnik et al., 2021).

IN-21K-P  /  PKP_{K} (@5)(@5) [1,0,]len=5\underbrace{[1,0,...]}_{\text{len}=5} [.5,0,,0,.5]len=5\underbrace{[.5,0,...,0,.5]}_{\text{len}=5} [.25,0,,0,.75]len=5\underbrace{[.25,0,...,0,.75]}_{\text{len}=5} [.1,0,,0,.9]len=5\underbrace{[.1,0,...,0,.9]}_{\text{len}=5}
Softmax (baseline) 39.29| 69.63| 78.55| 85.3339.29\,|\,69.63\,|\,78.55\,|\,85.33
NeuralSort 37.85| 68.08| 77.22| 84.2137.85\,|\,68.08\,|\,77.22\,|\,84.21 36.16| 67.60| 76.96| 84.0836.16\,|\,67.60\,|\,76.96\,|\,84.08 33.02| 67.29| 76.88| 84.0533.02\,|\,67.29\,|\,76.88\,|\,84.05
SoftSort 39.93| 70.63| 79.45| 85.9639.93\,|\,70.63\,|\,79.45\,|\,85.96 39.08| 70.27| 79.29| 85.9439.08\,|\,70.27\,|\,79.29\,|\,85.94 37.78| 70.07| 79.19| 85.8737.78\,|\,70.07\,|\,79.19\,|\,85.87
SinkhornSort 39.85| 70.56| 79.53| 86.1339.85\,|\,70.56\,|\,79.53\,|\,86.13 39.21| 70.41| 79.54| 86.1839.21\,|\,70.41\,|\,79.54\,|\,86.18 38.42| 70.12| 79.44| 86.1238.42\,|\,70.12\,|\,79.44\,|\,86.12
DiffSortNets 40.22| 70.88| 79.54| 86.03\boldsymbol{40.22}\,|\,\boldsymbol{70.88}\,|\,\boldsymbol{79.54}\,|\,86.03 39.56| 70.58| 79.44| 86.0139.56\,|\,70.58\,|\,79.44\,|\,86.01 38.48| 70.25| 79.29| 85.9038.48\,|\,70.25\,|\,79.29\,|\,85.90
IN-21K-P  /  PKP_{K} (@10)(@10) [1,0,]len=10\underbrace{[1,0,...]}_{\text{len}=10} [.5,0,,0,.5]len=10\underbrace{[.5,0,...,0,.5]}_{\text{len}=10} [.25,0,,0,.75]len=10\underbrace{[.25,0,...,0,.75]}_{\text{len}=10} [.1,0,,0,.9]len=10\underbrace{[.1,0,...,0,.9]}_{\text{len}=10}
Softmax (baseline) 39.33| 69.62| 78.55| 85.3639.33\,|\,69.62\,|\,78.55\,|\,85.36
NeuralSort 37.22| 67.02| 76.75| 84.1037.22\,|\,67.02\,|\,76.75\,|\,84.10 34.59| 66.09| 76.46| 84.0134.59\,|\,66.09\,|\,76.46\,|\,84.01 29.60| 65.16| 76.26| 84.0129.60\,|\,65.16\,|\,76.26\,|\,84.01
SoftSort 39.26| 69.52| 79.13| 85.9339.26\,|\,69.52\,|\,79.13\,|\,85.93 37.71| 68.56| 78.71| 85.7837.71\,|\,68.56\,|\,78.71\,|\,85.78 33.68| 67.35| 78.43| 85.7033.68\,|\,67.35\,|\,78.43\,|\,85.70
SinkhornSort 39.65| 70.25| 79.47| 86.2239.65\,|\,70.25\,|\,79.47\,|\,86.22 38.90| 69.91| 79.41| 86.2538.90\,|\,69.91\,|\,79.41\,|\,\boldsymbol{86.25} 37.98| 69.57| 79.33| 86.1637.98\,|\,69.57\,|\,79.33\,|\,86.16
DiffSortNets 39.92| 70.13| 79.38| 86.0239.92\,|\,70.13\,|\,79.38\,|\,86.02 39.10| 69.60| 79.21| 86.0339.10\,|\,69.60\,|\,79.21\,|\,86.03 37.88| 69.07| 79.04| 85.9137.88\,|\,69.07\,|\,79.04\,|\,85.91
IN-21K-P  /  PKP_{K} (@20)(@20) [1,0,]len=20\underbrace{[1,0,...]}_{\text{len}=20} [.5,0,,0,.5]len=20\underbrace{[.5,0,...,0,.5]}_{\text{len}=20} [.25,0,,0,.75]len=20\underbrace{[.25,0,...,0,.75]}_{\text{len}=20} [.1,0,,0,.9]len=20\underbrace{[.1,0,...,0,.9]}_{\text{len}=20}
Softmax (baseline) 39.33| 69.62| 78.55| 85.3639.33\,|\,69.62\,|\,78.55\,|\,85.36
NeuralSort 36.32| 65.33| 75.82| 83.9936.32\,|\,65.33\,|\,75.82\,|\,83.99 33.00| 62.99| 74.84| 83.8333.00\,|\,62.99\,|\,74.84\,|\,83.83 27.35| 60.34| 74.02| 83.7727.35\,|\,60.34\,|\,74.02\,|\,83.77
SoftSort 38.04| 65.98| 77.17| 85.4538.04\,|\,65.98\,|\,77.17\,|\,85.45 34.30| 62.89| 76.03| 85.1934.30\,|\,62.89\,|\,76.03\,|\,85.19 24.02| 56.35| 74.32| 84.8224.02\,|\,56.35\,|\,74.32\,|\,84.82
SinkhornSort 39.76| 69.76| 79.17| 86.1739.76\,|\,69.76\,|\,79.17\,|\,86.17 38.77| 69.18| 78.99| 86.2038.77\,|\,69.18\,|\,78.99\,|\,86.20 37.71| 68.68| 78.86| 86.1637.71\,|\,68.68\,|\,78.86\,|\,86.16
DiffSortNets 39.54| 68.59| 77.95| 85.4939.54\,|\,68.59\,|\,77.95\,|\,85.49 38.62| 67.62| 77.43| 85.3738.62\,|\,67.62\,|\,77.43\,|\,85.37 37.46| 66.80| 77.01| 85.1737.46\,|\,66.80\,|\,77.01\,|\,85.17
Table 6: ImageNet 21K with top-55, top-1010 and top-2020 components. The displayed metrics per column are (Top-1|1\,|\,Top-5|5\,|\,Top-10|10\,|\,Top-2020).

Appendix B Splitter Selection Networks

Similar to a sorting network, a selection network is generally a comparator network and hence it consists of wires (or lanes) carrying values and comparators (or conditional swap devices) connecting pairs of wires. A comparator swaps the values on the wires it connects if they are not in a desired order. However, in contrast to a sorting network, which sorts all the values carried by its wires, a (k,n)(k,n) selection network, which has nn wires, moves the knk\leq n largest (or, alternatively, the kk smallest) values to a specific set of wires (Knuth, 1998), most conveniently consecutive wires on one side of the wire array. Note that the notion of a selection network usually does not require that the selected values are sorted. However, in our context it is preferable that they are, so that PKP_{K} can easily be applied, and the selection networks discussed below all have this property.

Clearly, any sorting network could be used as a selection network, namely by focusing only on the top kk (or bottom kk) wires. However, especially if kk is small compared to nn, it is possible to construct selection networks with smaller size (i.e. fewer comparators) and often lower depth (i.e. a smaller number of layers, where a layer is a set of comparators that can be executed in parallel).

A core idea of constructing selection networks was proposed in (Wah & Chen, 1984), based on the odd-even merge and bitonic sorting networks (Batcher, 1968): partition the nn wires into subsets of at least kk wires (preferably 2log2(k)2^{\lceil\log_{2}(k)\rceil} wires per subset) and sort each subset with odd-even mergesort. Then merge the (sorted) top kk elements of each subsets with bitonic merge, thus halving the number of (sorted) subsets. Repeat merging pairs of (sorted) subsets until only a single (sorted) subset remains, the top kk elements of which are the desired selection. This approach requires 12log2(k)(log2(k)+1)+(log2(n)log2(k))(log2(k)+1)\frac{1}{2}\lceil\log_{2}(k)\rceil(\lceil\log_{2}(k)\rceil+1)+(\lceil\log_{2}(n)\rceil-\lceil\log_{2}(k)\rceil)(\lceil\log_{2}(k)\rceil+1) layers.

Improvements to this basic scheme were developed in (Zazon-Ivry & Codish, 2012; Karpiński & Piotrów, 2015) and either rely entirely on odd-even merge (Batcher, 1968) or entirely on pairwise sorting networks (Parberry, 1992). Especially selection networks based on pairwise sorting networks have advantages in terms of the size of the resulting network (i.e. number of needed comparators). However, these improvements do not change the depth of the networks, that is, the number of layers, which is most important in the context considered here.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Minimum ranks after a splitter cascade resulting from the transitive closure of the swaps.

Our own selection network construction draws on this work by focussing on a specific ingredient of pairwise sorting networks, namely a so-called splitter (which happens to be identical to a single bitonic merge layer, but for our purposes it is more comprehensible to refer to it as a splitter). A splitter for a list of mm wires having indices [0,,m1][\ell_{0},\ldots,\ell_{m-1}] has comparators connecting wires i\ell_{i} and i+s\ell_{i+s} where s=log2(m)1s=\lceil\log_{2}(m)\rceil-1 for i{0,,m12log2(m))1}i\in\{0,\ldots,m-1-2^{\lceil\log_{2}(m)\rceil)-1}\}.

A pairwise sorting network starts with what we call a splitter cascade. That is, an initial splitter partitions the input wires into subsets of (roughly) equal size. Each subset is split recursively until wire singletons result (Zazon-Ivry & Codish, 2012). An example of such a splitter cascade is shown in Figure 4 for 8 wires and in purple color for 16 wires in Figure 5 (arrows point to where the larger value is desired).

Refer to caption
Figure 5: A (5,16)(5,16) selection network constructed with the method described in the text. The numbers on the wires are the minimum ranks (starting at 0) that can be occupied by the values on these wires. Red crosses mark where wires can be excluded, green check marks where a top rank is determined. Swaps in blocks of equal color belong to the same splitter cascade. Swaps in gray boxes would be needed for full splitter cascades, but are not needed to determine the top 5 ranks.

After a splitter cascade, the value carried by wire i\ell_{i} has a minimum rank of r=2b(i)1r=2^{b(i)}{-}1, where b(i)b(i) counts the number of set bits in the binary number representation of ii. This minimum rank results from the transitivity of the swap operations in the splitter cascade, as is illustrated in Figure 4 for 8 wires: By following upward paths (in splitters to the left) through the splitter cascade, one can find for each wire i\ell_{i} exactly r=2b(i)1r=2^{b(i)}{-}1 wires with smaller indices that must carry values no less than the value carried by wire i\ell_{i}. This yields the minimum ranks shown in Figure 4 on the right.

The core idea of our selection network construction is to use splitter cascades to increase the minimum ranks of (the values carried by) wires. If such a minimum rank exceeds kk (or equals kk, since we work with zero-based ranks and hence are interested in ranks {0,,k1}\{0,\ldots,k{-}1\}), a wire can be discarded, since its value is certainly not among the top kk. On the other hand, if there is only one wire with minimum rank 0, the top 1 value has been determined. More generally, if all minimum ranks no greater than some value rr occur for one wire only, the top r+1r+1 values have been determined.

We exploit this as follows: Initially all wires are assigned a minimum rank of 0, since at the beginning we do not know anything about the values they carry. We then repeat the following construction: traversing the values r=k1,,0r=k{-}1,\ldots,0 descendingly, we collect for each rr all wires with minimum rank rr and apply a splitter cascade to them (provided there are at least two such wires). Suppose the wires collected for a minimum rank rr have indices [0,,m(r)1][\ell_{0},\ldots,\ell_{m(r)-1}]. After the splitter cascade we can update the minimum rank of wire i\ell_{i} to r+2b(i)1r+2^{b(i)}{-}1 , because before the splitter cascade there is no known relationship between wires with the same minimum rank, while the splitter cascade establishes relationships between them, increasing their ranks by 2b(i)12^{b(i)}{-}1 . The procedure of traversing the minimum ranks k1,,0k{-}1,\ldots,0 descendingly, collecting wires with the same minimum rank and applying splitter cascades to them is repeated until all minimum ranks 0,,k10,\ldots,k{-1} occur only once.

As an example, Figure 5 shows a (5,16)(5,16) selection network constructed is this manner, in which the minimum ranks of the wires are indicated after certain layers as well as when certain wires can be discarded (red crosses) and when certain top ranks are determined (green check marks). Comparators belonging to the same splitter cascade are shown in the same color.

kk full odd-even/pairwise/bitonic selection splitter selection
nn sort 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
16 10 4 7 9 9 10 10 10 10 4 6 7 8 10 11 12 13
1024 55 10 19 27 27 34 34 34 34 10 14 16 18 22 25 27 29
10450 105 14 27 39 39 50 50 50 50 14 18 20 23 27 30 32 34
65536 136 16 31 45 45 58 58 58 58 16 20 22 25 29 32 34 36
Table 7: Depths of sorting networks and selection networks (which are equal for odd-even, pairwise, or bitonic networks) compared to selection networks constructed with our splitter-based approach. Note that for small nn and comparatively large kk an odd-even/pairwise/bitonic selection network or even a full sorting network may be preferable (e.g. n=16n=16 and k>5k>5), but that for larger nn considerable savings can be obtained for small kk, even compared to other selection networks.

While selection networks resulting from adaptations of sorting networks (see above) have the advantage that they guarantee that their number of layers is never greater than that of a full sorting network, our approach may produce networks with more layers. However, if kk is sufficiently small compared to nn (in particular, if klog2(n)k\leq\log_{2}(n)), our approach can produce selection networks with considerably fewer layers, as is demonstrated in Table 7. Since in the context we consider here we can expect klog2(n)k\leq\log_{2}(n), splitter-based selection networks are often superior.