Representation Learning by Ranking across Multiple Tasks

Abstract

In recent years, representation learning has become the research focus of the machine learning community. Large-scale neural networks are a crucial step toward achieving general intelligence, with their success largely attributed to their ability to learn abstract representations of data. Several learning fields are actively discussing how to learn representations, yet there is a lack of a unified perspective. We convert the representation learning problem under different tasks into a ranking problem. By adopting the ranking problem as a unified perspective, representation learning tasks can be solved in a unified manner by optimizing the ranking loss. Experiments under various learning tasks, such as classification, retrieval, multi-label learning, and regression, prove the superiority of the representation learning by ranking framework. Furthermore, experiments under self-supervised learning tasks demonstrate the significant advantage of the ranking framework in processing unsupervised training data, with data augmentation techniques further enhancing its performance.

Lifeng Gu

Tianjin University

gulifeng666@163.com

1 Introduction

Several works model the representation learning problem as a ranking problem (Varamesh et al., 2020; Cakir et al., 2019). However, these efforts typically focus on developing ranking-based methods to address specific tasks, like classification. We argue that the representation learning by ranking framework offers a deeper, more intrinsic connection to representation learning.

Representation learning can naturally be framed as a ranking problem. Typically, a neural network first maps inputs into a feature space and then generates prediction labels. The ranking framework can guide the learning process within this feature space based on the similarity between labels.

Consider a model $f:\mathbb{R}^{n}\to\mathbb{R}^{m}$ that maps input samples into an $m$ -dimensional feature space, with a sample set $\{x_{i}\mid i=1,2,3,\ldots,n\}$ and a label set $\{y_{i}\mid i=1,2,3,\ldots,n\}$ . For any sample $x_{i}$ , we treat it as a query, with all other samples $\{x_{j}\mid j=1,2,3,\ldots,n,j\neq i\}$ as candidates. The model transforms $x_{i}$ into the feature $f(x_{i})$ and similarly maps the candidate samples into $\{f(x_{j})\mid j\neq i\}$ . Using a predefined similarity function, we compute the similarity set $\{\text{sim}(f(x_{i}),f(x_{j}))\mid j\neq i\}$ . Our objective is to rank this similarity set according to a true order, thereby guiding the model’s learning.

A straightforward solution is to use a ranking loss to optimize the order within the similarity set, guided by the true order computed from the samples’ labels.

We utilize an approximate NDCG loss to optimize this order. Experiments across various learning tasks, such as classification, retrieval, multi-label learning, regression, and self-supervised learning, demonstrate the superiority of the approximate NDCG loss. This not only highlights the superiority of the approximate NDCG loss, but also verifies the effectiveness and broad applicability of the representation learning by ranking framework.

2 Related Works

2.1 representation learning

According to (Bengio et al., 2013), representation learning aims to derive data representations that facilitate the extraction of useful information for building classifiers or other predictors. A good representation should also serve as an effective input for supervised predictors. Although numerous fields investigate this problem, a unified perspective remains elusive. We propose categorizing representation learning into supervised, self-supervised, and unsupervised approaches. For instance, supervised ImageNet pre-training models can reduce the complexity of downstream tasks by leveraging representations learned from ImageNet as input. Self-supervised visual representation learning methods, such as SimCLR (Chen et al., 2020) and BYOL (Grill et al., 2020), mitigate the complexity of visual tasks. Unsupervised generative models, including VAEs (Kingma & Welling, 2013; Oord et al., 2017) and BiGANs (Donahue et al., 2016; Donahue & Simonyan, 2019), enable data generation or disentangle variation factors (Chen et al., 2016; Dupont, 2018; Kim & Mnih, 2018). Similarly, unsupervised language models like BERT (Devlin et al., 2018) learn contextual language representations, simplifying downstream language tasks.

2.2 Deep Metric Learning

Deep metric learning seeks to develop effective metrics for measuring sample similarity, typically by computing distances between representations. In tasks like retrieval, a good representation inherently implies a robust metric, and vice versa. Popular deep metric learning methods include Contrastive Loss (Hadsell et al., 2006), Center Loss (Wen et al., 2016), N-Pair Loss (Sohn, 2016), and Circle Loss (Sun et al., 2020).(Zhu et al., 2018) proposed the concept of relation alignment, extending metric learning to multiple tasks, which motivated us to generalize it to broader areas.

2.3 Self-Supervised Representation Learning

In recent years, self-supervised learning has gained significant attention for representation learning. Early work, such as Deep InfoMax (DIM) (Hjelm et al., 2018), simultaneously estimates and maximizes mutual information between input data and high-level representations, employing adversarial learning to align representations with prior constraints. Contrastive Predictive Coding (CPC) (Oord et al., 2018) uses a powerful autoregressive model to predict future representations in latent space, introducing the InfoNCE objective, which has since become widely adopted. Contrastive Multiview Coding (CMC) (Tian et al., 2020a) builds on the assumption that good representations maintain consistency across perspectives, achieving this by maximizing mutual information between different views of the same sample; more views typically yield better results.

Tschannen et al. (Tschannen et al., 2019) synthesized prior work to analyze the principles of mutual information maximization. They argue that while this approach often enhances downstream task performance, it can occasionally degrade it. They suggest that its success may be attributed to its alignment with metric learning, a proven paradigm for effective representation learning. DeepCluster (Caron et al., 2018) integrates clustering with representation learning, iteratively assigning cluster labels and using them as pseudo-labels to refine representations. SwAV (Caron et al., 2020) employs multiple prototypes for clustering, ensuring consistency across data views.

Recent developments have further advanced the field. DINO (Caron et al., 2021) introduces self-distillation with information noise, eliminating the need for negative samples while achieving state-of-the-art results. SimSiam (Chen & He, 2021) explores a simple siamese architecture, leveraging positive samples alone to learn robust representations. Shen et al. (Shen et al., 2020) investigate the impact of hybrid data augmentation strategies on representation quality, while Tian et al. (Tian et al., 2020b) provide theoretical insights into optimal view selection for maximizing representation quality.

3 Background

In this section, we discuss the ranking problem and learning to rank.

3.1 Ranking

Ranking and learning to rank are classic problems with extensive research (Xu et al., 2008; Xia et al., 2008). Given an input query, a retrieval system aims to sort and return stored content based on its relevance to the query. The goal of learning to rank is to improve the accuracy of the returned results. One solution is to optimize evaluation metrics. Common evaluation metrics for ranking problems include Precision, Average Precision (AP) (Baeza-Yates et al., 1999), and Normalized Discounted Cumulative Gain (NDCG) (Järvelin & Kekäläinen, 2002); for further details, see (Qin et al., 2010).

Given a query sample $q$ , and a returned sorted sample set $S$ , the precision at $k$ is defined as:

\text{Prec}@k=\frac{1}{k}\sum_{j=1}^{k}r_{j}

(1)

where $r_{j}\in\{0,1\}$ indicates the relevance of the $j$ -th returned sample to the query sample. If $S_{j}$ is relevant to the query, $r_{j}=1$ ; otherwise, $r_{j}=0$ .

Average Precision (AP) is defined based on precision at $k$ as:

\text{AP}=\frac{1}{N}\sum_{j}{r_{j}\times\text{Prec}@_{j}}

(2)

where $N$ represents the number of samples related to the query sample in the returned sample set $S$ .

Recent works (Cakir et al., 2019; Brown et al., 2020) have used AP as the optimization target. However, AP is only applicable when the returned samples and the query sample are either correlated or uncorrelated. For multi-label learning tasks, AP is not suitable.

NDCG is an extension of AP that can handle multi-level correlations between returned samples and query samples:

\text{NDCG}=N_{n}^{-1}\sum_{j=1}^{n}g(r_{j})d(j)

(3)

where $n$ represents the size of the sample set $S$ , $r_{j}\geq 0$ represents the relevance score between the $j$ -th returned sample and the query sample, and $N_{n}$ is a normalization factor. It represents the value of $\sum_{j=1}^{n}g(r_{j})d(j)$ when the returned samples are sorted according to their true relevance to the query sample in descending order. This normalization ensures that the value of NDCG is not greater than 1. $g(r_{j})$ represents the gain function, and $d(j)$ represents the discount function. Following (Qin et al., 2010), the default settings are: $g(r_{j})=2^{r_{j}}-1$ and $d(j)=1/\log_{2}(1+j)$ . Substituting these values into Equation (3), we obtain:

\text{NDCG}=N_{n}^{-1}\sum_{j=1}^{n}(2^{r_{j}}-1)/\log_{2}(1+j)

(4)

4 A-NDCG

What defines a model with strong representation capabilities? We hypothesize that for any input sample, the similarity set generated by the model should exhibit a correct order: samples similar to the input should rank higher, while dissimilar ones should rank lower. This assumption has been widely validated in classification tasks, and we seek to extend this reasonable hypothesis to a variety of learning tasks. Leveraging the principles of learning-by-ranking, we optimize a unified ranking objective:

\text{Rank}(\text{sim}(f(x_{i}),f(x_{j})))=\text{Rank}(\text{sim}(y_{i},y_{j}))

(5)

where $j=1,2,\ldots,n$ and $j\neq i$ .

Here, $x_{i}$ represents any training sample, and $f(x_{i})$ is an $m$ -dimensional vector representing the feature embedding of $x_{i}$ . The function $\text{sim}(f(x_{i}),f(x_{j}))$ denotes the similarity between the feature embeddings of samples $x_{i}$ and $x_{j}$ , while $\text{sim}(y_{i},y_{j})$ denotes the similarity between their corresponding labels. The objective is to ensure that the ranking of similarities in the feature space matches the ranking of similarities in the label space.

The similarity function sim varies depending on the learning task. In general, we define the similarity in the feature space as $\text{sim}(f(x_{i}),f(x_{j}))=f(x_{i})\cdot f(x_{j})^{T}$ . The similarity in the label space is defined as follows for different tasks:

•

Classification: $\text{sim}(y_{i},y_{j})=\mathbb{I}[y_{i}=y_{j}]$ , where $\mathbb{I}$ is the indicator function, which equals 1 if $y_{i}=y_{j}$ and 0 otherwise.
•

Regression: $\text{sim}(y_{i},y_{j})=1-\frac{|y_{i}-y_{j}|}{\max(|y_{i}-y_{j}|)}$ , representing the normalized absolute difference between the target values.
•

Multi-label Classification: $\text{sim}(y_{i},y_{j})=\text{cosine}(y_{i},y_{j})=\frac{y_{i}\cdot y_{j}^{T}}{\|y_{i}\|\|y_{j}\|}$ , measuring the cosine similarity between the label vectors.
•

Self-supervised Classification: $\text{sim}(y_{i},y_{j})$ represents the similarity of pseudo-labels.
•

Pre-training: $\text{sim}(y_{i},y_{j})$ can represent the word distance within a sentence, or other relevant relationships depending on the pre-training task.

Figure1 provides an example to illustrate the representation learning by ranking framework.

Refer to caption — Figure 1: Suppose there are four samples, $x_{1},x_{2},x_{3},x_{4}$ , and their corresponding labels $y_{1},y_{2},y_{3},y_{4}$ . For the query sample $x_{1}$ , if in the label space it satisfies $\text{sim}(y_{1},y_{4})>\text{sim}(y_{1},y_{2})>\text{sim}(y_{1},y_{3})$ , then in the feature space, we also hope to satisfy $\text{sim}(f(x_{1}),f(x_{4}))>\text{sim}(f(x_{1}),f(x_{2}))>\text{sim}(f(x_{1}),f(x_{3}))$ , i.e., the order of similarity is preserved.

We use the approximate NDCG indicator to optimise the objective . For any sample $x_{i}$ from the sample set as the query sample, approximate NDCG indicator or A-NDCG loss can be formulated:

L(x)=\sum_{i}N_{i}^{-1}\sum_{j,j\neq i}^{n}(sim(y_{i},y_{j})/log_{2}(1+\pi(x_{i},x_{j}))

(6)

	$\displaystyle\pi(x_{i},x_{j})$	$\displaystyle=$	$\displaystyle 1+\sum_{k,k\neq j}{\frac{exp(-\alpha sim_{ijk}))}{1+exp(-\alpha sim_{ijk})}}$		(7)
	$\displaystyle sim_{ijk}$	$\displaystyle=$	$\displaystyle(sim(f(x_{i}),f(x_{j}))-sim(f(x_{i}),f(x_{k})))$

Here, $\alpha$ is a hyperparameter, and $N_{i}^{-1}$ is the normalization term, representing the maximum value of $\sum_{j=1}^{n}\operatorname{sim}(y_{i},y_{j})/\log_{2}(1+\pi(x_{i},x_{j}))$ when the order of $\{\operatorname{sim}(f(x_{i}),f(x_{j}))\mid j\neq i\}$ matches that of $\{\operatorname{sim}(y_{i},y_{j})\mid j\neq i\}$ .

NDCG is non-differentiable due to the position index $j$ . In A-NDCG, $\pi(x_{i},x_{j})$ serves as an approximation of the position index $j$ used in the NDCG calculation.

The advantages of the approximate NDCG loss are as follows:

1.

It can naturally incorporate an arbitrary number of perspectives of the training data, thereby relaxing the constraint of using only two perspectives as required by popular contrastive learning algorithms (Chen et al., 2020; Grill et al., 2020).
2.

It imposes minimal constraints, focusing solely on the ordering of samples without requiring additional conditions, which facilitates the learning of robust representations.
3.

Compared to contrastive learning methods (Chen et al., 2020; Grill et al., 2020) and learning-to-rank methods based on optimized average precision (Varamesh et al., 2020), the approximate NDCG loss is applicable in any scenario where a label similarity set can be obtained. This versatility makes it suitable for a wide range of learning tasks.
4.

Due to its capability to handle diverse label types, label-level data augmentation methods can be applied to the training data to continuously enhance the performance of the approximate NDCG loss, thereby fully leveraging the available training data.

5 A Unified Perspective on Understanding Modern Language Models and Classical Methods

Thanks to the broad applicability of ranking-based representation learning, we can understand modern and classical methods from a unified perspective. Modern language models typically consist of two stages: pre-training and fine-tuning. Classical methods (such as traditional image classification models) usually do not have a pre-training stage. Through the representation learning by ranking framework, the learning process of classical methods can also be divided into two stages: the feature learning stage using ranking loss like A-NDCG and the label learning stage using a simple classifier or regressor. This naturally corresponds to the pre-training and fine-tuning stages. The only difference is that the pre-training stage of modern language models uses a large dataset different from the downstream task, while classical methods use a small dataset that is the same as the downstream task. This unification provides deeper insights into understanding representation learning and offers a broader perspective for designing specific pre-training objectives. By simply identifying ranking relationships, new pre-training objectives can be designed. Figure 2 illustrates this unification.

6 Experiment

In order to evaluate A-NDCG and verify the advantages of learning through a ranking framework, we conducted experiments under a variety of learning tasks. The experiments in this paper include learning representations for various tasks: classification tasks, retrieval tasks, self-supervised tasks, multi-label classification, and regression tasks.

6.1 Classification Task

In the classification task, the learned representation should be able to effectively utilize linear classifiers such as softmax to solve classification problems. This paper uses Cross-Entropy loss and its variant (Liu et al., 2016), as well as the supervised contrastive learning algorithm SupCon (Caron et al., 2020), as comparison algorithms. Then, the classification accuracy of a linear softmax classifier on the popular CIFAR-10 and CIFAR-100 datasets is used to evaluate the approximate NDCG loss.

6.1.1 Implementation Details

For the approximate NDCG loss, this paper uses the standard residual network, ResNet-50, as the encoder. Following the standard practice (Chen et al., 2020), a small projection network composed of a two-layer MLP and ReLU activation function is added after the residual network. We use the standard Adam optimizer (Kingma & Ba, 2014).

6.1.2 Experimental Results Analysis

Table 1 shows that the performance of A-NDCG on the CIFAR-10 and CIFAR-100 datasets exceeds that of Cross-Entropy loss and some of its variants, such as Max-Margin (Liu et al., 2016). It is equivalent to the performance of SupCon (Khosla et al., 2020). Compared to Cross-Entropy loss and Max-Margin (Liu et al., 2016), A-NDCG can utilize the relationship between sample pairs instead of just the relationship between a single sample and its label. It also relaxes the limitations of SimCLR (Chen et al., 2020) regarding the number of perspectives and allows for the use of more comparative information between samples.

6.2 Retrieval Task

The image retrieval task is a standard evaluation task in the field of deep metric learning. The representation learned for the retrieval task should be able to use linear learners such as k-NN to retrieve samples.

We compare various deep metric learning algorithms (Wang et al., 2019). We use the standard dataset CUB-200-2011 (Wah et al., 2011) to evaluate the retrieval performance of the approximate NDCG loss.

The implementation details are consistent with the previous section.

6.2.1 Experimental Results Analysis

Table 2 shows that on the CUB-200-2011 dataset, the performance of A-NDCG exceeds that of many popular deep metric learning methods. In the specific implementation of this paper, the number of positive and negative samples involved in the calculation is the same as in other methods. Compared to other methods, A-NDCG only needs to constrain the order relationship without specifying a specific sample interval, resulting in more robust performance on the test data.

Table 1: Classification accuracy on CIFAR-10 and CIFAR-100 datasets

dataset	SimCLR	Cross-Entropy	Max-Margin(Liu et al., 2016)	SupCon(Caron et al., 2020)	A-NDCG
CIFAR10	93.6	95.0	92.4	96.0	95.3
CIFAR100	70.7	75.3	70.5	76.5	76.7

Table 2: Retrieval performance on CUB-200-2011 dataset, we use recall to estimate

Rank@K	1	2	4	8	16	32
Clustering⁶⁴	48.2	61.4	71.8	81.9	-	-
ProxyNCA⁶⁴	49.2	61.9	67.9	72.4	-	-
Smart Mining⁶⁴	49.8	62.3	74.1	83.3	-	-
HTL⁵¹²	57.1	68.8	78.7	86.5	92.5	95.5
ABIER⁵¹²	57.5	68.7	78.3	86.2	91.9	95.5
MS-Loss⁵¹²	57.5	70.3	80.0	88.0	93.2	96.2
A-NDCG⁵¹²	58.3	70.7	80.5	88.5	93.8	96.9

6.3 Multi-label Learning

Multi-label learning is a traditional research direction, and there have been quite a lot of research results (Zhang & Zhou, 2007), where the representations obtained should be able to reduce the learning difficulty of other linear multi-label learning algorithms.

Following (Zhu et al., 2018), We use Hamming loss and Jaccard score to evaluate the performance. Hamming loss measures the difference between predicted and true multi-label labels using Hamming distance, where lower values are better. The Jaccard score measures the ratio of the intersection to the union of the predicted and true labels, with higher values indicating better performance.

6.3.1 Dataset

We use a variety of popular multi-label datasets ¹¹1http://mulan.sourceforge.net/datasets-mlc.html. Core5k is an image dataset with 5000 images, Scene has more than 2000 images, Medical and Enron are two text data, Medical has 978 samples, and Enron has 1702 samples.

6.3.2 Evaluation Algorithm

MLKNN and BRKNN are two distance-based multi-label learning algorithms, we use them to evaluate A-NDCG.

6.3.3 Experimental Results Analysis

From the table 4, table 3, it can be clearly seen that as the number of iterations increases, the Hamming loss continues to decrease and the Jaccard score increase continually , A-NDCG loss is very effective in learning good representations on multi-label data. A-NDCG loss makes full use of the label information of the multi-label dataset, even for samples with very close multi-label labels. It can give specific optimization goals on how to distinguish them, so that the samples can distinguish samples with similar labels in the feature space, and ultimately reduce the learning difficulty of the linear multi-label learner.

Table 3: classification accuracy on multi-label datasets, we use MLKNN(Zhang & Zhou, 2007) to evaluate.

dataset	indicator	MLKNN	A-NDCG, epoch:10	20	30	40	50
Scene	Hamming Loss	0.102	0.095	0.088	0.090	0.088	0.093
Scene	Jaccard Score	0.610	0.698	0.707	0.710	0.722	0.715
Corel5k	Hamming Loss	0.012	0.011	0.011	0.011	0.012	0.012
Corel5k	Jaccard Score	0.094	0.1116	0.134	0.133	0.133	0.125
Medical	Hamming Loss	0.020	0.013	0.013	0.013	0.013	0.011
Medical	Jaccard Score	0.512	0.726	0.74	0.74	0.74	0.739
Enron	Hamming Loss	0.062	0.05	0.05	0.05	0.05	0.056
Enron	Jaccard Score	0.329	0.441	0.449	0.451	0.455	0.456

Table 4: classification accuracy, we use BRKNN (Eleftherios Spyromitros, 2008) to evaluate.

dataset	indicator	BRKNN	A-NDCG, epoch:10	20	30	40	50
Scene	Hamming Loss	0.109	0.095	0.095	0.090	0.089	0.091
Scene	Jaccard Score	0.640	0.698	0.720	0.725	0.726	0.726
Corel5k	Hamming Loss	0.011	0.011	0.011	0.011	0.011	0.012
Corel5k	Jaccard Score	0.069	0.123	0.140	0.145	0.149	0.142
Medical	Hamming Loss	0.020	0.014	0.013	0.013	0.014	0.013
Medical	Jaccard Score	0.472	0.696	0.703	0.709	0.72	0.723
Enron	Hamming Loss	0.059	0.05	0.05	0.05	0.05	0.052
Enron	Jaccard Score	0.324	0.44	0.46	0.46	0.46	0.472

6.4 Regression Task

The regression task is widely used in various fields such as time series forecasting, energy forecasting, and financial market forecasting. The representation learned under the regression task should be able to reduce the learning difficulty of the linear regressor.

We use ridge regression and linear regression methods as evaluation algorithms. We use absolute loss (MAE) and mean square loss (MSE) as evaluation metrics to evaluate A-NDCG. Both metrics measure the error of the regression results, with lower values indicating better performance.

We selected several regression datasets from the UCI dataset repository, including housing price data, wine data, and Parkinson’s disease data.

6.4.1 Experimental Analysis

The regression task is indeed challenging. Unlike discriminative tasks, it involves continuous labels. However, A-NDCG can still learn good representations in regression datasets, thereby reducing the learning difficulty of regression methods. Some works (Hooshmand & Sharma, 2019; Ye & Dai, 2018) have combined pre-training models with prediction tasks. We believe that A-NDCG can also be naturally applied to such tasks.

Table 5: result on regression datasets, we use ridge regression method to evaluate.

dataset	indicator	ridge	A-NDCG
parkinsons	MSE	91.4	77.61
parkinsons	MAE	7.90	7.40
housing	MSE	18.61	-
housing	MAE	3.40	-
wine	MSE	0.62	0.59
wine	MAE	0.60	0.59

Table 6: result on regression datasets, we use linear regression methods to evaluate.

dataset	indicator	LR	A-NDCG
parkinsons	MSE	91.42	72.72
parkinsons	MAE	7.433	7.044
housing	MSE	18.64	13.77
housing	MAE	3.398	2.95
wine	MSE	0.62	0.55
wine	MAE	0.60	0.58

6.5 Self-supervised Learning Task

Although there have been works that introduce the idea of learning to rank into the field of self-supervised learning (Varamesh et al., 2020), the optimization goal of this paper is different. (Varamesh et al., 2020) uses average recall as the optimization goal. In self-supervised tasks, the representations learned using unsupervised data should be able to reduce the difficulty of supervised learning, such as improving the classification performance of linear learners.

We conduct experiments on the popular STL-10 dataset (Coates et al., 2011). In this paper, logistic regression and k-nearest neighbor classifiers are used as methods for evaluating representations.

We also use the data augmentation methods mixup (Zhang et al., 2017) and cutmix (Yun et al., 2019) to augment the training data at the label level to verify whether A-NDCG can fully utilize the information in the unsupervised data.

Table 7: Classification accuracy on STL-10 dataset, we use logistic regression to evaluate.

method	epoch:0	4	8	12	16	20	24	28	32
SimCLR(train)	70.4	78.2	80.7	82.0	83.5	84.2	82.2	82.9	85.8
A-NDCG(train)	77.8	85.0	86.1	86.7	86.6	86.6	87.4	87.0	86.3
SimCLR(test)	44.0	49.8	51.8	52.3	51.6	52.7	52.7	52.6	51.7
A-NDCG(test)	44.0	49.5	51.9	52.3	51,8	52.1	52.8	50.9	52.4

Table 8: Classification accuracy on STL-10 dataset, we use knn to evaluate.

method	epoch:0	4	8	12	16	20	24	28	32
SimCLR(train)	59.2	62.4	62.9	64.5	64.5	65.0	65.0	65.9	66.1
A-NDCG(train)	62.2	66.1	66.6	67.5	68.3	67.4	67.8	67.9	68.9
SimCLR(test)	31.5	37.0	39.1	40.0	39.6	39.6	39.6	41.1	41.0
A-NDCG(test)	36.3	40.5	42.4	43.0	43.0	43.2	44.0	44.8	44.7

Table 9: Classification accuracy on mixed STL-10 dataset, we use logistic regression to evaluate

method	epoch:0	4	8	12	16	20	24	28	32
A-NDCG(train)	77.8	85.0	86.1	86.7	86.6	86.6	87.4	87.0	86.3
A-NDCG+mixup(train)	80.9	86.5	87.2	87.9	89.8	89.3	89.6	89.1	89.4
A-NDCG+cutmax(train)	88.4	86.4	87.1	87.5	87.6	87.3	87.8	87.7	88.3
A-NDCG(test)	44.0	49.5	51.9	52.3	51,8	52.1	52.8	50.9	52.4
A-NDCG+mixup(test)	44.9	52.0	52.7	53.1	53.8	54.3	53.8	54.8	55.7
A-NDCG+cutmax(test)	51.0	52.2	52.3	51.9	52.5	52.5	53.0	53.0	53.0

Table 10: Classification accuracy on mixed STL-10 dataset, we use knn to evaluate

method	epoch:0	4	8	12	16	20	24	28	32
A-NDCG(train)	62.2	66.1	66.6	67.5	68.3	67.4	67.8	67.9	68.9
A-NDCG+mixup(train)	64.8	68.1	68.9	69.7	68.8	70.5	70.3	71.3	70.8
A-NDCG+cutmax(train)	63.6	63.5	66.8	68.0	69.5	69.0	69.4	69.9	70.0
A-NDCG(test)	36.3	40.5	42.4	43.0	43.0	43.2	44.0	44.8	44.7
A-NDCG+mixup(test)	39.2	43.6	44.3	46.2	47.0	46.9	47.5	46.5	47.5
A-NDCG+cutmax(test)	37.7	42.1	42.6	44.4	44.6	45.4	46.3	45.9	45.8

6.5.1 Implementation Details

For the experiments on the STL-10 dataset, we follow the approach of (Varamesh et al., 2020). We use ResNet-18 as the encoder, followed by a projection network composed of two layers of MLP. The batch size is set to 32, and the optimizer and learning rate are the same as those used in SimCLR. The number of epochs is set to 36, and longer training times yield better results.

6.5.2 Experimental Results Analysis

Tables 7 and 8 indicate that under the two evaluation algorithms, the performance of A-NDCG is better than SimCLR (Chen et al., 2020). This is because there is no limit to the number of perspectives of a single data point. From the perspective of contrastive learning, A-NDCG compares the difference between the sample feature similarity set that is not sorted according to the real ranking relationship and the sample feature similarity set that is sorted according to the real ranking relationship. It does not compare the difference between positive sample pair similarity and negative sample pair similarity, thus surpassing the conceptual constraints of positive and negative samples, and has broader meaning and applicability.

Tables 9 and 5 show that A-NDCG can fully utilize label-level data augmentation methods to transform the training data, making full use of the information in the unsupervised data. Despite the fact that manually designed data augmentation methods may introduce a lot of noise and errors, the loose constraints of A-NDCG reduce the influence of noise. It can be seen that the improvement effect of A-NDCG is still very significant. Making full use of the various information in unsupervised training data is obviously necessary for unsupervised learning.

7 Conclusion

In this paper, the representation learning problem across multiple tasks is modeled as a ranking problem. By adopting the ranking problem as a unified perspective, we solve the representation learning problem under different tasks by optimizing a unified objective. We conducted extensive experiments across various learning tasks, including classification, retrieval, multi-label learning, regression, and self-supervised learning, to demonstrate the superiority of the approximate NDCG loss, thereby verifying the effectiveness and broad applicability of the representation learning by ranking framework.

Furthermore, under the self-supervised learning task, we applied data augmentation methods to transform the training data, which improved the performance of the approximate NDCG loss. This demonstrates that the framework can fully utilize the information in unsupervised data. Representation learning by ranking offers us a unified perspective and provides deeper insights into understanding and designing representation learning methods.

References

Baeza-Yates et al. (1999) Baeza-Yates, R., Ribeiro-Neto, B., et al. Modern information retrieval, volume 463. ACM press New York, 1999.
Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
Brown et al. (2020) Brown, A., Xie, W., Kalogeiton, V., and Zisserman, A. Smooth-ap: Smoothing the path towards large-scale image retrieval. In European Conference on Computer Vision, pp. 677–694. Springer, 2020.
Cakir et al. (2019) Cakir, F., He, K., Xia, X., Kulis, B., and Sclaroff, S. Deep metric learning to rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1861–1870, 2019.
Caron et al. (2018) Caron, M., Bojanowski, P., Joulin, A., and Douze, M. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149, 2018.
Caron et al. (2020) Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020.
Caron et al. (2021) Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9650–9660, 2021.
Chen et al. (2020) Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
Chen & He (2021) Chen, X. and He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29:2172–2180, 2016.
Coates et al. (2011) Coates, A., Ng, A., and Lee, H. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223. JMLR Workshop and Conference Proceedings, 2011.
Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Donahue & Simonyan (2019) Donahue, J. and Simonyan, K. Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pp. 10542–10552, 2019.
Donahue et al. (2016) Donahue, J., Krähenbühl, P., and Darrell, T. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
Dupont (2018) Dupont, E. Learning disentangled joint continuous and discrete representations. In Advances in Neural Information Processing Systems, pp. 710–720, 2018.
Eleftherios Spyromitros (2008) Eleftherios Spyromitros, Grigorios Tsoumakas, I. V. An empirical study of lazy multilabel classification algorithms. In Proc. 5th Hellenic Conference on Artificial Intelligence (SETN 2008), 2008.
Grill et al. (2020) Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. G., et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
Hadsell et al. (2006) Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp. 1735–1742. IEEE, 2006.
Hjelm et al. (2018) Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
Hooshmand & Sharma (2019) Hooshmand, A. and Sharma, R. Energy predictive models with limited data using transfer learning. In Proceedings of the Tenth ACM International Conference on Future Energy Systems, pp. 12–16, 2019.
Järvelin & Kekäläinen (2002) Järvelin, K. and Kekäläinen, J. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002.
Khosla et al. (2020) Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., and Krishnan, D. Supervised contrastive learning. arXiv preprint arXiv:2004.11362, 2020.
Kim & Mnih (2018) Kim, H. and Mnih, A. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Liu et al. (2016) Liu, W., Wen, Y., Yu, Z., and Yang, M. Large-margin softmax loss for convolutional neural networks. In ICML, volume 2, pp. 7, 2016.
Oord et al. (2017) Oord, A. v. d., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017.
Oord et al. (2018) Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Qin et al. (2010) Qin, T., Liu, T.-Y., and Li, H. A general approximation framework for direct optimization of information retrieval measures. Information retrieval, 13(4):375–397, 2010.
Shen et al. (2020) Shen, Z., Liu, Z., Liu, Z., Savvides, M., and Darrell, T. Rethinking image mixture for unsupervised visual representation learning. arXiv preprint arXiv:2003.05438, 2020.
Sohn (2016) Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems, 29:1857–1865, 2016.
Sun et al. (2020) Sun, Y., Cheng, C., Zhang, Y., Zhang, C., Zheng, L., Wang, Z., and Wei, Y. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6398–6407, 2020.
Tian et al. (2020a) Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 776–794. Springer, 2020a.
Tian et al. (2020b) Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., and Isola, P. What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243, 2020b.
Tschannen et al. (2019) Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M. On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625, 2019.
Varamesh et al. (2020) Varamesh, A., Diba, A., Tuytelaars, T., and Van Gool, L. Self-supervised ranking for representation learning. arXiv preprint arXiv:2010.07258, 2020.
Wah et al. (2011) Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
Wang et al. (2019) Wang, X., Han, X., Huang, W., Dong, D., and Scott, M. R. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5022–5030, 2019.
Wen et al. (2016) Wen, Y., Zhang, K., Li, Z., and Qiao, Y. A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Springer, 2016.
Xia et al. (2008) Xia, F., Liu, T.-Y., Wang, J., Zhang, W., and Li, H. Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on Machine learning, pp. 1192–1199, 2008.
Xu et al. (2008) Xu, J., Liu, T.-Y., Lu, M., Li, H., and Ma, W.-Y. Directly optimizing evaluation measures in learning to rank. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 107–114, 2008.
Ye & Dai (2018) Ye, R. and Dai, Q. A novel transfer learning framework for time series forecasting. Knowledge-Based Systems, 156:74–99, 2018.
Yun et al. (2019) Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032, 2019.
Zhang et al. (2017) Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Zhang & Zhou (2007) Zhang, M.-L. and Zhou, Z.-H. Ml-knn: A lazy learning approach to multi-label learning. Pattern recognition, 40(7):2038–2048, 2007.
Zhu et al. (2018) Zhu, P., Qi, R., Hu, Q., Wang, Q., Zhang, C., and Yang, L. Beyond similar and dissimilar relations: A kernel regression formulation for metric learning. In IJCAI, pp. 3242–3248, 2018.