Shuffle & Divide:
Contrastive Learning for Long Text

Joonseok Lee, Seongho Joe, Kyoungwon Park, Bogun Kim, Hoyoung Kang, Jaeseon Park, Youngjune Gwon Samsung SDS, Seoul, South Korea
Email: {js1985.lee, drizzle.cho, kw621.park, bogun0.kim, hoyoung.kang, jaeseon.park, gyj.gwon}@samsung.com

Abstract

We propose a self-supervised learning method for long text documents based on contrastive learning. A key to our method is Shuffle and Divide (SaD), a simple text augmentation algorithm that sets up a pretext task required for contrastive updates to BERT-based document embedding. SaD splits a document into two sub-documents containing randomly shuffled words in the entire documents. The sub-documents are considered positive examples, leaving all other documents in the corpus as negatives. After SaD, we repeat the contrastive update and clustering phases until convergence. It is naturally a time-consuming, cumbersome task to label text documents, and our method can help alleviate human efforts, which are most expensive resources in AI. We have empirically evaluated our method by performing unsupervised text classification on the 20 Newsgroups, Reuters-21578, BBC, and BBCSport datasets. In particular, our method pushes the current state-of-the-art, SS-SB-MT, on 20 Newsgroups by 20.94% in accuracy. We also achieve the state-of-the-art performance on Reuters-21578 and exceptionally-high accuracy performances (over 95%) for unsupervised classification on the BBC and BBCSport datasets.

I Introduction

Self-supervised Learning (SSL) can provide label-equivalent information necessary in gradient descent updates to a neural network. A carefully-designed pretext task facilitates the source of label-equivalent information. In a modern paradigm for natural language processing (NLP), an upstream task is set up in a self-supervised manner to pre-train a model, followed by a downstream task that is fine-tuned on the pre-trained model. The upstream task typically requires no labeled data examples (large text corpora) to build a language model, whereas the downstream task can be fine-tuned on a relatively smaller number of labeled examples.

In NLP, Transformer [1] has enabled some important SSL methods that produce outstanding results. The Transformer-based models such as BERT [2] and ALBERT [3] have empirically demonstrated the effectiveness of SSL. Up until now, however, its success is mainly for upstream tasks, not for the downstream. It requires yet another labeled data to fine-tune the pre-trained models. In other words, it still depends on supervised learning that requires much manual effort. NLP downstream tasks and their performances are limited by disadvantages of supervised learning.

In image processing, on the other hand, numerous approaches to utilize SSL have been fruitful. Especially, contrastive learning method has shown great advances in image classification with massive unlabeled data. It aims to use similarities and differences between images. SimCLR [4] is a prominent work for this approach. As its pre-trained model applies to downstream task of classification, it outperforms even supervised learning models. But, it also need two stage process of upstream and downstream task.

To take one step further, some works make downstream task also self-supervised, to minimize human intervention. They include SCAN [5], RUC [6], SelfMatch [7], yielding tangible results.

There is not much work to extend SSL or unsupervised learning method to downstream task in natural language processing field. A few has tried contrastive learning in the field. Although image can be augmented quite naturally and many image augmentation techniques already exist, natural language has relatively little techniques. Naturally, many has tried SSL with sentence similarity. Basic way of it is to use TF-IDF of documents. But it simply regards documents with the same frequency of a token similar. Moreover, it cannot distinguish homonym.

To overcome these shortcomings, some approaches have tried with semantic sentence similarity. They are G-BAT [8] and SS-SB-MT [9]. SS-SB-MT yields the state-of-the-art performance in unsupervised text classification. It generates keyword correlation graph with edges and nodes by using sentence similarity (SS) and Sentence BERT (SB). Then, multi-task graph autoencoder (MT) transform the graph into latent feature, which is to get document clusters. In short, it converts document features into graphs, then gathers similar ones together.

Pioneering work of SS-SB-MT outperforms previous works in long sentence classification with SSL. However, it only learns graphs summarizing documents, not the whole text. It is not effective for exploiting information of documents, compared to methods to learn feature vectors from whole text. New way of generating supervision effectively by conserving semantic features is needed.

We propose a new contrastive learning technique for document classification and corresponding effective sentence augmentation in this paper. Contrastive learning is to build positive relation between input data and its augmented ones, and negative with the others. So, optimal way of data augmentation is crucial. One method is back translation, which translates back to original language. But translation could be noisy. Moreover, its result is very similar to the input, i.e., weak augmentation.

To overcome previous limitations, we have developed new augmentation for contrastive learning: shuffling whole sentences in a document, then dividing them into two groups. It conserves meaning of individual sentence, simultaneously augments input strongly. Strong augmentation with whole semantics can make contrastive training optimal.

Experiments with 20 news group datasets for document classification task shows classification accuracy 68.34 %, outperforming G-BAT (41.30 %) and SS-SB-MT (47.40%) by 27.04 %p and 20.94 %p improved respectively, achieving new state-of-the-art. Experiments with datasets like router also shows outstanding results.

In conclusion, we suggests novel SSL techniques for unsupervised document classification, resulting in state-of-the-art accuracy performances. We have empirically demonstrated the efficacy of contrastive learning in unsupervised text classification.

In the paper, our contributions are summarized as follows.

•

We present state-of-the-art contrastive SSL for long text (intent) classification;
•

We suggest contrastive learning can apply to tasks in natural language processing;
•

We propose Shuffle and Divide (SaD), a novel text augmentation algorithm for effective contrastive learning of document classification.

The rest of this paper is organized as follows. Section 2 describes related work. In Section 3, we explain our architecture and algorithms in detail. Section 4 presents our experimental results and gives quantitative comparison with previous work. The paper concludes in Section 5.

II Related Work

We review widely-used approaches, self-supervised learning and text augmentation, which are related to our work in this session.

II-A Self-supervised learning

Self-supervised learning is to learn representation with unlabeled large-scale datasets. Traditionally, most deep learning models have relied on supervised learning. But they have required human supervision. SSL techniques includes MLM(Masked Language Model) [2], NSP(Next Sentence Prediction) [2], and contrastive learning [4]. They uses tasks which can be defined with unlabeled data, i.e, pretext task. A model pre-trained with pretext task is transferred to domain-specific situations. It is called downstream task.

Pre-training methods for SSL includes context prediction [10], image colorization [11], Jigsaw puzzle solving [12], and rotation prediction [13]. As contrastive learning based on InfoMax principle [14], SSL for image have advanced, with the works of CPC [15], DIM [16], AM-DIM [17], MoCo [18], and SimCLR [4]. Pre-trained models for text SSL includes BERT [2] and GPT-3 [19]

BERT

A language representation model BERT uses only encoder part of transformer [1]. Language pre-trained models have outperformed previous ones in various natural language tasks [20]. With unlabeled large-scale dataset, BERT defines two pretext tasks: MLM and next sentence prediction. MLM is to predict randomly masked words by learning context of sentences or words with attention. Next sentence prediction is to infer relationship between given two sentences.

SimCLR

SimCLR learns visual representation from unlabeled large-scale datasets with SSL method. It applies two different augmentations for each image. Then, augmentations from the same image becomes positive pair, the others negative pair, contrasting them to learn features. Its performance come close to supervised learning models.

II-B Text augmentation

Data augmentation aims to generate new data from the original, conserving their labels. For text, it includes text modification and generation. Examples of modification are random noise injection [21], randomly removing, inserting, or replacing words, and TF-IDF based word replacement, modifying less important words which are not in principal component of TF-IDF matrix. Examples of text translation are Back translation [22], translating back and forth, and generative methods [23], using sentence-generating models.

II-C Document clustering

Embedding for documents should be extracted, to cluster them. Traditional ways of embedding are well-known, such as TF-IDF [24], which counts the occurrence of words per document to give weight, LSI (Latent Semantic Indexing) [25], LDA (Latent Dirichlet Allocation) [26], and glove embedding average. Deep learning approaches have added new options. BAT, G-BAT [8] and SS-SB-MT [9] use neural networks to embed documents, to outperform traditional clustering methods. According to [9], ELMo [27] and SBERT [28] extract embedding of sentences or words, then averaging them to get better embedding.

BAT and G-BAT

BAT trains generative adversarial network (GAN) with TF-IDF vector and topic distribution. The encoder of that network is re-used to get embedding of documents to cluster. G-BAT, which apply multivariate Gaussian to the generator model of BAT, outperformed traditional methods on 20NG datasets.

SS-SB-MT

SS-SB-MT is to build keyword correlation graphs with sentence similarity (SS) and sentence BERT (SB). Multi-task graph autoencoder (MT) extracts latent features form graphs. Document clustering works in that feature space. It is innovative, but quite complicated model to handle. It achieved previous state-of-the-art on 20 newsgroup and Reuters datasets.

III Method

Since contrastive learning does not depend on the model architecture, it can be used to enhance a pre-trained NLP model in a specific way by running large corpora, all without additional labeled data. In NLP, however, it is difficult to make positive and negative examples necessary for contrastive learning by using augmentation, unlike SimCLR used for vision applications.

By overcoming the limitations of text augmentation, we propose a method of fine-tuning the NLP model in a contrastive self-supervised method without additional labels.

III-A Model Architecture

Refer to caption — Figure 1: Model Architecture. The encoder model receives $n$ documents and outputs 768-dimensional vectors. $l$ is the max sequence length and is predefined. Encoder consists of transformers and Mean Pooling Layer. Various models such as BERT and RoBERTa can be used as Transformers. The mean pooling layer calculates the mean of all output vectors like that of Sentence BERT.

Figure 1 shows the overall architecture of our model. Transformers are initialized with pre-trained weights from the BERT and RoBERTa models. To obtain the embedding of documents required for clustering, we add a mean pooling layer to the Transformers model and use it as an encoder. Mean pooling layer calculates the average of the Transformer output vectors. As a result, the encoder receives a token sequence of a predefined length as input and outputs a 768-dimensional vector.

III-B Contrastive Self-supervised Learning

We use a contrastive self-supervised learning approach to learn latent representations of documents without labels to ultimately cluster them.

During the training process, within each mini-batch, the distance between embeddings of the positive pair is made close to each other, and the distance between embeddings of the negative pair is made farther from each other. Among all document pairs that can be combined in each mini-batch, all pairs that are not positive pairs become negative pairs. Therefore, the same number of positive pairs as the batch size and (batch size * 2 - 2) negative pairs are created for each mini-batch. Then, the token sequences of all documents in the mini-batch are fed to the encoder to obtain output vectors, and the contrastive loss is calculated based on the previously obtained positive and negative pair configurations. The model parameters are updated through the loss calculated in each mini-batch.

In a general contrastive self-supervised method, positive pairs are created through augmentation. In Computer Vision, there are various augmentation options such as rotation, flipping, and color jittering, and by combining them, numerous label-preserving augmented versions of one sample can be obtained. However, In NLP, simple augmentation tricks have little effect in contrastive learning. In order to make meaningful changes to the original sample, it is necessary to use a word dictionary or a trained model such as WordNet Augmenter, Contextual Augmenter, or back translation. However, the effect is limited because the content and sentence structure of the original document and the augmentation result are very similar.

We propose methods suitable for use in contrastive self-supervised learning to replace or improve these existing augmentation methods. The first method is to completely replace text data augmentation. Based on the TF-IDF feature of documents, the pair with the highest cosine similarity is sampled to create a positive pair. Our main approach, the second method, called Shuffle & Divide, simply shuffles the order of the sentences in the documents and splits the document in half to get a positive pair by splitting one document into two.

III-C Positive Sampling with TF-IDF

The first method to solve the limitations of text augmentation is to completely replace augmentation itself with positive sampling using TF-IDF. Even if augmentation is not used, a contrastive learning method can be applied if a positive pair can be created from the dataset.

Instead of text augmentation, we propose a method to perform contrastive learning by making two similar samples as a positive pair. TF-IDF is a method of assigning weights to each word by using the number of documents in which a specific word appears and the number of times in which a specific word appears in a specific document. It is a traditional method often used to compare similarities between documents.

We will create a TF-IDF vector for all documents in the dataset and define it as set $\{D_{n}\}$ . For all $D_{n}$ , find $D_{m}(m\neq n)$ with the highest cosine similarity and assume $D_{m}$ and $D_{n}$ as a positive pair. Since the positive pair created in this way has a high probability of belonging to the same category but is completely different from each other, it can solve the problem that the result of text augmentation is very weak.

III-D Shuffle & Divide

Positive sampling by TF-IDF completely replaces text augmentation, enabling contrastive self-supervised learning even when augmentation is impossible or insufficient. However, there is a fundamental and potential limitation that positive pairs may actually be of different classes. In fact, when top-1 cosine similarity positive sampling is performed based on the TF-IDF vector of documents in the 20NG dataset, the case where the labels of the positive pair match is about 85%.

We propose the Shuffle & Divide algorithm as an alternative to this problem. Figure 2 is an example of applying Shuffle & Divide to document $d$ . The sentence order of $d$ is shuffled and then divided in half to make $d^{\prime}$ and $d^{\prime\prime}$ . Since they belong to the same category, they can be used as a positive pair. Shuffle & Divide is a very simple and intuitive method, but it showed a very powerful effect in the experimental process described in Experiments.

Figure 3 shows the Shuffle & Divide process. In the representation learning process, only documents consisting of 4 or more sentences are used as training data. Let $D_{n}$ be one of the documents sampled from each mini-batch. When $D_{n}$ is a document composed of $m$ sentences from $s_{1}$ to $s_{m}$ , $d_{n}$ is created by randomly shuffled sentence order in $D_{n}$ . By dividing $d_{n}$ in half, documents $d^{\prime}_{n}$ and $d^{\prime\prime}_{n}$ composed of $n/2$ sentences are created, and a positive pair ( $d^{\prime}_{n}$ , $d^{\prime\prime}_{n}$ ) can be obtained per sample document $D_{n}$ .

In other words, we can get as many positive pairs as the batch size in the mini-batch. Also, since the sentence order of the document is randomly shuffled at every epoch and then divided, the augmentation effect of positive pairs is amplified in the contrastive concept.

IV Experiments

To verify that our proposed method works, we have performed clustering on text datasets. Because the output vector from feeding the tokens sequences of documents to the encoder is clustered as is, the performance of the encoder can be observed intuitively. Additionally, for a specific dataset, we compared whether the encoder fine-tuned with our method was better initialized than the “bert-base-uncased” pre-trained weight of BERT. We prove the effectiveness of our method by adding a classification head to each of the two encoders, performing fine-tuning with the supervised method, and comparing them.

IV-A Datasets

Datasets	Size	Classes	Length
20NG	18.612	20	245
Reuters	7,316	10	141
BBC	2,225	5	26
BBCSport	737	5	345

TABLE I: Datasets summary.

We use 20 Newsgroups (20NG) [29], Reuters-21578 (Reuters) [30], BBC and BBCSport datasets for our experiments. To compare with SS-SB-MT [9], which is the current state-of-the-art for unsupervised text classification, 20NG and Reuters are preprocessed according to the method of [31], same as SS-SB-MT. Preprocessing the 20NG dataset is summarized as follows. The header and footer of each document were removed, and the URL and email address were also removed. And documents with fewer than 10 words were removed. For the Reuters dataset, multi-labeled data was removed, and after removing documents with empty bodies, duplicate data were also removed. And, only the top 10 categories of documents were extracted.

In case of Reuters, the class imbalance is more severe than 20NG, so the randomness in performance measurement tended to be large. The BBC and BBCSport datasets have not been widely used in previous deep learning studies due to the small number of data, but only the Shuffle and Divide method has been tested experimentally. Table I summarizes the information of each dataset used in the experiment.

IV-B Clustering

In the process of grouping documents expressed as high-dimensional vectors, the definition of distance metric between documents is very important. It is known that cosine distance is more suitable for clustering of high-dimensional vectors than euclidian distance. We used Spherical k-means, an algorithm that is fast and uses cosine distance as a metric, to cluster the 768-dimensional document vectors output from the encoder.

We compared our method with several traditional methods and recent deep learning methods that achieved state-of-the-art. The performance and settings of the models are well summarized in [32].

We used Accuracy (ACC), an indicator used in most previous studies, to measure and compare clustering performance. Using the Hungarian algorithm [33], it is possible to efficiently calculate the maximum accuracy value among all true label and cluster mappings. We also measured AMI for comparison with previous studies, and measured and referenced the silhouette score at every epoch to pick the best model without a label. formula 1 shows how we calculated ACC.

ACC=\max_{n}\frac{\sum_{i=1}^{n}\mathds{1}\{l_{i}=(c_{i})\}}{n}

(1)

IV-C Representation learning

	20NG			Reuters			BBC			BBCSport
Models	ACC	AMI	SS	ACC	AMI	SS	ACC	AMI	SS	ACC	AMI	SS
Document Embedding models
TFIDF	33.70	41.70	-	35.00	45.60	-	-	37.60	-	-	79.90	-
LSI	32.30	39.80		42.00	40.00	-	-	45.40	-	-	84.00	-
LDA	37.20	28.80		54.90	50.30	-	-	15.10	-	-	61.60	-
D2C	-	49.30		-	53.40	-	-	75.90	-	-	81.20	-
G-BAT	41.30	-	-	-	-	-	-	-	-	-	-	-
SS-SB-MT	47.40	53.00	-	56.30	58.40	-	-	-	-	-	-	-
Sentence Embedding models
GloVe	21.70	21.00	-	38.50	37.10	-	-	-	-	-	-	-
BERT	41.90	40.50	-	47.10	42.60	-	-	-	-	-	-	-
SBERT	44.10	45.10	-	51.40	52.40	-	-	-	-	-	-	-
Ours
TPS^†	66.04	63.72	58.55	41.93	42.82	42.96	-	-	-	-	-	-
SaD^‡	68.34	66.65	61.53	58.17	51.82	51.53	95.46	86.41	47.58	98.51	94.77	52.29

TABLE II: Clustering performance of our two methods in comparison to various baseline models. ACC represents accuracy and AMI represents adjusted mutual information. SS is a silhouette score, and within the specified training epochs range, we picked the encoder weight at the time of the highest SS. Among the performance indicators of various methods, ACC is the experimental result reported in [32] and AMI is reported in [31]. A - sign indicates that it has not been measured or specified in previous studies. (TPS^† is TFIDF positive sampling, and SaD^‡ Shuffle & Divide.)

Phase		20NG	Re^♯	BBC	BBCS^∗
TPS^†	train	256	256	-	-
TPS^†	test	256	256	-	-
SaD^‡	train	128	128	64	64
SaD^‡	test	256	256	128	128

TABLE III: Maximum length of the encoder’s input token sequence used in each dataset and phase. (Re^♯ is Reuters, BBCS^∗ BBCSport.)

Base encoder	SaD	Accuracy (%)
bert-base-uncased	FALSE	83.54
bert-base-uncased	TRUE	85.40
SaD encoder	TRUE	87.88

TABLE IV: Base encoders for initializing.

With the two methods we proposed, the model learns a representation for each dataset without using any labels at all. First, the encoder’s transformers are initialized with “bert-base-uncased” pre-trained weights. Contrastive loss is calculated from positive and negative pairs sampled in each mini-batch and the parameters of the model are updated. As the contrastive loss, the nt-xent loss introduced in SimCLR is used. For representation learning, the learning rate of 3e-5 was used in the AdamW optimizer. The larger the batch size, the more diverse the negative views for positive pairs. Therefore, 320, the maximum usable size, was used as the batch size. The range of the total training epoch was determined for each dataset and method, and the model with the highest silhouette score within the range was used as the final model.

TF-IDF positive sampling

The first method, TF-IDF positive sampling, calculates the cosine similarity between the TF-IDF vectors of all documents included in the dataset, and composes the most similar samples as positive pairs. When a positive pair is configured in this way, the probability that the pair belongs to the same category is calculated, 85 % for 20NG and 89.23 % for Reuters. In order to change the positive sample according to the progress of the training epoch, after each epoch, the cosine similarity of each of the TF-IDF vectors and the model output vectors was weighted summed to obtain similarity, and then positive sampling was performed again. formula 2 shows the similarity calculation method according to the progress of the learning epoch.

In the tokenize phase, 256 was used as the maximum token length for 20NG and Reuters, and 128 was used for BBC and BBCSport. Since the number of positive pairs is half compared to the Shuffle & Divide method, and even that has no augmentation effect, only 2 to 4 epochs were fine-tuned as recommended by the BERT authors. And at every epoch, all the documents of the dataset were fed to the fine-tuned model, and the output vector was clustered to measure the silhouette score, and finally the model with the highest score was kept.

As a result, as shown in Table II, it was possible to obtain 65.92 % accuracy and 63.45 % AMI high performance for the 20NG dataset, surpassing the existing state-of-the-art. However, this method has limitations in that the categories of positive pairs can be different from each other and that the quantitative effect of augmentation cannot be obtained significantly. Therefore, it showed relatively low performance on small datasets such as Reuters.

	$\displaystyle\mathcal{S}={}$	$\displaystyle\alpha^{epoch-1}\textrm{sim}(v_{tfidf})$		(2)
		$\displaystyle+(1-\alpha^{epoch-1})\textrm{sim}(v_{model})$		(2)

Shuffle & Divide

The authors of BERT recommend training only 2-4 epochs during supervised fine-tuning of the BERT model for downstream tasks. However, we trained our model for (dataset size / batchsize) epochs so that we can maximize the negative views that have been diversified through contrastive learning and maximize the instance level augmentation effect of positive and negative pairs obtained through shuffle & divide.

For shuffle & divide processing, only documents consisting of 4 or more sentences were used as training data. As in the TF-IDF positive sampling method, the best model was picked by measuring the silhouette score. Since the Shuffle&Divide method divides the document in half, we set the max sequence length to half that of the test phase in the training phase. The max sequence length for each dataset and phase is shown in the table III.

As a result, we were able to achieve higher performance than TF-IDF positive sampling for all datasets. For the 20NG and Reuters dataset, ACC improvement was about 20 %p and about 15 %p compared to the existing state-of-the-art, respectively. In the case of clustering on the BBC and BBCSport datasets, it showed satisfactory performance of more than 95 % based on Accuracy.

Classification with supervised fine-tuning

We additionally attach a classification head to the encoder obtained by training with the Shuffle & Divide method, and then try classification by the supervised fine-tuning method for 20NG dataset. Three experiments were conducted to check whether the BERT weight fine-tuned by the Shuffle & Divider method is a good initialization point and whether Shuffle & Divide is effective in the supervised fine-tuning learning process.

For the first time, an encoder initialized as “bert-base-uncased” was fine-tuned in a supervised manner without augmentation. In the second method, supervised fine-tuning was performed by adding only Shuffle & Divider augmentation to the first method. In the last method, the training data was augmented by the Shuffle & Divide method while supervised fine-tuning the encoder fine-tuned by the Shuffle & Divider method.

As a result, the encoder fine-tuned by the Shuffle & Divide method showed higher performance after supervised fine-tuning for classification than the encoder initialized to “bert-based-uncased.” In addition, when using Shuffle & Divide augmentation for supervised fine-tuning for classification, the performance was further improved. Detailed performance is listed in Table IV.

V Conclusion

In this paper, we have described a powerful means to document clustering that can lead to state-of-the-art accuracy performances for unsupervised text classification. There is no decisive way to augment natural language text to learn NLP models in a contrastive self-supervised manner. We propose a method that runs on a simple augmentation algorithm called Shuffle and Divide (SaD), which provides a facility to enhance a Transformer encoder for document embedding. By coupling contrastive learning with clustering, the final document embeddings result in document clusters that are as discriminative as the text classifier fine-tuned on labeled examples. As we have enhanced the encoder consisting of the Transformer and mean pooling layers with our method, clustering with the latent representation of the documents has verified the effectiveness in text classification tasks. Experiments show that our model achieves better performance than many unsupervised approaches including the state-of-the-art results by SS-SB-MT on the 20 Newsgroups and Reuters datasets. We have achieved a high performance with more than 95% accuracy for unsupervised classification on the BBC and BBCSport datasets. In addition, our SaD can overcome the limitations of text augmentation in a very simple way, and we are expecting SaD to be applicable to various tasks of learning document embedding as well as contrastive learning.

References

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
[3] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
[4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
[5] W. Van Gansbeke, S. Vandenhende, S. Georgoulis, M. Proesmans, and L. Van Gool, “Scan: Learning to classify images without labels,” in European Conference on Computer Vision. Springer, 2020, pp. 268–285.
[6] S. Park, S. Han, S. Kim, D. Kim, S. Park, S. Hong, and M. Cha, “Improving unsupervised image clustering with robust learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 278–12 287.
[7] B. Kim, J. Choo, Y.-D. Kwon, S. Joe, S. Min, and Y. Gwon, “Selfmatch: Combining contrastive self-supervision and consistency for semi-supervised learning,” arXiv preprint arXiv:2101.06480, 2021.
[8] R. Wang, X. Hu, D. Zhou, Y. He, Y. Xiong, C. Ye, and H. Xu, “Neural topic modeling with bidirectional adversarial training,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 340–350.
[9] B. Chiu, S. K. Sahu, D. Thomas, N. Sengupta, and M. Mahdy, “Autoencoding keyword correlation graph for document clustering,” in ACL, 2020.
[10] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2015.
[11] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in European Conference on Computer Vision (ECCV), 2016.
[12] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European Conference on Computer Vision (ECCV). Springer, 2016.
[13] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” in International Conference on Learning Representations (ICLR), 2018.
[14] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic, “On mutual information maximization for representation learning,” in International Conference on Learning Representations (ICLR), 2020.
[15] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[16] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” in International Conference on Learning Representations (ICLR), 2019.
[17] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” in Advances in Neural Information Processing Systems (NeurIPS), 2019.
[18] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[19] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
[20] A. M. Dai and Q. V. Le, “Semi-supervised sequence learning,” in NIPS, 2015.
[21] J. Wei and K. Zou, “Eda: Easy data augmentation techniques for boosting performance on text classification tasks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 6382–6388.
[22] S. Edunov, M. Ott, M. Auli, and D. Grangier, “Understanding back-translation at scale,” in EMNLP, 2018.
[23] V. Kumar, A. Choudhary, and E. Cho, “Data augmentation using pre-trained transformer models,” in Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, 2020, pp. 18–26.
[24] C. Sammut and G. I. Webb, Eds., TF–IDF. Springer US, 2010, pp. 986–987.
[25] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala, “Latent semantic indexing: A probabilistic analysis,” in Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, ser. PODS ’98. New York, NY, USA: Association for Computing Machinery, 1998, p. 159–168. [Online]. Available: https://doi.org/10.1145/275487.275505
[26] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, p. 993–1022, 2003.
[27] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 2227–2237.
[28] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3982–3992.
[29] K. Lang, “Newsweeder: Learning to filter netnews,” in ICML, 1995.
[30] D. Lewis, Y. Yang, T. Rose, and F. Li, “Rcv1: A new benchmark collection for text categorization research,” J. Mach. Learn. Res., vol. 5, pp. 361–397, 2004.
[31] J. Ye, Y. Li, Z. Wu, J. Z. Wang, W. Li, and J. Li, “Determining gains acquired from word embedding quantitatively using discrete distribution clustering,” in ACL, 2017.
[32] P. Xie and E. Xing, “Integrating document clustering and topic modeling,” ArXiv, vol. abs/1309.6874, 2013.
[33] H. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, pp. 83–97, 1955.

Shuffle & Divide: Contrastive Learning for Long Text