Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers
Abstract
Automatic data augmentation (AutoAugment) strategies are indispensable in supervised data-efficient training protocols of vision transformers, and have led to state-of-the-art results in supervised learning. Despite the success, its development and application on self-supervised vision transformers have been hindered by several barriers, including the high search cost, the lack of supervision, and the unsuitable search space. In this work, we propose AutoView, a self-regularized adversarial AutoAugment method, to learn views for self-supervised vision transformers, by addressing the above barriers. First, we reduce the search cost of AutoView to nearly zero by learning views and network parameters simultaneously in a single forward-backward step, minimizing and maximizing the mutual information among different augmented views, respectively. Then, to avoid information collapse caused by the lack of label supervision, we propose a self-regularized loss term to guarantee the information propagation. Additionally, we present a curated augmentation policy search space for self-supervised learning, by modifying the generally used search space designed for supervised learning. On ImageNet, our AutoView achieves remarkable improvement over RandAug baseline (+10.2% -NN accuracy), and consistently outperforms sota manually tuned view policy by a clear margin (up to +1.3% -NN accuracy). Extensive experiments show that AutoView pretraining also benefits downstream tasks (+1.2% mAcc on ADE20K Semantic Segmentation and +2.8% mAP on revisited Oxford Image Retrieval benchmark) and improves model robustness (+2.3% Top-1 Acc on ImageNet-A and +1.0% AUPR on ImageNet-O). Code and models will be available at https://github.com/Trent-tangtao/AutoView.
Index Terms:
AutoView, AutoAugment, self-supervised learning, adversarial learning, vision transformer.1 Introduction
Self-supervised learning based on identifying augmented views of the data has made great progress in unsupervised visual representation learning [1, 2, 3, 4, 5]. These view-based self-supervised learning methods optimize the network parameters by contrastive learning [2, 3] or self-distillation [4, 5] over different views of the same image. Recently, view-based self-supervised vision transformers (ViTs) have revealed emerging properties that have not been shown in either the supervised ViTs or previous unsupervised CNNs, attracting a lot of research interest in the community [6, 7, 8, 9, 10].

Augmentation policies are crucial for both supervised ViTs [11] and view-based self-supervised learning [2]. DeiT [11] improves the ImageNet accuracy of supervised ViT-L by 6.6% by using better training recipes and better augmentation policies. Meanwhile, [2] show that data augmentation policies are crucial for learning good representations in view-based self-supervised learning. However, designing augmentation policies of these views requires considerable trial and error by human experts and could still be sub-optimal due to the limited design space.
Recently, automatic data augmentation (AutoAugment) strategies have led to state-of-the-art results in supervised learning [12, 13, 14, 15]. These methods automatically search for improved data augmentation policies to achieve the highest validation accuracy on a target dataset. Among them, RandAug [13] has became an indispensable training recipe for supervised training of data-efficient ViTs.
Despite the success, there are still several barriers hindering its development and application in self-supervised learning. First of all, AutoAugment methods are computationally expensive. As AutoAugment is a bi-level optimization problem, repeated training or iterative optimization of network parameters and augmentation policies are required, resulting in high search cost. Previous works on efficient AutoAugment methods search augmentation policies on a subset of the dataset [14, 15, 16]. This proxy task is likely to be ineffective for augmentation search of self-supervised ViTs, as self-supervised ViTs rely on large datasets to learn properly. Secondly, the lack of label supervision is another challenge for unsupervised AutoAugment methods. In supervised AutoAugment methods, augmentation policies are optimized by maximizing validation accuracy on a target dataset. Without label supervision, adversarial learning could be a useful way to learn augmentation policies [17, 18, 19, 20]. However, without the constraint of label supervision as in semi-supervised adversarial view learning [18], these methods are prone to learning excessively strong augmentations, resulting in information collapse of the input. Thirdly, the search space designed for supervised AutoAugment could be unsuitable for view-based self-supervised learning. [2] show that the sophisticated AutoAugment policies searched by supervised learning, do not work better than simple cropping + color distortion. Similarly, we empirically find that RandAug results in a remarkable performance drop in self-supervised ViTs.
In this work, we propose AutoView, a self-regularized adversarial AutoAugment method to learn views for self-supervised vision transformers, by addressing the aforementioned barriers. First, we propose to learn adversarial augmentation policies efficiently by learning views and network parameters simultaneously in a single forward-backward step. Specifically, augmentation policies are optimized by minimizing the mutual information among different augmented views, while network parameters are optimized by maximizing it. To avoid the information collapse or learning excessively strong augmentation policies due to the lack of label supervision, we propose a self-regularized loss term to guarantee that the useful information can be preserved by the augmentation. This loss term is designed to guarantee that the teacher models in view-based self-supervised learning methods can identify the augmented images as the original ones. Additionally, we present a curated augmentation policy search space for self-supervised learning, by removing harmful geometric transformations from the supervised AutoAugment search space and adding some augmentations that are generally used in manually designed view policies. The augmentation pipeline contains two sequentially applied randomly sampled augmentation operations with automatically optimized sampling weights, execution probability, and operation magnitude. Different from the previous search space, the execution probability of the first operation is set to 1 to eliminate the probability of not applying any augmentation.
On ImageNet, our AutoView achieves remarkable improvement over RandAug baseline (+10.2% -NN and +7.3% linear classification accuracy), as shown in Fig. 1, and outperforms state-of-the-art manually tuned view policy by a clear margin (up to +1.3% -NN accuracy). Extensive experiments show that AutoView pretraining also benefits downstream tasks (+2.8% mAP on revisited Oxford Image Retrieval benchmark) and improves model robustness (+2.3% Top-1 Acc on ImageNet-A). To validate the effectiveness of AutoView on new training schemes, we also conduct experiments on an efficient self-supervised learning scheme by progressively increasing the image size during training.
Overall, our main contributions are three-fold:
-
•
We propose AutoView, a self-regularized adversarial AutoAugment method to learn views for self-supervised vision transformers, which addresses high learning cost and information collapse problem efficiently.
-
•
We present a curated augmentation policy search space for self-supervised learning, in which random baselines are able to perform comparably to state-of-the-art manually-tuned policies.
-
•
AutoView achieves remarkable improvements over RandAug baseline on -NN and linear classification (+10.2% and +7.3%) and consistently outperforms hand-crafted policies on various tasks.
2 Related Work
2.1 Self-supervised Visual Representation Learning
Currently, a large body of works on self-supervised learning focus on discriminative approaches [1, 2, 3, 4, 5, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], which regard each image as a different class and trains the model by discriminating them up to data augmentations. From a mutual-information perspective [31, 18, 1], models are trained to maximize the mutual information between different augmented views of an image. Plenty of positive and negative samples are needed to discover the similarity and dissimilarity. In practice, this requires large batches or memory banks. More recent works [4, 32, 33] have shown that we can learn high-quality representations without negative samples.
Self-supervised Vision Transformers. MoCov3 [7] first explored the iceberg of transformer-based self-supervised learning and presented a training recipe to let ViT perform reasonably well on ImageNet-1K linear evaluation. MoBY [6] adopts Swin Transformer [34] as the backbone to evaluate the learned representations on downstream tasks such as object detection and semantic segmentation. DINO [5] presents a new self-supervised learning method that shows good synergy with the transformer architecture and achieves the comparable performance of large self-supervised ConvNets using small/medium-size Transformers. Based on DINO, EsViT [8] further pursues efficient solutions to self-supervised vision transformers with two major insights: a multi-stage transformer architecture with sparse self-attentions, and a region-matching pre-training task. Concurrent BERT[35] like approachs (BEiT[36], MAE[10], CAE[37]) propose to reconstruct raw pixels via mask image modeling. And [38] demonstrates that them also implicitly learns occlusion-invariant features, which are analogous to other discriminative methods, while the latter learns other invariance. iBOT [9] performs masked image modeling via self-distillation as DINO with an online tokenizer. Different from these works on objective, our work aims at developing an automatic view learning method rather than using hand-crafted view policies.
Generation-based Pretext Tasks. At present, only a very small amount of work has explored automatic view generation for self-supervised learning on small-scale datasets preliminarily. InfoMin [18] leverages an adversarial semi-supervised view learning strategy. Except the flow-based view generation model which is adversarially trained to minimize mutual information, they need two classifiers on each of the learned views to perform classification using labels for the downstream task during the view learning process. InfoMin needs to first train the generative network with labels, then the frozen generator is used to generate views for self-supervised representation learning. Viewmaker [19] apply adversarial robustness literature to generate a variety of residual perturbations for a single input, with an iterative optimized image-to-image neural network as their Viewmaker network, requiring optimized in alternating steps with encoder. Their ineffectiveness prevents them from applying to self-supervised vision transformers that already require a large amount of computing resources. In this work, we employ a constrained adversarial training method that adaptively reduces the mutual information between different views while preserving useful input features for the encoder to learn from just through one-step optimization.
2.2 Automatic Augmentation
Inspired by recent advancements in neural architecture search (NAS) [39, 40], researchers try to learn augmentation policies from data automatically [12, 17, 13, 14, 15, 41]. AutoAugment [12] first adopts reinforcement learning for auto-augmentation tasks but it requires searching for thousands of GPU-days. AdvAA [17] proposes online search manners by jointly optimizing augmentation policies and training the target networks. RandAug [13] directly uses naive grid search to find the best policy. DDAS [15] completes differentiable augmentations search through meta-learning with one-step gradient update. DADA [14] introduces trainable policy parameters and adopts the Gumbel technique to update it with network weights by gradient descent algorithm alternately. These methods are based on supervised search space and need target validation sets, either have large search costs for multi-step optimization or multiple experiments, which are not suitable for self-supervised learning. Thus this work proposes a self-regularized adversarial AutoView that can be optimized with self-supervised networks simultaneously in each single forward-backward step.
3 AutoView
In this section, we first briefly formulate the standard view-based self-supervised vision transformers. And then we present our AutoView in detail, along with its three key elements: the one-step optimization (Sec. 3.1), the self-regularized information propagation (Sec. 3.2), and the curated search space (Sec. 3.3). A visual illustration is in Fig. 2.
3.1 Learning Adversarial Views Through One-Step Optimization

View-based Self-supervised Vision Transformer. View-based self-supervised vision transformers have shown remarkable results in representation learning recently, whereby one image is augmented using two separate data augmentations to get augmented views , and then tranformers are trained to maximize the mutual information between different views i.e., , through maximize , where is a similarity function and are transformer encoder network. And parameters are updated with an exponential moving average (ema) of the parameters. But generating views through careful construction of data augmentation strategies is still a significant limitation, as several barriers are hindering automatic view learning development and application on self-supervised vision transformers.
Adversarial View Learning Through One-Step Optimization. Without label supervision, adversarial learning could be a useful way to learn augmentation policies for self-supervised vision transformer. For an input image , the view generator is defined as . Given , and two encoders are trained to maximize . Meanwhile, is adversarially trained to minimize . Then the min-max objective is:
(1) |
However, the ineffectiveness of multi-step optimization of the encoder and generator prevents it from applying to self-supervised vision transformers that already require a large amount of computing resources.
Hence, we employ a constrained adversarial training method that enables the model to adaptively reduce the mutual information between different views just through one-step optimization. In practice, we follow [5] to learn to match the encoder networks’ output probability distributions by minimizing the cross-entropy loss w.r.t. the parameters of the encoder network with the network fixed. And our generates views by sampling according to probability, then we have:
(2) |
where . We relax the policy selection to be differentiable through Gumbel-Softmax reparameterization [42] to achieve differentiable optimization of our image transforming , details in Sec. 3.3. It is worth noting that we realize the one-step optimization of the generator and encoder through Gradient reversal layer (GRL)[43]. Remarkably, through that, the search cost of AutoView is nearly zero as views and network parameters are simultaneously optimized in each single forward-backward step. Through one-step optimization, AutoView addresses the high learning cost and successfully performs AutoAugment for self-supervised learning in large-scale datasets for the first time.
3.2 Avoid Collapse via Self-Regularized Information Propagation
With the loss function in Eqn. (2), AutoView intends to increase the sampling probability of those transformations that can generate samples with high training loss. By sampling such transformations, AutoView can pay more attention to more aggressive augmentation strategies and increase model robustness against difficult samples. However, blindly minimizing the mutual information, i.e., increasing the difficulty of samples may cause the augment ambiguity phenomenon [44]: augmented images may be far away from the majority of clean images, which could cause the under-fitting of models and deteriorate the learning process. InfoMin designed a semi-supervised method using a supervised constraint guiding the generator to alleviate this problem, which is not suitable for self-supervised learning without labels.
The Information Bottleneck (IB) theory [45] uses the information theoretic objective to constrain the mutual information between the input and the representation, which can help us to preserve information. As previous work [46] shows that the information constraining objective in supervised setting has the same form as that of self-supervised setting except the target outputs. Therefore, we unify these two objectives by using as the output of the downstream tasks. In supervised setting, represents the label of input . In self-supervised setting, represents the input itself. This leads to the unified objective function linking the representation of input and target as:
(3) |
This unified objective describes a constraint with the goal of maximizing the mutual information between the representation and the target . Hence, we propose a self-regularized loss to avoid information collapse caused by the lack of label supervision:
(4) |
where is the original image, is the augmented image. So this self-regularized loss is to keep the information to encourage distinguishable image features that share relevant semantic information with the original data.
In summary of Eqn. (2) and Eqn. (4), the training process of our AutoView is formulated as follow:
(5) |
where the second term of the loss function is actually Eqn.(4). We use to achieve our information constraint on , which will not affect the training of , and then accomplish the conversion of maximization and minimization by the negative sign. So that we integrate the self-regularized loss into our adversarial view learning framework without undermining its advantage of one-step optimization.
3.3 The Curated Augmentation Policy Search Space.
The current search space designed for supervised AutoAugment is unsuitable for view-based self-supervised learning as analyzed previously. Meanwhile, existing view learning methods, both the generative Viewmaker model and flow-based view generation model of InfoMin are pixel-wise generators, which are not suitable for view-based self-supervised learning of embedding space and can’t be applied to large scale datasets due to the limitations of the pixel-wise image generator. Thus we present a curated augmentation policy search space and design our online policy search framework based it.
In detail, we construct our search space of image pre-processing operations with 12 candidates. We denote the set of operations as : {AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, Sharpness, Hue, Grayscale, Gaussianblur}, removing geometric transformation(e.g., ShearY, TranslateX) [47] and adding augmentations commonly used in self-supervised methods (e.g., Gaussianblur, Grayscale) [3]. Here we intuitively list the augmentation operations in our curated search space:
AutoContrast | Invert | Equalize | Solarize | |
Posterize | Contrast | Color | Brightness | |
Sharpness | Hue | Grayscale | Gaussianblur |
Each augmentation policy contains image pre-processing operations: , where . With represents the number of elements in the set, we define sampling weights to represent the sampling probability for each candidate operation, which is the parameter that the generator needs to learn. Therefore, the -th image operation of policy can be sampled as:
(6) |
For -th policy, each image operation owns two parameters: the magnitude that determines the strength of the transformation and the probability of applying the transformation, . So that the output of image pre-processing can be computed as follows:
(7) |
And it is worth noting that if the policy is not applied, that is when there is no augmentation, the self-supervised training will easily collapse. So instead of sampling and applying the whole policy by probability as in previous work [14], as shown in the left part of Fig. 2, we sample the policy by layers and the execution probability of the first layer is fixed at 1. Finish formulating our search space and then we relax the policy selection to be differentiable through Gumbel-Softmax reparameterization [42].
As to search adversarial policy online with encoder, we use the Gumbel-Softmax reparameterization trick [42] to achieve the differentiable relaxation. Thus, we regard Eqn. (7) as , where is sampled from Bernoulli distribution. Since this distribution is non-differentiable, we instead use Relaxed Bernoulli distribution:
(8) |
where is sampled from a uniform distribution on . Using this reparameterization, each operation can be differentiable w.r.t. its probability parameter . For policy sample, we approximate their gradient in a similar manner to the straight-through estimator. For forward, we first sample operation by Eqn. (6), e.g.-th operation. Then:
(9) |
where and . So the backward pass uses differentiable variables and the forward pass uses discrete variables .
Through the above relaxation, we can achieve differentiable optimization of image transforming . So then the search cost of AutoView is nearly zero as we can learn views and network parameters simultaneously in Eqn. (5).
4 Experiments
This section presents our experimental setups and results. We first validate our AutoView with the standard self-supervised benchmark (Sec. 4.2). And we also validate the effectiveness of AutoView on new progressive learning schemes (Sec. 4.3). We then study the properties of the resulting features for segmentation, retrieval, transfer-learning, robustness and overfitting analysis (Sec. 4.4). Finally, we give the ablation study on the crucial components of AutoView (Sec. 4.6).
4.1 Implementations Details
Architectures. We use the Vision Transformers with different amounts of parameters, ViT-S/16, ViT-B/16 as the encoder network . For ViTs, /16 denotes the patch size being 16. We pretrain and finetune the transformers with 224-size images, so the total number of patch tokens is 196. The projection head is a 3-layer MLPs with -normalized bottleneck following DINO[5].
Pretraining Details. We pretrain the models on the ImageNet or OpenImages without labels. For self-supervised training, the hyperparameters follow closely to DINO. We use the adamw optimizer with a cosine decay learning rate schedule. The learning rate is linearly ramped up during the first 10 epochs to its base value (0.0005 for the batchsize of 256). The weight decay also follows a cosine schedule from 0.04 to 0.4.
For AutoView policy training, we made a simplification on our preliminary search space, we discretized the continuous magnitude of augmentation and considered the same augmentation with different magnitudes as multiple operations. The operation number in an augmentation policy is set to and the two operations share sampling weight . Then we adopt Adam optimizer with optimizing the policy weights with a step decay learning rate schedule, and the initial learning rate is 6e-5. The execution probability also optimized by Adam optimizer with with learning rate 1e-5.
For progressive learning, we divide the training process into four stages with about 25 epochs per stage: the early stage uses a small image size, while the later stages use larger image sizes with stronger regularization. And the minimum (for the first stage) and maximum (for the last stage) values of image size are 128, 224. The small view size of multi-crop training is from 55 to 96. Both of them change linearly according to the training stage.
For grid search of RandAug, RandAug makes the following simplifying assumptions: all operations share a single, discrete magnitude, ; all policies apply the same number of operations, ; all operations are applied with uniform probability. We select the best -NN result from a grid search over .
Linear and -NN Evaluation Details. We follow the evaluation protocols in DINO. For linear evaluation, we sweep over different learning rates. For -NN evaluation, we sweep over different numbers of nearest neighbors.
End-to-end fine-tuning Details. We follow the protocol used in DeiT[11] and finetune the features on downstream transfer tasks, e.g., Cifar10 and Flowers.
Fine-Tuning Setting of Semantic Segmentation on ADE20K. For semantic segmentation, we follow the configurations in iBOT [9], fine-tuning k iterations with images and a layer decay rate of and set the learning rate on . We do not use multi-scale training and testing. To produce hierarchical feature maps, we use the features output from layer , , , and , with additional deconvolution layers appended after.
4.2 Main Classification Results
We evaluate our method on two large scale image datasets, ImageNet-1K [48] and OpenImages [49]. ImageNet contains 1.2M train set images and 50K val set images in 1,000 classes. OpenImages consists of a total of 9.1 million images. We follow [50] to construct a subset of the OpenImages dataset that has 212K images present across 208 classes.

Method | Augmentation | Arch | Linear | -NN |
100 epochs | ||||
MoCoV3[7] | Manual | ViT-S/16 | 67.0 | - |
AutoView | ViT-S/16 | 67.5 | - | |
Manual | ResNet-50 | 68.9 | - | |
AutoView | ResNet-50 | 69.3 | - | |
iBOT[9] | Manual | ViT-S/16 | 74.4 | 70.7 |
AutoView | ViT-S/16 | 74.8 | 71.2 | |
DINO[5] | Manual | ViT-S/16 | 74.0 | 69.1 |
RandAug | ViT-S/16 | 67.5 | 59.7 | |
AutoView | ViT-S/16 | 74.8 | 69.9 | |
DINO*[5] | Manual | ViT-S/16 | 73.7 | 69.2 |
AutoView | ViT-S/16 | 74.5 | 70.5 | |
300 epochs | ||||
DINO[5] | Manual | ViT-S/16 | 76.1 | 72.8 |
AutoView | ViT-S/16 | 76.5 | 74.0 | |
400 epochs | ||||
DINO[5] | Manual | ViT-B/16 | 78.2 | 76.1 |
AutoView | ViT-B/16 | 78.6 | 76.5 | |
800 epochs | ||||
DINO[5] | Manual | ViT-S/16 | 77.0 | 74.5 |
AutoView | ViT-S/16 | 77.4 | 74.9 |
Linear and -NN results on ImageNet. To evaluate the quality of pre-trained features, we either use a linear classifier on the frozen representation or a -nearest neighbor classifier (-NN) . In Tab. I, AutoView achieves remarkable improvements over RandAug baseline on -NN and linear classification (+10.2% and +7.3%) and steadily boosts both the top-1 and -NN classification accuracy on ImageNet for different frameworks (e.g. DINO, iBOT, MoCoV3) , different architectures (e.g.ViT-S,ViT-B) and different training lengths (e.g.100800 epochs). To be specific, AutoView can promote the -NN accuracy of DINO with ViT-S of various train periods by 0.8%, 1.2%, 0.4% respectively, and the results are shown in Fig. 3 intuitively show our continuous boost on different training lengths(e.g. 100800 epochs). Particularly, AutoView consistently boosts the DINO on the progressive training setting with the gain of 1.3% -NN accuracy.
Why we focusing on ViTs? First, data augmentation is more critical in ViTs than CNNs. The vanilla ViT needs to be trained with a large dataset with full supervision (e.g., ImageNet-21k). DeiT improves the vanilla ViT-L by up to 2.2% using better augmentation policies proposed by RangAugment. Second, more and more current sota self-supervised methods use ViTs as their backbone, and our method is in line with them. Moreover, our AutoView is a general AutoAugment method for self-supervised learning. As shown in Tab. I, our method consistently improves the performance when applied to CNN-based self-supervised learning approach.
Method | Augmentation | Arch | Epoch | mAP |
MoCov2 | - | ResNet-50 | 200 | 58.6 |
DINO | Manual | ViT-S/16 | 100 | 59.6 |
DINO | AutoView | ViT-S/16 | 100 | 60.3 |
Classification results on OpenImages. We firstly pretrain models on OpenImages dataset. After pre-training, we freeze the backbone weights and train a linear classifier with a multi-class logistic regression loss. We follow the mAP metric as described in [50]. The last row in Tab. II shows the continuous boost obtained by using our AutoView on OpenImages.
4.3 Progressive Learning for Self-supervised training
As self-supervised training is time-consuming and image size plays an important role in training efficiency. [51, 52, 53, 54] have proposed different kinds of progressive training, which dynamically change the training settings or networks. We take the insight of EfficientNetV2[54] and present an efficient self-supervised learning scheme by progressively increasing the image size during training. We divide the training process into four stages. In the early training epochs, we trained the network with smaller images such that the network can learn simple representations easily and fast and then gradually increase image size. As the result shown in Tab. III, we indeed improve the training speed but it also comes with a drop in accuracy. In this section, all models are trained for 100 epochs with ViT-S/16 and train time is counted on the same machine using 8 RTX 2080Ti GPUs with the total batch size 256.
Method | Train Scheme | Linear | -NN | Time |
DINO | Manual | 74.0 | 69.1 | 61h |
DINO | Progressive | 73.7(-0.4) | 69.2(+0.1) | |
DINO | Progressive RandAug | 74.2(+0.1) | 70.1(+1.0) | 46h |
DINO | Progressive AutoView | 74.5(+0.5) | 70.5(+1.5) |
Progressive Learning with AutoView. As EfficientNetV2 points out using the same regularization for all image sizes causes the drop in accuracy. To improve both training speed and accuracy, they also adaptively adjust regularization linearly according to image size. To continue to explore this property, we adapt grid search over as RandAug for DINO progressive training in our search space to select the best result of linear classification on imagenet.
With the search result of four train stage, we improve both the accuracy and training time. The above results prove that adaptive regularization helps progressive learning gain increase in accuracy. But the RandAug search results also show that simple linear schedule regularization just according to image size is not always the most suitable. Then thanks to AutoView, we use it to dynamically adjust augmentations during model training and get better results than costly and tedious manual search, which clearly demonstrates the effectiveness and generalization of AutoView.
4.4 Downstream Tasks
We first evaluate our method in the dense downstream tasks, semantic segmentation, on which we observe improvements over the vanilla pre-trained baselines. We then evaluate properties of the learned features of models pretrained for 300 epochs with ViT-S/16 in terms of nearest neighbor search and transferability to different datasets.
Method | Augmentation | Arch | mIoU | aAcc | mAcc |
Sup. | - | ViTB/16 | 46.6 | - | - |
DINO | Manual | ViTB/16 | 46.8 | 82.4 | 56.6 |
DINO | AutoView | ViTB/16 | 47.1 | 82.8 | 57.8 |
Semantic Segmentation. We evaluate our method in the dense downstream task, semantic segmentation, on which we observe improvements over the vanilla pretrained baseline. And in this experiment models are pretrained for 400 epochs with ViT-B/16. Our semantic segmentation experiments on ADE20K[55] use UperNet [56] following the code in iBOT. We use the task layer in UperNet and fine-tune the entire network. From Tab. IV, we can see that AutoView advances DINO with ViT-B/16 with a clear margin of 1.2% on mAcc.
Method | Video object segmentation | Oxford | |||
& | Medium | Hard | |||
DINO Manual | 61.3 | 62.9 | 59.7 | 33.7 | 11.9 |
DINO AutoView | 61.7 | 63.1 | 60.3 | 36.5 | 13.6 |
Video Object Segmentation. We evaluate the output patch tokens on the DAVIS-2017 video instance segmentation benchmark[57]. We segment scenes with the nearest neighbor between consecutive frames, while do not train any model on top of the features, nor fine-tune any weights for the task, and follow the experimental protocol as DINO. According to Tab. V, AutoView is superior to vanilla DINO on all metrics.
Image Retrieval. We consider the revisited Oxford [58], which contains 3 different splits of gradual difficulty with query/database pairs. We freeze the features and directly apply -NN for retrieval as DINO, reporting the Mean Average Precision (mAP) for the Medium and Hard splits. Since AutoView has higher -NN results on Imagenet-1K, the performance is also better for AutoView in the image retrieval task (up to 2.8%) as in Tab. V.
Method | Cifar10 | Cifar100 | Flwrs | Cars |
Rand. | 99.0 | 89.5 | 98.2 | 92.1 |
DINO Manual | 99.0 | 90.5 | 98.3 | 92.6 |
DINO AutoView | 99.0 | 90.5 | 98.7 | 93.3 |
Transfer learning. We evaluate the quality of the features pretrained on ImageNet-1K and fine-tune on several smaller datasets. The results are demonstrated in Tab. VI. While the results on CIFAR10 and CIFAR100 have almost plateaued, AutoView consistently boosts the baseline.
4.5 Robustness and Overfitting Analysis
AutoView enables models to cover more augmentation policies. This property could improve model robustness to novel examples. To verify if AutoView can improve ViT-based models’ robustness and out-of-distribution performance, we evaluated our pretrained models on two robustness scenarios including natural adversarial example and out-of-distribution detection. And we also verify that our method has no overfitting problem on ImageNetV2 [59]. In this section, we study the robustness and overfitting with ViT-S/16 pretrained for 300 epochs and then linearly evaluated for 100 epochs.
Method | Nat. Adversarial Example | Out-of-Dist | IN-V2 | |
Top1-Acc | AURRA | AUPR | -NN | |
DINO Manual | 11.9 | 16.1 | 21.6 | 61.4 |
DINO AutoView | 13.2 | 18.2 | 22.6 | 62.3 |
Natural Adversarial Example. The dataset ImageNet-A [60] adversarially collects 7500 unmodified, natural but “hard” real-world images which are drawn from some challenging scenarios (e.g., fog scene and occlusion). The metric for assessing classifiers’ robustness to adversarially filtered examples includes the top-1 accuracy and Area Under the Response Rate Accuracy Curve (AURRA). AURRA is an uncertainty estimation metric introduced in [60]. As shown in Tab. VII, the ViT trained by AutoView improves over the vanilla pretrained DINO by 2.1% AURRA.
Out-of-distribution Detection. The dataset ImageNet-O[60] is an adversarial out-of-distribution detection dataset which adversarially collects 2000 images from outside ImageNet-1K. The anomalies of unforeseen classes should result in low-confidence predictions. The metric is the area under the precision-recall curve (AUPR) . As Tab. VII indicates, AutoView outperforms baseline by 1.0% AUPR.
Performance on ImageNetV2. Tab. VII reports the additional evaluation on ImageNet V2[59], that has a test set distinct from the ImageNet validation, which reduces overfitting on the validation set. The results (+0.9% -NN) verify the generalization and effectiveness of our method. Note that our method does not use the validation set for augmentation searching, so there should be no overfitting problem.
4.6 Ablation study
In this section, we perform extensive ablation studies to analyze each components of our proposed AutoView. All the results in this section are obtained by training ViT-S/16 with different schemes for 100 epochs.

Effectiveness of the Self-Regularized Loss. We compare the pretraining objective with and without self-regularized loss. As shown in Fig. 4, the training curve with our self-regularized loss is much stabler, and converges to a lower loss value, which demonstrate that the self-regularized loss can effectively avoid information clapse. As shown in the fourth column of Tab. IX, without self-regularized loss (Reg. Loss), the performance dropped by 0.4% in regular AutoView training and dropped considerably by 3.7% in progressive learning, proving its importance in AutoView.
Method | SS | Linear | -NN |
DINO RandAug | RandAug | 67.5 | 59.7 |
DINO RandAug | AutoView | 74.0(+6.5) | 69.3(+9.6) |
DINO AutoView | RandAug | 69.9 | 63.5 |
DINO AutoView | AutoView | 74.8(+4.9) | 69.9(+6.4) |
Effectiveness of the Search Space. For our search space, we compared our present curated augmentation policy search space with the one designed for supervised learning that is generally used by popular AutoAugment methods. As shown in Tab. VIII, AutoView and RangAug both gain a large margin of linear and -NN accuracy with our search space. For the first time, the AutoAugment method for self-supervised learning is proved to be able to perform in a policy search space. For AutoView, we can find it significantly outperforms RandAug on same search space (up to +3.8% -NN accuracy). And it is worth mentioning that the results of RandAug are obtained through tedious and time-consuming grid search.
H. Policy | Exc. Prob. | Weight | Reg. Loss | -NN |
✓ | ✓ | ✓ | ✓ | 69.9 |
✗ | ✓ | ✓ | ✓ | 69.1(-0.8) |
✓ | ✗ | ✓ | ✓ | 68.3(-1.6) |
✓ | ✓ | ✗ | ✓ | 69.0(-0.9) |
✓ | ✓ | ✓ | ✗ | 69.5(-0.4) |
✓ | ✓ | ✓ | ✓ | 70.5 |
✓ | ✓ | ✓ | ✗ | 66.8(-3.7) |
Importance of Searching. As shown in Tab. IX, we further investigate the effect of searching for sampling weights and execution probability of AutoView. In the first column, we search policy hierarchically (H. Policy) instead of sampling and applying the whole policy by probability. Then we fix the execution probability at 0.5 without searching (Exc. Prob.). Next, we randomly select views in our policy search framework rather than sampling by the weights searched (Weight). The clear accuracy drop reflects the searching of hierarchical policy, execution probability and views are all vital components of AutoView.
0 | 0.3 | 0.5 | 0.8 | 1.0 | 3.0 | |
NN-20ep | 47.41 | 47.53 | 47.57 | 47.64 | 47.72 | 47.52 |

The trade-off of two objectives in AutoView. There exists a trade-off between and in our AutoView. And we have considered and verified the trade-off between these two objectives through a loss weight , which multiplies on the self-regularized item. The results are shown in Tab. X, where we can see that is 1.0 is the optimal setting we adopt in this paper. Moreover, our self-regularized loss is the constraint of the augmented view and the original view on the single network , which is much weaker than the loss between the two siamese networks and just can be seen as one regularization term, and the regularization term is adaptively adjusted during training as the Fig. 5 shows.
Setting | AutoView | Policy-retrain | SS-allop | Not-share | |
NN-10ep | 42.7 | 42.7 | 41.1 | 40.6 | 41.1 |

Method Complexity. Our searched augmentation policy is general and universal. One can directly use our searched policy to get the same performance results without searching. As shown in Tab. XI, the model retraining with searched policy (Policy-retrain in Tab. XI) performs comparable with the the model that use AutoView in an online way (AutoView in Tab. XI).
Search Space Design. We adopt the supervised RA search space and make non-trivial modifications. The vanilla RA search space did not contain the transformations commonly used in SSL. We first added them, but then we found that the effect was unsatisfactory (see Tab. XI). With many geometric transformations combined, the object often shifts from the center of the image [47], which hurts SSL model learning, so we removed these transformations.
Choice of . First, the of sota methods (i.e., AA, RA, DADA, etc.) is set to 2. Second, our method is based on RA. During RA searching (see Sec. 4.3), we found that =2 achieved the best results. We also explore different s in Tab. XI.
Result when is not shared for and . We have considered and verified various cases of network structure design. When is not shared, the result is sub-optimal (see Tab. XI).
Method | AutoView | InfoMin | Viewmaker | DADA | RandAugment |
Overhead | 0x | 1x | 1x | 1x | 4-80x |
Search Overhead. The search overhead of different methods in Tab. XII shows the advantage of our one-step optimization. AutoView learns adversarial augmentation policies efficiently by learning views and network parameters simultaneously in a single forward-backward step without additional search costs. Noting that, the search overhead there is just a rough estimate, as for InforMin and Viewmaker, we did not consider the generation time of the pixel-wise image generator and for DADA, the search is only conducted on a small proxy subdataset.

Qualitative Studies of AutoView Searching. To demonstrate the effectiveness of our AutoView searching, we visualize the different operations’ sampling probability searched in Fig. 7. In particular, we put geometric transformation (e.g.,ShearY, TranslateX) back into our search space in this experiment. We can see that the probabilities of geometric transformation operations are getting smaller while GrayScale, the augmentations commonly used in self-supervised methods, is getting larger, which proves the correctness of our AutoView.

Visualization Results of Attention Maps In Fig. 6, we indicate that our AutoView shows visually stronger ability to capture the information of objects, especially in situations with complex backgrounds, compared with DINO Manual. For example, in the first column and the last column, the attention maps learned by AutoView are the complete information of objects (e.g., cats and cabinets) and eliminates the interference of background as much as possible than DINO Manual.
As Fig. 8 shows, without self-regularized loss, the features learned by the model are chaotic. By comparison, the model trained with our self-regularized loss can learn the target object information well.
5 Conclusion
In this work, we present AutoView, a self-regularized adversarial AutoAugment method, to learn views for self-supervised vision transformers. Extensive experiments have shown that AutoView achieves remarkable improvements over RandAug baseline (up to 10.2% on -NN classification), and consistently outperforms view policies designed by human experts on various tasks. Proved by ablation studies, AutoView can outperform RandAug with grid search by 2.4% linear accuracy on same search space. And our curated search space has been proved to improve the performance of RandAug baseline significantly by 6.5% linear accuracy. Ablation analyses have also shown that our proposed self-regularized loss term successfully addressed the problem of information collapse. Additionally, experiments on the efficient self-supervised learning scheme by progressive training also demonstrate the effectiveness and generalization of AutoView.
References
- [1] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” Advances in neural information processing systems (NeurIPS), vol. 32, 2019.
- [2] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
- [3] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
- [4] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent-a new approach to self-supervised learning,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 21 271–21 284, 2020.
- [5] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the International Conference on Computer Vision, 2021.
- [6] Z. Xie, Y. Lin, Z. Yao, Z. Zhang, Q. Dai, Y. Cao, and H. Hu, “Self-supervised learning with swin transformers,” arXiv preprint arXiv:2105.04553, 2021.
- [7] X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9640–9649.
- [8] C. Li, J. Yang, P. Zhang, M. Gao, B. Xiao, X. Dai, L. Yuan, and J. Gao, “Efficient self-supervised vision transformers for representation learning,” International Conference on Learning Representations, 2021.
- [9] J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “ibot: Image bert pre-training with online tokenizer,” International Conference on Learning Representations, 2021.
- [10] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- [11] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International Conference on Machine Learning. PMLR, 2021, pp. 10 347–10 357.
- [12] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation strategies from data,” in CVPR, 2019.
- [13] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical automated data augmentation with a reduced search space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 702–703.
- [14] Y. Li, G. Hu, Y. Wang, T. Hospedales, N. M. Robertson, and Y. Yang, “Differentiable automatic data augmentation,” in European Conference on Computer Vision. Springer, 2020, pp. 580–595.
- [15] A. Liu, Z. Huang, Z. Huang, and N. Wang, “Direct differentiable augmentation search,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 219–12 228.
- [16] S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim, “Fast autoaugment,” Advances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019.
- [17] X. Zhang, Q. Wang, J. Zhang, and Z. Zhong, “Adversarial autoaugment,” in International Conference on Learning Representations, 2019.
- [18] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola, “What makes for good views for contrastive learning?” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 6827–6839, 2020.
- [19] A. Tamkin, M. Wu, and N. Goodman, “Viewmaker networks: Learning views for unsupervised representation learning,” in International Conference on Learning Representations, 2020.
- [20] S. Suresh, P. Li, C. Hao, and J. Neville, “Adversarial graph augmentation to improve graph contrastive learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 15 920–15 933, 2021.
- [21] S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and N. Saunshi, “A theoretical analysis of contrastive unsupervised representation learning,” arXiv preprint arXiv:1902.09229, 2019.
- [22] T. Chen, C. Luo, and L. Li, “Intriguing properties of contrastive losses,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 834–11 845, 2021.
- [23] T. Hua, W. Wang, Z. Xue, S. Ren, Y. Wang, and H. Zhao, “On feature decorrelation in self-supervised learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9598–9608.
- [24] S. Purushwalkam and A. Gupta, “Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases,” Advances in Neural Information Processing Systems, vol. 33, pp. 3407–3418, 2020.
- [25] T. Wang and P. Isola, “Understanding contrastive representation learning through alignment and uniformity on the hypersphere,” in International Conference on Machine Learning. PMLR, 2020, pp. 9929–9939.
- [26] T. Xiao, X. Wang, A. A. Efros, and T. Darrell, “What should not be contrastive in contrastive learning,” arXiv preprint arXiv:2008.05659, 2020.
- [27] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 310–12 320.
- [28] E. Xie, J. Ding, W. Wang, X. Zhan, H. Xu, P. Sun, Z. Li, and P. Luo, “Detco: Unsupervised contrastive learning for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8392–8401.
- [29] Y. Xiong, M. Ren, and R. Urtasun, “Loco: Local contrastive representation learning,” Advances in neural information processing systems, vol. 33, pp. 11 142–11 153, 2020.
- [30] C. Lang, A. Braun, and A. Valada, “Contrastive object detection using knowledge graph embeddings,” arXiv preprint arXiv:2112.11366, 2021.
- [31] M. Wu, C. Zhuang, M. Mosse, D. Yamins, and N. Goodman, “On mutual information in contrastive learning for visual representations,” arXiv preprint arXiv:2005.13149, 2020.
- [32] X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 750–15 758.
- [33] Y. Tian, X. Chen, and S. Ganguli, “Understanding self-supervised learning dynamics without contrastive pairs,” in International Conference on Machine Learning. PMLR, 2021, pp. 10 268–10 278.
- [34] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
- [35] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- [36] H. Bao, L. Dong, and F. Wei, “Beit: Bert pre-training of image transformers,” International Conference on Learning Representations, 2021.
- [37] X. Chen, M. Ding, X. Wang, Y. Xin, S. Mo, Y. Wang, S. Han, P. Luo, G. Zeng, and J. Wang, “Context autoencoder for self-supervised representation learning,” arXiv preprint arXiv:2202.03026, 2022.
- [38] X. Kong and X. Zhang, “Understanding masked image modeling via learning occlusion invariant feature,” arXiv preprint arXiv:2208.04164, 2022.
- [39] M. Zhang, H. Li, S. Pan, X. Chang, Z. Ge, and S. Su, “Differentiable neural architecture search in equivalent space with exploration enhancement,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 13 341–13 351, 2020.
- [40] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” in International Conference on Learning Representations, 2018.
- [41] T. Suzuki, “Teachaugment: Data augmentation optimization using teacher knowledge,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 904–10 914.
- [42] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
- [43] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in International conference on machine learning. PMLR, 2015, pp. 1180–1189.
- [44] L. Wei, A. Xiao, L. Xie, X. Zhang, X. Chen, and Q. Tian, “Circumventing outliers of autoaugment with knowledge distillation,” in European Conference on Computer Vision. Springer, 2020, pp. 608–625.
- [45] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv preprint physics/0004057, 2000.
- [46] J. Hu, R. Ji, S. Zhang, X. Sun, Q. Ye, C.-W. Lin, and Q. Tian, “Information competing process for learning diversified representations,” Advances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019.
- [47] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan, “Augmix: A simple data processing method to improve robustness and uncertainty,” in International Conference on Learning Representations, 2019.
- [48] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
- [49] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The open images dataset v4,” International Journal of Computer Vision, vol. 128, no. 7, pp. 1956–1981, 2020.
- [50] S. Mishra, A. Shah, A. Bansal, A. Jagannatha, A. Sharma, D. Jacobs, and D. Krishnan, “Object-aware cropping for self-supervised learning,” arXiv preprint arXiv:2112.00319, 2021.
- [51] H. Yu, A. Liu, X. Liu, G. Li, P. Luo, R. Cheng, J. Yang, and C. Zhang, “Pda: Progressive data augmentation for general robustness of deep neural networks,” arXiv preprint arXiv:1909.04839, 2019.
- [52] O. Press, N. A. Smith, and M. Lewis, “Shortformer: Better language modeling using shorter inputs,” arXiv preprint arXiv:2012.15832, 2020.
- [53] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” in International Conference on Learning Representations, 2018.
- [54] M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster training,” in International Conference on Machine Learning. PMLR, 2021, pp. 10 096–10 106.
- [55] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
- [56] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” in European Conference on Computer Vision, 2018, pp. 418–434.
- [57] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 davis challenge on video object segmentation,” arXiv preprint arXiv:1704.00675, 2017.
- [58] F. Radenović, A. Iscen, G. Tolias, Y. Avrithis, and O. Chum, “Revisiting oxford and paris: Large-scale image retrieval benchmarking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5706–5715.
- [59] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do imagenet classifiers generalize to imagenet?” in International Conference on Machine Learning. PMLR, 2019, pp. 5389–5400.
- [60] D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 262–15 271.