Memory-based Jitter: Improving Visual Recognition on Long-tailed Data
with Diversity In Memory

Jialun Liu¹ Jingwei Zhang² Yi Yang³ Wenhui Li¹ Chi Zhang⁴ Yifan Sun⁴

¹Jilin University ²Tsinghua University ²Zhe Jiang University ⁴Megvii Inc.

Abstract

This paper considers deep visual recognition on long-tailed data. To be general, we tackle two applied scenarios, i.e., deep classification and deep metric learning. Under the long-tailed data distribution, the majority classes (i.e., tail classes) only occupy relatively few samples and are prone to lack of within-class diversity. A radical solution is to augment the tail classes with higher diversity. To this end, we introduce a simple and reliable method named Memory-based Jitter (MBJ). We observe that during training, the deep model constantly changes its parameters after every iteration, yielding the phenomenon of weight jitters. Consequentially, given a same image as the input, two historical editions of the model generate two different features in the deeply-embedded space, resulting in feature jitters. Using a memory bank, we collect these (model or feature) jitters across multiple training iterations and get the so-called Memory-based Jitter. The accumulated jitters enhance the within-class diversity for the tail classes and consequentially improves long-tailed visual recognition. With slight modifications, MBJ is applicable for two fundamental visual recognition tasks, i.e., deep image classification and deep metric learning (on long-tailed data). Extensive experiments on five long-tailed classification benchmarks and two deep metric learning benchmarks demonstrate significant improvement. Moreover, the achieved performance are on par with the state of the art on both tasks.

Figure 1: The proposed Memory-based Jitter (MBJ) enhances the tail feature diversity by accumulating historical features into a memory bank. We visualize the feature distribution of CIFAR-10 [13] with t-SNE [33]. We focus on a specified class ID-10. In (a), ID-10 has abundant training samples and is a head class. Its top-1 accuracy is 94.6%. In (b), we reduce the training samples of ID-10, so it becomes a tail class. Due to the lack of within-class diversity, its feature distribution collapses into a very limited scope and the top-1 accuracy dramatically decreases to 50.6%. In (c), MBJ collects historical features distributed among multiple training iterations into a memory bank. The historical features are scattered around the up-to-date features in the deeply-embedded space, yielding the so-called Memory-based Jitter (MBJ). Consequentially, MBJ enhances the tail data diversity and increases the classification accuracy of ID-10 from 50.6% to 88.6%.

1 Introduction

In visual recognition tasks, the long-tailed distribution of the data is a common and natural problem under realistic scenarios [35, 16, 7, 8]. A few categories (i.e., the head classes) occupy most of the data while the most categories (i.e., the tail classes) only occupy relatively few data. Such long-tailed distribution significantly challenges deep visual recognition, including both deep image classification [48, 12, 4, 35, 1, 6, 50] and deep metric learning [17, 42, 8, 32].

Explicitly, deep image classification and deep metric learning have the fundamental differences. Specifically, there are two aspects, i.e. the task definition and the optimized objective. On one hand, the definitions of two tasks are different. The classification task aims to recognize the already-seen classes. The categories of the training set and testing set are completely overlapped. The metric learning task aims to discriminate the unseen classes. The identities of training set and the testing set have no overlap. On the other hand, the optimized objectives of two tasks are also different. In classification task, the model aims to learn an accurate and unbiased classifier that outputs the correct label to the test instance as much as possible. In metric learning task, the model aims to learn a discriminative feature extractor that encourages the instances from the same class to be closer than those from different classes. To be general, this paper considers long-tailed visual recognition on these two elemental tasks with a uniform motivation.(As to be detailed in Section 3.2 and Section 3.3.)

We recognize the insufficient within-class diversity of the tail classes as the most prominent reason that hinders long-tailed deep visual recognition. In the deeply-embedded feature space, a tail class is under-represented and thus hard to recognize. To validate this point, we visualize the deep embedding of CIFAR-10 dataset in Fig. 1. When a specified class (“ID-10”) degrades from head (Fig.1 (a)) to tail (Fig.1 (b)), its visual concept collapses into a very limited scope in the deep embedding. Consequentially, when we employ the model for inference, samples from “ID-10” may exceed the already-learned scope and are thus easily mis-classified. Intuitively, a radical solution is to augment the tail classes with higher diversity.

We notice two phenomena which are potential for enhancing the tail data diversity, i.e., the weight jitter and the feature jitter. During training, the deep model constantly changes its parameters after every iteration, yielding the phenomenon of weight jitter. Consequentially, given a same image as the input, the models at two different training iterations generate two different feature representations in the deeply-embedded space, resulting in the phenomenon of feature jitter.

Since these jitters are distributed among historical models, we need to accumulate them across multiple training iterations for diversity enhancement. To this end, we employ a memory bank to store the desired jitters, and get the so-called Memory-based Jitter (MBJ). With slight modifications, MBJ is capable to accommodate two elemental visual recognition tasks, i.e., deep image classification and deep metric learning. On deep image classification, MBJ collects the historical features (i.e., feature jitters). Consequentially, the feature memory bank accumulates abundant tail-feature jitters, and improves the classification accuracy on tail classes, as shown in Fig. 1(c)). On deep metric learning, MBJ collects the weight vectors of the classifier layer instead of the features. Each weight vector is typically viewed as the prototype of a training class, so we name the corresponding memory bank as the prototype memory bank.

Besides the accumulated jitter, MBJ is benefited from a novel re-sampling effect between head and tail classes. On both the classification and the deep metric learning task, MBJ assigns larger sampling rate to the tail classes (than to the head classes). Correspondingly, the tail classes occupy more memory-based jitters than the head classes, which compensates for the imbalanced distribution of the raw data. We note that some recent works [48, 12] evidence that directly over-sampling the raw images, though alleviates the data imbalance problem to some extent, actually compromises deep embedding learning. In contrast, MBJ maintains the natural sampling frequency on the raw images and re-balances the head and tail classes in the memory bank.

The main contributions of this paper are summarized as follows:

$\bullet$ We find that the weight jitters and the feature jitters are informative clues to gain extra diversity for tail data augmentation.

$\bullet$ We propose Memory-based Jitter to accumulate the jitters within a memory bank and improve deep visual recognition on long-tailed data. MBJ is compatible to two elemental visual tasks, i.e., deep image classification and deep metric learning, with slight modifications.

$\bullet$ We conduct extensive experiments on five classification benchmarks and two metric learning benchmarks (person re-identification, in particular) under long-tailed scenario. On all these benchmarks, we demonstrate the superiority of our methods, which significantly improves the baseline and is on par with the state-of-the-art methods.

2 Related Work

2.1 Re-balancing strategy

MBJ has a novel re-balancing strategy, compared with prior works on long-tailed visual recognition. Generally, re-balancing aims to highlight the tail classes during training. In prior works, there are two major re-balancing types, i.e., re-weighting[11, 40, 6] and re-sampling[24, 46, 2, 3]. Re-weighting strategy allocates larger weights to tail classes in loss function. Re-sampling over-samples the raw images of the tail classes for training.

Different from these prior works, MBJ re-samples the features / prototypes to highlight the tail classes. It thus avoids directly re-sampling the raw data. Since directly re-sampling the raw data actually compromises the deep embedding learning [48, 12], avoiding such operation substantially benefits MBJ. An ablation study carried out on the long-tailed CIFAR-10 dataset shows that when we completely remove the jitter augmentation, this novel re-sampling strategy still brings $+2.1\%$ improvement over the baseline. The details are to be accessed in Section 4.5.2.

2.2 Memory-based learning

The memory bank plays a critical role in MBJ. Both the weight jitters and the feature jitters are scattered among sequential training iterations. To accumulate these jitters for tail data augmentation, we employ a memory bank. Since memory-based learning has been explored in several computer vision domains, including unsupervised learning, semi-supervised learning and supervised learning [9, 31, 14, 39, 23, 49], we make a detailed comparison as follows.

In unsupervised learning, [9] employs memory to include more data in the dictionary. It shows that larger optimization scope within a optimization step is beneficial for unsupervised learning [9]. In semi-supervised learning, [14, 31] enforce consistency between historical predictions. Such consistency offers auxiliary supervision for the unlabeled data. In supervised deep metric learning, [39] uses memory to enhance the hard mining effect. Regardless of their objectives of using memory, they all hold a negative attitude towards the jitters. [9] and [14, 31] suppress the jitters with momentum and consistency constraint, respectively. [39] tries to avoid the jitters by delaying the injection of memory.

In contrast to their negative attitude towards jitters, we find that the jitters are informative for long-tailed visual recognition. As a major contribution of this work, we analyze the mechanism in Section 3.1 and experimentally validate its effectiveness in Section 4.5.2.

Moreover, we notice a recent work IEM [50] also employs memory for long-tailed image classification. We compare MBJ against IEM in details for clarity. Our method significantly differs from IEM in three aspects, i.e., the applied task, the mechanism and the achieved performance. First, IEM is specified for image classification, while MBJ improves both image classification and deep metric learning with a uniform motivation. Second, IEM considers tail classes are harder to recognize, and thus employ more prototypes from the memory for higher redundancy, while MBJ employs the jitters in memory to augment the diversity of tail data. Finally, on image classification task, MBJ maintains competitive performance with significantly higher computing efficiency. IEM requires extraordinary large amount of memory (up to $50,000$ per class), and achieves Top-1 accuracy of $67\%$ on iNaturelist18. In contrast, MBJ is more memory-efficient and more accuracy. For example, on iNaturelist18 [34], MBJ only stores $40,000$ memorized features in total and achieves Top-1 accuracy of $66.9\%$ and $70.0\%$ at $90$ and $200$ epochs, respectively.

With these comparison, we find that MBJ is featured for the memory-based feature augmentation. It is orthogonal to many prior works. Specifically, we note a very recent work RIDE [38] using multiple classifiers (experts) ensemble to improve the accuracy of head and tail classes, simultaneously. MBJ can be integrated into RIDE [38] for better performance gains.

3 Proposed Method

Basically, MBJ accumulates historical jitters within a memory bank to enhance the diversity of the tail classes. Under this framework, MBJ adopts slight modifications to accommodate two fundamental visual recognition tasks, i.e. feature jitters for image classification and prototype jitters for deep metric learning. In this section, we first analyze the weight jitters and the feature jitters in Section 3.1. Then we introduce the MBJ for image classification and deep metric learning in Section 3.2 and Section 3.3, respectively.

3.1 Weight Jitters and Feature Jitters

Refer to caption — Figure 2: Quantitative statistics on feature jitters and weight jitters. We use a long-tailed CIFAR-10 [13] as the toy dataset and focus on the tail class / image. As the accumulated features / prototypes increase, the angular variance gradually increases.

To illustrate the phenomena of weight jitters and feature jitters, we conduct a toy experiment on CIFAR-10 [13]. We set a specified class to contain very limited samples (i.e., 50 samples) so that it turns into a tail class. We train a deep classification model to convergence and then continue the training for observation purpose. Within the following iterations, we record two objects, i.e., 1) the prototype (i.e., the weight vector in the classification layer) of the tail class and 2) the feature of a single tail sample. As the training iterates, both the prototypes and the features accumulate, allowing a quantitative statistic on their variances. We visualize the geometrical angular variance of the accumulated features / prototypes in Fig. 2, from which we draw two observations.

First, we observe considerable variance among the accumulated weight vectors (i.e., prototypes), as well as the accumulated features. It indicates that among multiple iterations, both the prototype of a single class and the feature of a single image keep on changing itself, yielding the phenomena of weight jitters and feature jitters, respectively.

Second, we observe that the above-described jitters require certain training iterations to accumulate before they reach stable status. When there is only one single feature / prototype, the corresponding variance is naturally $0$ . As the number of accumulated features / prototypes increases, the variance gradually grows until reaching a stable level.

Based on the above observation, we device MBJ. MBJ uses a memory bank to collect the historical features / prototypes, so as to accumulate the feature / weight jitters for tail augmentation.

3.2 MBJ for Deep Image Classification

The pipeline of MBJ for deep image classification is illustrated in Fig.3. In the raw image space, MBJ randomly samples the images without re-balancing the head and tail classes. In each training iteration, the deep model transforms the raw images into a batch of features. Given the features in current mini-batch, MBJ stores them into a memory bank. The memory bank has a larger size than the mini-batch, so it is capable to accumulate historical features across multiple training iterations.

When collecting the features, MBJ lays emphasis on the tail classes, so that the tail and head data will be re-balanced. Specifically, MBJ assigns small sampling probabilities to head classes, as well as relatively large sampling probabilities to tail classes, which is formulated as:

P_{i}=\frac{(1/N_{i})^{\beta}}{\sum_{j}^{C}(1/N_{j})^{\beta}}

(1)

where $P_{i}$ is the corresponding sampling probability of class $i$ , $N_{i}$ is the sample number of the $i$ -th class, $C$ is the total number of classes, and $\beta$ is a hyper-parameter to control the strength of re-balancing. A larger $\beta$ results in higher priority on accumulating the tail features. We use $\beta=1.5$ in all of our experiments.

To control the memory size, we use a queue strategy for updating the memory bank. After the memory bank reaches its size limitation, we enqueue the newest features (i.e., the features in current mini-batch), and dequeue the oldest ones.

Given the feature memory and the batch features, MBJ combines both of them to learn the classifier in a joint optimization manner. Specifically, MBJ uses the features in current mini-batch and the weight vectors in the classification layer to deduce a cross-entropy loss, i.e., the loss $\mathcal{L}_{batch}$ . Meanwhile, MBJ uses the memorized features and the weight vectors in the classification layer to deduce another cross-entropy loss, i.e., the loss $\mathcal{L}_{memory}$ . MBJ sums up those two losses by:

\mathcal{L}_{total}=\mathcal{L}_{memory}\times\eta+\mathcal{L}_{batch},

(2)

in which $\eta$ is a weighting factor. Algorithm 1 provides the pseudo-code of MBJ for deep image classification task.

Algorithm 1 Pseudocode of MBJ in Classification task.

⬇

Train network Q conventionally until convergence.

Initialize the memory as queue M.

Set the learning rate with a small value.

# input: data, target: labels, batch_f: batch_feature

for input, target in loader:

batch_f = Q.forward(input)

# memory update with higher sampling probability on

# the tail classes

enqueue(M,(batch_f.detach(),target))

dequeue(M)

# batch loss and memory loss

L_batch = criterion_batch(batch_f,target)

L_memory = criterion_memory(M)

L_total = L_memory

\times

\eta

+ L_batch

L_total.backward()

optimizer.step()

3.3 MBJ for Deep Metric Learning

A popular baseline [21, 17, 26, 25, 18, 27, 28] for deep metric learning is as follows: during training, we learn a classification model on the training set. The weight vectors in the classification layer are typically recognized as prototypes for each class. During testing, the distance between two images are measured under the learned deep embedding. Based on this baseline, MBJ collects the historical prototypes into a prototype memory with emphasis on the tail classes.

The sampling strategy is exactly the same as in Eq.1, so we omit the detailed description. Given the historical prototypes in memory and the up-to-date prototypes, MBJ combines both to learn the features. For clarity, we illustrate the learning process with focus on a single feature $x$ under optimization.

To learn with the up-to-date prototypes $W=\{w_{1},w_{2},\cdots,w_{C}\}$ ( $C$ is the total number of training classes), MBJ adopts a popular deep metric learning method, i.e., CosFace [36], which is formulated as:

\mathcal{L}_{batch}=-{\log{\frac{\exp\left({\alpha(w^{T}_{y}x-\delta)}\right)}{\exp\left({\alpha(w^{T}_{y}x-\delta)}\right)+\sum_{k\neq y}^{C}{\exp\left({\alpha w^{T}_{k}x}\right)}}}}

(3)

in which $C$ is the number of classes, $w_{y}$ is the prototype of the target class, $\alpha$ is a scale factor and $\delta$ is a margin for better similarity separation.

To learn with the prototype memory, MBJ needs to deal with multiple positive prototypes associated with feature $x$ . It is because the weight vector of the target class in the classifier may be sampled at several training iterations. Let us assume there are $K$ positive prototypes $\{u_{1},u_{2},\cdots,u_{K}\}$ (i.e., weight vectors of the target class), and $L$ negative prototypes $\{v_{1},v_{2},\cdots,v_{L}\}$ (i.e., weight vectors of the non-target class). We find that a recent deep metric learning method, i.e., Circle Loss[26], allows multiple similarities associated with a single sample feature. Accordingly to Circle Loss, we define the loss function associated with the prototype memory by:

\mathcal{L}_{memory}=\log\left[1+\sum_{j=1}^{L}\sum_{i=1}^{K}{\exp(\alpha(v_{j}^{T}x-u_{i}^{T}x+\delta)}\right]

(4)

Similar to Eq.2, MBJ sums the above two losses (i.e., $\mathcal{L}_{batch}$ and $\mathcal{L}_{memory}$ ) to optimize the feature $x$ . We note that two editions of MBJ share a unified framework, except for the jitter type. To improve the classification accuracy, MBJ memorizes the feature jitters; To improve the retrieval accuracy, MBJ memorizes the prototype jitters. They have a dual pattern against each other. Algorithm 2 provides the pseudo-code of MBJ for deep metric learning task.

Algorithm 2 Pseudocode of MBJ in DML task.

⬇

Train network Q conventionally until convergence.

Initialize the memory as queue M.

Set the learning rate with a small value.

# input: data, target: labels

# batch_f: batch_feature, batch_w: batch_weight

for input, target in loader:

batch_f = Q.forward(input)

batch_w = Q.fc.weight

# memory update with higher sampling probability on

# the tail calsses

enqueue(M, (batch_w.detach(),target))

dequeue(M)

# batch loss and memory loss

L_batch_ = criterion_batch(batch_f,batch_w,target)

L_memory = criterion_memory(batch_f,M,target)

L_total = L_memory

\times

\eta

+ L_batch

L_total.backward()

optimizer.step()

3.4 Discussions

MBJ is featured for its re-balancing strategy and its augmentation manner. Although re-balancing the features / prototypes (instead of re-balancing the raw data) considerably benefits MBJ (as introduced in Section 2.1), we note that the improvement is mainly because the accumulated jitters increase the tail data diversity. Removing the jitters or using other augmentation method significantly compromise MBJ. The details are to be accessed in Section 4.5.2.

Dataset	Long-tailed CIFAR-10			Long-tailed CIFAR-100
Imbalanced ratio (IR)	100	50	10	100	50	10
Basel. (CE)	70.4	74.8	86.4	38.3	43.9	55.7
Focal Loss [15]	70.4	76.7	86.7	38.4	44.3	55.8
CB Focal [6]	74.6	79.3	87.1	39.6	45.2	58.0
LDAM-DRW [4]	77.0	81.0	88.2	42.0	46.6	58.7
BBN [48]	79.8	82.2	88.3	43.7	47.0	59.1
SSP [41]	77.8	82.1	88.5	43.4	47.1	58.9
De-confound-TDE [30]	80.6	83.6	88.5	44.1	50.3	59.6
RIDE (4 experts)^‡ [38]	-	-	-	48.7	59.0	58.4
MBJ	81.0	86.6	88.8	45.8	52.6	60.7
MBJ + RIDE (4 experts)	-	-	-	49.9	59.8	58.9

Table 1: Comparison with baseline and the state-of-the-art methods on long-tailed CIFAR-10 and CIFAR-100. We report top-1 accuracy rates. The results of MBJ are in bold. The results of “MBJ + RIDE” are in red.

\ddagger

denotes our reproduced results with released code.

Methods	ImageNet-LT				Places-LT
Methods	Many-shot	Medium-shot	Few-shot	Overall	Many-shot	Medium-shot	Few-shot	Overall
Basel.(CE)	65.9	37.5	7.7	44.4	45.7	27.3	8.2	30.2
Lifted Loss [20]	35.8	30.4	17.9	30.8	41.1	35.4	24.0	35.2
Focal Loss [15]	36.4	29.9	16.0	30.5	41.1	34.8	22.4	34.6
OLTR [19]	43.2	35.1	18.5	35.6	44.7	37.0	25.3	35.9
Decouple-NCM [12]	56.6	45.3	28.1	47.3	40.4	37.1	27.3	36.4
Decouple-cRT [12]	61.8	46.2	27.4	49.6	42.0	37.6	24.9	36.7
Decouple- $\tau$ -norm [12]	59.1	46.9	30.7	49.4	37.8	40.7	31.8	37.9
Decouple-LWS [12]	60.2	47.2	30.3	49.9	40.6	39.1	28.6	37.6
FSA [5]	-	-	-	-	42.8	37.5	22.7	36.4
De-confound-TDE [30]	62.7	48.8	31.6	51.8	-	-	-	-
RIDE (4 experts)^‡ [38]	67.8	53.4	36.2	56.6	-	-	-	-
MBJ	61.6	48.4	39.0	52.1	39.5	38.2	35.5	38.1
MBJ + RIDE (4 experts)	68.4	54.1	37.7	57.7	-	-	-	-

Table 2: Long-tailed recognition accuracy on ImageNet-LT and Places-LT. On ImageNet-LT, all models used the ResNeXt-50 backbone. On Places-LT, start from an ImageNet pre-trained ResNet152. We report the Top-1 accuracy. The results of MBJ are in bold. The results of “MBJ + RIDE” are in red.

\ddagger

denotes our reproduced results with released code.

4 Experiments

4.1 Datasets and setup

Classification task. Under long-tailed image classification scenario, we evaluate MBJ on 5 datasets, i.e., CIFAR-10, CIFAR-100, ImageNet-LT, Places-LT and iNaturalist18.

For CIFAR datasets, we synthesize several long-tailed version, following [4]. We use an imbalance ratio to denote the ratio between sample size of the most frequent and least frequent class. Imbalanced ratio (IR) in our experiments are set to $10$ , $50$ and $100$ , respectively.

ImageNet-LT, Places-LT and iNaturalist18 are publicly available long-tailed dataset. In ImageNet-LT, the maximum of images per class is $1280$ and the minimum of images per class is $5$ . In Places-LT, the largest class has $4980$ images while the smallest ones have $5$ images. Their test set is balanced. The iNaturalist18 dataset is a large-scale dataset with extremely imbalanced label distribution. It has $437,513$ images from $8,142$ classes. We adopt the official training and validation splits for our experiments.

Deep metric learning. We employ the person re-identification (re-ID) [47, 29, 44] task to evaluate MBJ on deep metric learning. Given a query person, re-ID aims to spot the appearance of the same person in the gallery. The keynote of re-ID is to learn accurate metric that measures the similarity between query and gallery images. We adopt two dataset, i.e., Market-1501[43] and DukeMTMC-reID[22, 45]. Following the settings in feature cloud [17], we synthesize several long-tailed editions based on the original datasets. For comprehensive evaluation, we vary the number of head classes as $20$ , $50$ and $100$ , respectively. All the tail classes contain only $5$ images per class.

4.2 Implementation Details

Parameter Settings. For both task, the re-balancing factor $\beta$ (Eq. 1) is set to $1.5$ and the memory size is set to $5*C$ ( $C$ is the total number of training classes). For classification task, the loss ratio $\eta$ (Eq. 2) is set to $15$ . In deep metric learning task, the loss ratio $\eta$ (Eq. 2) is set to $1/15$ . Please refer to supplementary materials for more details.

4.3 Experiments on image classification

4.3.1 Evaluation on long-tailed CIFAR-10 / 100

Table 1 compares MBJ with the baseline and several state-of-the-art methods on the long-tailed CIFAR-10 and CIFAR-100. Comparing MBJ with “Basel. (CE)”, we find that MBJ significantly improves the baseline. Under the setting of IR $100$ , for instance, MBJ surpasses the baseline by $+10.6\%$ and $+7.5\%$ top-1 accuracy on CIFAR-10 and CIFAR-100, respectively. Comparing MBJ with several state-of-the-art methods, we clearly observe the superiority of MBJ. For example, comparing MBJ with De-confound-TDE [30], under all the IR settings, MBJ achieves higher classification accuracy. Especially under the setting of IR $50$ , MBJ marginally surpasses it by $+3.6\%$ and $+7.2\%$ top-1 accuracy on CIFAR-10 and CIFAR-100, respectively.

When we compare MBJ against the recent work RIDE [38], the accuracy of MBJ is lower than that of RIDE [38]. It is because RIDE uses multiple classifiers experts (i.e., 4) ensemble. MBJ is featured of the memory-based feature augmentation, and thus it can be integrated into RIDE, achieving the better performance.

4.3.2 Evaluation on ImageNet-LT and Places-LT

The experiment results on ImageNet-LT and Places-LT are shown in Table 2. These two datasets offer separate evaluations under Many-shot (more than 100 training images per class), Medium-shot (20 to 100 training images per class), Few-shot (less than 20 images per class) and the Overall performance, respectively. From Table 2, We draw three observations as follows: First, under Many-shot and Medium-shot, MBJ achieves comparable accuracy. For example, on ImageNet-LT, MBJ is slightly lower than De-confound-TDE [30] by $-1.1\%$ (Many-shot) and $-0.4\%$ (Medium-shot). Second, under Few-shot, MBJ exhibits significant superiority against all the competing methods. On ImageNet-LT, MBJ surpasses the second best method RIDE [38] by $+2.8\%$ top-1 accuracy. Third, due to significant superiority under the Few-shot, as well as the comparable performance under Many-shot and Medium-shot, the Overall performance of MBJ is on par with the state of the art.

Moreover, we notice that most the methods (including the proposed MBJ) actually lose some accuracy under the Many-shot, compared with the baseline (i.e., “Basel”). Only RIDE [38] improves the accuracy of Many-shot, Medium-shot and Few-shot, simultaneously. We combine MBJ with RIDE, then we further improve the performance.

Methods	90 epochs	200 epochs
Basel.(CE)	61.1	65.3
IEM* [50]	67.0	-
LDAM [4]	64.6	-
FSA [5]	65.9	-
CE+SSP [41]	64.4	-
LDAM-DRW ^† [4]	64.6	66.1
BBN [48]	66.3	69.7
Decouple-NCM [12]	58.2	63.1
Decouple-cRT [12]	65.2	67.6
Decouple- $\tau$ -norm [12]	65.6	69.3
Decouple-LWS [12]	65.9	69.5
RIDE (4 experts)^‡ [38]	-	72.6
MBJ	66.9	70.0
MBJ + RIDE (4 experts)	-	73.2

Table 3: Comparison with baseline and the state-of-the-art methods on iNaturalist18. All models use the ResNet-50 [10] backbone. We report the Top-1 accuracy. IEM* denotes the IEM [50] using global feature for fair comparison.

\dagger

denotes the results copied from BBN [48].

\ddagger

denotes our reproduced results with released code. The results of MBJ are in bold. The results of “MBJ + RIDE” are in red.

4.3.3 Evaluation on iNaturalist18

We further evaluate MBJ on the large-scale long-tailed dataset i.e. iNaturalist18. The results are shown in Table 3. For fair comparison, we report performance achieved at $90$ and $200$ training epochs, following the common practice of previous works. Under both settings, MBJ achieves competitive accuracy. When MBJ is combined with RIDE [38], “MBJ + RIDE” achieves the best performance.

4.4 Experiments on deep metric learning

We evaluate MBJ under a popular deep metric learning task (i.e., re-ID). We note that the long-tail problem on this task has been noticed recently, and the competing methods are relatively few. Table 4 compares MBJ with re-ID baseline (CosFace [37]) and a state-of-the-art method (Feature Cloud [17]), from which we draw two observations:

First, under typical long-tailed distribution, MBJ significantly improves the re-ID baseline. When there are only 20 head classes (“H20”), MBJ achieves $+11.1\%$ and $+10.9\%$ mAP on Market-1501 and DukeMTMC-reID, respectively.

Second, MBJ marginally surpasses the recent state-of-the-art, i.e., Feature Cloud [17]. For example, under three long-tailed condition on Market-1501, MBJ achieves $72.6\%$ , $68.8\%$ and $66.7\%$ mAP, which are higher than Feature Cloud by $+3.9\%$ , $+1.5\%$ and $+2.6\%$ , respectively. MBJ obtain the new state-of-the-art performance.

Method	Market-1501						DukeMTMC-reID
	H100		H50		H20		H100		H50		H20
	mAP	R-1	mAP	R-1	mAP	R-1	mAP	R-1	mAP	R-1	mAP	R-1
Baseline	62.8	83.8	60.5	80.7	55.6	78.6	52.6	70.3	48.0	67.7	47.0	66.0
Feature cloud [17]	68.7	86.5	67.3	84.9	64.1	83.2	55.6	74.8	53.1	73.0	52.4	72.7
MBJ	72.6	88.4	68.8	86.2	66.7	84.8	60.8	78.6	56.7	74.4	57.9	75.5

Table 4: Evaluation of MBJ on long-tailed re-ID task. Under each dataset, there are three different long-tailed conditions, i.e., “H100”: 100 head classes. “H50”: 50 head classes. “H20”: 20 head classes. All the tail classes contain only

5

images per class. We report Rank-1 accuracy (R-1) and mAP on Market-1501 and DukeMTMC-reID. Best performance are in bold.

4.5 Ablation study

4.5.1 Modifications on jitter types

Though MBJ shares a uniform underlying mechanism for classification and deep metric learning, it still has some task-specific modifications. For deep image classification, MBJ collects feature jitters while for deep metric learning, MBJ collects weight jitters. We explain such modifications with an ablation study. We implement three different editions of MBJ, i.e., MBJ-W (storing weights jitters), MBJ-F (storing feature jitters) and MBJ-WF (storing both weight and feature jitters). The comparison between these three modifications are shown in Fig. 4.

For deep classification task, we draw two observations from Fig. 4(a). First, compared with the baseline, MBJ-F significantly improves the accuracy, while MBJ-W barely shows any improvement. Second, MBJ-WF is inferior to MBJ-F with $-3.1\%$ . It indicates that adding weight jitters actually deteriorates MBJ-F. So we adopts feature memory to implement MBJ for classification task.

For deep metric learning task, we draw three observations from Fig. 4 (b). First, all editions of MBJ significantly improve the re-ID accuracy over the baseline. Second, MBJ-W is superior than MBJ-F (with +1.1% Rank-1 accuracy). Third, comparing MBJ-W with MBJ-WF, adding MBJ-F to MBJ-W does not bring incremental improvement.

Based on the above investigations, MBJ employs a respective memory type for these two long-tailed visual recognition tasks, i.e., MBJ-F (feature memory) for deep image classification and MBJ-W (prototype memory) for deep metric learning.

4.5.2 Decoupling re-balancing and jitters

When MBJ collects jitters into the memory bank, it has two coupling effects, i.e., a) re-balancing the head and tail distribution and b) accumulating the jitters. On the long-tailed CIFAR-10 (IR 100), we design an ablation study to decouple those two effects. Specifically, we train three different models as follows: 1) Model RR re-balances the raw images by over-sampling the tail classes. 2) Model FR re-balances the feature by over-sampling the tail features in current mini-batch. However, it does NOT collect historical features into the memory bank. In another word, Variant A maintains the re-balancing strategy of MBJ and removes the jitters. 3) Model FR+RJ over-samples the tail features and augments them with Gaussian-distributed disturbance.

Fig. 5 compares these three models with the baseline and MBJ, from which we draw two observations:First, comparing “FR” against “RR” and “Basel.”, we find that re-balancing the features considerably benefits MBJ. Specifically, directly re-balancing the raw data actually brings no obvious improvement over the baseline. It is consistent with the observation in LDAM [4]. According to [4, 48], it is because directly re-sampling the raw data compromises the deep embedding learning. In contrast, re-balancing the features avoids deterioration on the deep embedding and considerably increases the top-1 accuracy by $+2.1\%$ . Second, comparing “MBJ” against “FR” and “FR+RJ”, we find that the accumulated jitters is the dominating reason for the superiority of MBJ. While re-balancing the feature (FR) improves the baseline by $+2.1\%$ top-1 accuracy, accumulating jitters (MBJ) further brings a much larger improvement of $+8.5\%$ accuracy. Moreover, though adding random disturbance (“FR+RJ”) does obtain some degree of feature augmentation as well, its improvement is very limited and is much inferior to the proposed MBJ.

5 Conclusion

This paper proposes Memory-based Jitter (MBJ) to improve long-tailed visual recognition under both deep classification and deep metric learning tasks. The insights behind MBJ are two-fold. First, during training a deep model, the weight vectors and the features keep on changing after each iteration, resulting in the phenomena of (weight and feature) jitters. Second, accumulating these jitters provides extra augmentation for the tail data. Experimental results confirm MBJ significantly improves the baseline and achieves state-of-the-art performance on both deep image classification and deep metric learning.

An interesting observation is that MBJ favors different types of memory, depending on the specified task. For deep image classification, it favors the feature memory, while for deep metric learning, it favors the prototype memory. The underlying reasons for such preference are remained to be explored in our future work.

References

[1] Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 2018.
[2] Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 2018.
[3] Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning, pages 872–881, 2019.
[4] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems, pages 1565–1576, 2019.
[5] Peng Chu, Xiao Bian, Shaopeng Liu, and Haibin Ling. Feature space augmentation for long-tailed data. In European Conf. on Computer Vision (ECCV), 2020.
[6] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In CVPR, pages 9268–9277, 2019.
[7] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
[8] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European conference on computer vision, pages 87–102. Springer, 2016.
[9] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[11] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for imbalanced classification. In CVPR, pages 5375–5384, 2016.
[12] Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217, 2019.
[13] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[14] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
[15] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
[16] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
[17] Jialun Liu, Yifan Sun, Chuchu Han, Zhaopeng Dou, and Wenhui Li. Deep representation learning on long-tailed data: A learnable embedding augmentation perspective. arXiv preprint arXiv:2002.10826, 2020.
[18] Xiaofeng Liu, B. V. K. Vijaya Kumar, Jane You, and Ping Jia. Adaptive deep metric learning for identity-aware facial expression recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017.
[19] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2537–2546, 2019.
[20] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4004–4012, 2016.
[21] Qi Qian, Lei Shang, Baigui Sun, Juhua Hu, Hao Li, and Rong Jin. Softtriple loss: Deep metric learning without triplet sampling. In Proceedings of the IEEE International Conference on Computer Vision, pages 6450–6458, 2019.
[22] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision, pages 17–35. Springer, 2016.
[23] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850, 2016.
[24] Li Shen, Zhouchen Lin, and Qingming Huang. Relay backpropagation for effective learning of deep convolutional neural networks. In ECCV, pages 467–482, 2016.
[25] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1857–1865. Curran Associates, Inc., 2016.
[26] Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. Circle loss: A unified perspective of pair similarity optimization. arXiv preprint arXiv:2002.10857, 2020.
[27] Yifan Sun, Liang Zheng, Weijian Deng, and Shengjin Wang. Svdnet for pedestrian retrieval. In Proceedings of the IEEE International Conference on Computer Vision, pages 3800–3808, 2017.
[28] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), pages 480–496, 2018.
[29] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), pages 480–496, 2018.
[30] Kaihua Tang, Jianqiang Huang, and Hanwang Zhang. Long-tailed classification by keeping the good and removing the bad momentum causal effect. arXiv preprint arXiv:2009.12991, 2020.
[31] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195–1204, 2017.
[32] Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Commun. ACM, 59(2):64–73, Jan. 2016.
[33] Laurens Van Der Maaten. Accelerating t-sne using tree-based algorithms. The Journal of Machine Learning Research, 15(1):3221–3245, 2014.
[34] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
[35] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint arXiv:1709.01450, 2017.
[36] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5265–5274, 2018.
[37] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5265–5274, 2018.
[38] Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, and Stella X Yu. Long-tailed recognition by routing diverse distribution-aware experts. arXiv preprint arXiv:2010.01809, 2020.
[39] Xun Wang, Haozhi Zhang, Weilin Huang, and Matthew R Scott. Cross-batch memory for embedding learning. arXiv preprint arXiv:1912.06798, 2019.
[40] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Learning to model the tail. In NeurIPS, pages 7029–7039, 2017.
[41] Yuzhe Yang and Zhi Xu. Rethinking the value of labels for improving class-imbalanced learning. arXiv preprint arXiv:2006.07529, 2020.
[42] Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Manmohan Chandraker. Feature transfer learning for face recognition with under-represented data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5704–5713, 2019.
[43] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision, pages 1116–1124, 2015.
[44] Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz. Joint discriminative and generative learning for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2138–2147, 2019.
[45] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, pages 3754–3762, 2017.
[46] Q Zhong, C Li, Y Zhang, H Sun, S Yang, D Xie, and S Pu. Towards good practices for recognition & detection. In CVPR workshops, volume 1, 2016.
[47] Zhun Zhong, Liang Zheng, Zhedong Zheng, Shaozi Li, and Yi Yang. Camstyle: A novel data augmentation method for person re-identification. IEEE Transactions on Image Processing, 28(3):1176–1190, 2018.
[48] Boyan Zhou, Quan Cui, Xiu-Shen Wei, and Zhao-Min Chen. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. arXiv preprint arXiv:1912.02413, 2019.
[49] Linchao Zhu and Yi Yang. Compound memory networks for few-shot video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 751–766, 2018.
[50] Linchao Zhu and Yi Yang. Inflated episodic memory with region self-attention for long-tailed visual recognition. 2020.

Memory-based Jitter: Improving Visual Recognition on Long-tailed Data with Diversity In Memory