Meta Clustering Learning for Large-scale Unsupervised Person Re-identification

Xin Jin jinxin@eias.ac.cn Eastern Institute for Advanced Study , Tianyu He deeptimhe@gmail.com Alibaba Group , Xu Shen shenxu.sx@alibaba-inc.com Alibaba Group , Tongliang Liu tongliang.liu@sydney.edu.au The University of Sydney , Xinchao Wang xinchao@nus.edu.sg National University of Singapore , Jianqiang Huang jianqiang.jqh@gmail.com Alibaba Group , Zhibo Chen chenzhibo@ustc.edu.cn University of Science and Technology of China and Xian-Sheng Hua huaxiansheng@gmail.com Alibaba Group

(2022)

Abstract.

Unsupervised Person Re-identification (U-ReID) with pseudo labeling recently reaches a competitive performance compared to fully-supervised ReID methods based on modern clustering algorithms. However, such clustering-based scheme becomes computationally prohibitive for large-scale datasets, making it infeasible to be applied in real-world application. How to efficiently leverage endless unlabeled data with limited computing resources for better U-ReID is under-explored. In this paper, we make the first attempt to the large-scale U-ReID and propose a “small data for big task” paradigm dubbed Meta Clustering Learning (MCL). MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training. After that, the learned cluster centroids, termed as meta-prototypes in our MCL, are regarded as a proxy annotator to softly annotate the rest unlabeled data for further polishing the model. To alleviate the potential noisy labeling issue in the polishment phase, we enforce two well-designed loss constraints to promise intra-identity consistency and inter-identity strong correlation. For multiple widely-used U-ReID benchmarks, our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.

Clustering, Unsupervised Person Re-identification, Computational Cost Saving

^†^†journalyear: 2022^†^†copyright: acmcopyright^†^†conference: Proceedings of the 30th ACM International Conference on Multimedia; October 10–14, 2022; Lisboa, Portugal^†^†booktitle: Proceedings of the 30th ACM International Conference on Multimedia (MM ’22), October 10–14, 2022, Lisboa, Portugal^†^†price: 15.00^†^†doi: 10.1145/3503161.3547900^†^†isbn: 978-1-4503-9203-7/22/10^†^†ccs: Information systems Top-k retrieval in databases

1. Introduction

Ubiquitous cameras generate innumerable pedestrian data every day. Due to the growing demands on person re-identification (ReID) and its expensive labeling cost, unsupervised person ReID (U-ReID) (Fan et al., 2018; Li et al., 2018; Tang et al., 2019; Qi et al., 2019; Yu et al., 2019; Yang et al., 2019; Ding et al., 2020; Zhai et al., 2020; Jin et al., 2020a, b; Ge et al., 2020b; Dai et al., 2021; Zhuang et al., 2021) has attracted increasing attention recently.

Refer to caption — Figure 1. Motivation illustration for clustering-based unsupervised ReID: as the size of unlabeled data increases, the clustering process in previous works will (a) cost an intolerable computational resources in terms of memory and time costs, and (b) be more prone to be affected by pseudo label noise. Our work introduces a new meta-clustering learning to achieve a satisfactory U-ReID performance while simultaneously tackling these two challenges.

There are mainly two categories in U-ReID. One is unsupervised domain adaptive (UDA) person ReID, which first pre-trains a model on the labeled source dataset, and then fine-tunes the model on the unlabeled target dataset to reduce domain gap (Deng et al., 2018; Lin et al., 2018; Wang et al., 2018; Wei et al., 2018; Yu et al., 2019; Zhong et al., 2018, 2019). Albeit effective, UDA ReID branch typically suffers from a complex adaptation process, and its success also relies on an assumption that the discrepancy between source and target domain is not significant. This motivates the exploration on the other branch, the clustering-based unsupervised ReID (Fan et al., 2018; Fu et al., 2019a; Ge et al., 2020b; Lin et al., 2019; Wang and Zhang, 2020; Dai et al., 2021). As the “Previous work” shown in Figure 1, the works of this branch tend to perform an iterative optimization process of feature extraction–clustering–train. In this way, all unlabeled data can be explicitly leveraged with the pseudo labels generated by clustering. The focus of the recent clustering-based methods lies in creating more reliable clusters and efficiently using them to learn discriminative representations, e.g., with the help of self-similarity grouping (Fu et al., 2019a), hybrid memory bank with contrastive loss (Ge et al., 2020b) or cluster-level memory bank (Dai et al., 2021), multi-label classification (Wang and Zhang, 2020), and online hierarchical cluster dynamics (Zeng et al., 2020; Zheng et al., 2021).

However, these methods all neglect an important fact in practice: the clustering process costs an intolerable computational resources due to its pair-wise similarity calculation and neighboring samples searching. Taking the most common clustering algorithm DBScan (Ester et al., 1996) in U-ReID as example, its worst time complexity and space complexity are both $O(n^{2})$ . When the size of unlabeled data is very large (as shown in Figure 1(a)), both of the memory and time cost of clustering will rapidly increase. For example, performing clustering once on the LaST (Shu et al., 2021) (71.2k images) on the GPU following previous works (Ge et al., 2020b; Jin et al., 2020a; Dai et al., 2021) will take a memory usage up to 22GB , which can not run on a 16GB Tesla V100. One may ask why not use the offline clustering (on CPU) or batch-wise local clustering (e.g., K-means) to avoid a large memory and time cost. This is due to the specificity of ReID: (1) the clustering-based ReID needs iteratively perform the feature extraction and clustering in feature domain (on GPU) (Ge et al., 2020b; Wang and Zhang, 2020; Dai et al., 2021); (2) the batch-wise local clustering for ReID is sub-optimal, which hinders the exploration and utilization of global relationship among large-scale person data.

In this paper, we attempt to achieve a large-scale unsupervised ReID framework while taking the computational cost into account, which is challenging but valuable and meaningful to bridge the gap between ReID algorithms and practical applications. To this end, we propose a “small data for big task” paradigm dubbed Meta Clustering Learning (MCL). Inspired by the other concept of Meta Learning (Vilalta and Drissi, 2002; Vanschoren, 2018, 2019) that are designed for ‘learning to learn’ with the assistance of meta knowledge, our MCL first obtain the meta knowledge on a part of the unlabeled person data and then softly extend the knowledge to the rest unlabeled ones. Therefore, it naturally avoids clustering the full/whole dataset before each training epoch and thus reduces the computation overhead. In addition, during the knowledge extension process, MCL further leverages a clustering-free polishing step to enhance the discriminative representation learning while alleviating noisy label issue for ReID model.

As illustrated in Figure 1, MCL consists of two phases of meta-prototype optimization and prototype-referenced polishment (see Sec. 3.1, 3.2 for details). In the first phase, the features of the partial unlabeled images are extracted. This ratio can be flexibly determined according to the computing power of practical environment, as a by-product of MCL. Then, a clustering algorithm, like DBScan (Ester et al., 1996), is used to cluster features and generate pseudo ID labels. Based on them, the ReID model is trained with a memory-based optimization strategy (Ge et al., 2020b; Wang and Zhang, 2020; Dai et al., 2021). Meanwhile, the clustered centroids (termed as meta-prototypes) are stored in the memory and updated on the fly in a momentum manner (He et al., 2020).

The second prototype-referenced polishment is based on the learned meta-prototypes in the previous phase, which are taken as a proxy annotator to mine the potential label information for the rest unlabeled data. For each unused person image, we get a soft real-valued label likelihood vector by comparing it with meta-prototypes reference. Based on such clustering-free pseudo label, we further polish model by mining the relative comparative characteristic in person images. The reason why we call this phase as “polishment” is because “polish” has the meaning of try to perfect one’s skill, like here we promote the discriminative feature learning for ReID model with the rest unlabeled data.

Another point should be noticed is that, the pseudo labeling itself no matter of clustering-based or reference-based may generate wrong label predictions (Wang and Zhang, 2020; Jin et al., 2020a; Wu et al., 2021; Ge et al., 2021). As shown in Figure 1(b), the larger size of unlabeled dataset, the more possible of generating noisy labels. To alleviate it, we further leverage two loss constraints for label denoising in MCL. One loss enforces instance-level consistency to reduce intra-identity variance and the other constructs a soft-weighted triplet constraint to promise inter-identity correlation. In this way, MCL could better investigate the discriminative information of data even with noisy pseudo labels. We summarize our main contributions as follows:

•

To our best knowledge, this paper is the first to achieve the unsupervised large-scale ReID training while considering the computational cost savings. A “small data for big task” paradigm dubbed Meta Clustering Learning (MCL) is proposed. MCL performs clustering-based ReID training on partial unlabeled data, saving computing resources.
•

To further leverage the rest unlabeled data, we take the learned prototypes from partial data as proxy annotator to pseudo-label them, and then polish model based on such pseudo labels with two well-designed losses (as a minor contribution) to promise intra-identity consistency and inter-identity strong correlation, which helps alleviate the noisy label issue.
•

As the first attempt to handle the large-scale unsupervised ReID, extensive experiments on multiple benchmarks show that MCL could significantly save computational cost while achieving a state-of-the-art performance. In particular, MCL achieves ReID performance improvements of 4.8%, 2.9% in mAP on the large-scale MSMT17 (Wei et al., 2018), LaST (Shu et al., 2021), but saves 71.8%/87.9% memory costs and 73.7%/85.7% time costs compared to the baselines.

2. Related Work

2.1. Unsupervised Person Re-identification

Unsupervised Domain Adaptive (UDA) ReID. This branch usually utilizes transfer learning, e.g., style translation (Zhu et al., 2017), to reduce domain gap between source and target ReID scenarios for adaptation (Wei et al., 2018; Deng et al., 2018; Liu et al., 2019; Dai et al., 2022; Zhang et al., 2022). Their performance is typically inferior to the clustering-based approaches, since there is still a gap between the style-translated images and the realistic person images (Yang et al., 2020; Wang et al., 2020a).

Clustering-based Unsupervised ReID. This branch typically trains ReID model directly on unlabeled dataset with a clustering-based pseudo label estimation (Song et al., 2020; Fan et al., 2018; Yang et al., 2019; Zhang et al., 2019; Fu et al., 2019a; Yang et al., 2020; Zheng et al., 2022). This clustering labeling and training process are usually alternatively performed until the model is stable. In particular, Lin et al. (Lin et al., 2019) treat each individual sample as a cluster, and then gradually group similar samples into one cluster to generate pseudo labels. Jin et al. (Jin et al., 2020a) introduce a global distance-distribution separation constraint to handle the sample-wise noisy label. SPCL (Ge et al., 2020b) proposes a self-paced contrastive learning framework to gradually create more reliable clusters for ReID training while updating the hybrid memory containing both source and target domain features. Similarly, ClusterContrast (Dai et al., 2021) further stores feature vectors inside a cluster-level memory to alleviate the inconsistent clustering issue. Recently, Isobe et al. (Isobe et al., 2021) introduce cluster-wise contrastive learning (CCL), progressive domain adaptation (PDA), Fourier augmentation (FA), and ICE (Chen et al., 2021) introduces inter-instance contrastive encoding to boost the existing class-level contrastive ReID methods. However, all these methods focus on how to get more reliable pseudo labels or how to better leverage them for discriminative feature learning, an important point of computational cost is still under-explored. Besides, the noisy pseudo label issue in these methods has not yet been well addressed.

Reference-based Pseudo Labeling in ReID. Existing representative works, like MAR (Yu et al., 2019), MMCL (Wang and Zhang, 2020), MPRD (Ji et al., 2021), and SSL (Lin et al., 2020) either use labeled source data for pseudo labels generation or assign each unlabeled person image with a multi-class/softened label via pairwise similarity computation. Differently, our MCL creates clustering-free soft pseudo labels with the reference of online updated meta-prototypes that stored in the memory. Such design is more efficient because it does not need source labeled dataset as reference, and meta-prototypes (like a FC layer) could help directly infer out real-valued labels instead of repeat pairwise comparisons. Moreover, more reliable meta-prototypes encourage more accurate pseudo labeling (more effective unsupervised training), and vice versa. They promote each other, achieving a win-win effect.

2.2. Self-supervised Representation Learning

MCL is also related to self-supervised representation learning (SSL). Based on contrastive learning framework, SSL has achieved a great success (Ren et al., 2022; Feng et al., 2022), e.g., MoCo (He et al., 2020), MoCov2 (Chen et al., 2020a), SimCLR(Chen et al., 2020b), SimCLRv2 (Chen et al., 2020c), BYOL (Grill et al., 2020), and SimSiam (Chen and He, 2021). Their main idea is to match a same instance in different augmented views, which typically relies on a large number of explicit pairwise feature comparisons and faces a computational challenge. Besides, these instance-wise SSL methods can not directly address the fine-grained unsupervised ReID problem (they can only be taken for pre-training/initialization (Fu et al., 2021; Yang et al., 2022; Fu et al., 2022)), because ReID needs the cluster priors to mine fine discriminative clues.

3. Meta Clustering Learning (MCL)

Overview. To tackle the computing challenge in large-scale U-ReID, we propose a meta clustering learning (MCL), which is a unified episodic training framework, and comprises two phases of meta-prototype optimization (Figure 2) and prototype-referenced polishment (Figure 3). MCL alternates between these two phases: (1) group the partial unlabeled data into clusters and store the learned meta-prototypes, while training model with cluster-level contrastive loss (Section 3.1); (2) use meta-prototypes as reference to annotate the rest unlabeled samples for further fine-tune, and two loss constraints are enforced to promise intra-identity consistency and inter-identity correlation (Section 3.2).

Given an unlabeled dataset ${\mathbb{X}}$ , MCL first splits ${\mathbb{X}}$ into $N$ subsets uniformly, and then randomly selects one as meta-training subset ${\mathbb{X}}_{1}$ for meta-prototype optimization, the rest subsets ${\mathbb{X}}_{2},{\mathbb{X}}_{3},...,{\mathbb{X}}_{N}$ are taken for prototype-referenced polishment. This split is performed before each training epoch.

3.1. Phase 1: Meta-prototype Optimization

MCL costs less resources in clustering by only using a meta-training subset ${\mathbb{X}}_{1}$ .

Feature Extraction and Clustering. As shown in Figure 2, a network $f_{\theta}$ (e.g., ResNet-50 (He et al., 2016), initialized with pre-trained weights on ImageNet (Ge et al., 2020a, b; Dai et al., 2021; Isobe et al., 2021)) is taken as backbone to extract features from ${\mathbb{X}}_{1}$ . Then, DBScan (Ester et al., 1996) is used to cluster these features (unclustered outliers are discarded (Chen et al., 2021; Dai et al., 2021)). The IDs of cluster results are assigned to unlabeled samples as the pseudo labels for training.

Query Setup and Meta-prototype Initialization. After obtaining clustered pseudo labels for ${\mathbb{X}}_{1}$ , we sample $P$ person identities and $I$ instances for each identity, to set up a mini batch with the size of $P\times I$ . Different from works (Song et al., 2020; Jin et al., 2020a) that directly use the instance-wise loss contraints (e.g., triplet loss (Hermans et al., 2017)) for training, we take each batch as a query set and employ a memory dictionary based contrastive learning (Ge et al., 2020b; Dai et al., 2021; Isobe et al., 2021) for optimization.

We maintain a group of learnable meta-prototypes $\{{\bm{w}}_{1},\cdots,{\bm{w}}_{K}\}$ stored in the memory dictionary. Here, $K$ is same with the number of clustered clusters, which is always changing during the training. Particularly, the clustering algorithm (e.g., DBScan) is performed before each training epoch, and then the epoch-wise meta-prototypes are initialized with the mean feature vectors of each cluster, i.e., ${\bm{w}}_{k}=\frac{1}{|{{\mathbb{I}}}_{k}|}\sum{\bm{v}}_{i}$ , where ${\bm{v}}_{i}$ means $i$ -th feature vector of $k$ -th cluster, ${{\mathbb{I}}}_{k}$ denotes the $k$ -th set that contains all the feature vectors within cluster $k$ , and $|\cdot|$ denotes the number of features in the set.

Meta-prototypes Update and Model Optimization. At each iteration $t$ of epoch, the encoded feature vectors $\{{\bm{q}}\}$ of $P\times I$ query images in each mini-batch would be involved in meta-prototypes update. With the momentum updating (He et al., 2020), the $k$ -th cluster prototype ${\bm{w}}_{k}$ is updated by the mean of encoded query features belonging to class $k$ ,

(1)

\displaystyle{\bm{w}}_{k}^{t}\leftarrow m\cdot{\bm{w}}_{k}^{t-1}+(1-m)\cdot\frac{1}{|{{\mathbb{B}}}_{k}^{t}|}\sum_{{\bm{q}}^{t}_{i}\in{{\mathbb{B}}}_{k}^{t}}{\bm{q}}^{t}_{i},

where ${\mathbb{B}}_{k}^{t}$ denotes the feature vector set belonging to class $k$ in the mini-batch at the $t$ -th iteration, and $m\in[0,1]$ is a momentum coefficient, which is empirically set as $0.2$ following (Ge et al., 2020b; Dai et al., 2021). The learned meta-prototypes are taken for model optimization together with query samples in this phase, and also play a role of proxy annotator (see the ‘robot’ in Figure 2,3) for the rest unlabeled subsets ${\mathbb{X}}_{2},{\mathbb{X}}_{3},...,{\mathbb{X}}_{N}$ in the next phase.

With respect to the loss function in the first phase, we use a general contrastive loss (Ge et al., 2020b; Dai et al., 2021) for model optimization. Basically, given a query instance ${\bm{q}}$ , we compare it to all meta-prototypes $\{{\bm{w}}_{1},\cdots,{\bm{w}}_{K}\}$ using InfoNCE loss (Oord et al., 2018):

(2)

\mathcal{L}_{phase1}=-\log\frac{\exp({\bm{q}}\cdot{\bm{w}}^{+}/\tau)}{\sum_{i=0}^{K}{\exp({\bm{q}}\cdot{\bm{w}}_{i}/\tau)}}

where ${\bm{w}}^{+}$ is the positive cluster prototype vector to query instance ${\bm{q}}$ and $\tau$ is a temperature hyper-parameter per (Wu et al., 2018).

3.2. Phase 2: Prototype-referenced Polishment

To achieve the savings of clustering cost, the first meta-prototype optimization phase only uses a part of unlabeled data, i.e., the meta-training subset ${\mathbb{X}}_{1}$ . The rest unlabeled subsets ${\mathbb{X}}_{2},{\mathbb{X}}_{3},...,{\mathbb{X}}_{N}$ are leveraged in a clustering-free manner in this second phase. Basically, we take the learned meta-prototypes $\{{\bm{w}}_{1},\cdots,{\bm{w}}_{K}\}$ from the first phase as proxy annotator to softly mine discriminative information for those rest unlabeled data for model polishment. This two-phase training is equivalent to traverse the entire dataset once, i.e., one epoch.

Prototype-referenced Labeling. For clarity, we take an unused and unlabeled subset ${\mathbb{X}}_{2}$ as example for illustration. Given ${\mathbb{X}}_{2}=\{x_{i}\}_{i=0}^{N_{u}}$ where each $x_{i}$ is a collected unlabeled person image, the learned meta-prototypes $\{{\bm{w}}_{1},\cdots,{\bm{w}}_{K}\}$ defines the pseudo label space for ${\mathbb{X}}_{2}$ is $[1,K]$ . As shown in the ‘robot’ in Figure 3, the meta-prototypes set $\{{\bm{w}}_{k}\}_{k=1}^{K}$ is taken as proxy annotator, a soft real-valued pseudo label ${\bm{y}}_{i}$ can be assigned for $x_{i}$ by comparing $f(x_{i})$ with the reference agents $\{{\bm{w}}_{k}\}_{k=1}^{K}$ . This soft prototype-referenced pseudo labeling process is,

(3)

y_{i}^{j}=L(f(x_{i}),\{{\bm{w}}_{k}\}_{k=1}^{K})^{j}=\frac{\exp({\bm{w}}_{j}^{\rm T}f(x_{i}))}{\sum_{k}\exp({\bm{w}}_{k}^{\rm T}f(x_{i}))},

where $L(\cdot)$ means a soft pseudo labeling function. This function is epoch-wise and acts like a dimension-variable FC layer (i.e., dot-product). $y_{i}^{j}\in(0,1)$ is the $j$ -th entry of ${\bm{y}}_{i}$ . All dimensions of ${\bm{y}}_{i}$ add up to 1 and each dimension represents the label likelihood w.r.t. a reference prototype person ID. Different from the vanilla reference-based pseudo labeling (Yu et al., 2019) or using a global classifier for labeling, our prototype-referenced labeling allows the epoch-wise and dimension-variable proxy annotator $\{{\bm{w}}_{k}\}_{k=1}^{K}$ to be updated on the fly. More reliable meta-prototypes encourage more accurate labeling and thus more effective optimization, and vice versa. Besides, the clustering results are different ( $K$ is changing) for each epoch in ReID, an immutable global classifier is infeasible.

Last but not least, as shown in Figure 3, the label spaces for different subsets ${\mathbb{X}}_{2},{\mathbb{X}}_{3},...,{\mathbb{X}}_{N}$ are manually designed as non-overlapping by considering two aspects: 1). the learned meta-prototypes $\{{\bm{w}}_{k}\}_{k=1}^{K}$ can only cover a limited number of person identities ( $K$ IDs) of the entire dataset, especially when the meta-training subset ${\mathbb{X}}_{1}$ is very small. 2) assigning different subsets with different label spaces could increase the identity diversity, which simulates the real scenario where each person has multiple images from different views (see Sec. 4.4 in experiment).

Polish the Model with Soft Pseudo Labels. An intuitive solution is to ‘harden’ the obtained soft pseudo labels, i.e., regard the dimension with the largest value as person ID, i.e., $J=\mathop{\arg\max}\ \ \{y^{j}\}_{j=1}^{K}$ . Based on these identity labels $J$ s, we can construct ReID constraints for training, such as cross-entropy loss (Sun et al., 2018; Fu et al., 2019b) and triplet loss (Hermans et al., 2017). In fact, our early attempts along this line have failed to deliver very good results. We analyze that is because the reference-based pseudo labeling itself will inevitably introduce some noisy labels. To exploit the merits of the prototype-referenced labeling while alleviating the noisy label effect, we additionally introduce two well-designed loss constraints to better leverage these unsatisfactory soft pseudo labels for optimization, by considering the intra-identity consistency between different views within each person identity and the inter-identity correlations among different persons. They are shown in the two black boxes in Figure 3 and elaborated below:

Siamese consistency Loss $\mathcal{L}_{sc}$ is the first constraint to promise consistency within the same identity. As shown in the upper black box in Figure 3, $\mathcal{L}_{sc}$ is built on the “swapped” prediction idea of SwAV (Caron et al., 2020) to predict the label of a view from the representation of another view. Given two features ${\bm{f}}_{s}$ and ${\bm{f}}_{t}$ extracted from two different augmentations of the same image, we compute their referenced pseudo labels ${\bm{y}}_{s}$ and ${\bm{y}}_{t}$ following Eq.(3) by matching these features to the learned meta-prototypes $\{{\bm{w}}_{k}\}_{k=1}^{K}$ :

(4)		$\displaystyle\mathcal{L}_{sc}({\bm{f}}_{s},{\bm{f}}_{t})$	$\displaystyle=\mathcal{L}_{ce}({\bm{f}}_{s},{\bm{y}}_{t})+\mathcal{L}_{ce}({\bm{f}}_{t},{\bm{y}}_{s})$
(4)		$\displaystyle\mathcal{L}_{ce}({\bm{f}}_{s},{\bm{y}}_{t})$	$\displaystyle=-\sum_{k}{\bm{y}}_{t}^{(j)}\log{\bm{p}}_{s}^{(j)},\hskip 8.53581pt\text{where}\hskip 5.69054pt{\bm{p}}_{s}^{(j)}=\frac{\exp({\bm{w}}^{\top}_{j}{\bm{f}}_{s})}{\sum_{k}\exp({\bm{w}}_{k}^{\top}{\bm{f}}_{s})},$

where $\mathcal{L}_{ce}({\bm{f}}_{s},{\bm{y}}_{t})$ measures the fit between features ${\bm{f}}_{s}$ and soft pseudo label ${\bm{y}}_{t}$ . Intuitively, if these two features capture the same person information, they should be able to predict from each other. $\mathcal{L}_{ce}$ is the cross entropy loss between the label and the probability obtained by taking a softmax of the dot products of ${\bm{f}}$ and all prototypes in $\{{\bm{w}}_{k}\}_{k=1}^{K}$ .

Soft-weighted Triplet Loss $\mathcal{L}_{tri}^{sw}$ is the second constraint which softly leverages the relative correlations between identities to construct weighted-triplets for optimization. Considering the soft pseudo labels are continuous real-valued, a soft-weighted triplet loss $\mathcal{L}_{tri}^{sw}$ is enforced to promise the correct relative correlation among person identities. Let $\{x^{a},x^{p},x^{n}\}$ be an input triplet sample and the corresponding feature embeddings are $\{f(x^{a}),f(x^{p}),f(x^{n})\}$ , the soft-weighted triplet loss is given by,

(5)		$\displaystyle\mathcal{L}_{tri}^{sw}=\omega(a,p,n)[\\|f(x^{a})$	$\displaystyle-f(x^{p}))\\|_{2}^{2}-\\|f(x^{a})-f(x^{n}))\\|_{2}^{2}+m]_{+},$
(5)		$\displaystyle\omega(a,p,n)$	$\displaystyle=\left\langle f(x^{a}),f(x^{p})\right\rangle\left\langle f(x^{a}),f(x^{n})\right\rangle$

where $\omega(a,p,n)$ and $m$ are the loss weighting factor and margin factor (0.3 by fault), $\left\langle\cdot\right\rangle$ means the similarities between feature vectors, which adaptively alters the magnitude of the triplet loss in a soft manner. In general, when the anchor-positive pair is similar (i.e., $\left\langle f(x^{a}),f(x^{p})\right\rangle$ is high), the sample is more confident and reliable. Likewise, when the anchor-negative pair is similar (i.e., $\left\langle f(x^{a}),f(x^{n})\right\rangle$ is high), it forms a hard negative example (Schroff et al., 2015). Hence, $\mathcal{L}_{tri}^{sw}$ can give a higher priority and more attention on these reliable and hard cases, so as to alleviate the noisy label issue.

Table 1. Introduction and comparison of datasets we used.

Dataset	Style	Train IDs	Train images	Test IDs	Query images	Total images	Cameras
PersonX (Sun and Zheng, 2019)	Synthetic	410	9,840	856	5,136	45,792	6
Market-1501 (Zheng et al., 2015)	Real	751	12,936	750	3,368	32,668	6
MSMT17 (Wei et al., 2018)	Real	1,041	32,621	3,060	11,659	126,441	15
LaST (Shu et al., 2021)	Real	5,000	70,923	5,803	10,173	228,156	*

Table 2. Memory&Time Cost vs. Unsupervised ReID Performance (%). In which,

M(MB)

T(s)

denotes the memory cost, time cost of performing clustering once in training, where ‘s’ means ‘second’.

T(h)

denotes the total training time where ‘h’ means ‘hour’. We compare several MCL variants to baseline (All, i.e., Full Clustering scheme) by using 50%, 33%, 25%, and 20% data randomly selected from the entire unlabeled dataset as meta-training subset

{\mathbb{X}}_{1}

. For the smallest dataset PersonX (Sun and Zheng, 2019), it is not necessary to do experiments with too harsh computational requirements (e.g., 33%, 25%, 20%). We can see that the larger size of unlabeled dataset, the more superior of our method (red). Note that, the DukeMTMC-ReID dataset (Ristani et al., 2016) has been taken down and thus not used in our experiment, we just use PersonX (Sun and Zheng, 2019), Market1501 (Zheng et al., 2015), MSMT17 (Wei et al., 2018), and LaST (Shu et al., 2021) for experiments.

Methods	PersonX (9.8k imgs, 410 IDs)					Market1501 (12.9k imgs, 751 IDs)					MSMT17 (32.6k imgs, 1041 IDs)					LaST (71.2k imgs, 5000 IDs)
Methods	mAP	Rank1	M (MB)	T (s)	T (h)	mAP	Rank1	M (MB)	T (s)	T (h)	mAP	Rank1	M (MB)	T (s)	T (h)	mAP	Rank1	M (MB)	T (s)	T (h)
All	88.5	95.8	822.3	30.0	2.7	83.3	93.0	876.3	34.3	2.9	33.4	62.9	6251.5	118.3	9.3	19.8	74.0	22398.5	494.8	42.0
50%	79.0	93.5	412.6	13.1	2.2	82.9	92.7	348.6	10.8	2.4	38.2	66.5	1761.3	31.1	4.6	20.0	74.9	5779.6	121.2	20.0
33%	–	–	–	–	–	79.6	91.9	287.5	7.0	2.2	31.5	57.4	889.1	18.2	3.8	22.7	75.0	2688.2	70.8	14.0
25%	–	–	–	–	–	75.4	89.3	235.1	5.4	2.0	25.9	53.4	556.9	13.4	3.0	17.2	69.0	1564.2	44.7	9.0
20%	–	–	–	–	–	41.3	61.3	141.4	4.6	1.9	20.1	47.4	394.6	10.3	2.6	15.8	56.0	1066.0	38.4	7.0

4. Experiment

4.1. Datasets and Implementation

Datasets and Evaluation. We evaluate the proposed MCL method on multiple ReID benchmarks (from small to large scale): PersonX (PX) (Sun and Zheng, 2019), Market1501 (Ma) (Zheng et al., 2015), MSMT17 (MT) (Wei et al., 2018), and the largest public ReID dataset (so far) LaST (LS) (Shu et al., 2021). To further show the superiority of MCL in the large-scale data setting, we also conduct experiments on the mixed datasets, e.g., training on multiple datasets PX+Ma+MT+LS while testing on unseen test set of MT. The details about datasets are shown in Table 1.

Implementation Details. The proposed MCL is generic and can be applied to different clustering-based U-ReID backbones. Here, we re-implement ClusterContrast (Dai et al., 2021) as baseline, since it has been dominating the leaderboard in multiple benchmarks w.r.t unsupervised ReID performance, and is considerably more efficient as a source-free purely unsupervised ReID pipeline compared to those competitive adaptive (source data needed) U-ReID algorithms, like MMT (Ge et al., 2020a), SpCL (Ge et al., 2020b), (Zheng et al., 2021) etc. ResNet-50 (He et al., 2016) is adopted as the backbone of the feature extractor and initialize the model with the parameters pre-trained on ImageNet (Deng et al., 2009).

At the beginning of MCL training, we first train the ReID model only with the first phase of meta-prototype optimization (skip the second phase of prototype-referenced polishment), which aims to warm up the meta-prototype learning, like the FC layer warm-up (He et al., 2016; Ro et al., 2019), so as to have a reasonable pseudo labeling for the next model polishing. This process lasted for 5 epochs for PersonX, and 10epochs for Market1501, MSMT17, and LaST. For image size, the input is resized as 256 $\times$ 128 (height $\times$ width) for all person datasets. For data augmentation, we perform random horizontal flipping, padding with 10 pixels, random cropping, and random erasing (Zhong et al., 2020). For batch size, each mini-batch contains 256 images of 16 pseudo person identities (16 instances for each person). During the training, we adopt Adam optimizer to train the ReID model with weight decay 5e-4. The initial learning rate is set to 3.5e-4, and is decayed by a factor of 0.1 every 20 epoch in a total of 60 epoch. Following (Ge et al., 2020b; Dai et al., 2021), we use DBScan and Jaccard distance (Zhong et al., 2017) to cluster with $k$ nearest neighbors, where $k$ =30. For DBScan, the maximum distance $d$ between two samples is experimentally set as 0.4 for market1501, 0.7 for other datasets, and the minimal number of neighbors in a core point is all set as 4.

4.2. Effectiveness and Necessity of MCL

Memory&Time Cost v.s. U-ReID Performance. Table 7 shows the U-ReID performance resulted by using subsets of size 50%, 33%, 25%, and 20% randomly selected from all unlabeled data as meta-training set ${\mathbb{X}}_{1}$ vs. directly conducting clustering over full data (All). We observe that using partial data for clustering with MCL effectively saves computational costs on both memory and time. For example, the MCL, 50% schemes nearly achieve the same ReID performance but save memory&time cost by over 50%, such savings are particularly obvious on the large datasets: 1761.3MB, 31.1s vs 6251.5MB, 118.3s on MSMT17 and 5779.6MB, 121.2s vs 22398.5MB, 494.8s on LaST. However, we also observe that the ReID performance of mAP/Rank1 has a noticeable drop, especially on the small dataset PersonX (‘blue’ in Table 7). We analyze that’s because the noisy label issue will be enlarged on small datasets.

Interestingly, in contrast to the trend in Memory&Time saving vs. ReID accuracy reduction, we find an opposite trend for mAP/Rank1 improvements on the two largest datasets MSMT17 and LaST (‘red’ in Table 7). This reveals that the larger of unlabeled dataset, the more superior of our method. We analyze such gains come from two aspects: 1). less meta-training data gets more reliable clustered results; 2). the prototype-referenced polishment with intra- and inter-identity constraints promotes the discriminative ReID feature learning.

Necessity of MCL. Someone may think of directly splitting a large-scale dataset into multiple small subsets to do clustering-based U-ReID sequentially. This is also the most straightforward solution to handle the computational issue we focused. To study its feasibility, we deliberately design a scheme named Naive Splitting Training, where multiple subsets picked from one single large data set are sequentially used for clustering $\rightarrow$ labeling $\rightarrow$ training. Naive Splitting Training also could save memory cost due to its subset-wise clustering, but this operation also inadvertently enlarges the negative effect of time consuming and noisy labeling. As shown in Table 3, two Naive Splitting Training schemes of using 50%/25% subset as training unit, are inferior to MCL by 16.0%/9.7% in mAP on MSMT17, which reveals two facts that 1) naively splitting the holistic large-scale dataset for sequential training is not optimal, and 2) MCL is necessary and more superior.

Table 3. Study on the necessity of our MCL. For the scheme of Naive Splitting Training, several subsets picked from one single large data set are sequentially used for clustering $\rightarrow$ labeling $\rightarrow$ training (same with phase-1 in MCL).

Methods		Market1501		MSMT17
Methods		mAP	Rank1	mAP	Rank1
50%	Naive Splitting Training	73.6	82.2	22.2	43.9
50%	MCL	82.9 (↑9.3)	92.7 (↑10.5)	38.2 (↑16.0)	66.5 (↑22.6)
25%	Naive Splitting Training	68.4	74.1	16.2	36.5
25%	MCL	75.4 (↑7.0)	89.3 (↑15.2)	25.9 (↑9.7)	53.4 (↑16.9)

4.3. Study on Mixed Large-scale Datasets

As discussed in Sec. 4.2 and Table 7, the larger size of unlabeled dataset, the more superior of MCL. To fully study this point, we further construct two mixed large-scale training datasets PX+Ma+MT and PX+Ma+MT+LS, and evaluate models on the unseen test set of MSMT17 (MT). Note that, we originally planned to perform such group of experiments on the larger realistic ReID datasets, but which is limited by the truth that most large-scale realistic ReID datasets (e.g., Person30K (Bai et al., 2021), FastHuman (He et al., 2021)) have not fully released. As shown in Table 4, we can get two observations: 1). the scheme of All on the largest dataset PX+Ma+MT+LS is failed to be directly clustered/trained due to the computing pressure. 2). MCL outperforms All by 2.3% in mAP under 50% on PX+Ma+MT $\rightarrow$ MT 3). MCL performs better on PX+Ma+MT than that on PX+Ma+MT+LS, which may be due to the style/domain gap between LaST (Shu et al., 2021) and other ReID datasets.

Table 4. Study on mixed large-scale datasets, where PX, Ma, MT, LS denotes PersonX, Market1501, MSMT17, LaST. Note that, in the experimental environment with four 16GB Tesla V100 GPUs, the scheme of All on PX+Ma+MT+LS is failed to be directly clustered/trained due to the computing pressure.

Train Datasets	Methods	Test: MT
Train Datasets	Methods	mAP	Rank1	M (MB)	T (s)
MT	All	33.4	62.9	6251.5	118.3
PX+Ma+MT	All	29.6	56.3	29788.7	244.8
	MCL, 50%	31.9	59.3	7698.3	82.6
	MCL, 25%	23.1	49.6	1399.0	35.9
PX+Ma+MT+LS	All	–	–	–	–
	MCL, 50%	25.5	49.9	21207.8	323.3
	MCL, 25%	17.1	39.5	5126.9	107.5

4.4. Ablation Study

Influence of Loss Constraints. We study the effectiveness of the proposed siamese consistency loss $\mathcal{L}_{sc}$ and soft-weighted triplet loss $\mathcal{L}_{tri}^{sw}$ in Table LABEL:tab:loss. We see that MCL outperforms MCL w/o $\mathcal{L}_{sc}$ by 4.0%/3.2% in mAP for 50%/25% settings on Market1501. When replace the soft-weighted triplet loss $\mathcal{L}_{tri}^{sw}$ with the basic triplet loss version (Hermans et al., 2017), the scheme of MCL w/o $\mathcal{L}_{tri}^{sw}$ is inferior to MCL by 5.6%/4.4% in mAP for 50%/25% settings on Market1501. Such two constraints facilitate the pseudo label denoising via promising intra-identity consistency and inter-identity correlation. In addition, they are complementary and both vital to MCL, jointly resulting in a superior performance.

Table 5. Ablation study for meta clustering learning (MCL).

Methods		Market1501		MSMT17
Methods		mAP	Rank1	mAP	Rank1
50%	MCL w/o $\mathcal{L}_{sc}$	78.9	88.8	35.1	62.8
	MCL w/o $\mathcal{L}_{tri}^{sw}$	77.3	86.4	33.4	60.7
	MCL	82.9	92.7	38.2	66.5
25%	MCL w/o $\mathcal{L}_{sc}$	72.2	84.8	20.4	44.5
	MCL w/o $\mathcal{L}_{tri}^{sw}$	71.0	84.1	17.6	38.1
	MCL	75.4	89.3	25.9	53.4

(a)

Methods		Market1501		MSMT17
Methods		mAP	Rank1	mAP	Rank1
50%	MCL_fixed	71.5	87.8	21.2	44.0
	MCL_same	80.8	92.0	36.5	63.8
	MCL	82.9	92.7	38.2	66.5
25%	MCL_fixed	37.8	61.0	20.0	43.7
	MCL_same	59.8	80.3	14.4	34.3
	MCL	75.4	89.3	25.9	53.4

(b)

Influence of Data Split. As we described before Sec. 3.1, given an unlabeled dataset ${\mathbb{X}}$ , we use an random and uniform split strategy to divide the samples into meta training subset ${\mathbb{X}}_{1}$ and the rest subsets $\{{\mathbb{X}}_{2},{\mathbb{X}}_{3},...,{\mathbb{X}}_{N}\}$ . Such split is performed before each training epoch. And, the label spaces for different subsets $\{{\mathbb{X}}_{1},{\mathbb{X}}_{2},{\mathbb{X}}_{3},...,{\mathbb{X}}_{N}\}$ are same-size but non-overlapping. Here we study on the influence of different split designs. In Table LABEL:tab:data_split, the scheme of MCL_fixed means we only conduct the data split once at beginning and fix the split results during the training. MCL_same means the scheme where all subsets share the same label space. We can observe that MCL_fixed is inferior to MCL by 11.4%/37.6% in mAP under 50%/25% on Market1501, and MCL_same is inferior to MCL by 15.6%/11.5% in mAP under 25% on Market1501/MSMT17. We analyze that: 1). re-splitting dataset before each epoch plays a role of data re-organization, like the mechanism behind cross-validation (Kohavi et al., 1995), which avoids over-fitting and extremely cases, increasing robustness of MCL. 2). non-overlapped label spaces increase the diversity of training data, like a data augmentation, promoting the discriminative ReID representations learning. Such design brings obvious improvements especially when using less meta-training data. For example, MCL outperforms MCL_same by only 2.1%/1.7% in mAP under 50%, but by 15.6%/11.5% under 25%. More analytic and ablated results (including limitation discussions) are presented in Supplementary.

4.5. Visual Results and Insights

Visualization on Pseudo Labeling. To further show the proposed prototype-referenced labeling in MCL is superior to the general clustering, we compare these two pseudo-labeling methods by showing the same pseudo-labeled images (i.e., the positive pairs) in Figure 4. Our scheme is MCL, 50%. We observe that: 1). for the general clustering, the grouped entries share the global visual similar appearance. This is not reliable enough. For example, in the most left pair of Figure 4, the two women are dressed very similarly, the only local discriminative clue is they take different items in their hands. 2). the proposed meta-prototype referenced labeling has the capability of discovering fine-grained discriminative clues (bottom in Figure 4) due to the usage of relative comparative characteristic among samples. This also explains why MCL outperforms the baseline scheme of All even with less data for clustering to some extent.

Table 6. Comparison with state-of-the-art methods on the unsupervised ReID, including purely unsupervised methods and unsupervised domain adaptation (UDA) methods. “None” represents the former, and other value represents the source-domain dataset in UDA method.

Methods	Market1501
Methods	source	mAP	Rank1	Rank5	Rank10
BUC (Lin et al., 2019)	None	38.3	66.2	79.6	84.5
UGA (Wu et al., 2019)	None	70.3	87.2	-	-
SSL (Lin et al., 2020)	None	37.8	71.7	83.8	87.4
MMCL (Wang and Zhang, 2020)	None	45.5	80.3	89.4	92.3
HCT (Zeng et al., 2020)	None	56.4	80.0	91.6	95.2
DG-Net (Zou et al., 2020)	MT	64.6	83.1	91.5	94.3
CycAs (Wang et al., 2020b)	None	64.8	84.8	-	-
MMT (Ge et al., 2020a)	MT	75.6	89.3	95.8	97.5
SPCL (Ge et al., 2020b)	None	73.1	88.1	95.1	97.0
SPCL (Ge et al., 2020b)	MT	77.5	89.7	96.1	97.6
MPRD (Ji et al., 2021)	None	51.1	83.0	91.3	93.6
ICE (Chen et al., 2021)	None	79.5	92.0	97.0	98.1
HCD (Zheng et al., 2021)	MT	80.2	91.4	-	-
Cluster (Dai et al., 2021)	None	82.6	93.0	97.0	98.1
MCL, 50%	None	82.9	92.7	97.6	98.7

(a)

Methods	MSMT17
Methods	source	mAP	Rank1	Rank5	Rank10
TAUDL (Li et al., 2018)	None	12.5	28.4	-	-
MMCL (Wang and Zhang, 2020)	None	11.2	35.4	44.8	49.8
UTAL (Li et al., 2019)	None	13.1	31.4	-	-
UGA (Wu et al., 2019)	None	21.7	49.5	-	-
MMT (Ge et al., 2020a)	Ma	24.0	50.1	63.5	69.3
CycAs (Wang et al., 2020b)	None	26.7	50.1	-	-
SPCL (Ge et al., 2020b)	None	19.1	42.3	55.6	61.2
SPCL (Ge et al., 2020b)	Ma	26.8	53.7	65.0	69.8
MPRD (Ji et al., 2021)	None	14.6	37.7	51.3	57.1
HCD (Zheng et al., 2021)	None	26.9	53.7	65.3	70.2
ICE (Chen et al., 2021)	None	29.8	59.0	71.7	77.0
Cluster (Dai et al., 2021)	None	33.3	63.3	73.7	77.8
MCL, 50%	None	38.2	66.5	75.2	79.7

(b)

Moreover, we also count the proportions of correctly and wrongly clustering persons into the same category on MSMT17 in Figure 5 (a), we can see that MCL, 50% could achieve a better identity grouping performance more quickly compared to the baseline scheme of All.

Visualization of Feature Distributions. In Figure 5 (b), we visualize the distributions of the features using t-SNE (Van der Maaten and Hinton, 2008) on MSMT17. We compare the feature distribution with the baseline scheme of All, and observe that the features of different identities are better clearly separated for our scheme MCL, 50%, which demonstrates our learned ReID representations are more discriminative.

4.6. Comparison with State-of-the-arts

Although this work is the first attempt to achieve the unsupervised ReID learning while considering the computational cost savings, we also compare MCL to the state-of-the-art U-ReID methods that without considering resource limitations. From Table 5(b), we can see that MCL, 50% using only 50% unlabeled data for meta-clustering achieves a comparable U-ReID performance compared to SOTA methods, and even outperforms the second best ClusterContrast (Dai et al., 2021) by 4.9% in mAP on the large-scale MSMT17. In short, MCL is capable of achieving a good trade-off between U-ReID performance and computational costs.

Influence of Clustering Hyper-parameters. As discussed in implementation, we use DBScan and Jaccard distance (Zhong et al., 2017) for first-phase training to cluster with $k$ nearest neighbors ( $k$ =30) following (Ge et al., 2020b; Dai et al., 2021). For DBScan, the maximum distance $d$ between two samples is set as 0.4 for market1501, 0.7 for other datasets, and the minimal number of neighbors in a core point (denoted as $n$ ) is all set as 4. Here we analyze the influence of these parameters in Figure 7, and conclude that the proposed large-scale unsupervised ReID training method of MCL is robust and stable enough to achieve relatively satisfactory performance with variant hyper-parameters.

5. Conclusion

In this paper, we take the first attempt to explore a resource-friendly purely unsupervised person ReID framework, which effectively learns discriminative representations while considering the computational costs. A new concept of meta clustering learning (MCL) is introduced to perform clustering-based ReID training on partial unlabeled data, saving the required computing resources. For the rest data, we leverage the learned prototypes obtained before as proxy annotator to pseudo-label them. Based on the generated soft pseudo labels, we then polish model with two well-designed losses that take intra- and inter-identity constraints into account for alleviating noisy labels. MCL achieves a SOTA performance on unsupervised ReID, and could also flexibly meet the computing budgets in practice.

6. Acknowledgements

This work was supported in part by NSFC under Grant U1908209, 62021001, the National Key Research and Development Program of China 2018AAA0101400, and NUS Faculty Research Committee Grant (WBS:A-0009440-00-00).

Supplementary

Appendix 1 Limitations of Meta Cluster Learning

Table 7. Memory&Time Cost vs. Unsupervised ReID Performance (%). In which,

M(MB)

T(s)

denotes the memory cost, time cost of performing clustering once in training, where ‘s’ means ‘second’.

T(h)

denotes the total training time where ‘h’ means ‘hour’. We compare several MCL variants to baseline (All) by using 50%, 33%, 25%, and 20% data randomly selected from the entire unlabeled dataset as meta-training subset

{\mathbb{X}}_{1}

Methods	PersonX (9.8k imgs, 410 IDs)					Market1501 (12.9k imgs, 751 IDs)					MSMT17 (32.6k imgs, 1041 IDs)					LaST (71.2k imgs, 5000 IDs)
Methods	mAP	Rank1	M (MB)	T (s)	T (h)	mAP	Rank1	M (MB)	T (s)	T (h)	mAP	Rank1	M (MB)	T (s)	T (h)	mAP	Rank1	M (MB)	T (s)	T (h)
All	88.5	95.8	822.3	30.0	2.7	83.3	93.0	876.3	34.3	2.9	33.4	62.9	6251.5	118.3	9.3	19.8	74.0	22398.5	494.8	42.0
50%	79.0	93.5	412.6	13.1	2.2	82.9	92.7	348.6	10.8	2.4	38.2	66.5	1761.3	31.1	4.6	20.0	74.9	5779.6	121.2	20.0
33%	–	–	–	–	–	79.6	91.9	287.5	7.0	2.2	31.5	57.4	889.1	18.2	3.8	22.7	75.0	2688.2	70.8	14.0
25%	–	–	–	–	–	75.4	89.3	235.1	5.4	2.0	25.9	53.4	556.9	13.4	3.0	17.2	69.0	1564.2	44.7	9.0
20%	–	–	–	–	–	41.3	61.3	141.4	4.6	1.9	20.1	47.4	394.6	10.3	2.6	15.8	56.0	1066.0	38.4	7.0

MCL Cannot Work Well with Too Small Meta-training Data ${\mathbb{X}}_{1}$ . As we described in the manuscript, MCL costs less computing resources in clustering by only using a meta-training subset ${\mathbb{X}}_{1}$ in the first phase for training. Intuitively, if such meta-training subset ${\mathbb{X}}_{1}$ is too small, MCL will be seriously affected by the noisy pseudo label issue so that causes an unsatisfactory U-ReID performance. Taking an extreme case as example for illustration, given a person dataset with 10 IDs and 100 person images, if we split it into 10 subsets and just pick up one as ${\mathbb{X}}_{1}$ , the worst case is that ${\mathbb{X}}_{1}$ may only cover a single person, which will directly make the reference-based pseudo labeling and U-ReID failed. The similar phenomenon also can be observed from Table 7, MCL can not work so well with too small meta-training data ${\mathbb{X}}_{1}$ especially on small-scale datasets, like 20%, on PersonX (marked as blue).

The Relationship between Dataset Size and Optimal Split Number Is Not Clear. As we claimed in the manuscript, the larger size of unlabeled dataset, the more superior of the proposed MCL. However, it is difficult to give a deterministic relationship between dataset size and split number. As shown in Table 7, the scheme of All that directly conduct clustering over full data achieves the best ReID performance on PersonX and Market1501, the scheme of MCL, 50% achieves the best ReID performance on MSMT17, and the scheme of MCL, 33% achieves the best ReID performance on the largest LaST. We conclude that the larger size of unlabeled dataset, MCL might could use more less meta-training data ${\mathbb{X}}_{1}$ (i.e., a big split number) for training to get a satisfactory performance. But, such the relationship that related to the dataset size vs. the optimal number of subsets is not clear now. The exploration about this relationship is also limited by the existing/released public person ReID datasets, or said, not so many large-scale person ReID datasets, so this will be left as our future work.

Appendix 2 Meet the Computing Budgets In Practice

Uniquely, our MCL could enable U-ReID to operate at varying computation budgets. As we pointed before, as a by-product of MCL, our trained model could fully leverage the entire unlabeled data set with only a partial subset doing clustering. Such ratio can be flexibly determined according to the practical computing power. Given a computing budget, we compare the U-ReID performance of MCL with the Naive Splitting Training scheme (i.e., sequentially using subsets to meet resource requirements). As the two curves shown in Figure 6, MCL consistently outperforms Naive Splitting Training on the two largest datasets MSMT17 and LaST. That is, given a limited budget, MCL could achieve a better discriminative ReID representation learning.

Appendix 3 Hyper-parameter Analysis

Influence of the Loss Balance. As we described in the manuscript, we design a siamese consistency loss $\mathcal{L}_{sc}$ and a soft-weighted triplet loss $\mathcal{L}_{tri}^{sw}$ in the second polishment phase of meta-clustering learning (MCL) to alleviate the noisy pseudo label issue, which process can be formulated by $\mathcal{L}_{phase2}=\mathcal{L}_{sc}+\lambda*\mathcal{L}_{tri}^{sw}$ , the hyper-parameter $\lambda$ is used to balance the importance between the siamese consistency loss $\mathcal{L}_{sc}$ and the soft-weighted triplet loss $\mathcal{L}_{tri}^{sw}$ . For $\lambda$ , we initially set it to 1.0, and then coarsely determine it based on the corresponding loss values and its gradients observed during the training. The decision principle is to set its value to make the loss value/gradient lie in a similar range. Grid search within a small range of the derived $\lambda$ is further employed to get better parameter. Actually, we observed the final performance is not very sensitive to this hyper-parameter, we experimentally set $\lambda=1.0$ in the end.

Appendix 4 Social Impact

Positive. In this paper, we introduce the Meta Clustering Learning (termed as MCL), a new concept for large-scale person re-identification. To our best knowledge, this paper is the first attempt to develop an efficient unsupervised person ReID training framework to fully leverage the numerous pedestrian surveillance data while taking the computational cost into account. This is very important for both of academic community and industry, and is also valuable and meaningful to bridge the gap between the fast-developing ReID algorithms and practical applications.

This paper also has the potential to provide new insights to the person ReID field and accelerate the development of ReID algorithms. We improve ReID model’s learning ability, enabling the models to be trained with limited computing resources in practical applications. Moreover, the prototype-referenced peseudo labeling idea, and the well-designed intra- and inter-identity constraints are conceptually suitable for a wide range of tasks: from person detection, matching, retrieval to fine-grained person tracking, etc.

Negative. Due to the urgent demand of public safety and increasing number of surveillance cameras, person ReID is imperative in intelligent surveillance systems with significant research impact and practical importance, but this task also might raise questions about the risk of leaking private information. On the other hand, the data collected from the surveillance equipments or downloaded from the internet may violate the privacy of human beings. Therefore, we appeal and encourage further person ReID work to understand and avoid as much as possible the risks of using these pedestrian data. We also encourage research that understands and mitigates the risks arising from surveillance applications. A short-term solution may be developing detection systems. Besides, we recommend researchers to stop the spread of private datasets.

References

(1)
Bai et al. (2021) Yan Bai, Jile Jiao, Wang Ce, Jun Liu, Yihang Lou, Xuetao Feng, and Ling-Yu Duan. 2021. Person30k: A dual-meta generalization network for person re-identification. In CVPR. 2123–2132.
Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS (2020).
Chen et al. (2021) Hao Chen, Benoit Lagadec, and Francois Bremond. 2021. ICE: Inter-instance Contrastive Encoding for Unsupervised Person Re-identification. ICCV (2021).
Chen et al. (2020b) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020b. A simple framework for contrastive learning of visual representations. In ICML. PMLR, 1597–1607.
Chen et al. (2020c) Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. 2020c. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 (2020).
Chen et al. (2020a) Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020a. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020).
Chen and He (2021) Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation learning. In CVPR. 15750–15758.
Dai et al. (2022) Yongxing Dai, Yifan Sun, Jun Liu, Zekun Tong, Yi Yang, and Ling-Yu Duan. 2022. Bridging the Source-to-target Gap for Cross-domain Person Re-Identification with Intermediate Domains. arXiv preprint arXiv:2203.01682 (2022).
Dai et al. (2021) Zuozhuo Dai, Guangyuan Wang, Weihao Yuan, Siyu Zhu, and Ping Tan. 2021. Cluster Contrast for Unsupervised Person Re-Identification. arXiv preprint arXiv:2103.11568 (2021).
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.
Deng et al. (2018) Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, Yi Yang, and Jianbin Jiao. 2018. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In CVPR.
Ding et al. (2020) Yuhang Ding, Hehe Fan, Mingliang Xu, and Yi Yang. 2020. Adaptive exploration for unsupervised person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 1 (2020), 1–19.
Ester et al. (1996) Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96. 226–231.
Fan et al. (2018) Hehe Fan, Liang Zheng, Chenggang Yan, and Yi Yang. 2018. Unsupervised person re-identification: Clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2018).
Feng et al. (2022) Ruoyu Feng, Xin Jin, Zongyu Guo, Runsen Feng, Yixin Gao, Tianyu He, Zhizheng Zhang, Simeng Sun, and Zhibo Chen. 2022. Image Coding for Machines with Omnipotent Feature Learning. ECCV (2022).
Fu et al. (2021) Dengpan Fu, Dongdong Chen, Jianmin Bao, Hao Yang, Lu Yuan, Lei Zhang, Houqiang Li, and Dong Chen. 2021. Unsupervised Pre-training for Person Re-identification. In CVPR. 14750–14759.
Fu et al. (2022) Dengpan Fu, Dongdong Chen, Hao Yang, Jianmin Bao, Lu Yuan, Lei Zhang, Houqiang Li, Fang Wen, and Dong Chen. 2022. Large-Scale Pre-training for Person Re-identification with Noisy Labels. CVPR (2022).
Fu et al. (2019a) Yang Fu, Yunchao Wei, Guanshuo Wang, Xi Zhou, Honghui Shi, and Thomas S. Huang. 2019a. Self-similarity Grouping: A Simple Unsupervised Cross Domain Adaptation Approach for Person Re-identification. ICCV (2019).
Fu et al. (2019b) Yang Fu, Yunchao Wei, Yuqian Zhou, et al. 2019b. Horizontal pyramid matching for person re-identification. In AAAI.
Ge et al. (2021) Wenhang Ge, Chunyan Pan, Ancong Wu, Hongwei Zheng, and Wei-Shi Zheng. 2021. Cross-Camera Feature Prediction for Intra-Camera Supervised Person Re-identification across Distant Scenes. In ACMMM. 3644–3653.
Ge et al. (2020a) Yixiao Ge, Dapeng Chen, and Hongsheng Li. 2020a. Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification. ICLR (2020).
Ge et al. (2020b) Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, and Hongsheng Li. 2020b. Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. NeurIPS (2020).
Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733 (2020).
He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR. 9729–9738.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, et al. 2016. Deep residual learning for image recognition. In CVPR.
He et al. (2021) Lingxiao He, Wu Liu, Jian Liang, Kecheng Zheng, Xingyu Liao, Peng Cheng, and Tao Mei. 2021. Semi-Supervised Domain Generalizable Person Re-Identification. arXiv preprint arXiv:2108.05045 (2021).
Hermans et al. (2017) Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017).
Isobe et al. (2021) Takashi Isobe, Dong Li, Lu Tian, Weihua Chen, Yi Shan, and Shengjin Wang. 2021. Towards Discriminative Representation Learning for Unsupervised Person Re-identification. ICCV (2021).
Ji et al. (2021) Haoxuanye Ji, Le Wang, Sanping Zhou, Wei Tang, Nanning Zheng, and Gang Hua. 2021. Meta Pairwise Relationship Distillation for Unsupervised Person Re-Identification. In ICCV. 3661–3670.
Jin et al. (2020a) Xin Jin, Cuiling Lan, Wenjun Zeng, and Zhibo Chen. 2020a. Global distance-distributions separation for unsupervised person re-identification. In ECCV. Springer, 735–751.
Jin et al. (2020b) Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen, and Li Zhang. 2020b. Style Normalization and Restitution for Generalizable Person Re-identification. In CVPR.
Kohavi et al. (1995) Ron Kohavi et al. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, Vol. 14. Montreal, Canada, 1137–1145.
Li et al. (2018) Minxian Li, Xiatian Zhu, and Shaogang Gong. 2018. Unsupervised person re-identification by deep learning tracklet association. In ECCV.
Li et al. (2019) Minxian Li, Xiatian Zhu, and Shaogang Gong. 2019. Unsupervised Tracklet Person Re-Identification. TPAMI (2019).
Lin et al. (2018) Shan Lin, Haoliang Li, Chang-Tsun Li, and Alex Chichung Kot. 2018. Multi-task mid-level feature alignment network for unsupervised cross-dataset person re-identification. BMVC (2018).
Lin et al. (2019) Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi Yang. 2019. A bottom-up clustering approach to unsupervised person re-identification. In AAAI, Vol. 33. 8738–8745.
Lin et al. (2020) Yutian Lin, Lingxi Xie, Yu Wu, Chenggang Yan, and Qi Tian. 2020. Unsupervised person re-identification via softened similarity learning. In CVPR. 3390–3399.
Liu et al. (2019) Jiawei Liu, Zheng-Jun Zha, Di Chen, Richang Hong, and Meng Wang. 2019. Adaptive Transfer Network for Cross-Domain Person Re-Identification. In CVPR.
Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
Qi et al. (2019) Lei Qi, Lei Wang, Jing Huo, Luping Zhou, Yinghuan Shi, and Yang Gao. 2019. A Novel Unsupervised Camera-aware Domain Adaptation Framework for Person Re-identification. ICCV (2019).
Ren et al. (2022) Sucheng Ren, Daquan Zhou, Shengfeng He, Jiashi Feng, and Xinchao Wang. 2022. Shunted Self-Attention via Multi-Scale Token Aggregation. In CVPR. 10853–10862.
Ristani et al. (2016) Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. 2016. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV.
Ro et al. (2019) Youngmin Ro, Jongwon Choi, Dae Ung Jo, Byeongho Heo, Jongin Lim, and Jin Young Choi. 2019. Backbone cannot be trained at once: Rolling back to pre-trained network for person re-identification. In AAAI, Vol. 33. 8859–8867.
Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR.
Shu et al. (2021) Xiujun Shu, Xiao Wang, Shiliang Zhang, Xianghao Zhang, Yuanqi Chen, Ge Li, and Qi Tian. 2021. Large-Scale Spatio-Temporal Person Re-identification: Algorithm and Benchmark. arXiv preprint arXiv:2105.15076 (2021).
Song et al. (2020) Liangchen Song, Cheng Wang, Lefei Zhang, Bo Du, Qian Zhang, Chang Huang, and Xinggang Wang. 2020. Unsupervised domain adaptive re-identification: Theory and practice. Pattern Recognition (2020), 107173.
Sun and Zheng (2019) Xiaoxiao Sun and Liang Zheng. 2019. Dissecting person re-identification from the viewpoint of viewpoint. In CVPR. 608–617.
Sun et al. (2018) Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV. 480–496.
Tang et al. (2019) Haotian Tang, Yiru Zhao, and Hongtao Lu. 2019. Unsupervised Person Re-Identification With Iterative Self-Supervised Domain Adaptation. In CVPR workshops.
Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
Vanschoren (2018) Joaquin Vanschoren. 2018. Meta-learning: A survey. arXiv preprint arXiv:1810.03548 (2018).
Vanschoren (2019) Joaquin Vanschoren. 2019. Meta-learning. In Automated Machine Learning. Springer, Cham, 35–61.
Vilalta and Drissi (2002) Ricardo Vilalta and Youssef Drissi. 2002. A perspective view and survey of meta-learning. Artificial intelligence review 18, 2 (2002), 77–95.
Wang and Zhang (2020) Dongkai Wang and Shiliang Zhang. 2020. Unsupervised person re-identification via multi-label classification. In CVPR. 10981–10990.
Wang et al. (2018) Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li. 2018. Transferable joint attribute-identity deep learning for unsupervised person re-identification. In CVPR.
Wang et al. (2020a) Yanan Wang, Shengcai Liao, and Ling Shao. 2020a. Surpassing real-world source training data: Random 3d characters for generalizable person re-identification. In ACMMM. 3422–3430.
Wang et al. (2020b) Zhongdao Wang, Jingwei Zhang, Liang Zheng, Yixuan Liu, Yifan Sun, Yali Li, and Shengjin Wang. 2020b. CycAs: Self-supervised Cycle Association for Learning Re-identifiable Descriptions. In ECCV. Springer, 72–88.
Wei et al. (2018) Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. 2018. Person transfer GAN to bridge domain gap for person re-identification. In CVPR.
Wu et al. (2019) Jinlin Wu, Yang Yang, Hao Liu, Shengcai Liao, Zhen Lei, and Stan Z Li. 2019. Unsupervised graph association for person re-identification. In ICCV. 8321–8330.
Wu et al. (2021) Yiming Wu, Xintian Wu, Xi Li, and Jian Tian. 2021. MGH: Metadata Guided Hypergraph Modeling for Unsupervised Person Re-identification. In ACMMM. 1571–1580.
Wu et al. (2018) Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In CVPR. 3733–3742.
Yang et al. (2020) Fengxiang Yang, Ke Li, Zhun Zhong, Zhiming Luo, Xing Sun, Hao Cheng, Xiaowei Guo, Feiyue Huang, Rongrong Ji, and Shaozi Li. 2020. Asymmetric Co-Teaching for Unsupervised Cross Domain Person Re-Identification. AAAI (2020).
Yang et al. (2019) Qize Yang, Hong-Xing Yu, Ancong Wu, and Wei-Shi Zheng. 2019. Patch-Based Discriminative Feature Learning for Unsupervised Person Re-Identification. In CVPR.
Yang et al. (2022) Zizheng Yang, Xin Jin, Kecheng Zheng, and Feng Zhao. 2022. Unleashing the Potential of Unsupervised Pre-Training with Intra-Identity Regularization for Person Re-Identification. CVPR (2022).
Yu et al. (2019) Hong-Xing Yu, Wei-Shi Zheng, Ancong Wu, Xiaowei Guo, Shaogang Gong, and Jian-Huang Lai. 2019. Unsupervised Person Re-identification by Soft Multilabel Learning. In CVPR.
Zeng et al. (2020) Kaiwei Zeng, Munan Ning, Yaohua Wang, and Yang Guo. 2020. Hierarchical clustering with hard-batch triplet loss for person re-identification. In CVPR. 13657–13665.
Zhai et al. (2020) Yunpeng Zhai, Shijian Lu, Qixiang Ye, Xuebo Shan, Jie Chen, Rongrong Ji, and Yonghong Tian. 2020. AD-Cluster: Augmented discriminative clustering for domain adaptive person re-identification. In CVPR. 9021–9030.
Zhang et al. (2019) Xinyu Zhang, Jiewei Cao, Chunhua Shen, and Mingyu You. 2019. Self-training with progressive augmentation for unsupervised cross-domain person re-identification. In ICCV.
Zhang et al. (2022) Xinyu Zhang, Dongdong Li, Zhigang Wang, Jian Wang, Errui Ding, Javen Qinfeng Shi, Zhaoxiang Zhang, and Jingdong Wang. 2022. Implicit Sample Extension for Unsupervised Person Re-Identification. In CVPR. 7369–7378.
Zheng et al. (2015) Liang Zheng, Liyue Shen, et al. 2015. Scalable person re-identification: A benchmark. In ICCV.
Zheng et al. (2021) Yi Zheng, Shixiang Tang, Guolong Teng, Yixiao Ge, Kaijian Liu, Jing Qin, Donglian Qi, and Dapeng Chen. 2021. Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-Identification. In ICCV. 8371–8381.
Zheng et al. (2022) Yi Zheng, Yong Zhou, Jiaqi Zhao, Ying Chen, Rui Yao, Bing Liu, and Abdulmotaleb El Saddik. 2022. Clustering Matters: Sphere Feature for Fully Unsupervised Person Re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 4 (2022), 1–18.
Zhong et al. (2017) Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. 2017. Re-ranking person re-identification with k-reciprocal encoding. In CVPR.
Zhong et al. (2020) Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13001–13008.
Zhong et al. (2018) Zhun Zhong, Liang Zheng, Shaozi Li, and Yi Yang. 2018. Generalizing a person retrieval model hetero-and homogeneously. In ECCV.
Zhong et al. (2019) Zhun Zhong, Liang Zheng, Zhiming Luo, Shaozi Li, and Yi Yang. 2019. Invariance matters: Exemplar memory for domain adaptive person re-identification. In CVPR. 598–607.
Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, et al. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.
Zhuang et al. (2021) Weiming Zhuang, Yonggang Wen, and Shuai Zhang. 2021. Joint optimization in edge-cloud continuum for federated unsupervised person re-identification. In ACMMM. 433–441.
Zou et al. (2020) Yang Zou, Xiaodong Yang, Zhiding Yu, BVK Vijaya Kumar, and Jan Kautz. 2020. Joint disentangling and adaptation for cross-domain person re-identification. In ECCV. Springer, 87–104.