This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

TAL: Two-stream Adaptive Learning for Generalizable Person Re-identification

Yichao Yan, Junjie Li, Shengcai Liao, Jie Qin, Bingbing Ni, and Xiaokang Yang
yanyichao@sjtu.edu.cn
Abstract

Domain generalizable person re-identification aims to apply a trained model to unseen domains. Prior works either combine the data in all the training domains to capture domain-invariant features, or adopt a mixture of experts to investigate domain-specific information. In this work, we argue that both domain-specific and domain-invariant features are crucial for improving the generalization ability of re-id models. To this end, we design a novel framework, which we name two-stream adaptive learning (TAL), to simultaneously model these two kinds of information. Specifically, a domain-specific stream is proposed to capture training domain statistics with batch normalization (BN) parameters, while an adaptive matching layer is designed to dynamically aggregate domain-level information. In the meantime, we design an adaptive BN layer in the domain-invariant stream, to approximate the statistics of various unseen domains. These two streams work adaptively and collaboratively to learn generalizable re-id features. Our framework can be applied to both single-source and multi-source domain generalization tasks, where experimental results show that our framework notably outperforms the state-of-the-art methods.

1 Introduction

Person re-identification (re-id) [50] has been actively studied in the computer vision community. Over the past few years, fully supervised re-id has achieved remarkable progress, together with the rapid development of deep learning. In the supervised setting, re-id models are typically trained and evaluated within the same domain. Consequently, their performance may be significantly degraded when applied to unseen target domains due to the large discrepancy between the source/training and target/test domains. In other words, their generalization abilities are quite limited, preventing them from being adapted to various practical applications. To address this, many unsupervised domain adaptation (UDA) models have been proposed to bridge the domain gap [12, 7, 38, 15]. In this regime, re-id models are first trained in the labeled source domain, and further fine-tuned in the unlabeled target domain. Although these models significantly improve the performance under cross-dataset evaluation, an additional time-consuming adaptation step is required, making them inefficient for real-world deployment. Moreover, target domains are not always available for adaptation, hindering the successful applications of UDA models. To circumvent the above shortcomings, domain generalizable (DG) re-id [37, 25, 10, 22, 53], which aims to improve the generalization ability of re-id models without domain adaptation, has been receiving more and more attention over the past years. The goal of DG re-id is to work out-of-the-box regardless of the scenarios of target domains, highly desired by real-world applications.

Refer to caption
Figure 1: The proposed two-stream adaptive learning framework is composed of a domain-specific (DS) and a domain-invariant (DI) network. Both networks share the same backbone but with exclusive BN layers. In the DS network, DSBN captures domain-specific statistics for each domain expert. In the DI network, DABN adaptively approximates the unseen domain statistics from the DSBN parameters. These two networks simultaneously exploit the complementary DS and DI information and improve the generalization ability of the re-id model.

Previous efforts devoted to DG re-id can be generally divided into two categories. The first line of approaches, which we name single model frameworks, combine all the training data in the source domains and attempt to learn domain-invariant features via style normalization [22], feature mapping [37] and meta-learning [9], etc. However, the model parameters are fixed after training, and thus insufficient to be adapted to various target domains. More importantly, the domain-specific information, which can provide critical guidance to link the source and target domains, is ignored in these methods. The second line of works, which we refer to as the mixture of experts (MoE), develop a voting mechanism to aggregate the features from a series of domain-specific networks [10]. However, the efficiency of these methods tends to decrease when there exist a large number of source domains. In the meantime, the domain-invariant information is overlooked with the disentangled treatment from each expert.

To address the above issues, we propose a two-stream adaptive learning (TAL) framework for DG re-id, to simultaneously capture these two types of information. As shown in Fig. 1, our framework includes a domain-specific (DS) network that shares a similar spirit with the MoE framework, and a domain-invariant (DI) network that takes the hybrid source domain data as input. More specifically, on the one hand, we employ the domain-specific batch normalization (DSBN) layers [4] in the DS stream to capture the statistics of the source domains, while other parameters in the backbone are shared. This not only makes the network more efficient in handling a large number of training domains, but also facilitates the domain-invariant feature learning in the convolutional layers. On the other hand, the DI network introduces a separate BN layer and is trained without domain labels from a hybrid dataset that contains all the training domains. In this way, both domain-specific and domain-invariant features can be simultaneously captured in our framework, providing complementary information to improve the generalization ability of our re-id model.

To fully unleash the potential of both streams and make our framework better generalize to the unseen domains, we introduce two adaptive learning modules. First, it is challenging for the fixed BN layers learned from the hybrid dataset to adapt to various unseen target domains. To this end, we design a novel domain-adaptive BN (DABN) layer in the DI stream, to dynamically approximate the target domain statistics from the domain-specific BN layers. Second, instead of explicit voting, we propose a domain-adaptive matching layer in the DS stream to better aggregate features from the experts. These two strategies mimic the unseen domain during training, and thus significantly improve the adaptability of our model when applied to novel domains.

In summary, our main contributions include:

  • We propose a two-stream adaptive learning (TAL) framework for generalizable person re-id, which simultaneously captures the complementary domain-specific and domain-invariant information.

  • We design two adaptive learning modules to approximate the unseen domain statistics, which significantly improve the adaptation and generalization abilities of our framework.

  • Extensive experimental results on four large-scale datasets show that our framework outperforms state-of-the-art models in terms of both single-source and multi-source DG re-id tasks.

2 Related Works

2.1 Person Re-identification

Traditional person re-id methods mainly depend on hand-crafted features, such as [28, 13]. With the development of deep learning, recent re-id models are typically based on deep neural networks, including but not limited to part-based feature representation learning [41, 40, 31], distance metric learning [1, 19, 5, 8, 39], and video-based learning [49, 54, 48]. Despite their remarkable success, these models are fully-supervised. Due to domain shift, their performance will be significantly degraded if directly applied to a novel domain. However, it is difficult and time-consuming to obtain the annotated data for each target domain, which makes it less practical for real-world applications. Therefore, unsupervised domain adaptation (UDA) re-id [38, 15, 16, 30] and domain generalizable (DG) re-id [37, 21, 22, 25, 53] have recently attracted growing interest in the community.

2.2 Domain Adaptation for Person Re-id

Unsupervised domain adaptation (UDA) is the task of training a model on the source domain with labels and transferring it to the target domain under the unsupervised setting. UDA has been widely studied in recent years and was introduced into re-id task by [29]. Current UDA methods can be generally categorized into two branches: (1) those using generative adversarial networks (GAN) [17] to transfer the style of the source domain labeled images to the target domain [12, 7, 27], and (2) those adapting to the target domain with pseudo labels generated by clustering [38, 15, 16, 52] or assigning soft labels [43]. However, UDA re-id models still rely on target domain images for adaptation, which are not always available. In contrast, DG re-id does not require the target domain data during training, which is more practical.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Architecture of the proposed TAL framework. Both the DS and DI networks share the same backbone, but with different BN layers. The DSBN layers in the DS network are trained with data with domain labels, while the DABN layer in the DI network predicts the statistics of the unseen domain from DSBNs. ‘×\times’ denotes that the gradients do not flow back.

2.3 Generalizable Person Re-id

The objective of domain generalization is to learn a robust, generalizable model, such that the learned model can obtain good performance on an unseen target domain without further updates. Existing DG re-id methods can be divided into two categories: (1) those that train a single model to learn domain-invariant features from single/multi-source domains [37, 25, 22, 53], and (2) those that employ a mixture of experts (MoE) to learn a series of separated experts for the subsets of whole source domains [10]. For the first category, Song et al. [37] designed a Domain-Invariant Mapping Network (DIMN) for DG re-id. Jin et al. [22] proposed Style Normalization and Restitution (SNR) to disentangle the identity-relevant and identity-irrelevant features. Liao et al. [25] proposed a query-adaptive matching mechanism to aggregate the local similarities for DG re-id. Zhao et al. [53] introduced a meta-learning strategy to simulate the training and test process of DG re-id. For the second category, MoE methods have been widely employed in scene parsing [14] and image classification [2, 45]. Dai et al. [10] incorporated this strategy into the DG re-id task by designing a relevance-aware mixture of experts (RaMoE) that utilizes meta-learning for updating the voting network. In this work, we take advantage of both approaches by introducing a two-stream framework to simultaneously exploit the domain-specific and domain-invariant information.

2.4 Normalization for Domain Generalization

Recently, several works [33, 32, 4, 3, 9] have shown advantages of normalization techniques for boosting the generalization ability of neural networks. One of the most inspiring methods is domain-specific batch normalization (DSBN) [4], which employs exclusive BN layers to represent the statistics of the source and target domains. Nam et al. [32] designed a novel batch-instance normalization (BIN) layer to boost the collaboration of both BN and instance normalization (IN) layers. Choi et al. [9] proposed a generalizable normalization layer by simulating unsuccessful generalization scenarios in the meta-learning pipeline. Inspired by these methods, we propose to employ the DSBN in our DS network, and utilize DSBN parameters to predict the statistics of the unseen domains in the DI network. In this way, both DS and DI information are exploited and thus yielding enhanced generalization ability.

3 Methodology

In this section, we introduce the proposed two-stream adaptive learning framework (i.e., TAL), for generalizable person re-id. We first elaborate on the proposed domain-specific and domain-invariant networks, and then present the training procedure of our framework.

3.1 Domain-specific Network

Suppose we have KK source domains {𝒟1,,𝒟K}\{\mathcal{D}_{1},...,\mathcal{D}_{K}\} in the training set, where the ii-th domain 𝒟i={𝐱i1,,𝐱iNi}\mathcal{D}_{i}=\{\mathbf{x}_{i}^{1},...,\mathbf{x}_{i}^{N_{i}}\} contains NiN_{i} training images. Based on a shared backbone (e.g., ResNet-50 [18]), the DS network contains a set of BN layers that are specifically engaged in each domain, as shown in Fig LABEL:subfig-1-1. We denote the DS network as DS={1,,K}\mathcal{F}_{DS}=\{\mathcal{F}_{1},...,\mathcal{F}_{K}\}, which can be regarded as KK domain experts that share all the parameters except for the BN layers. Considering there are PP stages in the backbone, each domain expert will output a series of domain-specific feature maps:

𝐇i=i(𝐱i),\mathbf{H}_{i}=\mathcal{F}_{i}({\mathbf{x}_{i}}), (1)

where 𝐇i={𝐡i1,,𝐡iP}\mathbf{H}_{i}=\{\mathbf{h}_{i}^{1},...,\mathbf{h}_{i}^{P}\}, 𝐱i𝒟i\mathbf{x}_{i}\in\mathcal{D}_{i} and i=1,,Ki=1,...,K. Subsequently, 𝐇i\mathbf{H}_{i} are fed into a multi-scale variation of the query-adaptive convolutional layer [25] (MS-QAConv), and supervised by a triplet matching loss tri\mathcal{L}_{tri} (which will be introduced in Sec 3.3).

As each domain expert in our DS network only receives single-source information, it is difficult for them to generalize to unseen domains. To address this, we employ a hybrid source to mimic the target domain, as shown in Fig. LABEL:subfig-1-2, and again train the DS network without domain labels. Specifically, suppose 𝒟HS\mathcal{D}_{HS} contains all the training samples regardless of domain labels. Given input 𝐱HS𝒟HS\mathbf{x}_{HS}\in\mathcal{D}_{HS}, we extract its features from all the domain experts:

𝐇i,HS=i(𝐱HS),\mathbf{H}_{i,HS}=\mathcal{F}_{i}({\mathbf{x}_{HS}}), (2)

where i{1,,K}i\in\{1,...,K\}. The feature maps from all the domain experts {𝐇1,HS,,𝐇K,HS}\{\mathbf{H}_{1,HS},...,\mathbf{H}_{K,HS}\} are aggregated by a domain-adaptive MS-QAConv layer, which we denote as MSDA-QAConv. The same matching loss tri\mathcal{L}_{tri} is applied. Note that in this case, the gradients only flow back to the MSDA-QAConv layer, while the backbone parameters are no longer updated. In this way, the domain experts maintain the domain-specific information while being able to generalize to unseen domains.

3.2 Domain-invariant Network

Although the DS network tries to capture the domain-invariant information given the hybrid source, the feature aggregation only takes place at the last MSDA-QACovn layer, which is incapable of modeling the target domain statistics at different levels. To this end, we further introduce a DI network to normalize the features into domain-invariant representation. In particular, the DI network shares the same backbone with the DS network, but with specifically designed domain-adaptive BN (DABN) layers. To make the DABN layers adaptively approximate the target domain statistics, their normalization results are dynamically predicted from the DSBN layers, as shown in Fig. LABEL:subfig-1-2. Suppose the input feature before the DABN layer is 𝐱\mathbf{x}, we input 𝐱\mathbf{x} into each DSBN, where the output of the ii-th domain can be represented as:

DSBNi(𝐱,γi,βi)=γi𝐱^+βi,{\rm DSBN}_{i}(\mathbf{x},\gamma_{i},\beta_{i})=\gamma_{i}\hat{\mathbf{x}}+\beta_{i}, (3)

where γi\gamma_{i} and βi\beta_{i} are the DSBN parameters and

𝐱^=𝐱μiσi2+ϵ.\hat{\mathbf{x}}=\frac{\mathbf{x}-\mu_{i}}{\sqrt{\sigma_{i}^{2}+\epsilon}}. (4)

Here, μi\mu_{i} and σi\sigma_{i} are the mean and variance of the input feature. In DABN, the normalization result 𝐱~\tilde{\mathbf{x}} can be calculated as a weighted summation of the DSBN outputs:

DABN(𝐱,α,γ,β)=i=1Kαi(γi𝐱^+βi),{\rm DABN}(\mathbf{x},\alpha,\gamma,\beta)=\sum_{i=1}^{K}\alpha_{i}(\gamma_{i}\hat{\mathbf{x}}+\beta_{i}), (5)

where the weighting terms α={α1,,αK}\alpha=\{\alpha_{1},...,\alpha_{K}\} are estimated in a similar way as squeeze-excitation network [20] (SENet), as shown in Fig. 3. Specifically, the input feature maps are first input into an average pooling layer to extract global information. Then a bottleneck layer with ReLU activation is introduced to reduce the dimension (with reduction rate rr), followed by another linear layer to predict the weights of each domain. The attention weights are normalized by a softmax layer:

α=Softmax(𝐖2(ReLU(𝐖1GP(𝐱)))),\alpha={\rm Softmax}(\mathbf{W}_{2}({\rm ReLU}(\mathbf{W}_{1}{\rm GP}(\mathbf{x})))), (6)

where GP denotes global pooling operation, 𝐖1\mathbf{W}_{1} and 𝐖2\mathbf{W}_{2} are the parameters of the linear layers, respectively. α\alpha is applied to Eq. 5 to calculate DABN activation.

Refer to caption
Figure 3: Illustration of the structure of DABN. The output of DABN is a weighted summation of the DSBN outputs, where the weights are calculated in a similar way as SENet [20].

3.3 Model Training Procedure

To make the model not only adaptive to novel domains, but also flexible to handle new test samples, we train our framework in the same fashion as QAConv [25]. Specifically, QAConv tries to find local correspondences between two feature maps by constructing convolution kernels on-the-fly from one feature map, while performing convolution on the other feature map to calculate the local similarities. In our framework, we design two enhanced variations of QAConv, to further capture the multi-scale information (MS-QAConv), and to aggregate the domain-specific features (MSDA-QAConv).

MS-QAConv. The structure of MS-QAConv is illustrated in Fig. 4. To match a pair of images, we first extract the multi-scale feature maps (e.g., from the res4 to res5 layers) of both the query and gallery images. Then the query feature maps are partitioned into local patches, which are re-organized into convolution kernels. By applying these kernels to the gallery feature maps, i.e., performing QAConv, we can get a set of response maps measuring the similarities between local patches. Then a global max pooling (GMP) operation is applied to these response maps to extract a feature vector that represents the best local correspondence. Vanilla QAConv directly take the vectors from a single scale (res4) into a BN-FC-BN block to calculate the final similarity value. However, as indicated by prior re-id works [6, 44], multi-scale feature representation contains hierarchical information to facilitate re-id learning. Therefore, MS-QAConv first aggregates the multi-scale matching information by concatenating the feature vectors from several scales, and then predicts the final similarity. In this way, the multi-level information is exploited to help our model make better predictions.

Refer to caption
Figure 4: Illustration of MS-QAConv. Compared to QAConv [25], MS-QAConv computes multi-scale response maps and thus generates more reliable matching results by multi-scale fusion.

MSDA-QAConv. When an image without domain label is input into the DS network, it requires the QAConv block to be able to adaptively aggregate the response maps from different domain experts. Therefore, we further propose MSDA-QAConv to tackle this challenge. Specifically, the basic structure of MSDA-QAConv is similar to MS-QAConv, but it further contains an attention block to weight the response vectors from each domain. The attention weights are calculated in a similar way as in Eq. 6, where the vectors are concatenated and fed into a FC-ReLU-FC-Softmax block to output the domain weights. Subsequently, the weighted summations of response vectors are calculated as the domain-level representation at each scale level, which are further aggregated to yield the final similarity. In this way, both multi-scale and multi-domain information are simultaneously captured for better adaptive feature learning.

Loss Function. A triplet loss [19] with batch hard-negative mining is employed to supervise model training. Within a batch containing BB instances, we calculate the pairwise similarities between all the training samples. Specifically, we use Si,pS_{i,p} to denote the similarities between the ii-th sample and its positive pairs, while Si,nS_{i,n} denotes the negative similarities. Then the batch hard triplet loss is calculated as follows:

tri=i=1B[mminpSi,p+maxnSi,n]+,\displaystyle\mathcal{L}_{tri}=\sum_{i=1}^{B}[m-\min_{p}{S_{i,p}}+\max_{\begin{subarray}{c}n\end{subarray}}{S_{i,n}}]_{+}, (7)

where mm is a hyperparameter denoting the margin between the positive and negative pairs, and []+=max(,0)[*]_{+}=\max(*,0) ensures the loss value to be non-negative.

Model Inference. In the test phase, the output of the MSDA-QAConv block from the DS network and the output of the MS-QAConv from the DI network are combined to yield the final similarity.

4 Experiments

4.1 Implementation Details

Our implementation is based on the official code of QAConv111https://github.com/ShengcaiLiao/QAConv [25] and its improved variant (QAConv-GS [26]). We employ the IBN-Net50 [33] pretrained on ImageNet [11] as our default backbone. The features from res4 and res5 layers are employed in our MS-QAConv and MSDA-QAConv blocks. In DABN, we set the reduction rate r=16r=16. We set the batch size B=64B=64, and we employ the graph sampler [26] to select 16 identities containing 4 samples in each ID. The input images are resized to 384×\times128, and several data augmentation strategies, including random flipping, cropping, color jittering, are applied. We adopt the SGD optimizer with a weight decay of 0.0005. The initial learning rate is set to 0.005 and is reduced by a factor of 10 after 20 epochs, with the training stage terminating at the 30-th epoch. All the experiments are implemented in PyTorch [34], with four NVIDIA 3090 GPUs.

4.2 Datasets and Evaluation Protocols

Datasets.222Note that DukeMTMC-reID [36] has been taken down and we did not employ it in our paper. DukeMTMC-reID is replaced by the RandPerson dataset [46] in our experiments. We employ four large-scale datasets, CUHK03 [23], Market-1501 [55], MSMT17 [47], and RandPerson [46] in our experiments, among which RandPerson is a newly released synthesized person re-id dataset and the others are captured from real-world cameras. For CUHK03 dataset, we adopt CUHK03-NP [56] detected subset which contains 767 subjects for training and 700 for evaluation. The widely used Market-1501 dataset includes 1,501 identities and 32,668 pedestrian images captured by 6 different cameras. MSMT17 is currently one of the largest public datasets obtained with real-world cameras. It consists of 4,101 annotated identities and 126,441 images from 12 outdoor and 3 indoor cameras. RandPerson is a synthesized re-id dataset generated by MakeHuman and Unity3D, and it contains a training set of 1,801,816 images and 8,000 synthesized identities. To make a fair comparison, we follow [26] to employ a subset of RandPerson, including 132,145 images of the 8,000 identities. Since RandPerson merely contains the training subset, it is not adopted as the target domain in our experiment.

  Training CUHK03-NP Market-1501 MSMT17
Methods Venue Data top-1 mAP top-1 mAP top-1 mAP
MGN [44] ACMMM 2018 Market-1501 8.5 7.4 - - - -
MuDeep [35] TPAMI 2020 Market-1501 10.3 9.1 - - - -
CBN [57] ECCV 2020 Market-1501 - - - - 25.3 9.5
QAConv [25] ECCV 2020 Market-1501 9.9 8.6 - - 22.6 7.0
QAConv-GS [26] arXiv 2021 Market-1501 16.4 15.7 - - 41.2 15.0
TAL (Ours) - Market-1501 20.2 19.2 - - 43.5 16.3
PCB [42] ECCV 2018 MSMT17 - - 52.7 26.7 - -
MGN [44] ACMMM 2018 MSMT17 - - 48.7 25.1 - -
ADIN [51] WACV 2020 MSMT17 - - 59.1 30.3 - -
SNR [22] CVPR 2020 MSMT17 - - 70.1 41.4 - -
CBN [57] ECCV 2020 MSMT17 - - 73.7 45.0 - -
QAConv-GS [26] arXiv 2021 MSMT17 20.0 19.2 75.1 46.7 - -
TAL (Ours) - MSMT17 20.4 19.8 77.5 49.4 - -
RP Baseline [46] ACMMM 2020 RandPerson 13.4 10.8 55.6 28.8 20.1 6.3
QAConv-GS [26] arXiv 2021 RandPerson 14.8 13.4 74.0 43.8 42.4 14.4
TAL (Ours) - RandPerson 16.2 14.2 75.0 45.9 42.8 14.9
Table 1: Comparison with the state-of-the-art methods in single-source DG re-id under protocol-1 .
  Target: Market mAP top-1 top-5 top-10
QAConv-GS* [26] 59.3 83.0 92.8 95.0
M3L{\rm M^{3}L}* [53] 58.3 80.6 91.2 94.2
MetaBIN* [9] 57.9 80.3 91.2 93.7
TAL (Ours) 64.4 85.1 94.1 96.2
Target: CUHK03 mAP top-1 top-5 top-10
QAConv-GS* [26] 25.8 26.6 45.4 56.1
M3L{\rm M^{3}L}* [53] 33.9 34.8 55.4 66.3
MetaBIN* [9] 34.4 35.3 54.1 64.4
TAL (Ours) 35.0 36.1 58.1 66.8
Target: MSMT17 mAP top-1 top-5 top-10
QAConv-GS* [26] 20.2 52.1 64.4 69.2
M3L{\rm M^{3}L}* [53] 14.6 35.4 49.0 54.8
MetaBIN* [9] 16.3 40.3 53.9 59.5
TAL (Ours) 22.3 54.7 65.6 70.4
Table 2: Comparison with the state-of-the-art methods in multi-source DG re-id under protocol-2. * denotes the results are re-implemented based on the authors’ code with the same source datasets.

Evaluation Protocols. To fully verify the effectiveness of our two-stream adaptive learning framework, we design experiments under both single-source and multi-source protocols: (1) Protocol-1: single-source DG re-id. Under this protocol , we perform training on the training set of one dataset or its subset, and evaluate on the test set of another dataset. As a single dataset does not naturally contain different domains, we split domains according to camera views, which display different statistics within a dataset. (2) Protocol-2: multi-source DG re-id. Under this protocol, we employ RandPerson and two of the three real-world datasets (CUHK03, Market-1501 and MSMT17) as the training set, and the remaining one is used for evaluation. In this case, each source dataset can be naturally regarded as a domain. For both settings, we adopt mean average precision (mAP) and CMC top-1, top-5, top-10 accuracy as the evaluation metrics.

4.3 Comparison with State-of-the-arts

Comparison Under Protocol-1. We compare our TAL framework with several DG re-id methods, including MGN [44], MuDeep [35], CBN [57], PCB [42], ADIN [51], SNR [22], QAConv [25] and QAConv-GS [26]. We successively employ Market-1501, MSMT17, and RandPerson as training sources, while the other real-world datasets are used for testing. As shown in Table 1, our proposed TAL clearly outperforms the state-of-the-art models on all three target datasets. TAL further improves our baseline models, i.e., QAConv and QAconv-GS. On common real-world \to real-world tasks, we achieve 0.6% to 3.5% improvements in terms of mAP and up to 3.8% improvements for top-1 accuracy. The improvements on the more challenging synthesized \to real-world task are up to 1.4% for top-1 and 2.1% for mAP. Notably, our experiment confirms the finding in [46], which claims that models trained on a synthesized dataset could achieve competitive generalization ability as those trained on a real dataset.

Comparison Under Protocol-2. Under the multi-source protocol, we respectively adopt Market-1501, CUHK03, and MSMT17 as the test set, while all the other datasets are combined for training. To conduct fair comparisons, we use the authors’ code to re-implement QAConv-GS [26], M3L\rm M^{3}L [53] and MetaBIN [9] with the same training data as our method. From the comparisons shown in Table 2, we can make the following observations. First, our proposed TAL framework improves current methods by a large margin. For example, mAP is improved by 5.1%, and top-1 is increased by 2.1% on Market-1501. Second, state-of-the-arts multi-source DG re-id methods appear to be not stable, as QAConv-GS shows advantages on Market-1501 and MSMT17, but clear poor performance for both mAP and top-1 on CUHK03. In contrast, benefiting from the two-stream design for capturing both domain-specific and domain-invariant information, our method demonstrates notable superiority on all three test datasets.

  Target: M Target: C
backbone DS DI mAP top-1 mAP top-1
×\times ×\times 56.4 81.1 22.7 23.5
\checkmark ×\times 57.5 82.0 25.4 27.0
×\times \checkmark 58.4 82.3 27.6 28.9
ResNet-50 \checkmark \checkmark 60.7 82.7 30.0 31.6
×\times ×\times 59.3 83.0 25.8 26.6
\checkmark ×\times 60.1 83.1 31.0 33.6
×\times \checkmark 62.4 83.8 32.7 34.9
IBN-Net50 \checkmark \checkmark 64.4 85.1 35.0 36.1
Table 3: Comparative results when employing different backbones with different network branches. M: Market-1501. C: CUHK03.
  Target: M Target: C
DI Network mAP top-1 mAP top-1
Average Pooling 61.5 83.4 31.5 34.0
BN 61.0 83.2 31.9 34.3
DABN 62.4 83.8 32.7 34.9
Table 4: Comparative results when employing different training strategies in the DI network. M: Market-1501. C: CUHK03.

4.4 Ablation Study

In this section, we analyze the effectiveness of our TAL framework and conduct comprehensive ablation studies on the detailed components of TAL. We report the results under protocol-2 with the target datasets Market-1501 (M) and CUHK03 (C).

Effectiveness of Two-stream Modeling. Since TAL aims to simultaneously capture the domain-specific and domain-invariant information in a single framework, it is important to investigate how each sub-network influences the overall performance and how they work together. To this end, we perform independent evaluations on the output of each network, and the results are reported in Table 3. With neither the DS nor the DI networks, our framework degenerates into QAConv-GS [26], which could be regarded as our baseline. By introducing the DS and DI networks, the performances are continuously improved. Finally, by combining the results of both the DS and DI networks under the IBN-Net50 backbone, the performance of TAL is further improved, i.e., improving the baseline by 5.1% and 9.2% in mAP on Market-1501 and CUHK03, respectively. In the meantime, significant improvements can also be achieved with the ResNet-50 backbone. These results indicate that both the DS and DI information plays an important role in enhancing the generalization ability of the re-id model, and in TAL, the two-stream structure successfully captures both clues as complements to each other.

Effectiveness of Domain Adaptive BN. To evaluate the effectiveness of DABN in adaptively aggregating domain-specific information, we compare it with an average pooling and a simple BN baseline. From Table 4, it can be observed that by replacing the attention mechanism with an average pooling operation, the performance is decreased by \sim1% on both datasets. In the meantime, replacing DABN with a vanilla BN layer also deteriorates the performance. These results indicate that DABN could facilitate the fusion of statistics from the domain-specific BNs and effectively approximate the unseen target domain information.

  Target: M Target: C
DS network mAP top-1 mAP top-1
Average Pooling 58.2 82.5 29.4 31.1
Voting 58.5 82.4 30.1 31.7
MSDA-QAConv 60.1 83.1 31.0 33.6
Table 5: Comparative results when employing different training strategies in the DS network. M: Market-1501. C: CUHK03.
  Target: M Target: C
res4 res5 mAP top-1 mAP top-1
\checkmark ×\times 61.7 83.2 28.9 30.6
×\times \checkmark 62.5 83.9 30.4 32.2
\checkmark \checkmark 64.4 85.1 35.0 36.1
Table 6: Comparative results when employing different levels of features in backbone features. M: Market-1501. C: CUHK03.

Effectiveness of Domain Adaptive Matching. In the DS network, we employ MSDA-QAConv to aggregate the correspondence maps and then perform matching. To evaluate its contribution, we compare it with two baselines. First, instead of domain-adaptive attention, we employ the average response map to calculate the matching results. As shown in Table 5, this causes 1.6% to 1.9% performance drop in mAP. Second, we remove the MSDA-QAConv block, and directly sum up the voting results from domains-specific matchers. This also results in inferior performance. These results verify the strong adaptability of the proposed domain-adaptive matching strategy.

Effectiveness of Multi-Scale Matching. We conduct experiments by employing different levels of features in the backbone for calculating the similarity between image pairs. Specifically, we only investigate the features from the res4 and res5 layers, because low-level features from res2 and res3 layers are with relatively high resolution and will cause memory and efficiency issues. As shown in Table 6, although single-level features also yield promising results, combining them further significantly improves the performance. These results demonstrate that multi-scale features not only work well in supervised re-id tasks, but also benefit DG re-id.

Refer to caption
Figure 5: Visualization of the top-5 ranking results. These queries are difficult for the QAConv-GS [26] model to retrieve, but much better results are achieved by TAL. The green and red bounding boxes denote the correct and incorrect matches, respectively.

4.5 Qualitative Results

To better illustrate the improvements achieved with our framework, we visualize some qualitative matching results in Fig. 5. We can observe that our model (TAL) successfully retrieves the difficult queries from the target domain under various challenging conditions, including occlusion, pose/viewpoint variation, where the baseline model (QAConv-GS) fails. Notably, although QAConv-GS also adopts a local matching strategy, it fails to accurately differentiate persons with similar appearances. In contrast, our model performs much better under this circumstance. We believe this is because our framework simultaneously captures DS and DI information, which makes TAL more robust to the feature variation caused by domain shift.

To better present the working mechanism of the DS and DI networks, we visualize the local correspondences between the query and gallery images. As shown in Fig. 6, the domain experts in the DS network generate different matching patterns. For example, the expert trained on CUHK03 focuses more on the head and leg regions, maybe because these regions in Market-1501 have similar statistics with CUHK03. In contrast, the correspondences generated by the DI network trained with the hybrid dataset tend to locate in more diverse regions, e.g., head, body, and leg, which are critical for identifying a person. This demonstrates that the DS and DI networks provide complementary information to each other to better identify a person in unseen domains.

Refer to caption
Figure 6: Examples of the correspondences between local patches. We evaluate on Market-1501 while the model is trained on CUHK03, MSMT17, and RandPerson. The left three columns show the matching results generated by three domain experts in the DS network, and the right column shows the results generated by the DI network with hybrid inputs.

5 Conclusion and Limitation

In this paper, we propose a two-stream framework to address the challenges in DG re-id, where the domain-specific and domain-invariant features are simultaneously captured in a unified framework. We also design two adaptive learning modules to facilitate domain generalizable learning. Extensive results on both single-source and multi-source DG re-id benchmarks demonstrate that the proposed framework significantly outperforms the state-of-the-art methods.

Limitation. Although the employment of DSBN makes our framework more efficient than traditional MoE frameworks, it still faces challenges in terms of memory and speed when there are a large number of training domains, e.g., 15 camera views on the MSMT17 dataset. In this work, we replace all the BN layers with DSBN. It would be interesting to explore whether some BN layers can be shared, or introduce a dynamic learning strategy [24] in DSBN. This will further improve the efficiency of our framework.

References

  • [1] Ejaz Ahmed, Michael Jones, and Tim K Marks. An improved deep learning architecture for person re-identification. In CVPR, pages 3908–3916, 2015.
  • [2] Karim Ahmed, Mohammad Haris Baig, and Lorenzo Torresani. Network of experts for large-scale image categorization. In ECCV, pages 516–532, 2016.
  • [3] Zechen Bai, Zhigang Wang, Jian Wang, Di Hu, and Errui Ding. Unsupervised multi-source domain adaptation for person re-identification. In CVPR, pages 12914–12923, 2021.
  • [4] Woong-Gi Chang, Tackgeun You, Seonguk Seo, Suha Kwak, and Bohyung Han. Domain-specific batch normalization for unsupervised domain adaptation. In CVPR, pages 7354–7362, 2019.
  • [5] Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. In CVPR, pages 403–412, 2017.
  • [6] Yanbei Chen, Xiatian Zhu, and Shaogang Gong. Person re-identification by deep learning multi-scale representations. In ICCV Workshops, pages 2590–2600, 2017.
  • [7] Yanbei Chen, Xiatian Zhu, and Shaogang Gong. Instance-guided context rendering for cross-domain person re-identification. In ICCV, pages 232–242, 2019.
  • [8] Ying-Cong Chen, Xiatian Zhu, Wei-Shi Zheng, and Jian-Huang Lai. Person re-identification by camera correlation aware feature augmentation. IEEE Trans. Pattern Anal. Mach. Intell., 40(2):392–408, 2017.
  • [9] Seokeon Choi, Taekyung Kim, Minki Jeong, Hyoungseob Park, and Changick Kim. Meta batch-instance normalization for generalizable person re-identification. In CVPR, pages 3425–3435, 2021.
  • [10] Yongxing Dai, Xiaotong Li, Jun Liu, Zekun Tong, and Ling-Yu Duan. Generalizable person re-identification with relevance-aware mixture of experts. In CVPR, pages 16145–16154, 2021.
  • [11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  • [12] Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, Yi Yang, and Jianbin Jiao. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In CVPR, pages 994–1003, 2018.
  • [13] Michela Farenzena, Loris Bazzani, Alessandro Perina, Vittorio Murino, and Marco Cristani. Person re-identification by symmetry-driven accumulation of local features. In CVPR, pages 2360–2367, 2010.
  • [14] Huan Fu, Mingming Gong, Chaohui Wang, and Dacheng Tao. Moe-spnet: A mixture-of-experts scene parsing network. Pattern Recognition, 84:226–236, 2018.
  • [15] Yang Fu, Yunchao Wei, Guanshuo Wang, Yuqian Zhou, Honghui Shi, and Thomas S Huang. Self-similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification. In ICCV, pages 6112–6121, 2019.
  • [16] Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, and Hongsheng Li. Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In NeurIPS, 2020.
  • [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. NIPS, 27, 2014.
  • [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [19] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. CoRR, abs/1703.07737, 2017.
  • [20] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141, 2018.
  • [21] Jieru Jia, Qiuqi Ruan, and Timothy M Hospedales. Frustratingly easy person re-identification: Generalizing person re-id in practice. In BMVC, 2019.
  • [22] Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen, and Li Zhang. Style normalization and restitution for generalizable person re-identification. In CVPR, pages 3143–3152, 2020.
  • [23] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, pages 152–159, 2014.
  • [24] Yunsheng Li, Lu Yuan, Yinpeng Chen, Pei Wang, and Nuno Vasconcelos. Dynamic transfer for multi-source domain adaptation. In CVPR, pages 10998–11007, 2021.
  • [25] Shengcai Liao and Ling Shao. Interpretable and generalizable person re-identification with query-adaptive convolution and temporal lifting. In ECCV, pages 456–474, 2020.
  • [26] Shengcai Liao and Ling Shao. Graph sampling based deep metric learning for generalizable person re-identification. CoRR, abs/2104.01546, 2021.
  • [27] Chong Liu, Xiaojun Chang, and Yi-Dong Shen. Unity style transfer for person re-identification. In CVPR, pages 6887–6896, 2020.
  • [28] David G Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis., 60(2):91–110, 2004.
  • [29] Andy J Ma, Pong C Yuen, and Jiawei Li. Domain transfer support vector ranking for person re-identification without target camera label information. In ICCV, pages 3567–3574, 2013.
  • [30] Djebril Mekhazni, Amran Bhuiyan, George Ekladious, and Eric Granger. Unsupervised domain adaptation in the dissimilarity space for person re-identification. In ECCV, pages 159–174, 2020.
  • [31] Jiaxu Miao, Yu Wu, Ping Liu, Yuhang Ding, and Yi Yang. Pose-guided feature alignment for occluded person re-identification. In ICCV, pages 542–551, 2019.
  • [32] Hyeonseob Nam and Hyo-Eun Kim. Batch-instance normalization for adaptively style-invariant neural networks. In NeurIPS, pages 2563–2572, 2018.
  • [33] Xingang Pan, Ping Luo, Jianping Shi, and Xiaoou Tang. Two at once: Enhancing learning and generalization capacities via ibn-net. In ECCV, pages 464–479, 2018.
  • [34] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pages 8024–8035, 2019.
  • [35] Xuelin Qian, Yanwei Fu, Tao Xiang, Yu-Gang Jiang, and Xiangyang Xue. Leader-based multi-scale attention deep architecture for person re-identification. IEEE Trans. Pattern Anal. Mach. Intell., 42(2):371–385, 2019.
  • [36] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, pages 17–35, 2016.
  • [37] Jifei Song, Yongxin Yang, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. Generalizable person re-identification by domain-invariant mapping network. In CVPR, pages 719–728, 2019.
  • [38] Liangchen Song, Cheng Wang, Lefei Zhang, Bo Du, Qian Zhang, Chang Huang, and Xinggang Wang. Unsupervised domain adaptive re-identification: Theory and practice. Pattern Recognition, 102:107173, 2020.
  • [39] Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. Circle loss: A unified perspective of pair similarity optimization. In CVPR, pages 6398–6407, 2020.
  • [40] Yifan Sun, Liang Zheng, Yali Li, Yi Yang, Qi Tian, and Shengjin Wang. Learning part-based convolutional features for person re-identification. IEEE Trans. Pattern Anal. Mach. Intell., 2019.
  • [41] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, pages 480–496, 2018.
  • [42] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and A strong convolutional baseline). In ECCV, pages 501–518, 2018.
  • [43] Dongkai Wang and Shiliang Zhang. Unsupervised person re-identification via multi-label classification. In CVPR, pages 10981–10990, 2020.
  • [44] Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple granularities for person re-identification. In ACM MM, pages 274–282, 2018.
  • [45] Xin Wang, Fisher Yu, Lisa Dunlap, Yi-An Ma, Ruth Wang, Azalia Mirhoseini, Trevor Darrell, and Joseph E Gonzalez. Deep mixture of experts via shallow embedding. In UAI, pages 552–562, 2020.
  • [46] Yanan Wang, Shengcai Liao, and Ling Shao. Surpassing real-world source training data: Random 3d characters for generalizable person re-identification. In ACM MM, pages 3422–3430, 2020.
  • [47] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In CVPR, pages 79–88, 2018.
  • [48] Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In CVPR, pages 5177–5186, 2018.
  • [49] Yichao Yan, Jie Qin, Jiaxin Chen, Li Liu, Fan Zhu, Ying Tai, and Ling Shao. Learning multi-granular hypergraphs for video-based person re-identification. In CVPR, pages 2899–2908, 2020.
  • [50] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven C. H. Hoi. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell., 2021.
  • [51] Ye Yuan, Wuyang Chen, Tianlong Chen, Yang Yang, Zhou Ren, Zhangyang Wang, and Gang Hua. Calibrated domain-invariant learning for highly generalizable large scale re-identification. In WACV, pages 3589–3598, 2020.
  • [52] Yunpeng Zhai, Shijian Lu, Qixiang Ye, Xuebo Shan, Jie Chen, Rongrong Ji, and Yonghong Tian. Ad-cluster: Augmented discriminative clustering for domain adaptive person re-identification. In CVPR, pages 9021–9030, 2020.
  • [53] Yuyang Zhao, Zhun Zhong, Fengxiang Yang, Zhiming Luo, Yaojin Lin, Shaozi Li, and Nicu Sebe. Learning to generalize unseen domains via memory-based multi-source meta-learning for person re-identification. In CVPR, pages 6277–6286, 2021.
  • [54] Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. Mars: A video benchmark for large-scale person re-identification. In ECCV, pages 868–884, 2016.
  • [55] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In ICCV, pages 1116–1124, 2015.
  • [56] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification with k-reciprocal encoding. In CVPR, pages 1318–1327, 2017.
  • [57] Zijie Zhuang, Longhui Wei, Lingxi Xie, Tianyu Zhang, Hengheng Zhang, Haozhe Wu, Haizhou Ai, and Qi Tian. Rethinking the distribution gap of person re-identification with camera-based batch normalization. In ECCV, pages 140–157, 2020.