This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Benchmark of Video-Based Clothes-Changing Person Re-Identification

Likai Wang*1, Xiangqun Zhang*1, Ruize Han1, Jialin Yang1, Xiaoyu Li1, Wei Feng1, Song Wang2
1Tianjin University, China  2University of South Carolina Columbia, USA
{kkww, clzxq, han_ruize, abdullahyang, xiaoyuli}@tju.edu.cn, wfeng@ieee.org, songwang@cec.sc.edu
*Equal contribution.
Abstract

Person re-identification (Re-ID) is a classical computer vision task and has achieved great progress so far. Recently, long-term Re-ID with clothes-changing has attracted increasing attention. However, existing methods mainly focus on image-based setting, where richer temporal information is overlooked. In this paper, we focus on the relatively new yet practical problem of clothes-changing video-based person re-identification (CCVReID), which is less studied. We systematically study this problem by simultaneously considering the challenge of the clothes inconsistency issue and the temporal information contained in the video sequence for the person Re-ID problem. Based on this, we develop a two-branch confidence-aware re-ranking framework for handling the CCVReID problem. The proposed framework integrates two branches that consider both the classical appearance features and cloth-free gait features through a confidence-guided re-ranking strategy. This method provides the baseline method for further studies. Also, we build two new benchmark datasets for CCVReID problem, including a large-scale synthetic video dataset and a real-world one, both containing human sequences with various clothing changes. We will release the benchmark and code in this work to the public.

1 Introduction

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Examples in clothes-changing benchmark datasets. (a) Image-based datasets PRCC [44], Celeb-reID [23], and VC-Clothes [40] (from left to right). Each column shows a pair of samples of the same person in different clothes. It is can be seen that cased by clothes-changing, appearance information show great bias in the single image. (b) Video-based dataset SCCVReID collected in this paper. Each row shows a video sequence of the same person in different clothes. It is can be seen that the clothes-irrelevant information, e.g., human gait, can be clearly obtained.

The past few years have witnessed that person re-identification (Re-ID) has become a very popular topic in the computer vision community, because of its significance in many real-world applications, such as video surveillance, unmanned supermarket, etc. The classical person Re-ID task [51, 27, 52] focuses on the image-based data, in which the clothes of a person also keep unchanged in the dataset. This may limit the information utilization from the video sequence, which is easy to obtain in many real-world application. Also, for the long-period applications, e.g., history criminal retrieval, the assumption of unchanged clothes is not reasonable. This way, two categories of person Re-ID task have been proposed in recent years, i.e., the video-based person Re-ID [33, 30, 11] and the clothes-changing person Re-ID [44, 23, 34, 45]. The former focuses on utilizing the temporal information from the video sequence, and the later aims to extract the clothes-independent appearance features for person identification.

In this paper, we are more interested in studying the video-based clothes-changing person Re-ID problem. This is because the video-based Re-ID is complementary to the clothes-changing Re-ID. Specifically, the (image-based) clothes-changing person Re-ID is very challenging given the very limited information from a single human image, which is dominated by the appearance of the clothes (as seen in Fig. 1(a)). If a video sequence is given, we can obtain more information not related to the clothes, e.g., the human gait (as seen in Fig. 1(b)). However, the video-based clothes-changing person Re-ID problem has not been studied widely. For this problem, only one dataset namely CCVID [16] is publicly available, which is actually built based on a gait recognition dataset [49]. Although clothes-changing is considered, all samples in the dataset are taken from the same camera view, and all identities follows the same designated route. Thus, it can not meet the request of person Re-ID problem, e.g., the various scenes, varied perspectives, abundant samples.

In this work, we propose to build a benchmark to systematically study the Clothes-Changing Video-based Re-ID (CCVReID) problem. For this purpose, the first step is to collect the applicable datasets. Considering the convenience of dataset collection and the huge success in cross-domain person Re-ID [38, 40], we first build a large-scale synthetic dataset for CCVReID. We use a game namely Grand Theft Auto V for data collection, whose character modeling is very realistic. We record the pedestrians for 48 hours in total, from which we obtained contains 9,620 sequences from 333 identities and each identity has 2-37 suits of clothes, with an average of 7. The number of sequences, identities and suits for each person are much larger than previous datasets. Besides, we also build a real dataset including the human sequences with various clothe changes.

We also propose a new baseline for the CCVReID problem. Our basic idea is to take advantage of the clothes-independent appearance feature from each RGB image and the temporal-aware gait feature from the video sequence. Specifically, we first use a video-based Re-ID method [1] and a gait recognition method [3] to extract the appearance and gait features, respectively. We then conduct the inference stage of the Re-ID task that using a query to get the candidate ranking from the gallery, using the appearance and gait representations, respectively. We propose a two-branch ranking fusion framework, containing a candidate relation graph, to combine the ranking lists from these two representations. We also propose a confidence-aware re-weighting strategy to estimate the representation certainty for balancing the two branches. We conduct the experiments on an existing dataset and two new datasets collected in this work. The results verify the effectiveness of our method, which outperforms the methods using the appearance and gait separately, with an obvious margin.

We summarize the main contributions in this work:

  • We systematically study a relatively new and practical problem of Clothes-Changing Video-based Re-ID (CCVReID). For this problem, we simultaneously consider the clothes-inconsistent appearance and the temporal information contained in the sequence to complement each other for person Re-ID, which is overlooked previously.

  • We develop a preliminary framework for handling the CCVReID problem. The proposed framework integrates two branches that focus on the classical appearance and cloth-free biological features (specifically the gait feature) through a confidence-balanced re-ranking strategy. This method can provide the baseline for further studies.

  • We build two new benchmark datasets for the proposed problem, i.e., a large-scale synthetic video dataset and a real-world one, both including the human sequences with various clothe changes.

Refer to caption
Figure 2: Framework of the proposed method for CCVReID. Given a query and several gallery samples, we first extract their appearance and gait features, and calculate the initial similarity of each gallery to the query. Based on the obtained appearance and gait similarity, we collect a certain number of candidates, build appearance and gait graph, and use GCN to estimate confidence for each original appearance and gait similarity. Finally, we use the confidence to re-weight the initial similarities and fuse all information for final re-ranking.

2 Related Work

Video-based person Re-ID. Video-based person Re-ID aims to extract spatial-temporal features from consecutive frame sequences. To this end, some of existing methods lift image-based Re-ID methods by aggregating multi-frame features through RNNs [33, 30], mean/max pooling [50, 8], and temporal attention [15, 5], etc. Other methods perform concurrent spatial-temporal information modeling via 3D convolution [17, 26] or graph convolution [4]. Despite the fact that a number of works have been done for video-based Re-ID, most of them focus on clothes consistent setting only, which are not applicable to long-term application scenarios.
Clothes-changing person Re-ID. Image-based clothes-changing person Re-ID has been widely studied in the literature. Several public datasets, e.g., PRCC [44], Celeb-reID [23], LTCC [34], VC-Clothes&Real28 [40], and COCAS [45], have been collected successively to support this task. Based on these datasets, [23] proposed to use vector-neuron capsules instead of the traditional scalar neurons, to perceive cloth changes of the same person. [28] proposed a Clothing Agnostic Shape Extraction Network (CASE-Net) to shape-based feature representation via adversarial learning and feature disentanglement. Besides, other works attempted to leverage contour sketch [44], silhouettes [21, 24], face [40], skeletons [34], 3D shape [6], or radio signals [14] to capture clothes-irrelevant features. Despite the achievement of image-based methods, they are susceptible to the quality of person images, i.e., they are less tolerant to noise due to the limited information contained in a single frame.

Recently, some works [46, 47, 16] have extended clothes-changing person Re-ID to video-based data. [46] collected a Motion-ReID dataset with clothes-changing, and developed a FIne moTion encoDing (FITD) model based on true motion cues from videos. [47] collected a Cloth-Varying vIDeo re-ID (CVID-reID) dataset, which contains video tracklets of celebrities posted on the Internet, and proposed to learn hybrid feature representation from image sequences and skeleton sequences. Furthermore, [16] constructed a Clothes-Changing Video person re-ID (CCVID) dataset from a gait recognition dataset FVG [49], and proposed a Clothes-based Adversarial Loss (CAL) to mine clothes-irrelevant features from the original RGB images. In the aforementioned datasets, only CCVID [16] dataset is publicly available. However, since the original FVG [49] dataset contains only one view, and the field of view contains only one person that follows designated routes, the CCVID dataset built on that is not suitable for real-world scenarios, where there are usually multiple people who walk in arbitrary directions. For supporting the CCVReID task, we build two new benchmark datasets of surveillance scenarios in this paper.
Person Re-ID with gait. Affected by lighting, clothing, view, and other factors, the same person’s appearance features may change a lot, which brings challenges to the Re-ID task. To solve that, [47, 31] attempted to introduce gait to achieve complementarity of appearance and gait features. [24] performed gait prediction and regularization from a single image, which used gait to drive the main Re-ID model to learn cloth-independent features. However, such methods only consider the fusion of body and face information in a single sample. In the proposed method, in addition to considering this, we also introduce the contextual information among neighbor samples.
Synthetic datasets. Due to the difficulty of building large-scale datasets in the real world and the cost of extensive manual annotation, synthetic datasets are gaining increasing attention in many computer vision tasks, including pose estimation [12, 20], tracking [12], action recognition [39], and semantic segmentation [35, 32], etc. For person Re-ID, [2, 38, 40] have introduced synthetic data. Although the PersonX [38] and VC-Clothes [40] dataset involve clothes changing, they are all image-based. To promote the researches on CCVReID, we construct a large-scale synthetic dataset in this paper.

3 Proposed Method

3.1 Overview

The overall pipeline of the proposed method is illustrated in Fig. 2. Given an image sequence xqx^{\mathrm{q}}, whose identity label is denoted as yqy^{\mathrm{q}}, as query, person Re-ID aims to return a ranking list of gallery image sequences where the sequences from the same person, i.e., {xig|yig=yq}\{x^{\mathrm{g}}_{i}|y^{\mathrm{g}}_{i}=y^{\mathrm{q}}\}, can rank top. To this end, this paper proposes a two-branch confidence-aware re-ranking framework. Specifically, given a query and several galleries, we first extract the appearance and gait features. Base on that, we conduct the inference stage to get the initial ranking list and collect top-K candidates for model training. Then, we construct candidate relation graph and use GCN to estimate the confidence for balancing the two branches. Finally, we fuse the confidence-aware re-weighted appearance and gait representations, and re-rank the galleries as the final retrieval result.

3.2 Two-branch Initial Ranking

Two-branch top-K candidates collection. Given a query sequence xqx^{\mathrm{q}} and a gallery set 𝒢={xig|i=1N}\mathcal{G}=\{x^{\mathrm{g}}_{i}|_{i=1}^{N}\}, where NN represents the number of gallery sequences, the extracted appearance features can be denotes as 𝐚q=fa(xq|θ)\mathbf{a}^{\mathrm{q}}=\mathrm{f_{a}}(x^{\mathrm{q}}|\theta) and {𝐚ig=fa(xig|θ)|i=1N}\{\mathbf{a}^{\mathrm{g}}_{i}=\mathrm{f_{a}}(x^{\mathrm{g}}_{i}|\theta)|_{i=1}^{N}\}, respectively. Similarly, the gait features can be denoted as 𝐠q=fg(xq|ϕ)\mathbf{g}^{\mathrm{q}}=\mathrm{f_{g}}(x^{\mathrm{q}}|\phi) and {𝐠ig=fg(xig|ϕ)|i=1N}\{\mathbf{g}^{\mathrm{g}}_{i}=\mathrm{f_{g}}(x^{\mathrm{g}}_{i}|\phi)|_{i=1}^{N}\}. fa(|θ)\mathrm{f_{a}}(\cdot|\theta) denotes the appearance feature extraction model with parameters θ\theta, and fg(|ϕ)\mathrm{f_{g}}(\cdot|\phi) denotes the gait model with parameters ϕ\phi. Then, we can obtain the ranking list, i.e., a={a1,a2,,aN}\mathcal{R}^{\mathrm{a}}=\{a_{1},a_{2},\cdots,a_{N}\}, through the appearance features according to the pairwise distance d(𝐚q,𝐚ig)d(\mathbf{a}^{\mathrm{q}},\mathbf{a}^{\mathrm{g}}_{i}) between the query xqx^{\mathrm{q}} and each gallery xigx^{\mathrm{g}}_{i}, where d(𝐚q,𝐚aig)<d(𝐚q,𝐚ai+1g)d(\mathbf{a}^{\mathrm{q}},\mathbf{a}^{\mathrm{g}}_{a_{i}})<d(\mathbf{a}^{\mathrm{q}},\mathbf{a}^{\mathrm{g}}_{a_{i+1}}). In the same way, the ranking list b={g1,g2,,gN}\mathcal{R}^{\mathrm{b}}=\{g_{1},g_{2},\cdots,g_{N}\} calculated through the gait (biological) features can also be obtained.

Since the value of NN is usually large, we select KK-nearest neighbors of the query for subsequent training. Specifically, we select the top γK(0γ1)\gamma\cdot K\ (0\leq\gamma\leq 1) samples of a\mathcal{R}^{\mathrm{a}}, and supplement the rest with the top samples of b\mathcal{R}^{\mathrm{b}}. The selected KK candidates are denoted as t={t1,t2,,tK}\mathcal{R}^{\mathrm{t}}=\{t_{1},t_{2},\cdots,t_{K}\}.
Candidate relation graph building. Given the selected samples t\mathcal{R}^{\mathrm{t}}, we first construct two graphs with respect to appearance and gait features, respectively. We approximately represent the appearance similarity between the query xqx^{\mathrm{q}} and sample xtigx^{\mathrm{g}}_{t_{i}} in t\mathcal{R}^{\mathrm{t}} by applying min-max normalization to the distance:

ska=Md(𝐚q,𝐚tkg)Mm,\begin{split}s_{k}^{a}=\frac{M-d(\mathbf{a}^{\mathrm{q}},\mathbf{a}^{\mathrm{g}}_{t_{k}})}{M-m},\end{split} (1)

where MM and mm denote the maximum and minimum values of all d(𝐚q,𝐚tkg)d(\mathbf{a}^{\mathrm{q}},\mathbf{a}^{\mathrm{g}}_{t_{k}}) with tktt_{k}\in\mathcal{R}^{\mathrm{t}}, respectively. The gait similarity skbs_{k}^{\mathrm{b}} can be represented by the same way.

The similarity of appearance features between the query and samples in t\mathcal{R}^{\mathrm{t}} and that of gait features are combined as an initialized similarity matrix 𝐒0K×2\mathbf{S}_{0}\in\mathbb{R}^{K\times 2}, and then embedded into a shared matrix 𝐒K×D\mathbf{S}\in\mathbb{R}^{K\times D} by an encoder. For the appearance graph, the nodes represent all samples, and the features of each node is initialized as the vector of the corresponding row in the matrix 𝐒\mathbf{S}. The edges represent the proximity relationship between the samples. If the gait feature 𝐠tjg\mathbf{g}^{\mathrm{g}}_{t_{j}} is one of the nn-nearest neighbors of feature 𝐠tkg\mathbf{g}^{\mathrm{g}}_{t_{k}}, there will be an edge from tjt_{j} to tkt_{k}, and the edge feature 𝐞jk\mathbf{e}_{jk} will be defined as the element-wise multiplication of 𝐚tjg\mathbf{a}^{\mathrm{g}}_{t_{j}} and 𝐚tkg\mathbf{a}^{\mathrm{g}}_{t_{k}}. In the same way, the gait graph can be constructed.

3.3 Confidence Balanced Re-ranking

Confidence estimation. Based on the appearance and gait graph, we aim to estimate confidence for the appearance/gait feature of each sample, which explicitly represents the reliability of the feature. Inspired by the effectiveness and generalization that GCN has demonstrated in many computer vision applications [42, 18, 36, 41, 10, 43] , we adopt GCN here to explore the latent structural relationships among the candidate samples. The message passing of node kk in layer ll can be formulated as

𝐇kl=σ(j𝒩(k){k}α𝐞jk𝐖l𝐇kl1),\mathbf{H}_{k}^{l}=\sigma(\sum_{j\in\mathcal{N}(k)\cup\{k\}}\mathbf{\alpha}^{\top}\mathbf{e}_{jk}\mathbf{W}^{l}\mathbf{H}_{k}^{l-1}), (2)

where 𝐇kl\mathbf{H}_{k}^{l} denotes the feature of node kk in layer ll, α𝐞jk\mathbf{\alpha}^{\top}\mathbf{e}_{jk} denotes the weight of the edge from node jj to node kk, and 𝐖l\mathbf{W}^{l} denotes the learnable weight matrix of layer ll. 𝒩(k)\mathcal{N}(k) is the set of all nodes that have an edge pointing to node kk. σ()\sigma(\cdot) is an activation function.

After the GCN layer, a linear layer is used to map the node feature into a one-dimensional confidence score, which can be formulated as

ck=ω𝐇kL,c_{k}=\omega^{\top}\mathbf{H}_{k}^{L}, (3)

where ω\omega denotes the learnable weight of the linear layer, and ckc_{k} denotes the estimated confidence score for node kk. Note that the network structure used to extract appearance confidences from the appearance graph and that used to extract gait confidences from the gait graph are the same, while do not share parameters. The appearance confidence for node kk can be denoted as ckac_{k}^{\mathrm{a}} and the gait confidence can be denoted as ckbc_{k}^{\mathrm{b}}.
Confidence-aware re-ranking. With the estimated confidence, we propose to adjust the weight of original appearance/gait similarity for each sample in the fusion phase. The final similarity between the query xqx^{\mathrm{q}} and the sample xtkgx^{\mathrm{g}}_{t_{k}} in t\mathcal{R}^{t} is defined as:

sk=ckaska+ckbskb+ω0𝐒(k,:),s_{k}=c_{k}^{\mathrm{a}}s_{k}^{\mathrm{a}}+c_{k}^{\mathrm{b}}s_{k}^{\mathrm{b}}+\omega_{0}^{\top}\mathbf{S}(k,:), (4)

where 𝐒(k,:)D\mathbf{S}(k,:)\in\mathbb{R}^{D} denotes the vector in the k-th row of matrix 𝐒\mathbf{S}, i.e., the initialized node feature, which is mapped into one dimension using a linear layer with weight ω0\omega_{0}. Finally, the revised ranking list can be obtained by descending sort of the final similarities.
Confidence loss. To guide the learning of the network, we propose the confidence loss by designing pseudo-labels for the confidence. Specifically, assume the ground-truth label of the sample xtkgx^{\mathrm{g}}_{t_{k}} is bkb_{k}. When the identity label of sample xtkgx^{\mathrm{g}}_{t_{k}} equals that of the query, i.e., ytkg=yqy^{\mathrm{g}}_{t_{k}}=y^{\mathrm{q}}, the value of bkb_{k} is 1, otherwise it is 0. Given the initial appearance similarity skas_{k}^{\mathrm{a}} and the ground-truth label bkb_{k} of sample xtkgx^{\mathrm{g}}_{t_{k}}, we propose to define the pseudo-label of confidence for appearance feature as

c~ka=|(1bk)ska|,\tilde{c}_{k}^{\mathrm{a}}=|(1-b_{k})-s_{k}^{\mathrm{a}}|, (5)

where |||\cdot| denotes the symbol of absolute value.

The proof is as follows. Since similarity indicates the relational degree between two samples, i.e., the greater the similarity, the higher the probability that the query and the gallery sample have the same identity label, we define a threshold λ\lambda to represent the prediction results based on similarities. If skaλs_{k}^{\mathrm{a}}\geq\lambda, we consider the prediction result b^ka\hat{b}_{k}^{\mathrm{a}} to be 1. In this case, if bkb_{k} is also 1, i.e., the prediction is correct, the value of confidence should be large. Intuitively, the larger the similarity skas_{k}^{\mathrm{a}}, the larger the confidence c~ka\tilde{c}_{k}^{\mathrm{a}} should be. Therefore, c~ka\tilde{c}_{k}^{\mathrm{a}} can be assigned the value of skas_{k}^{\mathrm{a}}. In contrast, if bk=0b_{k}=0, i.e., the prediction is wrong, the value of confidence should be small. And the larger the similarity skas_{k}^{\mathrm{a}}, the smaller the confidence c~ka\tilde{c}_{k}^{\mathrm{a}} should be. Therefore, c~ka\tilde{c}_{k}^{\mathrm{a}} can be assigned the value of 1ska1-s_{k}^{\mathrm{a}}. The above can be formulated as

c~ka={ska,bk=11ska,bk=0.\tilde{c}_{k}^{\mathrm{a}}=\begin{cases}s_{k}^{\mathrm{a}},&b_{k}=1\\ 1-s_{k}^{\mathrm{a}},&b_{k}=0\\ \end{cases}. (6)

Similarly, if ska<λs_{k}^{\mathrm{a}}<\lambda, we consider the prediction result b^ka\hat{b}_{k}^{\mathrm{a}} to be 0. In this case, if bk=1b_{k}=1, i.e., the prediction is wrong, the value of confidence should be small. And the smaller the similarity skas_{k}^{\mathrm{a}}, the smaller the confidence c~ka\tilde{c}_{k}^{\mathrm{a}} should be. Therefore, c~ka\tilde{c}_{k}^{\mathrm{a}} can be assigned the value of skas_{k}^{\mathrm{a}}. In contrast, if bk=0b_{k}=0, i.e., the prediction is correct, the value of confidence should be large. And the smaller the similarity skas_{k}^{\mathrm{a}}, the larger the confidence c~ka\tilde{c}_{k}^{\mathrm{a}} should be. Therefore, c~ka\tilde{c}_{k}^{\mathrm{a}} can be assigned the value of 1ska1-s_{k}^{\mathrm{a}}. The formulation of the above analysis takes the same form as Eq. 6. Further, Eq. 6 can be rewritten as follows:

c~ka={ska(1bk),bk=1(1bk)ska,bk=0,\tilde{c}_{k}^{\mathrm{a}}=\begin{cases}s_{k}^{\mathrm{a}}-(1-b_{k}),&b_{k}=1\\ (1-b_{k})-s_{k}^{\mathrm{a}},&b_{k}=0\\ \end{cases}, (7)

which is exactly the form of classified discussion of Eq. 5. In the same way, the pseudo-label of confidence for gait feature can be defined as

c~kb=|(1bk)skb|.\tilde{c}_{k}^{\mathrm{b}}=|(1-b_{k})-s_{k}^{\mathrm{b}}|. (8)

In the training stage, the designed pseudo-labels c~ka\tilde{c}_{k}^{\mathrm{a}} and c~kb\tilde{c}_{k}^{\mathrm{b}} play a role of ground-truths of confidence to guide the confidence estimation. The confidence loss can be formulated as

c=k=1K(|c~kacka|+|c~kbckb|).\mathcal{L}_{c}=\sum_{k=1}^{K}(|\tilde{c}_{k}^{\mathrm{a}}-c_{k}^{\mathrm{a}}|+|\tilde{c}_{k}^{\mathrm{b}}-c_{k}^{\mathrm{b}}|). (9)

3.4 Network Setting

Ranking loss. We also consider the widely used triplet ranking loss [19] in person Re-ID:

r=k=1Kpj=1Kn[ϵ(skpsjn)]+,\mathcal{L}_{r}=\sum_{k=1}^{K_{p}}\sum_{j=1}^{K_{n}}[\epsilon-(s_{k}^{\mathrm{p}}-s_{j}^{\mathrm{n}})]_{+}, (10)

where skps_{k}^{\mathrm{p}} denotes the final similarity between the query and the positive sample whose ground-truth label is 1, and sjns_{j}^{\mathrm{n}} denotes that for negative sample whose ground-truth label is 0. KpK_{p} and KnK_{n} denote the number of positive and negative samples, respectively, with Kp+Kn=KK_{p}+K_{n}=K. ϵ\epsilon denotes the margin. Combining the confidence and ranking loss, the total loss in the training stage can be formulated as =c+r\mathcal{L}=\mathcal{L}_{c}+\mathcal{L}_{r}.
Implementation Details. We use SINet [1] proposed for video-based person Re-ID as the appearance model fa(|θ)\mathrm{f_{a}}(\cdot|\theta), and GaitSet [3] proposed for gait recognition as the gait model fg(|ϕ)\mathrm{f_{g}}(\cdot|\phi). For SINet, we directly input RGB sequences for appearance feature extraction. While for GaitSet, we first perform an instance segmentation algorithm HTC [7] to extract silhouettes from RGB sequences, and then input silhouette sequences for gait feature extraction. SINet and GaitSet are pre-trained on ImageNet [9] and GREW [53], respectively, and then trained on the corresponding CCVReID datasets to get the initial features. The number of selected samples for training, i.e., KK, is set to 100. The ratio γ\gamma of selecting samples from a\mathcal{R}^{\mathrm{a}} is set to 0.75. In graph building, following [43], the encoder takes a Multi-Layer Perceptron (MLP) structure, which consists of a linear, BN, PReLU, dropout, and linear layer successively. The dimension of the initialized node feature DD is 32. The nn in nn-nearest neighbors for edge connection is set to 30. In confidence estimation, one GCN layer is used, with the output dimension of 32. The margin ϵ\epsilon in Eq. 10 is set to 0.2. We use Adam [25] algorithm to optimize the model in all experiments. During the testing phase, given a query, the KK samples selected for training from gallery are ranked in the top-K according to their final similarities, followed by the rest samples reordered according to the summation of the initial appearance rank in a\mathcal{R}^{\mathrm{a}} and gait rank in b\mathcal{R}^{\mathrm{b}}.

Refer to caption
(a) CCVID.
Refer to caption
(b) SCCVReID.
Refer to caption
(c) RCCVReID.
Figure 3: Examples of CCVID [16] and the proposed benchmark datasets in this paper.
Methods CC Standard
mAP R1 R5 R10 mAP R1 R5 R10
Video-based w/o clothes-changing AP3D [17] 26.7 47.1 71.5 80.2 39.1 79.4 94.3 96.9
TCLNet [22] 30.1 48.6 72.4 81.7 43.2 84.3 95.6 97.3
SINet [1] 33.0 51.0 75.5 84.6 46.5 84.4 97.4 98.6
Gait recognition GaitSet [3] 16.7 23.4 56.0 68.5 20.6 33.6 68.4 80.1
GaitPart [13] 15.9 22.4 54.7 68.8 19.2 30.4 64.7 79.3
GaitGL [29] 11.7 15.6 44.3 56.7 14.5 23.5 54.1 65.4
Image-based clothes-changing ReIDCaps-R [23] 12.2 30.0 53.5 65.1 12.9 32.2 56.0 67.0
ReIDCaps-A [23] 10.2 26.0 48.6 60.4 21.2 66.8 87.1 92.1
Pixel_sampling-R [37] 13.8 29.8 57.3 68.9 21.7 57.9 81.8 88.3
Pixel_sampling-A [37] 10.1 23.2 46.8 60.4 17.4 50.4 77.2 85.8
GI-ReID-R [24] 4.9 8.9 21.4 31.3 6.4 12.4 30.1 42.2
GI-ReID-A [24] 8.2 14.3 33.3 43.1 8.5 14.7 35.1 44.5
Video-based clothes-changing CAL [16] 31.2 48.9 71.3 81.8 45.4 87.6 95.0 96.9
Ours 39.8 54.8 76.2 86.1 51.1 83.0 94.5 97.4
Table 1: Comparison with the state-of-the-art Re-ID and gait recognition methods on SCCVReID dataset (%).

4 Datasets

Current works focus little on CCVReID. Of the datasets available for this task, including Motion-ReID [46], CVID-reID [47], and CCVID [16], only CCVID is publicly available. However, as seen in Fig. 3(a), all samples in the CCVID were taken from the same view, i.e., the frontal view, and have clean backgrounds and no occlusion, which is not fully applicable to real-world scenes. Therefore, to advance related researches, we build two new benchmark datasets in this paper, including a large-scale synthetic one named SCCVReID, and a small real one named RCCVReID. We will give a brief introduction in Secs. 4.1 and 4.2.

4.1 The SCCVReID Dataset

Following [12], we collect a large-scale synthetic dataset for CCVReID in surveillance scenarios by exploiting the highly photorealistic video game Grand Theft Auto V. The examples can be seen in Fig. 3(b). We set up 10 surveillance cameras within 5 scenes (2 cameras for each scene) to collect data. After recording the pedestrians for 48 hours in the game at 60 FPS, we use the automatic bounding boxes to crop out RGB sequences of each person. The obtained SCCVReID dataset contains 9,620 sequences from 333 identities and each identity has 2-37 suits of clothes, with an average of 7. Each sequence contains a number of frames ranging between 8 and 165 with an average length of 37. For evaluation, 167 identities with 5,768 sequences are used for training and the rest 166 identities with 3,852 sequences for testing. In the test set, we select the first sequence of each suit of each identity as query. In total, 971 sequences are used as query, and the rest 2,881 as gallery.

4.2 The RCCVReID Dataset

We also collect a real dataset named RCCVReID to support researches on CCVReID, which can be seen in Fig. 3(c). All raw videos were recorded in outdoor scenarios. For each video, we first performed ByteTrack [48] to generate human bounding boxes with unified IDs, and used them to crop out RGB sequences of each identity, which is then split into several short sequences (about 200 frames). After that, we merged the sequences of the same identity from different videos and manually labeled the clothes IDs. In total, 6,948 sequences from 34 identities are obtained. Each identity has 2-9 suits of clothes, with an average of 4. We consider two evaluation settings on this dataset. (1) Due to the small number of identities in RCCVReID, we train the models on the large-scale synthetic dataset SCCVReID and only use RCCVReID for performance evaluation. In this case, 2,133 sequences are used as query, and the rest 4,815 sequences as gallery. (2) Despite the small number of identities, the total sample size is sufficient for training models. Thus, we divide the training and test sets as existing Re-ID datasets do. Specifically, 20 identities with 5,532 sequences are reserved for training, and the remaining 14 identities are used for test. In the test set, 486 sequences are used as query, and the rest 932 sequences as gallery.

5 Experiments

5.1 Datasets and Evaluation Protocol

We perform experiments on three CCVReID datasets, i.e., CCVID [16], SCCVReID, and RCCVReID, to evaluate the proposed method. For evaluation, we focus on two kinds of test settings, i.e., clothes-changing (CC) setting and standard setting. In clothes-changing setting, gallery samples that have the same identity label and clothes label with query are removed, that is, only gallery samples with clothes-changing are considered. While in standard setting, both clothes-consistent and clothes-changing samples are used as gallery to calculate accuracy. Following existing person Re-ID works, we adopt the Rank-1/5/10 (R1/R5/R10) of CMC curve and the mAP (mean average precision) score as the evaluation protocols.

Methods Only test With train
CC Standard CC Standard
mAP R1 mAP R1 mAP R1 mAP R1
Video-based w/o clothes-changing AP3D [17] 9.2 16.6 30.0 85.3 21.0 20.2 62.7 94.9
TCLNet [22] 12.0 20.2 37.3 95.6 22.6 28.0 62.3 95.7
SINet [1] 8.2 15.2 29.8 86.9 23.8 31.9 67.3 96.7
Gait recognition GaitSet [3] 6.3 13.0 16.5 66.5 16.4 26.5 43.2 84.4
GaitPart [13] 6.5 14.6 16.3 64.2 15.6 23.7 37.4 77.4
GaitGL [29] 5.0 10.6 10.1 44.6 18.7 31.1 45.3 87.9
Image-based clothes-changing ReIDCaps-R [23] 7.3 13.2 25.5 85.0 18.0 22.0 56.1 95.5
ReIDCaps-A [23] 8.2 14.3 29.8 90.8 21.6 29.6 62.4 97.3
Pixel_sampling-R [37] 4.5 4.5 9.1 14.6 21.3 27.2 55.3 87.4
Pixel_sampling-A [37] 6.0 11.7 20.7 74.5 17.9 17.7 58.0 92.2
GI-ReID-R [24] 4.5 3.9 9.9 18.7 11.5 9.1 31.8 44.0
GI-ReID-A [24] 5.2 5.7 11.9 21.8 14.1 11.5 37.6 53.5
Video-based clothes-changing CAL [16] 13.0 20.9 40.3 96.3 26.0 31.5 68.1 99.2
Ours 14.5 25.7 43.1 95.8 33.9 46.7 72.3 99.4
Table 2: Comparison with the state-of-the-art Re-ID and gait recognition methods on RCCVReID dataset (%).
Methods CC Standard
mAP R1 mAP R1
VReID AP3D [17] 69.4 75.5 71.7 77.0
TCLNet [22] 71.2 77.1 73.3 78.3
SINet [1] 77.8 81.1 82.4 85.6
GR GaitSet [3] 62.6 72.7 69.2 77.6
GaitPart [13] 60.1 73.1 67.0 79.9
GaitGL [29] 69.0 82.1 75.4 88.1
CCIReID ReIDCaps-R [23] 45.9 54.2 49.8 58.6
ReIDCaps-A [23] 46.2 50.1 49.9 52.5
Pixel_sampling-R [37] 42.4 52.5 48.0 58.2
Pixel_sampling-A [37] 50.4 57.3 55.7 62.9
GI-ReID-R [24] 17.6 17.3 21.3 23.1
GI-ReID-A [24] 27.3 27.8 31.7 34.7
CCVReID CAL [16] 81.7 83.8 83.2 84.5
Ours 84.5 88.1 87.1 89.7
Table 3: Comparison with the state-of-the-art Re-ID and gait recognition methods on CCVID [16] dataset (%).

5.2 Comparison with State-of-the-art Methods

Since there are few works to explore CCVReID task, we compare our method with four kinds of methods for a comprehensive evaluation: 1) video-based person Re-ID methods that do not involve clothes-changing, including AP3D [17], TCLNet [22], and SINet [1], 2) gait recognition methods, including GaitSet [3], GaitPart [13], and GaitGL [29], 3) image-based clothes-changing methods, including ReIDCaps [23], Pixel_sampling [37], and GI-ReID [24], 4) video-based clothes-changing method CAL [16]. Note that image-based methods receive a single frame as input, we report the results under two different settings of such methods, i.e., randomly select a frame for feature extraction (denoted with suffix ”-R”), and extract the features of all frames and take the average as the final feature (denoted with suffix ”-A”).
Results on SCCVReID. As shown in Tab. 1, compared with image-based methods, video-based Re-ID and gait recognition methods achieve better performance, which indicates the importance of spatio-temporal information in improving the accuracy of Re-ID. Besides, our method achieves the best performance under CC setting. Compared with CAL [16], our method improves the mAP and R1 by more than 8.6% and 5.9%, respectively. However, it is observed that under standard setting, our method is lower than SINet [1] in R1, R5, and R10. Intuitively, this can be attributed to the low accuracy of the gait features. While gait information can assist in identification when appearance information is unreliable, i.e., with clothes-changing, it may also introduce certain interference when appearance information are reliable, i.e., with clothes-consistent. Despite this, our method achieve the best mAP under this setting, that is, our method allows the highest average ranking for all gallery samples with the same identity as the query.
Results on RCCVReID. We compare our method with state-of-the-art methods under two evaluation settings, i.e., use SCCVReID for training and use RCCVReID only for testing, or use RCCVReID for both training and testing. The results are shown in Tab. 2. Note that the only test setting can be considered as cross-domain CCVReID, evaluating the generalization ability of the models. Under both evaluation settings, our method achieves the best mAP and R1 under CC setting, and the best mAP under standard setting, which are consistent with the results on SCCVReID.
Results on CCVID. As seen in Tab. 3, our method outperforms all the other methods by a large margin on all evaluation protocols. Since CCVID is built based on gait recognition dataset, the extracted gait features are relatively more discriminative. Thus, the drop in R1 under standard setting caused by the interference of low-quality gait features does not exist here.

5.3 Ablation Study

In this section, we conduct a series of ablation experiments on SCCVReID to verify each design of our method.
The effectiveness of proposed method. To verify the effectiveness of the proposed method, we compare it with different fusion framework in Tab. 4. The top 2 rows report the performance of using appearance and gait features solely, respectively. The third row report the performance when the initial appearance and gait similarity are directly summed for re-ranking. SEF [43] is the inspiration for this paper. Differently, SEF considers the appearance (gait) features only when building appearance (gait) graph, and uses GCN to directly predict the final similarity. The results show that all fusion strategies bring an improvement in accuracy under CC setting and mAP under standard setting, while a certain drop in R1 under standard setting. The main reason for that has been discussed in Sec. 5.2. However, compared to the other two fusion strategies, our method obtains the highest improvement under clothes-changing setting and the least drop of R1 under standard setting, which indicates the superiority of our method.

CC Standard
mAP R1 mAP R1
App only 33.0 51.0 46.5 84.4
Gait only 18.7 28.1 23.3 38.0
Direct summation 36.4 50.2 48.7 81.7
SEF [43] 38.8 50.6 47.2 67.1
Ours w pseudo-label (Oracle) 69.4 97.2 75.2 99.6
Ours w/o appConf 39.4 53.6 50.7 81.7
Ours w/o gaitConf 39.2 52.6 50.8 82.3
Ours 39.8 54.8 51.1 83.0
Table 4: Analysis of different fusion framework and the effectiveness of the proposed confidence loss (%).

The effectiveness of confidence loss. To evaluate the effectiveness of the proposed confidence loss, We first list the accuracy obtained by calculating the final similarity using designed pseudo-label of confidence. As shown in Tab. 4, compared with the simple summation of similarities, our model with pseudo-label, i.e., the ground-truth of confidence, improves 33.0% on mAP and 47.0% on R1 under CC setting, 26.5% on mAP and 17.9% on R1 under standard setting. The results demonstrate the validity of the designed pseudo-label of confidence, which can be regarded as the oracle of the proposed confidence-aware method. We also compare our method with and without appearance/gait confidence loss. In Tab. 4, we can observe that removing the confidence loss of both the appearance and gait branch causes a certain degradation under both clothes-changing and standard setting. This demonstrates that the proposed loss can effectively supervise the learning of confidence.
Discussion of two-branch framework. We conduct ablation experiments to verify each branch of the proposed framework in Tab. 5. We discuss three settings of confidence: 1) the confidence are output by the network, which is represented by \surd. 2) the confidence is fixed to 0, that is, the original similarity is not used to calculate the final similarity, only the other branch is considered. However, in this case, the original similarity is not completely neglected, since the information has been implicitly included in the linear layer. 3) the confidence is fixed to 1, that is, the original similarity is used to calculate the final similarity without weight adjustment. From the comparison between the first and second rows, the third and fourth rows, we can see that the model with confidence set to 1 outperforms the model with confidence set to 0. That shows that the importance of the original similarity in the re-ranking stage. From the comparison between the top 4 rows and the fifth to the eighth rows, we can see that the model with confidence fixed for one branch outperforms the model with confidence fixed for both branches. Also, the model without confidence fixed achieves the best performance (the last row), which fully proves the superiority of the proposed network for confidence learning. Besides, we compare the models with and without linear layer (the last 2 rows). The results show that with addition of the linear layer, the performance of model is improved.

AppConf GaitConf Linear CC Standard
mAP R1 mAP R1
0 \surd \surd 38.4 51.7 50.5 82.7
1 \surd \surd 38.7 52.3 50.9 83.8
\surd 0 \surd 37.6 52.1 49.8 83.0
\surd 1 \surd 37.6 52.2 50.0 84.3
0 0 \surd 36.5 50.9 49.1 82.3
0 1 \surd 36.3 49.2 48.3 76.7
1 0 \surd 36.5 50.7 49.0 81.1
1 1 \surd 36.6 50.8 49.2 82.1
\surd \surd 38.9 52.7 50.7 83.2
\surd \surd \surd 39.8 54.8 51.1 83.0
Table 5: Results of different settings in the two-branch framework (%). “0” means the confidence is set to 0, i.e., the corresponding branch is removed. “1” means the confidence is set to 1, i.e., the original corresponding branch is directly added without weight adjustment. “\surd” means the confidence is output by the proposed network.
CC Standard
mAP R1 mAP R1
γ=0\gamma=0 32.4 50.3 41.9 77.2
γ=0.25\gamma=0.25 36.6 53.1 48.0 81.2
γ=0.5\gamma=0.5 38.0 52.7 49.5 82.9
γ=0.75\gamma=0.75 (Ours) 39.8 54.8 51.1 83.0
γ=1\gamma=1 39.2 53.2 50.9 82.4
Table 6: Results of different candidates collection strategy (%).

Discussion of candidates collection. The parameter γ\gamma controls the contribution of initial appearance and gait ranking list when selecting training samples. As shown in Tab. 6, we observe that as the value of γ\gamma increases, the accuracy under both the settings show a trend of first increasing and then decreasing. The model achieves the best performance when γ=0.75\gamma=0.75. This means that the optimal neighbors collection strategy is to select most of the samples based on more reliable appearance features, while samples with similar gait features to the query are also considered.

6 Conclusion

In this paper, we have focused on a less studied yet and practical problem of video-based person re-identification with clothes-changing (CCVReID). For this problem, we propose a two-branch confidence-aware re-ranking framework, which fuse the appearance and gait features that have been re-weighted by confidence, for final re-ranking. We also design confidence pseudo-labels to supervise confidence learning. Besides, we build two new benchmarks for CCVReID problem, including a large-scale synthetic one and a real-world one. With the above efforts, we hope to promote the researches on this practical topic.

References

  • [1] Shutao Bai, Bingpeng Ma, Hong Chang, Rui Huang, and Xilin Chen. Salient-to-broad transition for video person re-identification. In CVPR, pages 7339–7348, 2022.
  • [2] Slawomir Bak, Peter Carr, and Jean-Francois Lalonde. Domain adaptation through synthesis for unsupervised person re-identification. In ECCV, pages 189–205, 2018.
  • [3] Hanqing Chao, Kun Wang, Yiwei He, Junping Zhang, and Jianfeng Feng. Gaitset: Cross-view gait recognition through utilizing gait as a deep set. 44(7):3467–3478, 2021.
  • [4] Di Chen, Andreas Doering, Shanshan Zhang, Jian Yang, Juergen Gall, and Bernt Schiele. Keypoint message passing for video-based person re-identification. In AAAI, volume 36, pages 239–247, 2022.
  • [5] Dapeng Chen, Hongsheng Li, Tong Xiao, Shuai Yi, and Xiaogang Wang. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In CVPR, pages 1169–1178, 2018.
  • [6] Jiaxing Chen, Xinyang Jiang, Fudong Wang, Jun Zhang, Feng Zheng, Xing Sun, and Wei-Shi Zheng. Learning 3d shape feature for texture-insensitive person re-identification. In CVPR, pages 8146–8155, 2021.
  • [7] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In CVPR, pages 4974–4983, 2019.
  • [8] Dahjung Chung, Khalid Tahboub, and Edward J Delp. A two stream siamese convolutional neural network for person re-identification. In CVPR, pages 1983–1991, 2017.
  • [9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
  • [10] Naina Dhingra, George Chogovadze, and Andreas Kunz. Border-seggcn: improving semantic segmentation by refining the border outline using graph convolutional network. In ICCV, pages 865–875, 2021.
  • [11] Chanho Eom, Geon Lee, Junghyup Lee, and Bumsub Ham. Video-based person re-identification with spatial and temporal memory networks. In ICCV, pages 12036–12045, 2021.
  • [12] Matteo Fabbri, Fabio Lanzi, Simone Calderara, Andrea Palazzi, Roberto Vezzani, and Rita Cucchiara. Learning to detect and track visible and occluded body joints in a virtual world. In ECCV, pages 430–446, 2018.
  • [13] Chao Fan, Yunjie Peng, Chunshui Cao, Xu Liu, Saihui Hou, Jiannan Chi, Yongzhen Huang, Qing Li, and Zhiqiang He. Gaitpart: Temporal part-based model for gait recognition. In CVPR, pages 14225–14233, 2020.
  • [14] Lijie Fan, Tianhong Li, Rongyao Fang, Rumen Hristov, Yuan Yuan, and Dina Katabi. Learning longterm representations for person re-identification using radio signals. In CVPR, pages 10699–10709, 2020.
  • [15] Yang Fu, Xiaoyang Wang, Yunchao Wei, and Thomas Huang. Sta: Spatial-temporal attention for large-scale video-based person re-identification. In AAAI, volume 33, pages 8287–8294, 2019.
  • [16] Xinqian Gu, Hong Chang, Bingpeng Ma, Shutao Bai, Shiguang Shan, and Xilin Chen. Clothes-changing person re-identification with rgb modality only. In CVPR, pages 1060–1069, 2022.
  • [17] Xinqian Gu, Hong Chang, Bingpeng Ma, Hongkai Zhang, and Xilin Chen. Appearance-preserving 3d convolution for video-based person re-identification. In ECCV, pages 228–243. Springer, 2020.
  • [18] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurIPS, 2017.
  • [19] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
  • [20] David T Hoffmann, Dimitrios Tzionas, Michael J Black, and Siyu Tang. Learning to train with synthetic humans. In GCPR, pages 609–623. Springer, 2019.
  • [21] Peixian Hong, Tao Wu, Ancong Wu, Xintong Han, and Wei-Shi Zheng. Fine-grained shape-appearance mutual learning for cloth-changing person re-identification. In CVPR, pages 10513–10522, 2021.
  • [22] Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Temporal complementary learning for video person re-identification. In ECCV, pages 388–405. Springer, 2020.
  • [23] Yan Huang, Jingsong Xu, Qiang Wu, Yi Zhong, Peng Zhang, and Zhaoxiang Zhang. Beyond scalar neuron: Adopting vector-neuron capsules for long-term person re-identification. IEEE TCSVT, 30(10):3459–3471, 2019.
  • [24] Xin Jin, Tianyu He, Kecheng Zheng, Zhiheng Yin, Xu Shen, Zhen Huang, Ruoyu Feng, Jianqiang Huang, Zhibo Chen, and Xian-Sheng Hua. Cloth-changing person re-identification from a single image with gait prediction and regularization. In CVPR, pages 14278–14287, 2022.
  • [25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [26] Jianing Li, Shiliang Zhang, and Tiejun Huang. Multi-scale 3d convolution network for video based person re-identification. In AAAI, volume 33, pages 8618–8625, 2019.
  • [27] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, pages 152–159, 2014.
  • [28] Yu-Jhe Li, Xinshuo Weng, and Kris M Kitani. Learning shape representations for person re-identification under clothing change. In WACV, pages 2432–2441, 2021.
  • [29] Beibei Lin, Shunli Zhang, and Xin Yu. Gait recognition via effective global-local feature representation and local temporal aggregation. In ICCV, pages 14648–14656, 2021.
  • [30] Yiheng Liu, Zhenxun Yuan, Wengang Zhou, and Houqiang Li. Spatial and temporal mutual promotion for video-based person re-identification. In AAAI, volume 33, pages 8786–8793, 2019.
  • [31] Xiaoyan Lu, Xinde Li, Weijie Sheng, and Shuzhi Sam Ge. Long-term person re-identification based on appearance and gait feature fusion under covariate changes. Processes, 10(4):770, 2022.
  • [32] John McCormac, Ankur Handa, Stefan Leutenegger, and Andrew J Davison. Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In ICCV, pages 2678–2687, 2017.
  • [33] Niall McLaughlin, Jesus Martinez Del Rincon, and Paul Miller. Recurrent convolutional network for video-based person re-identification. In CVPR, pages 1325–1334, 2016.
  • [34] Xuelin Qian, Wenxuan Wang, Li Zhang, Fangrui Zhu, Yanwei Fu, Tao Xiang, Yu-Gang Jiang, and Xiangyang Xue. Long-term cloth-changing person re-identification. In ACCV, 2020.
  • [35] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, pages 3234–3243, 2016.
  • [36] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In CVPR, pages 12026–12035, 2019.
  • [37] Xiujun Shu, Ge Li, Xiao Wang, Weijian Ruan, and Qi Tian. Semantic-guided pixel sampling for cloth-changing person re-identification. IEEE Sign. Process. Letters, 28:1365–1369, 2021.
  • [38] Xiaoxiao Sun and Liang Zheng. Dissecting person re-identification from the viewpoint of viewpoint. In CVPR, pages 608–617, 2019.
  • [39] Gül Varol, Ivan Laptev, Cordelia Schmid, and Andrew Zisserman. Synthetic humans for action recognition from unseen viewpoints. IJCV, 129(7):2264–2287, 2021.
  • [40] Fangbin Wan, Yang Wu, Xuelin Qian, Yixiong Chen, and Yanwei Fu. When person re-identification meets changing clothes. In CVPRW, pages 830–831, 2020.
  • [41] Likai Wang, Jinyan Chen, and Yuxin Liu. Frame-level refinement networks for skeleton-based gait recognition. CVIU, 222:103500, 2022.
  • [42] Max Welling and Thomas N Kipf. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • [43] Qiaokang Xie, Zhenbo Lu, Wengang Zhou, and Houqiang Li. Improving person re-identification with multi-cue similarity embedding and propagation. IEEE TMM, 2022.
  • [44] Qize Yang, Ancong Wu, and Wei-Shi Zheng. Person re-identification by contour sketch under moderate clothing change. 43(6):2029–2046, 2019.
  • [45] Shijie Yu, Shihua Li, Dapeng Chen, Rui Zhao, Junjie Yan, and Yu Qiao. Cocas: A large-scale clothes changing person dataset for re-identification. In CVPR, pages 3400–3409, 2020.
  • [46] Peng Zhang, Qiang Wu, Jingsong Xu, and Jian Zhang. Long-term person re-identification using true motion from videos. In WACV, pages 494–502, 2018.
  • [47] Peng Zhang, Jingsong Xu, Qiang Wu, Yan Huang, and Xianye Ben. Learning spatial-temporal representations over walking tracklet for long-term person re-identification in the wild. IEEE TMM, 23:3562–3576, 2020.
  • [48] Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. Bytetrack: Multi-object tracking by associating every detection box. In ECCV, pages 1–21. Springer, 2022.
  • [49] Ziyuan Zhang, Luan Tran, Xi Yin, Yousef Atoum, Xiaoming Liu, Jian Wan, and Nanxin Wang. Gait recognition via disentangled representation learning. In CVPR, pages 4710–4719, 2019.
  • [50] Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. Mars: A video benchmark for large-scale person re-identification. In ECCV, pages 868–884, 2016.
  • [51] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In ICCV, pages 1116–1124, 2015.
  • [52] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV, pages 3754–3762, 2017.
  • [53] Zheng Zhu, Xianda Guo, Tian Yang, Junjie Huang, Jiankang Deng, Guan Huang, Dalong Du, Jiwen Lu, and Jie Zhou. Gait recognition in the wild: A benchmark. In ICCV, pages 14789–14799, 2021.