Devil’s in the Details: Aligning Visual Clues for Conditional Embedding in Person Re-Identification

Fufu Yu¹ Xinyang Jiang¹ Fufu Yu and Xinyang Jiang contribute equally Yifei Gong¹ Shizhen Zhao³ Xiaowei Guo¹
Wei-Shi Zheng² Feng Zheng⁴ Xing Sun¹
¹ Tencent Youtu Lab Correspondance Author: winfredsun@tencent.com ² Sun Yat-sen University
³ Huazhong University of Science and Technology ⁴ Southern University of Science and Technology

Abstract

Although Person Re-Identification has made impressive progress, difficult cases like occlusion, change of view-point and similar clothing still bring great challenges. Besides overall visual features, matching and comparing detailed information is also essential for tackling these challenges. This paper proposes two key recognition patterns to better utilize the detail information of pedestrian images, that most of the existing methods are unable to satisfy. Firstly, Visual Clue Alignment requires the model to select and align decisive regions pairs from two images for pair-wise comparison, while existing methods only align regions with predefined rules like high feature similarity or same semantic labels. Secondly, the Conditional Feature Embedding requires the overall feature of a query image to be dynamically adjusted based on the gallery image it matches, while most of the existing methods ignore the reference images. By introducing novel techniques including correspondence attention module and discrepancy-based GCN, we propose an end-to-end ReID method that integrates both patterns into a unified framework, called CACE-Net ((C)lue (A)lignment and (C)onditional (E)mbedding). The experiments show that CACE-Net achieves state-of-the-art performance on three public datasets.

1 Introduction

Person re-identification (ReID) increasingly draws attention due to its wide applications in surveillance, tracking, smart retail, etc [35, 32, 20]. Although ReID methods progress rapidly and achieve impressive performance on benchmark datasets, in practice, difficult cases like occlusion, change of view-point and similar clothing still bring great challenges. As shown in Figure 1 a), in these cases, the overall appearance of a pedestrian may not always be reliable, and comparing the detailed information becomes essential. Thus, this paper focuses on how to effectively utilize the detailed information for matching pedestrian images.

Refer to caption — Figure 1: Visual Clue Alignment: The illustration of matching image pairs with global feature and local correspondence

Looking at how human annotators would compare the similarities between two images, we find that there are two key recognition patterns involving matching detailed features. As shown in Figure 1 a) and b) existing methods usually compare the overall visual similarity of the entire body or densely compare the similarity of all local regions, which could be unreliable in many hard cases. On the other hand, a human annotator will select several local regions crucial for recognition, and align the selected visual clues between two images for pairwise comparison. For example, in Figure 1 c), visual clue pairs including hat, shoulder, arms, and shoes are selected for comparison, and since all of these pairs have high feature similarity, one will have high confidence to accept the images as the same person. The same logic can be applied to negative pairs, as shown in Figure 1 c), the general appearance of the image pair is very similar, but one recognize the images pair as irrelevant by comparing visual clues including the head, legs, shoes and coat pockets.

Secondly, for the same query image, a human annotator’s attention to visual features varies drastically when matching with different gallery images. Figure 2 is an intuitive example to explain this recognition pattern. For the same query image in Figure 2, the value of its feature vectors is conditioned on the gallery image it matches. In other word, different feature vectors are needed to match gallery image A and gallery image B. Since the face and glasses cannot be seen in gallery image A, the channels related to these semantics are suppressed in the corresponding feature. Similarly, when matching query image with gallery image B, channels related to the black jacket and plastic bag are suppressed.

In conclusion, a good ReID matching model should meet two requirements: 1) Locally, decisive visual clues need to be discovered and aligned for pair-wise comparison, namely Visual Clue Alignment. 2) Globally, the overall feature extracted from a query image should be dynamically adjusted based on the gallery image it matches, namely Conditional Feature Embedding.

Most of the existing methods do not satisfy the aforementioned two requirements. Some recent alignment-based methods propose to densely align the local parts between image pair based on semantic human part labels [22, 19] or feature similarity [20, 28, 6] for pairwise comparison. However, these methods align local regions for pair-wise comparison based on pre-defined rules (e.g., regions with highest feature similarity or region with the same human part label). These rules in many cases are not able to discover the most decisive and discriminative visual clues. For example, as shown in Figure 1 the importance of the aligned pairs is irrelevant to the feature similarity. In Figure 1 b), several visual clue pairs with high visual similarity are selected, while in Figure 1 d) region pairs with large visual difference should be selected to reject the image pair. Instead of predefined rules, we propose to learn a novel correspondence attention module to automatically select decisive key-point pairs based on the visual content of both images. Furthermore, most of the existing methods extract individual features from each image and unable to learn dynamically adaptive conditional features.

As a result, we propose a novel ReID model that integrates both recognition patterns into a unified framework, called CACE-Net. As shown in the orange box of Figure 3, for clue alignment, instead of pre-defined alignment rules, we propose a novel correspondence attention module to automatically select and align crucial regions between images. Secondly, since the region correspondence obtained by the last stage forms very complex correspondence graph, GCN is an appropriate tool to capture the the high ordered topological relation among multiple regions. As shown in grey box of Figure 3, our model takes the obtained region correspondence graph as inputs and extracts conditional feature embedding with a novel graph convolutional network [10, 1]. Instead of the standard GCN that smooths the adjacent node features, a novel discrepancy-based graph convolution is proposed to obtain the feature difference between crucial regions.

The contribution of our proposed method is listed as follows: 1) CACE-Net is able to integrate both visual clue alignment and conditional feature embedding into a unified ReID framework. 2) Instead of using a pre-defined Adjacency Matrix, our CACE-Net uses a novel correspondence attention module where the visual clues is automatically predicted and dynamically adjusted during training. 3) A novel discrepancy-based graph convolution is proposed to analyze the feature difference between adjacent graph nodes.

2 Related Works

2.1 Part base methods

Part-based models learn local features of different body parts to enhance the global ReID feature on cross-view matching. One of the most common and effective type of part-based models simply split the output feature-maps of ReID model’s inter-mediate layers into several horizontal stripes and learn local features for each stripe, such as PCB [20], MGN [23], Pyramid [33], RelationNet [14] and VA-ReID [37]. Another type of part-based models [11, 31, 34]segment human body into meaningful body parts and learn local feature for each body parts. SPReID [9] learns a human parsing branch for body part segmentation and fuses local features for different parts by weighted average pooling. DSA-ReID [29] proposes projects human parts into a UV space and uses this UV space branch to guide the learning of a stripe model.

2.2 Alignment-based methods

On top of the local features, instead of fusing the local features directly, some methods propose to align parts from a pair of images, and match a pair of images based on the similarity of their aligned part pairs. AlignReID[28] proposes a dynamic programming algorithm to align a stripe in the image to a stripe in another image based on their local feature similarity. DSR [6] and SFR [7] propose a sparse coding method to implicitly look for similar key-point pairs by reconstructing one image’s feature map with another. VPM [19] proposes to align the stripes from two images based on the visibility of each stripes. PGFA [13] exploits pose landmarks to align stripes. HOReID [22] aligns same semantic parts for Occluded ReID by the aid of extra key-points information. CDPM [24] proposes to localize and align local parts by a sliding window method. Some GAN based methods like FD-GAN [4] proposes to align local features by directly transfer the image to the same pose and viewpoint of the target image.

2.3 Joint Feature Learning

In this paper, Joint Feature Learning refers to methods that feed both images into a model simultaneously to obtain a conditional feature embedding. Existing methods like DCCs [26] and Deep Spatially Multiplicative Integration [27] learns a RNN to iteratively encode couple features step-by-step.

2.4 Attention-based methods

Attention based methods are a type of State-of-the-Art ReID methods that propose to select important regions or channels of a feature-maps to form the ReID feature and discard region irrelevant to recognition such as background. Unlike our CACE-Net that selects crucial region pairs based on a pair of images, most of the existing attention based methods focus on selecting important information from individual images. Method in [32] proposes to predict multiple attention maps for different human parts. HA-CNN [12] uses a Harmonious Attention module to conduct feature selection both spatially and in channel-wise for a individual image. ABDNet [2] proposes a similar spatial and channel attention with an orthogonality constraint. RGA-SC [30] stacks the pairwise relations between single feature and all features together with the feature itself to infer the current position’s attention on spatial and channel dimensions.

2.5 Graph-based methods

Recently some methods propose to use graph-based techniques to learn more complex relationship for ReID model. However, instead of exploring the complex correlation between detailed local regions inside image pairs, most of the existing methods are based on global features. SGGNN [17] use graph to represent the relation between multiple probe images and gallery images and use a graph neural network update samples’ global features with a massage passing method. Group shuffling random walk [18] further extend the probe-gallery relation to gallery-gallery relationship.

3 Methods

3.1 General Framework

Figure 3 shows the general training workflow of CACE-Net. CACE-Net is an end-to-end learning framework containing three stages, namely individual feature embedding, visual clue alignment, and conditional feature embedding. At the first stage, our method extracts individual feature maps for both images. Secondly, given the feature-maps of two images, a correspondence attention module is used to select crucial region pairs both within an image and between image pair. Finally, a novel discrepancy-based GCN is used to extract conditional features from the region correspondence graph. The following three sub-sections elaborate on the implementation and formulation the three stages.

3.2 Individual Feature Embedding

The individual feature embedding stage is responsible for extracting a feature-map for each individual pedestrian image. Any type of CNN backbones for person Re-identification can be applied for the individual feature extraction.

As shown in the blue box in Figure 3, to enforce the backbone network extracting good individual features, an additional training loss branch is attached to the module. Given a training set $\mathcal{D}=\{(I^{(u)},y^{(u)})\}_{u=1}^{N}$ , the input image $I^{(u)}$ is first fed into a backbone network. Then, the output feature-map is further fed into a encoder (i.e. a Global Average Pooling followed by a $1*1$ convolution layer) to obtain the individual feature vector for $I^{(u)}$ , denoted as $f(I^{(u)})$ . Then, a cross entropy based ID loss is used to train the individual feature extractor:

L_{CE}(y^{(u)},I^{(u)})=\frac{1}{C}\sum_{c}^{C}\mathbf{1}(y^{(u)}=c)log(p(c|f(I^{(u)})))

(1)

where, $p(c|f(I^{(u)}))=\frac{e^{W^{T}_{C}f(I^{(u)})}}{\sum_{j=1}^{K}e^{W^{T}_{j}f(I^{(u)})}}$ , and $W$ is a weight matrix of a fully connect layer to classify $f(I^{(u)})$ into different identities, and $C$ is the total number of identities in the training set.

3.3 Visual Clue Alignment

Visual Clue Alignment select and align crucial regions for pairwise detailed comparison. We denote the feature-map extracted at individual feature embedding stage as $X\in R^{H*W*D}$ , where $H$ and $W$ is the height and width of the feature-map and $D$ is the number of feature channels. $X$ is further reshaped into a two-dimensional $HW*D$ matrix denoted as $\hat{X}$ . Given the feature-maps of two images indexed as $\hat{X}^{(u)}$ and $\hat{X}^{(v)}$ , we select and align the visual clues between them by evaluating the importance of each pair of pixels, denoted as $S^{(u,v)}\in R^{HW*HW}$ , where each element $S^{(u,v)}_{ij}$ in the matrix indicate the importance of an aligned pixel pair. Then, The selected pairs of pixels in $X$ forms a undirected graph, whose adjacent matrix $A^{(u,v)}\in R^{HW*HW}$ will be used to extract conditional features of both images with Graph Convolutional Network.

One of the intuitive way to obtain the correlation between two different visual clues is to compute feature similarity. In the ablation study, a cosine similarity is used to evaluate the importance and correlation between a pixel pair from two feature-maps $X^{(u)}_{i},X^{(v)}_{j}$ . Then, following a similar strategy in [28], the crucial region pairs are selected by assigning each pixel with the pixel with highest similarity from the other feature-map. Another way to build the pixel correspondence between feature-map pairs is by the human part or other semantic labels of each pixel, where all the corresponding human part are selected with the same importance weight.

As discussed in Section 1, the aforementioned two types of methods for visual clue alignment and selection are based on pre-defined rules and in many cases not able to find the most decisive and discriminative region pairs. Hence, we further propose a novel correspondence attention module that automatically selects crucial visual clue pairs.

Our method not only focuses on finding crucial visual clues between two images, but also discovering intra-relationship between feature-map pixels within an individual image. The importance of pixel pairs within a feature-map $\hat{X}^{(u)}$ is computed as follows:

S^{(u)}=\hat{X}^{(u)}W\hat{X}^{(u)\mathsf{T}}

(2)

where $W$ is a diagonal parameter matrix that assigns a learn-able weight for each feature-map channel.

Similarly, given two images whose feature-map is denoted as $X^{(u)}$ and $X^{(v)}$ , the inter-image importance is computed as follows:

S^{\prime(u,v)}=X^{(u)}W^{\prime}X^{(v)\mathsf{T}}

(3)

We combine the importance weight of both intra-image and inter-image visual clues, and select the crucial pairs based on the importance matrix as follow:

\displaystyle A^{(u,v)}=ReLU\left(\begin{array}[]{cc}S^{(u)}&S^{\prime(u,v)}\\ S^{\prime(v,u)}&S^{(v)}\end{array}\right)

(4)

where a ReLU activation is used to select the regions pairs with positive importance weights.

3.4 Conditional Feature Embedding

Given a pair of images $(I^{(u)},I^{(v)})$ , the conditional feature embedding of $I^{(u)}$ should be dynamically adjusted based on $I^{(v)}$ , denoted as $f_{cond}(I_{i}|I_{j})$ .

In order to fully exploit the detailed information in both images, we propose to extract the conditional feature embedding based on the crucial visual clue pairs between $I^{(u)}$ and $I^{(v)}$ . Given the alignment graph $A^{(u,v)}$ of visual clues predicted by the correspondence attention module, we propose a novel discrepancy-based GCN to encode the complex graph structured contextual information into conditional feature vectors.

3.4.1 Discrepancy-based GCN

In this paper, we choose a spectral based GCN [10] to extract conditional features. Given the $i$ -th pixel of an image’s feature-map, denoted as $\hat{X}_{i}^{(u)}$ (i.e. the node feature), the graph convolution operation is formulated as follows:

	$\displaystyle g_{\theta^{\prime}}*X_{i}^{(u)}$	$\displaystyle=\theta_{0}^{\prime}X_{i}^{(u)}+\theta_{1}^{\prime}(L-I)X_{i}^{(u)}$		(5)
		$\displaystyle=\theta_{0}^{\prime}X_{i}^{(u)}-\theta_{1}^{\prime}D^{-\frac{1}{2}}A^{(u,v)}D^{\frac{1}{2}}X_{i}^{(u)}$		(5)

where $L$ is the Laplacian matrix of the adjacent matrix $A^{(u,v)}$

To further decrease the number of learn-able parameter, common GCN sets $\theta=\theta_{0}^{\prime}=-\theta_{1}$ , which leads to following expression:

g_{\theta}*X_{i}^{(u)}=\theta(I+D^{-\frac{1}{2}}A^{(u,v)}D^{\frac{1}{2}})X_{i}^{(u)}

(6)

By taking a closer look at this equation, we will find that this operation is essentially a weighted average or smooth operation over current node feature itself and its connected neighbours. However, this is not what we want for our graph convolution. Under the setting of the Re-identification, instead of smoothing the value between connected nodes, we require the model to obtain the features difference between the aligned crucial regions. Thus, We propose a novel graph convolution operation, that instead of letting $\theta_{0}^{\prime}=-\theta_{1}$ , sets $\theta=\theta_{0}^{\prime}=\theta_{1}^{\prime}$ , which leads to following graph convolution operation:

	$\displaystyle g_{\theta}*X_{i}^{(u)}$	$\displaystyle=\theta(I-D^{-\frac{1}{2}}A^{(u,v)}D^{\frac{1}{2}})X_{i}^{(u)}$		(7)
		$\displaystyle=\theta D^{-\frac{1}{2}}LD^{\frac{1}{2}}X_{i}^{(u)}$		(7)

From Eq. 7, we can see that for our new graph convolution, the coefficient of $g_{\theta}$ becomes the normalized graph Laplacian matrix, which is equivalent to computing a secondary gradient of the node feature. Hence this convolution is able to obtain the level of feature change between adjacent nodes.

3.4.2 Mixed up ID Loss

As shown in Figure 3, given the feature-maps of image pair $(I^{(u)},I^{(v)})$ and the adjacent matrix $A^{(u,v)}$ generated from the feature-map pair, we obtain the conditional feature-map by applying the graph convolution described in Eq. 7. The outputs are a pair of conditional feature-map with the same size of the input feature-map. Similar to the individual exam stage, the conditional feature-maps are then fed into a feature encoder consisting of a Global Average Pooling layer and a $1*1$ convolution layer for dimension reduction to obtain the encoded conditional feature vectors .

In the training process, we use both triplet loss and cross entropy loss as the supervised signals. In every training iteration, we sample $P$ identities from the training set and for each identity in the training set, we sample $M$ samples. For triplet loss, we use a common hard triplet loss for person ReID [8]. As for the cross-entropy loss, since the conditional feature $f_{cond}(I^{(u)}|I^{(v)})$ is extracted based on the information from both $I^{u}$ and $I^{(v)}$ , the identity labels from both image should be used to supervised the feature extraction. Instead of the common cross entropy loss, we propose a mix-up cross entropy loss specifically for training conditional feature vector. Given a mini-batch containing $PM$ images, the mix-up cross-entropy loss is formulated as:

	$\displaystyle L_{mix-up}$	$\displaystyle=\sum_{u=1}^{PM}\sum_{v=1}^{PM}\alpha L_{CE}(y^{(u)},f_{cond}(I^{(u)}\|I^{(v)}))$		(8)
		$\displaystyle+(1-\alpha)L_{CE}(y^{(v)},f_{cond}(I^{(u)}\|I^{(v)}))$		(8)

where $L_{CE}$ is the softmax and cross entropy loss shown in Eq.1 .

3.5 Model Inference

Like most of the ReID method, the model is evaluated as an image retrieval task. Given a query image set containing $N_{q}$ images and a gallery set containing $N_{g}$ images, we need to retrieve the images with the same identity of $I_{q}$ . Our method obtains the similarity between two images with both individual features and conditional features.

We first extract the individual features for all images in query and gallery set, with computational complexity of $O(N_{q}+N_{g})$ . Then, for each query, we first sort the gallery images based on the similarity of the individual features. After that, the feature-maps of query image and the top-K images in sorted gallery forms $K$ feature-map pairs, which are fed into the key-point alignment stage and conditional feature embedding stage to obtain conditional features, with computational complexity of $O(N_{q}K)$ . Finally, the top-K gallery images are sorted once more by the similarity of coupled features, forming the final ranking result. Compared to the individual feature extraction, much fewer computing operations are needed to obtain conditional feature embedding, so the entire computation cost of CACE-Net is very close to normal ReID method.

4 Experiments

In this section we propose the performance comparison of CACE-Net with the state-of-the-art methods and ablation study of different components in CACE-Net.

4.1 Datasets

Our experiments are conducted on three widely used ReID benchmark datasets.

Market-1501 [35] dataset contains 32,668 person images of 1,501 identities captured by six cameras. Training set is composed of 12,936 images of 751 identities while testing data is composed of the other images of 750 identities.

MSMT-17 [25] dataset contains 124,068 person images of 4,101 identities captured by 15 cameras (12 outdoor, 3 indoor). Training set is composed of 30,248 images of 1,041 identities while testing data is composed of the other images of 3060 identities.

DukeMTMC-reID [16] dataset contains 36,411 person images of 1,404 identities captured by eight cameras. They are randomly divided, with 702 identities as the training set and the remaining 702 identities as the testing set. In the testing set, for each ID in each camera, one image is picked for the query set while the rest remain for the gallery set.

Occluded-DukeMTMC re-splits the DukeMTMC-reID dataset to generate the new Occluded-DukeMTMC dataset. All query images and $10\%$ gallery images in the new dataset are occluded person images.

4.2 Implementation Detail

The input images are resized into $384\times 128$ . In training stage, we set batch size to be 16 by sampling 4 identities and 4 images per identity. The ResNet-50 [5] model pretrained on ImageNet is used as the backbone network. Some common data augmentation strategies including horizontal flipping, random cropping, random erasing [36] (with a probability of 0.5) are used. We adopt Gradient Descent optimizer to train our model and set weight decay $5\times 10^{-4}$ . The total number of epoch is 80. The learning rate is initialized to $6.25\times 10^{-3}$ and is decayed by cosine method until it equals to 0. At the beginning, we warm up the models for 5 epochs and the learning rate grows linearly from 0 to $6.25\times 10^{-3}$ .

4.3 Ablation Study

In this sub-section, we report the evaluation results of the influence of different components and hyper-parameters of our method.

4.3.1 Influence of Model Components

Table 1 shows the influence of the three stages of CACE-Net (i.e. individual feature embedding, visual clue alignment and conditional feature embedding). Following methods are compared:

•

Individual Feature Embedding. This is a baseline method using only the individual feature vector extracted by the encoder in the individual exam stage. We observe that this method achieves lowest performance.
•

Visual Clue Alignment. The feature-maps extracted by the individual stage are further fed into the Visual Clue Alignment stage to get correspondent pixel pairs between two images. Instead of applying GCN, we directly compute the average feature similarity of the cross-image key-point pairs as the overall similarity of the two images. Table 1 shows an extra key-point alignment improves the baseline model by more than $1$ percentage point in terms of MAP.
•

Individual-GCN. All three stages are performed, but all cross image connections in the adjacent matrix are discarded and only intra-frame relations are considered. As show in Table 1, intra-frame relation based GCN gives around $1$ percent improvement in terms of MAP, which indicates that extracting features based on local information and correlation inside individual image can boost ReID performance.
•

Pairwise-GCN (Normal). Pairwise-GCN considers both intra-frame relation and inter-frame relation in GCN. Here the common GCN that uses a smooth operation on adjacent node is applied and we observe that normal GCN does not achieve performance improvement, which indicates that normal GCN is not suitable for extracting conditional features for ReID.
•

Pairwise-GCN (Discrepancy-based). Our novel discrepancy-based GCN is used and achieves obvious improvement compared to normal GCN. Furthermore, adding inter-frame relation between two images outperforms individual-GCN by a clear margin, showing the effectiveness of extracting conditional features based on local correlation between image pairs.

We visualize the conditional feature-map obtained by CACE-NET in Figure 4. The feature-map is visualized by computing the L2 norm of each feature in the feature-map. As shown in Figure 4, the feature-map of the query image (the left image in each column) changes drastically based on the gallery image it matches (the right image in each column). For example, the feature-map will have higher activation on the lower body on column 1 row 2, column 2 row 1 and column 3 row 1, because the lower body has more distinctive features to tell the query image and the corresponding gallery image apart. This visualization results prove that CACE-NET is able to dynamically adjust the feature embedding based on the image it matches, i.e. the conditional feature embedding.

Table 1: Performance (%) comparisons of three stages (Individual Feature Embedding, Visual Clue Alignment and Conditional Embedding) on DukeMTMC.

Method	mAP	Rank-1
Individual Feature Embedding	77.03	87.84
+ Visual Clue Alignment	78.29	89.68
+ Individual-GCN	79.19	89.77
+ Pairwise-GCN (Normal)	78.58	89.77
+ Pairwise-GCN (Discrepancy-based)	81.29	90.89

Table 2: Performance (%) comparisons of different alignment strategy for graph generation in CACE-Net on DukeMTMC.

Method	mAP	Rank-1
No Alignment (Baseline)	77.03	87.84
Part Alignment	78.39	89.45
Top-K Similarity	79.30	89.72
Fully Connect	80.19	89.90
Semantic Alignment	80.64	90.08
Correspondence Attention	81.29	90.89

Table 3: Performance (%) comparisons of different

\alpha

in mix-up id loss on Market1501.

Method	mAP	Rank-1
Cross Entropy (Baseline)	88.57	94.74
Mix-up $\alpha=0.95$	89.44	95.75
Mix-up $\alpha=0.9$	90.30	95.96
Mix-up $\alpha=0.8$	89.56	95.84
Mix-up $\alpha=0.7$	88.72	95.43
Mix-up $\alpha=0.6$	85.67	94.83

Table 4: Performance (%) comparisons to the state-of-the-art results on Market-1501, DukeMTMC-reID and MSMT-17. Our proposed CACE-Net outperforms the state-of-the-art methods.

Category	Method	Market-1501		DukeMTMC-reID		MSMT-17
Category	Method	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1
Part-based	PCB (ECCV 2018) [20]	77.4	92.3	66.1	81.7	-	-
	MGN (ACM MM 2018) [23]	86.9	95.7	78.4	88.7	-	-
	Pyramid (CVPR 2019) [33]	88.2	95.7	79.0	89.0	-	-
	RelationNet (AAAI 2020) [14]	88.9	95.2	78.6	89.7	-	-
Alignment	AlignedReID (ECCV 2018) [20]	79.3	91.8	-	-	-	-
	FD-GAN (NIPS 2018) [4]	77.7	90.5	64.5	80.0	-	-
	VPM (CVPR 2019) [19]	80.8	93.0	-	-	-	-
	HOReID (CVPR 2020) [22]	84.9	94.2	75.6	86.9	-	-
Attention	ABD-Net (ICCV 2019) [2]	88.28	95.6	78.59	89.0	60.8	82.3
	RGA-SC (CVPR 2020) [30]	88.4	96.1	-	-	57.5	80.3
	SCSN (CVPR 2020) [3]	88.5	95.7	79.0	91.0	58.5	83.8
Joint Learning	SMI (PR 2018) [27]	65.25	86.15	-	-	-	-
Joint Learning	DCCs (TNNLS 2020) [26]	71.1	88.4	59.2	80.3	-	-
Graph-based	Group-shuffling (CVPR 2018) [17]	82.5	92.7	66.4	80.7	-	-
Graph-based	SGGNN (ECCV 2018) [18]	82.8	92.3	68.2	81.1	-	-
CACE-Net	CACE-Net	90.3	95.96	81.29	90.89	62.0	83.54

4.3.2 Influence of Alignment Strategy

In order to prove the advantage of using automatically learned neural network to predict the correspondence between pixel in the feature-map pairs, Table 2 compares the influence of different alignment strategies to our CACE-Net method. Following alignment strategy is evaluated:

•

No Alignment: The baseline method uses only feature vector from individual exam stage where no connection exists between any pixels.
•

Part-based Alignment: Similar to the part-based model like PCB [20], two pixels at the same location of the images are aligned. As shown in Table 2, part alignment strategy outperforms the baseline method but achieve lower performance than correspondence attention, because it is not robust to scale changes and unable to rule out non-decisive pairs.
•

Fully Connect: A fully connected adjacent matrix is applied where all pixels are correspondent. In this way, the contextual information for all other pixels are explored when extracting a conditional feature. This method outperforms the baseline but does not achieve the best performance, because too much redundant contextual information is involved in a fully connected graph.
•

Similarity-based Alignment: Similar to AlignedReID [28], this method selects the most similar pixel as each pixel’s correspondent neighbour. It does not perform as well as our method because as discussed in section 1, the predefined similarity-based alignment strategy is able to find the most decisive and discriminative region pairs.
•

Semantic Alignment: Pixels lie in the same semantic body part are connected as neighbours whose distances are 1, unlike soft distances in our correspondence attention. Despite the help of extra human parsing network, this method is inferior to our correspondence attention.
•

Correspondence Attention Module: With a correspondence attention module, CACE-Net is able to build a more crucial correspondence compared to pre-defined rules and achieves the best performance.

In Figure 5 we compare the visualization results of similarity-based alignment and correspondence attention, where crucial region pairs with highest scores are visualized. We observe that similarity-based alignment that only aligns region pairs with high visual similarity, causing mis-matching images with similar local parts (e.g., the lower body in Figure 5 a,b and upper-body in Figure 5 c). On the other hand, our correspondence attention disregards the visual similarity and is able to focus on correct decisive clues to reject image pairs (e.g. the shoulder area in Figure 5).

Figure 6 compares the retrieval results of baseline method and CACE-Net. We observe that in hard cases like similar clothing or view-point variance show in Figure 6, CACE-Net is able to achieve much better results. It proves that even for images with extremely similar general visual appearance, CACE-Net is able to discover and compare crucial visual clues, such as the hair style in Figure6 row 1 and row 2, pattern on the t-shirt in row 3 and shoes in row 4.

4.3.3 Influence of Mix-up ID Loss

Table 3 shows the influence of our customized Mix-up Loss for conditional feature embedding. As shown in Table 3, with the right hyper-parameter $\alpha$ , our proposed Mix-up Loss significantly outperforms common cross entropy loss. We also observe the influence of the hyper-parameter $\alpha$ to the model performance, where CACE-Net achieves the best performance when $\alpha$ is set to $0.9$ , which shows that conditional feature contains only small amount of information from contextual image compared to the target image.

4.4 Comparison with the State-of-the-Art

Table 5: Performance (%) comparisons of State-of-the-art occluded ReID method and CACE-Net on Occluded Duke.

Method	mAP	Rank-1
PCB (ECCV2018)[21]	33.7	42.6
DSR (CVPR2018)[6]	30.4	40.8
SFR [7]	32.0	42.3
PGFA (ICCV 2019)[13]	37.3	51.4
HOReID (CVPR 2020) [22]	43.8	55.1
SGSFA (ACML 2020) [15]	47.4	62.3
CACE-Net	50.8	58.8

We evaluate our proposed CACE-Net with the state-of-the-art ReID models. These methods include: (1) the part-based models such as Pyramid, RelationNet; (2) the alignment-based methods like AlignReID, VPM, HOReID; (3) the attention-based methods like ABD-Net, RGA-SC, SCSN; (4) Joint learning methods including DCCs and SMI that learns conditional features with RNN; (5) ReID methods that utilizes graph structure such as Group-shuffling Random Walk and SGGNN. Table 4 shows the performance comparison of CACE-Net with State-of-the-Art methods. As shown in Table 4, thanks to our novel ReID framework that integrates visual clue alignment and conditional feature embedding, CACE-Net outperforms most of the state-of-the-art methods on Market1501, DukeMTMC and MSMT-17.

Furthermore, our CACE-Net can also be applied in ReID in occluded person. As shown in Table 5, other than archieves state-of-the-art method on the general ReID datasets, CACE-Net also outperforms most of the existing occluded ReID methods. It further demonstrates CACE-Net has the ability to solve hard cases like body occlusion.

5 Conclusions

This paper proposes a novel Person ReID framework that integrates both visual clue alignment and conditional feature embedding. Our proposed CACE-Net is able to automatically select crucial region pairs by correspondence attention module, and extract conditional feature embedding from the key-point pairs with a novel discrepancy-based GCN. The experiments show the effectiveness of our model.

References

[1] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
[2] Tianlong Chen, Shaojin Ding, Jingyi Xie, Ye Yuan, Wuyang Chen, Yang Yang, Zhou Ren, and Zhangyang Wang. Abd-net: Attentive but diverse person re-identification. arXiv preprint arXiv:1908.01114, 2019.
[3] Xuesong Chen, Canmiao Fu, Yong Zhao, Feng Zheng, Jingkuan Song, Rongrong Ji, and Yi Yang. Salience-guided cascaded suppression network for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3300–3310, 2020.
[4] Yixiao Ge, Zhuowan Li, Haiyu Zhao, Guojun Yin, Shuai Yi, Xiaogang Wang, et al. Fd-gan: Pose-guided feature distilling gan for robust person re-identification. In Advances in neural information processing systems, pages 1222–1233, 2018.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[6] Lingxiao He, Jian Liang, Haiqing Li, and Zhenan Sun. Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7073–7082, 2018.
[7] Lingxiao He, Zhenan Sun, Yuhao Zhu, and Yunbo Wang. Recognizing partial biometric patterns. arXiv preprint arXiv:1810.07399, 2018.
[8] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
[9] Mahdi M. Kalayeh, Emrah Basaran, Muhittin Gökmen, Mustafa E. Kamasak, and Mubarak Shah. Human semantic parsing for person re-identification. In CVPR, June 2018.
[10] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[11] Dangwei Li, Xiaotang Chen, Zhang Zhang, and Kaiqi Huang. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 384–393, 2017.
[12] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious attention network for person re-identification. In CVPR, June 2018.
[13] Jiaxu Miao, Yu Wu, Ping Liu, Yuhang Ding, and Yi Yang. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 542–551, 2019.
[14] Hyunjong Park and Bumsub Ham. Relation network for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11839–11847, 2020.
[15] Xuena Ren, Dongming Zhang, and Xiuguo Bao. Semantic-guided shared feature alignment for occluded person re-identification. In Asian Conference on Machine Learning, pages 17–32. PMLR, 2020.
[16] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision, pages 17–35. Springer, 2016.
[17] Yantao Shen, Hongsheng Li, Tong Xiao, Shuai Yi, Dapeng Chen, and Xiaogang Wang. Deep group-shuffling random walk for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2265–2274, 2018.
[18] Yantao Shen, Hongsheng Li, Shuai Yi, Dapeng Chen, and Xiaogang Wang. Person re-identification with deep similarity-guided graph neural network. In Proceedings of the European conference on computer vision (ECCV), pages 486–504, 2018.
[19] Yifan Sun, Qin Xu, Yali Li, Chi Zhang, Yikang Li, Shengjin Wang, and Jian Sun. Perceive where to focus: Learning visibility-aware part-level features for partial person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 393–402, 2019.
[20] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling. In ECCV, September 2018.
[21] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), pages 480–496, 2018.
[22] Guan’an Wang, Shuo Yang, Huanyu Liu, Zhicheng Wang, Yang Yang, Shuliang Wang, Gang Yu, Erjin Zhou, and Jian Sun. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6449–6458, 2020.
[23] Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple granularities for person re-identification. In ACM MM, pages 274–282. ACM, 2018.
[24] Kan Wang, Changxing Ding, Stephen J Maybank, and Dacheng Tao. Cdpm: convolutional deformable part models for semantically aligned person re-identification. IEEE Transactions on Image Processing, 2019.
[25] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 79–88, 2018.
[26] Lin Wu, Yang Wang, Junbin Gao, Meng Wang, Zheng-Jun Zha, and Dacheng Tao. Deep coattention-based comparator for relative representation learning in person re-identification. IEEE Transactions on Neural Networks and Learning Systems, 2020.
[27] Lin Wu, Yang Wang, Xue Li, and Junbin Gao. What-and-where to match: Deep spatially multiplicative integration networks for person re-identification. Pattern Recognition, 76:727–738, 2018.
[28] Xuan Zhang, Hao Luo, Xing Fan, Weilai Xiang, Yixiao Sun, Qiqi Xiao, Wei Jiang, Chi Zhang, and Jian Sun. Alignedreid: Surpassing human-level performance in person re-identification. arXiv preprint arXiv:1711.08184, 2017.
[29] Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, and Zhibo Chen. Densely semantically aligned person re-identification. In CVPR, June 2019.
[30] Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, and Zhibo Chen. Relation-aware global attention for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3186–3195, 2020.
[31] Haiyu Zhao, Maoqing Tian, Shuyang Sun, Jing Shao, Junjie Yan, Shuai Yi, Xiaogang Wang, and Xiaoou Tang. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1077–1085, 2017.
[32] Liming Zhao, Xi Li, Yueting Zhuang, and Jingdong Wang. Deeply-learned part-aligned representations for person re-identification. In The IEEE International Conference on Computer Vision, Oct 2017.
[33] Feng Zheng, Cheng Deng, Xing Sun, Xinyang Jiang, Xiaowei Guo, Zongqiao Yu, Feiyue Huang, and Rongrong Ji. Pyramidal person re-identification via multi-loss dynamic training. In CVPR, June 2019.
[34] Liang Zheng, Yujia Huang, Huchuan Lu, and Yi Yang. Pose invariant embedding for deep person re-identification. IEEE Transactions on Image Processing, 2019.
[35] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In The IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[36] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.
[37] Zhihui Zhu, Xinyang Jiang, Feng Zheng, Xiaowei Guo, Feiyue Huang, Weishi Zheng, and Xing Sun. Viewpoint-aware loss with angular regularization for person re-identification. arXiv preprint arXiv:1912.01300, 2019.