Noisy-Correspondence Learning for Text-to-Image Person Re-identification

Yang Qin¹ Yingke Chen² Dezhong Peng¹ Xi Peng¹ Joey Tianyi Zhou³ Peng Hu^1,*
College of Computer Science, Sichuan University
Department of Computer and Information Sciences, Northumbria University
Centre for Frontier AI Research (CFAR) and Institute of High Performance Computing (IHPC), A*STAR, Singapore Yang Qin¹ Yingke Chen² Dezhong Peng^1,4,5 Xi Peng¹ Joey Tianyi Zhou³ Peng Hu¹
¹College of Computer Science, Sichuan University, Chengdu, 610095, China.
²Department of Computer and Information Sciences, Northumbria University, Newcastle upon Tyne NE1 8ST, UK.
³Centre for Frontier AI Research (CFAR) and Institute of High Performance Computing (IHPC), A*STAR, Singapore.
⁴Sichuan Newstrong UHD Video Technology Co., Ltd., Chengdu 610095, China.
⁵Chengdu Ruibei Yingte Information Technology Company Ltd., Chengdu 610065, China Corresponding author: Peng Hu (penghu.ml@gmail.com).

Abstract

Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community, which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance, they implicitly assume the training image-text pairs are correctly aligned, which is not always the case in real-world scenarios. In practice, the image-text pairs inevitably exist under-correlated or even false-correlated, a.k.a noisy correspondence (NC), due to the low quality of the images and annotation errors. To address this problem, we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically, RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data, which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet Alignment Loss (TAL) relaxes the conventional Triplet Ranking loss with the hardest negative samples to a log-exponential upper bound over all negative ones, thus preventing the model collapse under NC and can also focus on hard-negative samples for promising performance. We conduct extensive experiments on three public benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets. Code is available at https://github.com/QinYang79/RDE.

1 Introduction

Text-to-image person re-identification (TIReID) [27, 45, 24] aims to understand the natural language descriptions to retrieve the matched person image from a large gallery set. This task has received increasing attention from both academic and industrial communities recently, e.g., finding/tracking suspect/lost persons in a surveillance system [47, 11]. However, TIReID remains a challenging task due to the inherent heterogeneity gap across different modalities and appearance attribute redundancy.

Refer to caption — Figure 1: The illustration of noisy correspondence. The figure shows an example of the NC problem, which occurs when the image-text pairs are wrongly aligned, *i.e*., false positive pairs (FPPs). Since the model does not know which pairs are noisy in practice, they will unavoidably degrade the performance by incorrect supervision information. As seen in the figure, (a) the clean image-text pair is semantically matched, while (b) the noisy pair is not, which would cause the cross-modal model to learn erroneous visual-textual associations. Note that both examples in (a) and (b) are from and actually exist in the RSTPReid dataset [62].

To tackle these challenges, most of the existing methods explore global- and local-matching alignment to learn accurate similarity measurements for person re-identification. To be specific, some global-matching methods [60, 52, 45] leverage vision/language backbones to extract modality-specific features and employ contrastive learning to achieve global visual-semantic alignments. To capture fine-grained information, some local-matching methods [34, 25, 46, 42] explicitly align local body regions to textually described entities/objectives to improve the discriminability of pedestrian features. Recently, some works [19, 53, 24] propose to exploit visual/semantic knowledge learned by the pre-trained models, such as BERT [6], ViT [10], and CLIP [40], and achieve explicit global alignments or discover more fine-grained local correspondence, thus boosting the re-identification performance. Although these methods achieve remarkable progress, they implicitly assume that all training image-text pairs are aligned correctly.

In reality, this assumption is hard or even impossible to hold due to the person’s pose, camera angle, illumination, and other inevitable factors in images, which may result in some inaccurate/mismatched textual descriptions of images (see Figure 1), e.g., the RSTPReid dataset [62]. Moreover, we observe that excessive such imperfect/mismatched image-text pairs would cause an overfitting problem and degrade the performance of existing TIReID methods shown in Figure 5. Based on the observation, in this paper, we reveal and study a new problem in TIReID, i.e., noisy correspondence (NC). Different from noisy labels, NC refers to the false correspondences of image-text pairs in TIReID, i.e., False Positive Pairs (FPPs): some negative image-text pairs are used as positive ones for cross-modal learning. Inevitably, FPPs will misguide models to overfit noisy supervision and collapse to suboptimal solutions due to the memorization effect [1] of Deep Neural Networks (DNNs).

To address the NC problem, we propose a Robust Dual Embedding method (RDE) for TIReID in this paper, which benefits from an effective Confident Consensus Division mechanism (CCD) and a novel Triplet Alignment Loss (TAL). Specifically, CCD fuses the dual-grained decisions to consensually divide the training data into clean and noisy sets, thus providing more reliable correspondences for robust learning. To diversify the model grain, the basic global embeddings (BGE) and token selection embeddings (TSE) are presented for coarse-grained and fine-grained cross-modal interactions respectively, thus capturing visual-semantic associations comprehensively. Different from the widely-used Triplet Ranking loss with the hardest negatives, our TAL relaxes the similarity learning from the hardest negative samples to all negative ones by applying an upper bound, which brings a stable solution for the collapse of training under NC while also benefiting from the hardest negatives mining to achieve promising performance. As a result, our RDE can achieve robustness against NC thanks to the proposed reliable supervision and stable triplet loss. The contributions and innovations of this paper are summarized as follows:

•

We reveal and study a new and ubiquitous problem in TIReID, termed noisy correspondence (NC). Different from class-level noisy labels, NC refers to erroneous correspondences in the person-description pairs that can mislead the model to learn incorrect visual-semantic associations. To the best of our knowledge, this paper could be the first work to explore this problem in TIReID.
•

We propose a robust method, termed RDE, to mitigate the adverse impact of NC through the proposed Confident Consensus Division (CCD) and novel Triplet Alignment Loss (TAL). By using CCD and TAL, RDE can obtain convincing consensus pairs and reduce the misleading risks in training, thus embracing robustness against NC.
•

Extensive experiments on three public image-text person benchmarks demonstrate the robustness and superiority of our method. Our method achieves the best performance both with and without synthetic noisy correspondence on all three datasets.

2 Related Work

2.1 Text-to-Image Person Re-identification

Text-to-image person re-identification (TIReID) is a novel and challenging task that aims to match a person image with a given natural language description [27, 60, 2, 55, 44, 33, 3, 4, 43, 29]. Existing TIReID methods could be roughly classified into two groups according to their alignment levels, i.e., global-matching methods [45, 61, 62] and local-matching methods [16, 46, 42]. The former try to learn cross-modal embeddings in a common latent space by employing textual and visual backbones with a matching loss (e.g., CMPM/C loss [60] and Triplet Ranking loss [12]) for TIReID. However, these methods mainly focus on global features while ignoring the fine-grained interactions between local features, which limits their performance improvement. To achieve fine-grained interactions, some of the latter methods explore explicit local alignments between body regions and textual entities for more refined alignments. However, these methods require more computational resources due to the complex local-level associations. Recently, inspired and benefited from vision-language pre-training models [40], some methods [19, 53, 24] expect to use the learned rich alignment knowledge of pre-trained models for local- or global-alignments. Although these methods achieve promising performance, almost all of them implicitly assume that all input training pairs are correctly aligned, which is hard to meet in practice due to the ubiquitous noise. In this paper, we address the inevitable and challenging noisy correspondence problem in TIReID.

2.2 Learning with Noisy Correspondence

As a special learning paradigm with noisy labels [26, 32, 14] in multi-modal/view community [21, 37, 38, 37, 57], the studies for noisy correspondence (NC) have recently attracted more and more attention in various tasks, e.g., video-text retrieval [59], visible-infrared person re-identification [56, 58, 31], and image-text matching [23, 36], which means that the negative pairs are wrongly treated as positive ones, i.e., false positive pairs (FPPs). To handle this problem, numerous methods are proposed to learn with NC, which can be broadly categorized into sample selection [23, 59, 18] and robust loss functions [56, 36, 22, 39]. The former commonly leverage the memorization effect of DNNs [1] to gradually distinguish the noisy data, thus paying more attention to clean data while less attention to noisy data. Differently, the latter methods aim to develop noise-tolerance loss functions to improve the robustness of model training against NC. Although the aforementioned methods achieve promising performance in various tasks, they are not specifically designed for TIReID and may be inefficient or ineffective in person re-identification. In this paper, we propose a well-designed method to tackle the NC problem in TIReID, which not only performs superior in noisy scenarios but also achieves promising performance in ordinary scenarios.

3 Methodology

3.1 Problem Statement

The purpose of TIReID is to retrieve a pedestrian image from the gallery set that matches the given textual description. For clarity, we represent the gallery set as $\mathcal{V}=\{I_{i},y^{p}_{i},y^{v}_{i}\}^{N_{v}}_{i=1}$ and the corresponding text set as $\mathcal{T}=\{T_{i},y^{v}_{i}\}^{N_{t}}_{i=1}$ , where $N_{v}$ is the number of images, $N_{t}$ is the number of texts, $y^{p}_{i}\in\mathcal{Y}_{p}=\{1,\cdots,C\}$ is the class label (person identify), $C$ is the number of identifies, and $y^{v}_{i}\in\mathcal{Y}_{v}=\{1,\cdots,N_{v}\}$ is the image label. The image-text pair set used in TIPeID can be defined as $\mathcal{P}=\{(I_{i},T_{i}),y^{v}_{i},y^{p}_{i}\}^{N}_{i=1}$ , where the cross-modal samples of each pair have the same image label $y_{i}^{v}$ and class label $y_{i}^{p}$ . We define a binary correspondence label $l_{ij}\in\{0,1\}$ to indicate the matched degree of any image-text pair. If $l_{ij}=1$ , the pair $(I_{i},T_{j})$ is matched (positive pair), otherwise it is not (negative pair). In practice, due to ubiquitous annotation noise, some unmatched pairs $(l_{ij}=0)$ are wrongly labeled as matched $(l_{ij}=1)$ , resulting in noisy correspondences (NCs) and performance degradation. To handle NC for robust TIReID, we present an RDE that leverages the Confident Consensus Division (CCD) and Triplet Alignment Loss (TAL) to mitigate the negative impact of label noise.

3.2 Cross-modal Embedding Model

In this section, we describe the cross-modal model used in our RDE. Following previous work [24], we utilize the visual encoder $f^{v}$ and textual encoder $f^{t}$ of the pre-trained model CLIP as modality-specific encoders to obtain token representations and implement cross-modal interactions through two embedding modules.

3.2.1 Token Representations

Give an input image $I_{i}\in\mathcal{V}$ , we use the visual encoder $f^{v}$ of CLIP to tokenize the image into a discrete token representation sequence with a length of $N_{\circ}+1$ , i.e., $\boldsymbol{V}_{i}=f^{v}(I_{i})=\{\boldsymbol{v}^{i}_{g},\boldsymbol{v}^{i}_{1},\boldsymbol{v}^{i}_{2},\cdots,\boldsymbol{v}^{i}_{N_{\circ}}\}^{\top}\in\mathbb{R}^{(N_{\circ}+1)\times d}$ , where $d$ is the dimensionality of the shared latent space. These features include an encoded feature $\boldsymbol{v}^{i}_{g}$ of the [CLS] token and patch-level local features $\{\boldsymbol{v}^{i}_{j}\}^{N_{\circ}}_{j=1}$ of $N_{\circ}$ fixed-sized non-overlapping patches of $I_{i}$ , wherein $\boldsymbol{v}^{i}_{g}$ can represent the global representation. For an input text $T_{i}\in\mathcal{T}$ , we apply the textual encoder $f^{t}$ of CLIP to obtain global and local representations. Specifically, following IRRA [24], we first tokenize the input text $T_{i}$ using lower-cased byte pair encoding (BPE) with a 49,152 vocab size into a token sequence. The token sequence is bracketed with [SOS] and [EOS] tokens to represent the beginning and end of the sequence. Then, we feed the token sequence into $f_{t}$ to obtain the features $\boldsymbol{T}_{i}=\{\boldsymbol{t}^{i}_{s},\boldsymbol{t}^{i}_{1},\cdots,\boldsymbol{t}^{i}_{N_{\diamond}},\boldsymbol{t}^{i}_{e}\}^{\top}\in\mathbb{R}^{(N_{\diamond}+2)\times d}$ , where $\boldsymbol{t}^{i}_{s}$ and $\boldsymbol{t}^{i}_{e}$ are the features of [SOS] and [EOS] tokens and $\{\boldsymbol{v}_{j}^{i}\}^{N_{\diamond}}_{j=1}$ are the word-level local features of $N_{\diamond}$ word tokens of text $T_{i}$ . Generally, the $\boldsymbol{t}^{i}_{e}$ can be regarded as the sentence-level global feature of $T_{i}$ .

3.2.2 Dual Embedding Modules

To measure the similarity between any image-text pair $(I_{i},T_{j})$ , we can directly use the global features of [CLS] and [EOS] tokens to compute the Basic Global Embedding (BGE) similarity by the cosine similarity, i.e., $S^{b}_{ij}={{\boldsymbol{v}^{i}_{g}}{\boldsymbol{t}^{j}_{e}}^{\top}}\big{/}{\|\boldsymbol{v}^{i}_{g}\|\|\boldsymbol{t}^{j}_{e}\|}$ , where the global features represent the global embedding representations of two modalities. However, optimizing the BGE similarities alone may not capture the fine-grained interactions between two modalities, which will limit performance improvement. To address this issue, we exploit the local features of informative tokens to learn more discriminative embedding representations, thus mining the fine-grained correspondences. In CLIP, the global features of the tokens ([CLS] and [EOS]) are obtained by a weighted aggregation of all local token features. These weights reflect the correlation between the global token and each local token. Following previous methods [53, 63], we could select the informative tokens based on these correlation weights to aggregate local features for a more representative global embedding.

In practice, these correlation weights can be obtained directly in the self-attention map of the last Transformer blocks of $f^{v}$ and $f^{t}$ , which reflects the relevance among the input $1+N_{\circ}$ (or $2+N_{\diamond}$ ) tokens. Given the output self-attention map $\boldsymbol{A}^{v}_{i}\in\mathbb{R}^{(1+N_{\circ})\times(1+N_{\circ})}$ of image $I_{i}$ , the correlation weights between global token and local tokens are $\{a^{v}_{i,j}\}^{N_{\circ}}_{j=1}=\boldsymbol{a}^{v}_{i}=\boldsymbol{A}^{v}_{i}[0,1:N_{\circ}+1]\in\mathbb{R}^{N_{\circ}}$ . Similarly, for text $T_{i}$ , the correlation weights are $\{a^{t}_{i,j}\}^{N_{\diamond}}_{j=1}=\boldsymbol{a}^{t}_{i}=\boldsymbol{A}^{t}_{i}[0,1:N_{\diamond}+1]\in\mathbb{R}^{N_{\diamond}}$ , where $\boldsymbol{A}^{t}_{i}\in\mathbb{R}^{(2+N_{\diamond})\times(2+N_{\diamond})}$ is the output self-attention map for text $I_{i}$ . Then, we select a proportion ( $TopK$ ) of the corresponding token features with higher scores for embedding. Specifically, for $I_{i}$ , the selected token sequences and correlation weights are reorganized as $\boldsymbol{V}^{s}_{i}=\{\boldsymbol{v}^{i}_{j}\}_{j\in\boldsymbol{K}^{v}_{i}}$ and $\hat{\boldsymbol{a}}^{v}_{i}=\{a^{v}_{i,j}\}_{j\in\boldsymbol{K}^{v}_{i}}$ , where $\boldsymbol{K}^{v}_{i}\in\mathbb{R}^{\lfloor\mathcal{R}N_{\circ}\rfloor}$ is the set of indices for the selected local tokens of $I_{i}$ and $\mathcal{R}$ is the selection ratio. For text $T_{i}$ , the selected token sequences and correlation weights are also reorganized as $\boldsymbol{T}^{s}_{i}=\{\boldsymbol{t}^{i}_{j}\}_{j\in\boldsymbol{K}^{t}_{i}}$ and $\hat{\boldsymbol{a}}^{t}_{i}=\{a^{t}_{i,j}\}_{j\in\boldsymbol{K}^{t}_{i}}$ , where $\boldsymbol{K}^{t}_{i}\in\mathbb{R}^{\min(\lfloor\mathcal{R}N_{\diamond}^{\prime}\rfloor,N_{\diamond})}$ is the set of indices for the selected local tokens of $T_{i}$ . $N_{\diamond}^{\prime}$ is the maximum input sequence length of $f^{t}$ . For $I_{i}$ and $T_{i}$ , we perform an embedding transformation on these selected token features to obtain subtle representations, instead of using complex fine-grained correspondence discovery used in CFine [53]. The transformation is performed by an embedding module like the residual block [20], as follows:

	$\displaystyle\boldsymbol{v}^{i}_{tse}=$	$\displaystyle{MaxPool}({MLP(\hat{\boldsymbol{V}}^{s}_{i})}+FC(\hat{\boldsymbol{V}}^{s}_{i})),$		(1)
	$\displaystyle\boldsymbol{t}^{i}_{tse}=$	$\displaystyle{MaxPool}({MLP(\hat{\boldsymbol{T}}^{s}_{i})}+FC(\hat{\boldsymbol{T}}^{s}_{i})),$		(1)

where $MaxPool(\cdot)$ is the max-pooling function, $MLP(\cdot)$ is a multi-layer perceptron (MLP) layer, $FC(\cdot)$ is a linear layer, $\hat{\boldsymbol{V}}^{s}_{i}=L2Norm({\boldsymbol{V}}^{s}_{i})$ , and $\hat{\boldsymbol{T}}^{s}_{i}=L2Norm({\boldsymbol{T}}^{s}_{i})$ . $L2Norm(\cdot)$ is the $\ell_{2}$ -normalization function to normalize features. Finally, for any pair $(I_{i},T_{j})$ , we compute the cosine similarity $S^{t}_{ij}$ between $\boldsymbol{v}_{tse}^{i}$ and $\boldsymbol{t}_{tse}^{j}$ as the Token Selection Embedding (TSE) similarity to measure the cross-modal matching degree for auxiliary training and inference.

3.3 Robust Similarity Learning

In this section, we detail how we use the image-text similarities computed by the dual embedding modules for robust TIReID, which involves Confident Consensus Division (CCD) and Triplet Alignment Loss (TAL).

3.3.1 Confident Consensus Division

To alleviate the negative impact of NC, the key is to filter the possible noisy pairs in the training data, which directly avoids false supervision information. Some previous work in learning with noisy labels [17, 26, 23] are inspired by the memorization effect [1] of DNNs to perform filtrations, i.e., the clean (easy) data tend to have a smaller loss value than that of noisy (hard) data in early training. Based on this, we can exploit the two-component Gaussian Mixture Model (GMM) to fit the per-sample loss distributions computed by the predictions of BGE and TSE to further identify the noisy pairs in the training data. Specifically, given a cross-modal model $\mathcal{M}$ , we first define the per-sample loss as:

\ell(\mathcal{M},\mathcal{P})=\{\ell_{i}\}^{N}_{i=1}=\big{\{}\mathcal{L}(I_{i},T_{i})\big{\}}^{N}_{i=1},

(2)

where $\mathcal{L}$ is the loss function for pair $(I_{i},T_{i})\in\mathcal{P}$ to bring them closer in the shared latent space. In our method, $\mathcal{L}$ is the proposed $\mathcal{L}_{tal}$ defined in Equation 11. Then, the per-sample loss is fed into the GMM to separate clean and noisy data, i.e., assigning the Gaussian component with a lower mean value as a clean set and the other as a noisy one, respectively. Following [26, 23], we use the Expectation-Maximization algorithm to optimize the GMM and compute the posterior probability $p(k|\ell_{i})=p(k)p(\ell_{i}|k)/p(\ell_{i})$ for the $i$ -th pair as the probability of being clean/noisy pair, where $k\in\{0,1\}$ is used to indicate whether it is a clean or a noisy component. Then, we set a threshold $\delta=0.5$ to $\{p(k=0|l_{i})\}^{N}_{i=1}$ to divide the data into clean and noisy sets, i.e.,

	$\displaystyle\mathcal{P}^{c}=$	$\displaystyle\{(I_{i},T_{i})\|p(k=0\|\ell_{i})>\delta,\forall(I_{i},T_{i})\in\mathcal{P}\},$		(3)
	$\displaystyle\mathcal{P}^{n}=$	$\displaystyle\{(I_{i},T_{i})\|p(k=0\|\ell_{i})\leq\delta,\forall(I_{i},T_{i})\in\mathcal{P}\},$		(3)

where $\mathcal{P}^{c}$ and $\mathcal{P}^{n}$ are the divided clean and noisy sets, respectively. For BGE and TSE, the divisions conducted with Equation 13 are $\mathcal{P}=\mathcal{P}^{c}_{bge}\cup\mathcal{P}^{n}_{bge}$ and $\mathcal{P}=\mathcal{P}^{c}_{tse}\cup\mathcal{P}^{n}_{tse}$ , separately.

To obtain the final reliable divisions, we propose to exploit the consistency between the two divisions to find the consensus part as the final confident clean set, i.e., $\hat{\mathcal{P}}^{c}=\mathcal{P}^{c}_{bge}\cap\mathcal{P}^{c}_{tse}$ . The rest of the data can be divided into noisy and uncertain subsets, i.e., $\hat{\mathcal{P}}^{n}=\mathcal{P}^{n}_{bge}\cap\mathcal{P}^{n}_{tse}$ and $\hat{\mathcal{P}}^{u}=\mathcal{P}-(\hat{\mathcal{P}}^{c}\cup\hat{\mathcal{P}}^{n})$ . Finally, we exploit the divisions to further recalibrate the correspondence labels, e.g., for $i$ -th pair, the process can be expressed as:

\hat{l}_{ii}=\left\{\begin{array}[]{ll}1,&\text{if $(I_{i},T_{i})\in\hat{\mathcal{P}}^{c}$},\\ 0,&\text{if $(I_{i},T_{i})\in\hat{\mathcal{P}}^{n}$},\\ Rand(\{0,1\}),&\text{if $(I_{i},T_{i})\in\hat{\mathcal{P}}^{u}$},\end{array}\right.

(4)

where $Rand(X)$ is the function to randomly select an element from the collection $X$ .

3.3.2 Triplet Alignment Loss

The Triplet Ranking Loss (TRL) is a common matching loss that is widely used in cross-modal learning, and achieves promising performance by employing the hardest negatives, e.g., image-text matching [7], video-text retrieval [9], etc. However, we find that this strategy may lead to bad local minima or even model collapse for TIReID under NC in the early stages of training. In contrast, the summation version of TRL that considers all negative samples, namely TRL-S, can maintain better stability and avoid model collapse, but suffers from insufficient performance due to the lack of attention to hard negatives (see Section 3.3.3 for more discussion). Therefore, we propose a novel Triplet Alignment Loss (TAL) to guide TIReID, which differs from TRL in that it relaxes the optimization of the hardest negatives to all negatives with an upper bound (see Lemma 1). Thanks to the relaxation, TAL reduces the risk of the optimization being dominated by the hardest negatives, thereby making the training more stable and comprehensive by considering all pairs. For an input pair $(I_{i},T_{i})$ in a mini-batch $\mathbf{x}$ , TAL is defined as

$\begin{aligned} \mathcal{L}_{tal}(I_{i},T_{i})&=\big{[}m-S^{+}_{i2t}(I_{i})+\tau\log(\sum_{j=1}^{K}q_{ij}\exp(S(I_{i},T_{j})/\tau))\big{]}_{+}\\ &+\big{[}m-S^{+}_{t2i}(T_{i})+\tau\log(\sum_{j=1}^{K}q_{ji}\exp(S(I_{j},T_{i})/\tau))\big{]}_{+},\end{aligned}$

(5)

where $m$ is a positive margin coefficient, $\tau$ is a temperature coefficient to control hardness, $S(I_{i},T_{j})\in\{S_{ij}^{b},S_{ij}^{t}\}$ , $[x]_{+}\equiv\max(x,0)$ , $\exp(x)\equiv e^{x}$ , $q_{ij}=1-l_{ij}$ , and $K$ is the size of $\mathbf{x}$ . From Lemma 1, as $\tau\to 0$ , TAL approaches TRL and focuses more on hard negatives. Since multiple positive pairs from the same identity may appear in the mini-batch, $S^{+}_{i2t}(I_{i})=\sum^{K}_{j=1}\alpha_{ij}S(I_{i},T_{j})$ is the weighted average similarity of positive pairs for image $I_{i}$ , where $\alpha_{ij}=\frac{l_{ij}\exp{(S(I_{i},T_{j})/\tau)}}{\sum^{N}_{k=1}l_{ik}\exp{(S(I_{i},T_{k})/\tau)}}$ . Similarly, $S^{+}_{i2t}(T_{i})$ is the weighted average similarity of positive pairs for text $T_{i}$ .

Lemma 1

TAL is the upper bound of TRL, i.e.,

		$\displaystyle\mathcal{L}_{trl}(I_{i},T_{i})=\big{[}m-S^{+}_{i2t}(I_{i})+S(I_{i},\hat{T}_{i})\big{]}_{+}$		(6)
		$\displaystyle+\big{[}m-S^{+}_{t2i}(T_{i})+S(\hat{I}_{i},T_{i})\big{]}_{+}\leq\mathcal{L}_{tal}(I_{i},T_{i}),$		(6)

where $\hat{T}_{i}\in\{T_{j}|l_{ij}=0,\forall j\in\{1,\cdots,K\}\}$ is the hardest negative text for $I_{i}$ and $\hat{I}_{i}\in\{I_{j}|l_{ji}=0,\forall j\in\{1,\cdots,K\}\}$ is the hardest negative image for $I_{i}$ , respectively.

3.3.3 Revisit Triplet Raking Loss

To explore the behaviors of the triplet losses in the noisy case, we record the similarity distributions versus iterations of TRL, TRL-S, and the proposed TAL under 50% noise. From Figure 3(a), one can see that the similarities of all pairs are gradually gathered to 1 during training with TRL, i.e., all samples collapses to a narrow neighborhood space on a hypersphere, resulting in a trivial solution and a bad performance (3.64%).

To delve deeper into the underlying reason, we performed a gradient analysis. For ease of representation and analysis, we only consider one direction since image-to-text retrieval and text-to-image retrieval are symmetrical. And, we suppose that there is only one paired text for each image in the mini-batch. Due to the truncation operation $[x]_{+}$ , we only discuss the case of $\mathcal{L}>0$ that could generate gradients. Taking the image-to-text direction as an example, the gradients generated by TRL, TRL-S, and TAL are

$\frac{\partial\mathcal{L}_{trl}}{\partial\boldsymbol{v}_{i}}=\hat{\boldsymbol{t}}_{i}-\boldsymbol{t}_{i},\ \ \ \frac{\partial\mathcal{L}_{trl}}{\partial{\boldsymbol{t}}_{i}}=-\boldsymbol{v}_{i},\ \ \ \frac{\partial\mathcal{L}_{trl}}{\partial\hat{\boldsymbol{t}}_{i}}=\boldsymbol{v}_{i},$

(7)

$\frac{\partial\mathcal{L}_{trls}}{\partial\boldsymbol{v}_{i}}=\sum_{j\in\mathcal{Z}}({\boldsymbol{t}}_{j}-\boldsymbol{t}_{i}),\frac{\partial\mathcal{L}_{trls}}{\partial{\boldsymbol{t}}_{i}}=-|\mathcal{Z}|\boldsymbol{v}_{i},\frac{\partial\mathcal{L}_{trls}}{\partial{\boldsymbol{t}}_{j}}=\boldsymbol{v}_{i},$

(8)

$\frac{\partial\mathcal{L}_{tal}}{\partial\boldsymbol{v}_{i}}=\sum^{K}_{j\neq i}\beta_{j}(\boldsymbol{t}_{j}-\boldsymbol{t}_{i}),\frac{\partial\mathcal{L}_{tal}}{\partial{\boldsymbol{t}}_{i}}=-\boldsymbol{v}_{i},\frac{\partial\mathcal{L}_{tal}}{\partial{\boldsymbol{t}}_{j}}=\beta_{j}\boldsymbol{v}_{i},$

(9)

where $\mathcal{Z}=\{z\ |\ \big{[}m-S(I_{i},T_{i})+S(I_{i},T_{z})\big{]}_{+}>0,z\neq i,z\in\{0,\cdots,K\}\}$ , $\beta_{j}=\frac{\exp(\boldsymbol{v}_{i}^{\top}\boldsymbol{t}_{j}/\tau)}{\sum^{K}_{k\neq i}\exp(\boldsymbol{v}_{i}^{\top}\boldsymbol{t}_{k}/\tau)}$ , $\hat{\boldsymbol{t}}_{i}$ , $\boldsymbol{t}_{j}$ and $\boldsymbol{t}_{i}$ are the hardest negative sample, negative sample, and positive sample of the anchor sample $\boldsymbol{v}_{i}$ , respectively. Since the hardest sample is most similar to the positive one, $\frac{\partial_{trl}}{\partial\boldsymbol{v}_{i}}$ would easily approach $\boldsymbol{0}$ and the gradients for other negative samples except for the hardest negative one are all $\boldsymbol{0}$ , which may lead to bad local minima early on in training and even cause the worst-case scenario, i.e., model collapse (see Figure 3(a)). Unlike TRL, TRL-S aims to push all negative samples away from the anchor by a constant margin and produces stronger gradients for the anchor, i.e., $\|\frac{\partial_{trls}}{\partial\boldsymbol{v}_{i}}\|_{2}\geq\|\frac{\partial_{trl}}{\partial\boldsymbol{v}_{i}}\|_{2}$ , thus avoiding model collapse (see Figure 3(b)). However, the drawback is that TRL-S treats every negative sample equally while ignoring challenging ones, which limits performance improvement. Different from TRL and TRL-S, from Equation 22, our TAL can comprehensively consider all negative samples and exploits the anchor-negative semantic relationships to adaptively adjust the gradients for each negative, thus paying more attention to hard negatives. As a result, TAL would avoid model collapse under NC while achieving superior performance (63.35% vs. 44.93% vs. 3.64%). More details for the derivations of gradients are provided in the supplementary material.

Algorithm 1 The training process of our RDE

0: The training data

\mathcal{P}

with

N

image-text pairs, maximal epoch

N_{e}

, the cross-modal model

\mathcal{M}(\Theta)

, and the hyper-parameters

\mathcal{R},m,\tau

;

1: Initialize the backbones with the weights of the pre-trained CLIP except for the TSE module, which is randomly initialized;

2: for e = 1, 2,

\cdots

N_{e}

3: Calculate the per-sample loss

\ell(\mathcal{M},\mathcal{P})

;

4: Divide the training data with the predictions of BGE and TSE using Equation 13, respectively;

5: Obtain the consensus divisions to recalibrate the correspondence labels

\{\hat{l}_{ii}\}^{N}_{i=1}

with Equation 4;

6: for

\mathbf{x}

in mini-batches

\{\mathbf{x}_{m}\}^{M}_{m=1}

7: Extract the BGE and TSE features of

\mathbf{x}

;

8: Compute the similarities between

K

image-text pairs in

\mathbf{x}

with above features;

9: Calculate the final matching loss

\mathcal{L}_{m}

with Equation 10;

10:

{\Theta}=\text{Optimizer}\left(\Theta,\mathcal{L}_{m}\right)

;

11: end for

12: end for

12: The optimized parameters

\hat{\Theta}

3.3.4 Training and Inference

To train the model robustly, we use the corrected label $\hat{l}_{ii}$ instead of the original correspondence label $l_{ii}$ to compute the final matching loss, i.e.,

\mathcal{L}_{m}=\sum^{K}_{i=1}\hat{l}_{ii}(\mathcal{L}^{b}(I_{i},T_{i})+\mathcal{L}^{t}(I_{i},T_{i})),

(10)

where $\mathcal{L}^{b}(I_{i},T_{i})$ and $\mathcal{L}^{t}(I_{i},T_{i})$ are the TAL losses computed by Equation 11 with BGE and TSE similarities, respectively. The training process of RDE is shown in Algorithm 1. For the joint inference, we compute the final similarity of the image-text pair as the average of the similarities computed by both embedding modules, i.e., $S=(S^{b}+S^{t})/2$ .

			CUHK-PEDES					ICFG-PEDES					RSTPReid
Noise	Methods		R-1	R-5	R-10	mAP	mINP	R-1	R-5	R-10	mAP	mINP	R-1	R-5	R-10	mAP	mINP
0%	SSAN	Best	61.37	80.15	86.73	-	-	54.23	72.63	79.53	-	-	43.50	67.80	77.15	-	-
	IVT	Best	65.59	83.11	89.21	-	-	56.04	73.60	80.22	-	-	46.70	70.00	78.80	-	-
	CFine	Best	69.57	85.93	91.15	-	-	60.83	76.55	82.42	-	-	50.55	72.50	81.60	-	-
	IRRA	Best	73.38	89.93	93.71	66.13	50.24	63.46	80.25	85.82	38.06	7.93	60.20	81.30	88.20	47.17	25.28
	RDE	Best	75.94	90.14	94.12	67.56	51.44	67.68	82.47	87.36	40.06	7.87	65.35	83.95	89.90	50.88	28.08
20%	SSAN	Best	46.52	68.36	77.42	42.49	28.13	40.57	62.58	71.53	20.93	2.22	35.10	60.00	71.45	28.90	12.08
	SSAN	Last	45.76	67.98	76.28	40.05	24.12	40.28	62.68	71.53	20.98	2.25	33.45	58.15	69.60	26.46	10.08
	IVT	Best	58.59	78.51	85.61	57.19	45.78	50.21	69.14	76.18	34.72	8.77	43.65	66.50	75.70	37.22	20.47
	IVT	Last	57.67	78.04	85.02	56.17	44.42	48.70	67.42	75.06	34.44	9.25	37.95	63.35	73.75	34.24	19.67
	IRRA	Best	69.74	87.09	92.20	62.28	45.84	60.76	78.26	84.01	35.87	6.80	58.75	81.90	88.25	46.38	24.78
	IRRA	Last	69.44	87.09	92.04	62.16	45.70	60.58	78.14	84.20	35.92	6.91	54.00	77.15	85.55	43.20	22.53
	CLIP-C	Best	66.41	85.15	90.89	59.36	43.02	55.25	74.76	81.32	31.09	4.94	54.45	77.80	86.70	42.58	21.38
	CLIP-C	Last	66.10	86.01	91.02	59.77	43.57	55.17	74.58	81.46	31.12	4.97	53.20	76.25	85.40	41.95	21.95
	DECL	Best	70.29	87.04	91.93	62.84	46.54	61.95	78.36	83.88	36.08	6.25	61.75	80.70	86.90	47.70	26.07
	DECL	Last	70.08	87.20	92.14	62.86	46.63	61.95	78.36	83.88	36.08	6.25	60.85	80.45	86.65	47.34	25.86
	RDE	Best	74.46	89.42	93.63	66.13	49.66	66.54	81.70	86.70	39.08	7.55	64.45	83.50	90.00	49.78	27.43
	RDE	Last	74.53	89.23	93.55	66.13	49.63	66.51	81.70	86.71	39.09	7.56	63.85	83.85	89.45	50.27	27.75
50%	SSAN	Best	13.43	31.74	41.89	14.12	6.91	18.83	37.70	47.43	9.83	1.01	19.40	39.25	50.95	15.95	6.13
	SSAN	Last	11.31	28.07	37.90	10.57	3.46	17.06	37.18	47.85	6.58	0.39	14.10	33.95	46.55	11.88	4.04
	IVT	Best	50.49	71.82	79.81	48.85	36.60	43.03	61.48	69.56	28.86	6.11	39.70	63.80	73.95	34.35	18.56
	IVT	Last	42.02	65.04	73.72	40.49	27.89	36.57	54.83	62.91	24.30	5.08	28.55	52.05	62.70	26.82	13.97
	IRRA	Best	62.41	82.23	88.40	55.52	38.48	52.53	71.99	79.41	29.05	4.43	56.65	78.40	86.55	42.41	21.05
	IRRA	Last	42.79	64.31	72.58	36.76	21.11	39.22	60.52	69.26	19.44	1.98	31.15	55.40	65.45	23.96	9.67
	CLIP-C	Best	64.02	83.66	89.38	57.33	40.90	51.60	71.89	79.31	28.76	4.33	53.45	76.80	85.50	41.43	21.17
	CLIP-C	Last	63.97	83.74	89.54	57.35	40.88	51.49	71.99	79.32	28.77	4.37	52.35	76.35	85.25	40.64	20.45
	DECL	Best	65.22	83.72	89.28	57.94	41.39	57.50	75.09	81.24	32.64	5.27	56.75	80.55	87.65	44.53	23.61
	DECL	Last	65.09	83.58	89.26	57.89	41.35	57.49	75.10	81.23	32.63	5.26	55.00	80.50	86.50	43.81	23.31
	RDE	Best	71.33	87.41	91.81	63.50	47.36	63.76	79.53	84.91	37.38	6.80	62.85	83.20	89.15	47.67	23.97
	RDE	Last	71.25	87.39	91.76	63.59	47.50	63.76	79.53	84.91	37.38	6.80	62.85	83.20	89.15	47.67	23.96

Table 1: Performance comparison under different noise rates on three benchmarks. “Best” means choosing the best checkpoint on the validation set to test, and “Last” stands for choosing the checkpoint after the last training epoch to conduct inference. R-1,5,10 is an abbreviation for Rank-1,5,10 (%) accuracy. The best and second-best results are in bold and underline, respectively.

4 Experiments

In this section, we conduct extensive experiments to verify the effectiveness and superiority of the proposed RDE on three widely-used benchmark datasets.

4.1 Datasets and Settings

4.1.1 Datasets

In the experiments, we use the CHUK-PEDES [27], ICFG-PEDES [8], and RSTPReid [62] datasets to evaluate our RDE. We follow the data partitions used in IRRA [24] to split the datasets into training, validation, and test sets, wherein the ICFG-PEDES dataset only has training and validation sets. More details are provided in the supplementary material.

4.1.2 Evaluation Protocols

For all experiments, we mainly employ the popular Rank-K metrics (K=1,5,10) to measure the retrieval performance. In addition to Rank-K, we also adopt the mean Average Precision (mAP) and mean Inverse Negative Penalty (mINP) as auxiliary retrieval metrics to further evaluate performance following [24].

4.1.3 Implementation Details

As mentioned earlier, we adopt the pre-trained model CLIP [40] as our modality-specific encoders. In fairness, we use the same version of CLIP-ViTB/16 as IRRA [24] to conduct experiments. During training, we introduce data augmentations to increase the diversity of the training data. Specifically, we utilize random horizontal flipping, random crop with padding, and random erasing to augment the training images. For training texts, we employ random masking, replacement, and removal for the word tokens as the data augmentation. Moreover, the input size of images is $384\times 128$ and the maximum length of input word tokens is set to 77. We employ the Adam optimizer to train our model for 60 epochs with a cosine learning rate decay strategy. The initial learning rate is $1e-5$ for the original model parameters of CLIP and the initial one for the network parameters of TSE is initialized to $1e-3$ . The batch size is 64. Following IRRA [24], we adopt an early training process with a gradually increasing learning rate. For hyperparameter settings, the margin value $m$ of TAL is set to 0.1, the temperature parameter $\tau$ is set to 0.015, and the selection ratio $\mathcal{R}$ is 0.3.

4.2 Comparison with State-of-the-Art Methods

In this section, we evaluate the performance of our RDE on three benchmarks under different scenarios. For a comprehensive comparison, we compare our method with several state-of-the-art methods, including both ordinary methods and robust methods. Moreover, we use two synthetic noise levels (i.e., noise rates), 20%, and 50%, to simulate the real-world scenario where the image-text pairs are not well-aligned. We randomly shuffle the text descriptions to inject NCs into the training data. We compare our RDE with five state-of-the-art baselines: SSAN [8], IVT [45], IRRA [24], DECL [36], and CLIP-C. SSAN, IVT, and IRRA are recent ordinary methods that are not designed for NC. DECL is a general framework that can enhance the robustness of image-text matching methods against NC. We use the model of IRRA as the base model of DECL for TIReID. CLIP-C is a strong baseline that fine-tunes the CLIP(ViT-B/16) model with only clean image-text pairs. We report the results of both the best checkpoint on the validation set and the last checkpoint to show the overfitting degree. Furthermore, we also evaluate our RDE on the original datasets without synthetic NC to demonstrate its superiority in Table 1. We compare our RDE with two local-matching methods: SSAN [8] and CFine [53]); and two global-matching methods: IVT [45] and IRRA [24]. More comparisons with other methods are provided in the supplementary material.

From Table 1, one can see that our RDE achieves state-of-the-art performance on three datasets and we can draw three observations: (1) On the datasets with synthetic NC, the ordinary methods suffer from remarkable performance degradation or poor performance as the noise rate increases. In contrast, our RDE achieves the best results on all metrics. Moreover, by comparing the ‘Best’ performance with the ‘Last’ ones in Table 1, we can see that our RDE can effectively prevent the performance deterioration caused by overfitting against NC. (2) Compared with the robust framework DECL and the strong baseline CLIP-C, our RDE also shows obvious advantages, which indicates that our solution against NC is effective and superior in TIReID. For instance, on CUHK-PEDES under 50% noise, our RDE achieves 71.33%, 87.41%, and 91.81% in terms of Rank-1,5,10 on the ‘Best’ rows, respectively, which surpasses the best baseline DECL by a large margin, i.e., +6.11%, +3.69%, and +2.53%, respectively. (3) On the datasets without synthetic NC, our RDE outperforms all baselines by a large margin. Specifically, RDE achieves performance gains of +2.56%, +4.22%, and +5.15% in terms of Rank-1 compared with the best baseline IRRA on three datasets, respectively, demonstrating the effectiveness and advantages of our method.

4.3 Ablation Study

In this section, we conduct ablation studies on the CUHK-PEDES dataset with 50% noise to investigate the effects and contributions of each proposed component in RDE. We compare different combinations of our components in Table 2. From the experimental results, we could draw the following observation: (1) RDE achieves the best performance by using both BGE and TSE for joint inference, which demonstrates that these two modules are complementary and effective. (2) RDE benefits from CCD, which can enhance the robustness and alleviate the overfitting effect caused by NC. (3) Our TAL outperforms the widely-used Triplet Ranking Loss (TRL) and SDM loss [24], which demonstrates the superior stability and robustness of our TAL against NC.

No.	$S^{b}$	$S^{e}$	CCD	Loss	R-1	R-5	R-10	mAP	mINP
#1	✓	✓	✓	TAL	71.33	87.41	91.81	63.50	47.36
#2	✓	✓	✓	TRL	6.40	16.08	22.14	6.53	2.51
#3	✓	✓	✓	TRL-S	67.38	85.35	90.64	60.04	43.60
#4	✓	✓	✓	SDM	69.33	86.99	91.68	61.99	45.34
#5		✓	✓	TAL	70.70	86.60	91.16	62.67	46.19
#6	✓		✓	TAL	69.07	86.09	91.13	61.69	45.40
#7	✓	✓		TAL	63.11	81.04	87.22	55.42	38.68

Table 2: Ablation studies on the CHUK-PEDES dataset.

4.4 Parametric Analysis

To study the impact of different hyperparameter settings on performance, we perform sensitivity analyses for two key hyperparameters (i.e., $m$ and $\tau$ ) on the CHUK-PEDES dataset with 50% noise. From Figure 4, we can see that: (1) Too large or too small $m$ will lead to suboptimal performance. We choose $m=0.1$ in all our experiments. (2) Too small $\tau$ will cause training failure, while the increasing $\tau$ will gradually decrease the separability (hardness) of positive and negative pairs for suboptimal performance. We choose $\tau=0.015$ in all our experiments.

4.5 Robustness Study

In this section, we provide some visualization results during cross-modal training to verify the robustness and effectiveness of our method. As shown in Figure 5, one can clearly see that our RDE not only achieves excellent performance under noise but also effectively alleviates noise overfitting.

5 Conclusion

In this paper, we reveal and study a novel challenging problem of noisy correspondence (NC) in TIReID, which violates the common assumption of existing methods that image-text data is perfectly aligned. To this end, we propose a robust method, i.e., RDE, to effectively handle the revealed NC problem and achieve superior performance. Extensive experiments are conducted on three datasets to comprehensively demonstrate the superiority and robustness of RDE both with and without synthetic NCs.

Acknowledgments

This work was supported in part by NSFC under Grant U21B2040, 62176171, 62372315, and 62102274, in part by Sichuan Science and Technology Program under Grant 2022YFH0021 and 2023ZYD0143; in part by Chengdu Science and Technology Project under Grant 2023-XT00-00004-GX; in part by the SCU-LuZhou Sciences and Technology Coorperation Program under Grant 2023CDLZ-16; in part by the Fundamental Research Funds for the Central Universities under Grant CJ202303 and YJ202140.

In this supplementary material, we provide additional information for RDE. More specifically, we first give detailed proof and derivation for lemmas and gradients in Appendix A. In Appendix B, we detail the used datasets and the compared baselines. In Appendix C, to further verify the robustness of RDE, we provide the re-identification performance on three benchmark datasets under extremely high noise rate, i.e., 80%. Besides, in Appendix D, we provide more comparison results compared with state-of-the-art methods to comprehensively verify the superiority of our RDE. In Appendix E, we explore the impact of different selection ratios ( $\mathcal{R}$ ) on performance. In Appendix F, we provide a more ablation analysis. In Appendix G, we provide a large number of real noisy examples existing in the three public datasets to conduct a case study, thus emphasizing our motivation. We also provide a more comprehensive robustness analysis to verify the robustness of RDE in Appendix H. Finally, in Appendix I, we provide some qualitative results to illustrate the advantages of our RDE.

Appendix A Proof and Derivation

A.1 Proof for Lemma 1(Lemma 2)

Given an input image-text pair $(I_{i},T_{i})$ in a mini-batch $\mathbf{x}$ , TAL is defined as:

(11)

where $m$ is a positive margin coefficient, $\tau$ is a temperature coefficient to control hardness, $S(I_{i},T_{j})\in\{S_{ij}^{b},S_{ij}^{t}\}$ , $[x]_{+}\equiv\max(x,0)$ , $\exp(x)\equiv e^{x}$ , $q_{ij}=1-l_{ij}$ , and $K$ is the size of $\mathbf{x}$ . From Lemma 1, as $\tau\to 0$ , TAL is close to TRL and focuses more on hard negatives. Since multiple positive pairs from the same identity may appear in the mini-batch, $S^{+}_{i2t}(I_{i})=\sum^{K}_{j=1}\alpha_{ij}S(I_{i},T_{j})$ is the weighted average similarity of positive pairs for image $I_{i}$ , where $\alpha_{ij}=\frac{l_{ij}\exp{(S(I_{i},T_{j})/\tau)}}{\sum^{N}_{k=1}l_{ik}\exp{(S(I_{i},T_{k})/\tau)}}$ . And, $S^{+}_{i2t}(T_{i})$ is similar to the definition of $S^{+}_{i2t}(I_{i})$ .

Lemma 2

TAL is the upper bound of TRL, i.e.,

		$\displaystyle\mathcal{L}_{trl}(I_{i},T_{i})=\big{[}m-S^{+}_{i2t}(I_{i})+S(I_{i},\hat{T}_{i})\big{]}_{+}$		(12)
		$\displaystyle+\big{[}m-S^{+}_{t2i}(T_{i})+S(\hat{I}_{i},T_{i})\big{]}_{+}\leq\mathcal{L}_{tal}(I_{i},T_{i}),$		(12)

where $\hat{T}_{i}\in\mathbf{T}_{i}=\{T_{j}|l_{ij}=0,\forall j\in\{1,2,\cdots,K\}\}$ is the hardest negative text for image $I_{i}$ and $\hat{I}_{i}\in\mathbf{I}_{i}=\{I_{j}|l_{ji}=0,\forall j\in\{1,2,\cdots,K\}\}$ is the hardest negative image for text $I_{i}$ , respectively.

Proof 1

To prove Equation 12, we first take the image-to-text direction as an example. For $S(I_{i},\hat{T}_{i})$ in Equation 12, we have that

$\displaystyle S(I_{i},\hat{T}_{i})$	$\displaystyle=\max_{T_{j}\in\mathbf{T}_{i}}\left(S(I_{i},T_{j})\right)$	(13)
	$\displaystyle=\max_{T_{j}\in\mathbf{T}_{i}}\left(\tau\log\exp\left(S(I_{i},T_{j})\right)^{\frac{1}{\tau}}\right)$
	$\displaystyle=\tau\log\left(\max_{T_{j}\in\mathbf{T}_{i}}\left(\exp\left(S(I_{i},T_{j})\right)^{\frac{1}{\tau}}\right)\right)$
	$\displaystyle\leq\tau\log\left(\sum_{T_{j}\in\mathbf{T}_{i}}\exp(S(I_{i},T_{j})/\tau)\right)$
	$\displaystyle\leq\tau\log(\sum_{j=1}^{K}q_{ij}\exp(S(I_{i},T_{j})/\tau)),$

where $q_{ij}=1-l_{ij}$ . Based on Equation 13, we have that

\begin{gathered}\big{[}m-S^{+}_{i2t}(I_{i})+\tau\log(\sum_{j=1}^{K}q_{ij}\exp(S(I_{i},T_{j})/\tau))\big{]}_{+}\\ \geq\big{[}m-S^{+}_{i2t}(I_{i})+S(I_{i},\hat{T}_{i})\big{]}_{+}.\end{gathered}

(14)

Similarly, in the text-to-image direction, we have that

\begin{gathered}\big{[}m-S^{+}_{t2i}(T_{i})+\tau\log(\sum_{j=1}^{K}q_{ji}\exp(S(I_{j},T_{i})/\tau))\big{]}_{+}\\ \geq\big{[}m-S^{+}_{t2i}(T_{i})+S(\hat{I}_{i},T_{i})\big{]}_{+}.\end{gathered}

(15)

Thus, combining Equation 14 and Equation 15, we can get $\mathcal{L}_{trl}(I_{i},T_{i})\leq\mathcal{L}_{tal}(I_{i},T_{i})$ . This completes the proof.

A.2 Derivation for Gradient

In this appendix, we provide more details of gradient derivation. For ease of representation and analysis, we only consider one direction like [30] since image-to-text retrieval and text-to-image retrieval are symmetrical. Besides, we suppose that there is only one paired text for each image in the mini-batch. Thus, TRL, TRL-S, and TAL are simplified as follows:

\begin{gathered}\mathcal{L}_{trl}(I_{i},T_{i})=\big{[}m-\boldsymbol{v}_{i}^{\top}\boldsymbol{t}_{i}+\boldsymbol{v}_{i}^{\top}\hat{\boldsymbol{t}}_{i}\big{]}_{+},\\ \mathcal{L}_{trls}(I_{i},T_{i})=\sum^{K}_{j\neq i}\big{[}m-\boldsymbol{v}_{i}^{\top}\boldsymbol{t}_{i}+\boldsymbol{v}_{i}^{\top}{\boldsymbol{t}}_{j}\big{]}_{+},\\ \mathcal{L}_{tal}(I_{i},T_{i})=\Big{[}m-\boldsymbol{v}_{i}^{\top}\hat{\boldsymbol{t}}_{i}+\tau\log(\sum^{K}_{j\neq i}e^{(\boldsymbol{v}_{i}^{\top}{\boldsymbol{t}}_{j}/\tau)})\Big{]}_{+},\end{gathered}

(16)

where $\hat{\boldsymbol{t}}_{i}$ , $\boldsymbol{t}_{j}$ and $\boldsymbol{t}_{i}$ are the hardest negative sample, negative sample, and positive sample of the anchor sample $\boldsymbol{v}_{i}$ , respectively. These $\ell_{2}$ -normalized features are embedded by the modality-specifical models, i.e., $f_{\Theta_{v}}(\cdot)$ and $f_{\Theta_{t}}(\cdot)$ . Due to the truncation operation $[x]_{+}$ , we only discuss the case of $\mathcal{L}>0$ that could generate gradients. For TRL, the gradients to the parameters $\Theta_{v}$ and $\Theta_{t}$ are:

\begin{gathered}\frac{\partial\mathcal{L}_{trl}}{\partial\Theta_{v}}=\frac{\partial\mathcal{L}_{trl}}{\partial\boldsymbol{v}_{i}}\frac{\partial\boldsymbol{v}_{i}}{\partial\Theta_{v}},\\ \frac{\partial\mathcal{L}_{trl}}{\partial\Theta_{t}}=\frac{\partial\mathcal{L}_{trl}}{\partial\hat{\boldsymbol{t}}_{i}}\frac{\partial\hat{\boldsymbol{t}}_{i}}{\partial\Theta_{t}}+\frac{\partial\mathcal{L}_{trl}}{\partial{\boldsymbol{t}}_{i}}\frac{\partial{\boldsymbol{t}}_{i}}{\partial\Theta_{t}}.\end{gathered}

(17)

Since the learning of normalized features can be viewed as the movement process of points on a unit hyperplane, we only consider the loss gradients with respect to $\boldsymbol{v}_{i}$ , $\hat{\boldsymbol{v}}_{i}$ , and $\boldsymbol{t}_{i}$ are:

\frac{\partial\mathcal{L}_{trl}}{\partial\boldsymbol{v}_{i}}=\hat{\boldsymbol{t}}_{i}-\boldsymbol{t}_{i},\frac{\partial\mathcal{L}_{trl}}{\partial{\boldsymbol{t}}_{i}}=-\boldsymbol{v}_{i},\frac{\partial\mathcal{L}_{trl}}{\partial\hat{\boldsymbol{t}}_{i}}=\boldsymbol{v}_{i}.

(18)

For TRL-S, the gradients to the parameters $\Theta_{v}$ and $\Theta_{t}$ are:

\begin{gathered}\frac{\partial\mathcal{L}_{trls}}{\partial\Theta_{v}}=\frac{\partial\mathcal{L}_{trls}}{\partial\boldsymbol{v}_{i}}\frac{\partial\boldsymbol{v}_{i}}{\partial\Theta_{v}},\\ \frac{\partial\mathcal{L}_{trls}}{\partial\Theta_{t}}=\sum_{j\in\mathcal{Z}}\frac{\partial\mathcal{L}_{trls}}{\partial{\boldsymbol{t}}_{j}}\frac{\partial{\boldsymbol{t}}_{j}}{\partial\Theta_{t}}+\frac{\partial\mathcal{L}_{trls}}{\partial{\boldsymbol{t}}_{i}}\frac{\partial{\boldsymbol{t}}_{i}}{\partial\Theta_{t}}.\end{gathered}

(19)

Thus, for $\boldsymbol{v}_{i}$ , ${\boldsymbol{v}}_{j}$ , and $\boldsymbol{t}_{i}$ , the gradients are:

\begin{gathered}\frac{\partial\mathcal{L}_{trls}}{\partial\boldsymbol{v}_{i}}=\sum_{j\in\mathcal{Z}}({\boldsymbol{t}}_{j}-\boldsymbol{t}_{i}),\frac{\partial\mathcal{L}_{trls}}{\partial{\boldsymbol{t}}_{j}}=\boldsymbol{v}_{i},\forall j\in\mathcal{Z},\\ \frac{\partial\mathcal{L}_{trls}}{\partial{\boldsymbol{t}}_{i}}=-\sum_{j\in\mathcal{Z}}\boldsymbol{v}_{i}=-|\mathcal{Z}|\boldsymbol{v}_{i},\end{gathered}

(20)

where $\mathcal{Z}=\{z\ |\ \big{[}m-S(I_{i},T_{i})+S(I_{i},T_{z})\big{]}_{+}>0,z\neq i,z\in\{0,\cdots,K\}\}$ . For our TAL, the gradients to the parameters $\Theta_{v}$ and $\Theta_{t}$ are:

\begin{gathered}\frac{\partial\mathcal{L}_{tal}}{\partial\Theta_{v}}=\frac{\partial\mathcal{L}_{tal}}{\partial\boldsymbol{v}_{i}}\frac{\partial\boldsymbol{v}_{i}}{\partial\Theta_{v}},\\ \frac{\partial\mathcal{L}_{tal}}{\partial\Theta_{t}}=\sum_{j\neq i}\frac{\partial\mathcal{L}_{tal}}{\partial{\boldsymbol{t}}_{j}}\frac{\partial{\boldsymbol{t}}_{j}}{\partial\Theta_{t}}+\frac{\partial\mathcal{L}_{tal}}{\partial{\boldsymbol{t}}_{i}}\frac{\partial{\boldsymbol{t}}_{i}}{\partial\Theta_{t}}.\end{gathered}

(21)

Thus, the gradients for $\boldsymbol{v}_{i}$ , ${\boldsymbol{v}}_{j}$ $\boldsymbol{t}_{i}$ are:

\begin{gathered}\frac{\partial\mathcal{L}_{tal}}{\partial\boldsymbol{v}_{i}}=\sum^{K}_{j\neq i}\beta_{j}\boldsymbol{t}_{j}-\boldsymbol{t}_{i}=\sum^{K}_{j\neq i}\beta_{j}(\boldsymbol{t}_{j}-\boldsymbol{t}_{i}),\\ \frac{\partial\mathcal{L}_{tal}}{\partial{\boldsymbol{t}}_{i}}=-\boldsymbol{v}_{i},\ \ \ \frac{\partial\mathcal{L}_{tal}}{\partial{\boldsymbol{t}}_{j}}=\beta_{j}\boldsymbol{v}_{i},\end{gathered}

(22)

where $\beta_{j}=\frac{\exp(\boldsymbol{v}_{i}^{\top}\boldsymbol{t}_{j}/\tau)}{\sum^{K}_{k\neq i}\exp(\boldsymbol{v}_{i}^{\top}\boldsymbol{t}_{k}/\tau)}$ .

			CUHK-PEDES					ICFG-PEDES					RSTPReid
Noise	Methods		R-1	R-5	R-10	mAP	mINP	R-1	R-5	R-10	mAP	mINP	R-1	R-5	R-10	mAP	mINP
80%	SSAN	Best	0.18	0.83	1.45	0.47	0.24	0.28	0.99	1.90	0.27	0.15	0.65	3.25	5.95	1.30	0.70
	SSAN	Last	0.13	0.67	1.46	0.42	0.21	0.18	1.01	1.77	0.25	0.14	0.65	2.95	5.85	1.32	0.68
	IVT	Best	34.03	55.49	66.16	33.90	23.29	21.10	37.10	45.64	13.68	2.32	15.15	30.00	40.50	14.98	7.79
	IVT	Last	10.61	23.81	31.38	11.13	5.72	5.64	12.48	17.15	4.00	0.69	4.95	13.55	19.75	6.07	2.85
	IRRA	Best	38.63	56.69	64.18	34.60	21.84	28.19	44.14	51.27	14.36	1.41	29.65	46.65	54.50	23.77	11.32
	IRRA	Last	9.06	19.69	25.65	8.26	3.18	8.68	18.76	24.50	3.65	0.27	8.15	21.00	29.05	7.28	2.40
	CLIP-C	Best	57.38	78.05	84.97	51.08	34.83	44.84	65.24	73.27	24.27	3.42	47.80	72.70	81.75	37.50	18.09
	CLIP-C	Last	57.05	78.09	85.07	51.14	35.05	44.65	65.26	73.45	24.20	3.44	44.60	70.75	80.20	35.67	17.09
	DECL	Best	47.90	71.57	80.17	44.51	29.86	40.53	61.49	69.84	21.78	2.97	48.15	72.20	80.75	37.31	18.83
	DECL	Last	46.57	70.19	78.48	42.93	27.91	39.91	61.16	69.51	21.56	2.89	45.85	71.05	81.00	35.34	16.35
	RDE	Best	64.99	83.15	89.52	57.84	41.07	56.02	74.00	80.62	30.67	4.60	53.40	76.70	85.55	39.71	18.28
	RDE	Last	64.91	83.20	89.54	57.83	41.07	55.96	74.09	80.61	30.79	4.62	52.35	76.85	84.90	39.92	17.72

Table 3: Performance comparison under 80% noise rate on three benchmarks. “Best” means choosing the best checkpoint on the validation set to test, and “Last” stands for choosing the checkpoint after the last training epoch to conduct inference. R-1,5,10 is an abbreviation for Rank-1,5,10 (%) accuracy. The best and second-best results are in bold and underline, respectively.

Appendix B Dataset and Baseline Description

B.1 Datasets.

To verify the effectiveness and superiority of RDE, we use three widely-used image-text person datasets to conduct experiments. A brief introduction of these datasets is given as follows:

•

CHUK-PEDES [27] is the first large-scale benchmark to dedicate TIReID, which includes 40,206 person images and 80,412 text descriptions for 13,003 unique identities. We follow the official data split to conduct experiments, i.e., 11,003 identities for training, 1,000 identities for validation, and the rest of the 1,000 identities for testing.
•

ICFG-PEDES [8] is a widely-used benchmark collected from the MSMT17 dataset [51] and consists of 54,522 images for 4,102 unique persons and each image has a corresponding textual description. We follow the data split used by most TIReID methods [45, 24], i.e., a training set with 3,102 identifies and a validation set with 1,000 identifies. Note that we uniformly used the validation performance as the test performance due to its lack of a test set.
•

RSTPReid [62] is another benchmark dataset constructed from the MSMT17 dataset [51] for TIReID. RSTPReid contains 20,505 images for 4,101 identities, wherein each person has 5 images and each image is paired with 2 text descriptions. Following the official data split, we use 3,701 identities for training, 200 identities for validation, and the remaining 200 identities for testing.

B.2 Baselines.

To verify the effectiveness and robustness of our method in the NC scenario, we provide the comparison results with 5 baselines that have published code. We introduce each baseline as follows:

•

SSAN¹¹1https://github.com/zifyloo/SSAN [8] is a local-matching method for TIReID, which mainly benefits from a proposed multiview non-local network that could capture the local relationships, thus establishing better correspondences between body parts and noun phrases. Besides, SSAN also exploits a compound ranking loss to make an effective reduction of the intra-class variance in textual features.
•

IVT²²2https://github.com/TencentYoutuResearch/PersonRetrieval-IVT [45] is an implicit visual-textual framework, which belongs to the global-matching method. To explore fine-grained alignments, IVT utilizes two implicit semantic alignment paradigms, i.e., multi-level alignment (MLA) and bidirectional mask modeling (BMM). MLA aims to see “finer” by exploring local and global alignments from three-level matchings. BMM aims to see “more” by mining more semantic alignments from random masking for both modalities.
•

IRRA³³3https://github.com/anosorae/IRRA [24] is a recent state-of-art global-matching method that could learn relations between local visual-textual tokens and enhances global alignments without requiring additional prior supervision. IRRA exploits a novel similarity distribution matching to minimize the KL divergence between the similarity distributions and the normalized label matching distributions for better performance.
•

CLIP-C is a quite strong baseline that fine-tunes the original CLIP⁴⁴4https://github.com/openai/CLIP model with only clean image-text pairs. We use the same version as IRRA, i.e., ViTB/16, for a fair comparison and use InfoNCE loss [35] to optimize the model.
•

DECL⁵⁵5https://github.com/QinYang79/DECL [36] is an effective robust image-text matching framework, which utilizes the cross-modal evidential learning paradigm to capture and leverage the uncertainty brought by noise to isolate the noisy pairs. Since TIReID can be treated as the sub-task of instance-level image-text matching, DECL also can be used to ease the negative impact of NCs in TIReID. In this paper, we exploit the used model of IRRA [24] as the base model of DECL for robust TIReID.

Appendix C The Results under Extreme Noise

To further verify the effectiveness and robustness of our method, we report comparison results under extremely high noise, i.e., 80%. From the results in Table 3, one can see that our RDE achieves the best performance and can effectively alleviate the performance degradation caused by noise overfitting. For example, compared with the ‘Best’ rows, our RDE surpasses the best baselines by +7.56%, +5.95%, and +3.5% in terms of Rank-1 on the three datasets, respectively.

Appendix D More Comparisons

In this section, we follow the organization of IRRA [24] and provide more comparative experimental results on three benchmarks in Tables 4, 5 and 6. From the results, our RDE achieves the best results and exceeds the best baselines, i.e., +0.92%, +2.63%, and +0.15% in terms of Rank-1 on three datasets, respectively.

Methods	Ref.	Image Enc.	Text Enc.	R-1	R-5	R-10	mAP	mINP
CMPM/C [60]	ECCV’18	RN50	LSTM	49.37	-	79.27	-	-
TIMAM [41]	ICCV’19	RN101	BERT	54.51	77.56	79.27	-	-
ViTAA [48]	ECCV’20	RN50	LSTM	54.92	75.18	82.90	51.60	-
NAFS [16]	arXiv’21	RN50	BERT	59.36	79.13	86.00	54.07	-
DSSL [62]	MM’21	RN50	BERT	59.98	80.41	87.56	-	-
SSAN [8]	arXiv’21	RN50	LSTM	61.37	80.15	86.73		-
Lapscore [52]	ICCV’21	RN50	BERT	63.40	-	87.80	-	-
ISANet [54]	arXiv’22	RN50	LSTM	63.92	82.15	87.69	-	-
LBUL [50]	MM’22	RN50	BERT	64.04	82.66	87.22	-	-
Han et al.2021	BMVC’21	CLIP-RN101	CLIP-Xformer	64.08	81.73	88.19	60.08	-
SAF [28]	ICASSP’22	ViT-Base	BERT	64.13	82.62	88.40	-	-
TIPCB [5]	Neuro’22	RN50	BERT	64.26	83.19	89.10	-	-
CAIBC [49]	MM’22	RN50	BERT	64.43	82.87	88.37	-	-
AXM-Net [13]	MM’22	RN50	BERT	64.44	80.52	86.77	58.73	-
LGUR [42]	MM’22	DeiT-Small	BERT	65.25	83.12	89.00	-	-
IVT [45]	ECCVW’22	ViT-Base	BERT	65.59	83.11	89.21	-	-
CFine [53]	TIP’23	CLIP-ViT	BERT	69.57	85.93	91.15	-	-
IRRA [24]	CVPR’23	CLIP-ViT	CLIP-Xformer	73.38	89.93	93.71	66.13	50.24
BiLMa [15]	ICCVW’23	CLIP-ViT	CLIP-Xformer	74.03	89.59	93.62	66.57	-
PBSL [44]	ACMMM’23	RN50	BERT	65.32	83.81	89.26	-	-
BEAT[33]	ACMMM’23	RN101	BERT	65.61	83.45	89.54	-	-
LCR²S [55]	ACMMM’23	RN50	TextCNN	67.36	84.19	89.62	59.24
DCEL [29]	ACMMM’23	CLIP-ViT	CLIP-Xformer	75.02	90.89	94.52	-	-
UniPT [43]	ICCV’23	CLIP-ViT	CLIP-Xformer	68.50	84.67	-	-	-
RaSa [2]	IJCAI’23	ALBEFF	ALBEFF	76.51	90.29	94.25	69.38
RaSa ${}_{\text{TCL}}$ [2]	IJCAI’23	TCL	TCL	73.23	89.20	93.32	66.43	-
TBPS [4]	Arxiv’23	CLIP-ViT	CLIP-Xformer	73.54	88.19	92.35	65.38	-
Our RDE	-	CLIP-ViT	CLIP-Xformer	75.94	90.14	94.12	67.56	51.44

Table 4: Performance comparisons on the CUHK-PEDES dataset. The best results are in bold.

Methods	R-1	R-5	R-10	mAP	mINP
Dual Path [61]	38.99	59.44	68.41	-	-
CMPM/C [60]	43.51	65.44	74.26	-	-
ViTAA [48]	50.98	68.79	75.78	-	-
SSAN [8]	54.23	72.63	79.53	-	-
IVT [45]	56.04	73.60	80.22	-	-
ISANet [54]	57.73	75.42	81.72	-	-
CFine [53]	60.83	76.55	82.42	-	-
IRRA [24]	63.46	80.25	85.82	38.06	7.93
BiLMa [15]	63.83	80.15	85.74	38.26	-
PBSL [44]	57.84	75.46	82.15	-	-
BEAT[33]	58.25	75.92	81.96	-	-
LCR²S [55]	57.93	76.08	82.40	38.21	-
DCEL [29]	64.88	81.34	86.72	-	-
UniPT [43]	60.09	76.19	-	-	-
RaSa [2]	65.28	80.40	85.12	41.29	-
RaSa ${}_{\text{TCL}}^{*}$ [2]	63.29	79.36	84.36	39.23	-
TBPS [4]	65.05	80.34	85.47	39.83	-
Our RDE	67.68	82.47	87.36	40.06	7.87

Table 5: Performance comparisons on the ICFG-PEDES dataset. The best results are in bold. ‘*’ indicates our reproducible results.

Methods	R-1	R-5	R-10	mAP	mINP
DSSL [62]	39.05	62.60	73.95	-	-
SSAN [8]	43.50	67.80	77.15	-	-
LBUL [50]	45.55	68.20	77.85	-	-
IVT [45]	46.70	70.00	78.80	-	-
CFine [53]	50.55	72.50	81.60	-	-
IRRA [24]	60.20	81.30	88.20	47.17	25.28
BiLMA [15]	61.20	81.50	88.80	48.51	-
PBSL [44]	47.80	71.40	79.90	-	-
BEAT[33]	48.10	73.10	81.30	-	-
LCR²S [55]	54.95	76.65	84.70	40.92	-
DCEL [29]	61.35	83.95	90.45	-	-
RaSa [2]	66.90	86.50	91.35	52.31	-
RaSa ${}_{\text{TCL}}^{*}$ [2]	65.20	84.05	89.85	50.14	-
TBPS [4]	61.95	83.55	88.75	48.26	-
Our RDE	65.35	83.95	89.90	50.88	28.08

Table 6: Performance comparisons on the RSTPReid dataset. The best results are in bold. ‘*’ indicates our reproducible results.

Appendix E Study on the Selection Ratio

Figure 6 shows the variation of performance with different selection ratio $\mathcal{R}$ . From the figure, one can see that too large or too small $\mathcal{R}$ will cause suboptimal performance. We think that a small $R$ will cause too much information loss and poor embedding presentations, while too large will focus on too many meaningless features. For this reason, we recommend $\mathcal{R}$ to be set between 0.3 $\sim$ 0.5. Thus, $\mathcal{R}$ is set to 0.3 in all our experiments.

Appendix F Ablation Study

F.1 Ablation analysis for TSE

To verify the design rationality of TSE in our RDE, we conduct dedicated ablation experiments on TSE. The results are reported in Table 7. In the table, TSE^′ means that the token features encoded by CLIP are directly used for aggregation to obtain the embedding representations instead of conducting embedding transformation. Also, we show the impact of different pooling strategies on performance. From the results, our standard version of TSE obtains the best performance, i.e., conducting the embedding transformation and using the max-pooling strategy to obtain the TSE representations.

Methods	Pool	R-1	R-5	R-10	mAP	mINP
TSE^′	Avg.	67.22	84.96	90.03	60.22	43.84
TSE^′	TopK.	67.35	85.36	90.51	60.21	43.54
TSE^′	Max.	67.46	85.17	90.58	60.11	43.45
TSE	Avg.	67.43	85.19	90.50	60.42	43.97
TSE	TopK.	68.27	86.03	90.79	60.95	44.37
TSE	Max.	71.33	87.41	91.81	63.50	47.36

Table 7: Performance comparisons with state-of-the-art methods on the RSTPReid dataset. ’Avg.’, ’TopK.’, and ’Max.’ indicate the use of average-pooling, topK-pooling (K=10), and max-pooling strategies, respectively.

Noise	No.	$S^{b}$	$S^{t}$	CCD	Loss	R-1	R-5	R-10	mAP	mINP
80%	#1	✓	✓	✓	TAL	64.99	83.15	89.52	57.84	41.07
	#2	✓	✓	✓	TRL	2.18	6.45	10.48	2.65	0.83
	#3	✓	✓	✓	TRL-S	51.62	74.53	82.21	46.15	30.12
	#4	✓	✓	✓	SDM	58.32	79.03	85.79	51.27	34.00
	#5		✓	✓	TAL	63.56	82.59	88.84	56.69	39.71
	#6	✓		✓	TAL	61.70	81.61	87.95	55.11	38.34
	#7	✓	✓		TAL	41.03	62.62	71.99	37.29	23.54

Table 8: Ablation studies on the CHUK-PEDES dataset.

F.2 Ablation study on High Noise

In this appendix, we provide more ablation studies on the CUHK-PEDES dataset to investigate the effects and contributions of each proposed component in RDE. The experimental results are shown in Table 8. The observations and conclusions are consistent with those in the main text, which also demonstrate the effectiveness of our method.

Appendix G Case Study

In this section, we show a large number of real examples of noisy pairs in three public datasets without synthetic NCs in Figures 1, 2 and LABEL:fig2, which are identified by CCD. Note that for privacy and security, the face areas of people in all images are blurred. From these examples, one can see that there are various reasons for noisy correspondences, e.g., occlusion (e.g., Figure 7(a,b)), lighting (e.g., Figure 8(f)), and inaccurate noisy text descriptions (e.g., Figure 7(c,f) and Figure 9(a-f)). But all in all, these noisy pairs are real in these datasets and actually break the implicit assumption that all training image-text pairs are aligned correctly and perfectly at an instance level. Thus, we reveal the noisy correspondence problem in TIReID and propose a robust method, i.e., RDE, to particularly address it.

Appendix H Robustness Study

For a comprehensive robustness analysis, we provide more performance curves versus epochs in Figures 10, 11 and 12. It can be seen from the Figure 10 that when the noise rate is 20%, each baseline shows a certain degree of robustness, and there is no obvious performance degradation due to over-fitting noisy pairs. However, as the noise rate increases, the non-robust methods (SSAN, IVT, and IRRA) all show a curve that first rises and then falls. This tendency is caused by the memorization effect that DNNs tend to learn simple patterns before fitting noisy samples. Besides, we can also find that when the noise rate is 80%, SSAN fails and other non-robust methods (IVT and IRRA) also have a serious performance drop. By contrast, thanks to the CCD and TAL, our RDE can learn accurate visual-semantic associations by obtaining confident clean training image-text pairs, which can effectively and directly prevent over-fitting noisy pairs, thus achieving robust cross-modal learning. From these figures, our method not only exhibits strong robustness but also achieves excellent re-identification performance.

Appendix I Qualitative Results

To illustrate the advantages of our RDE, some retrieved examples for TIReID are presented in Figure 13. These results are obtained by testing the model trained on the CUHK-PEDES dataset with 20% NCs. From the examples, one can see that our RDE obtains more accurate and reasonable re-identification results. Simultaneously, in some inaccurate results (e.g., the results (b) and (d)) obtained by IRRA, we find that the visual information of the retrieved image often only matches part of the text query, which indicates that the model cannot learn complete alignment knowledge. We think the reason is that the NCs mislead the model of IRRA to focus on some wrong visual-semantic associations. In contrast, our RDE could filter out erroneous correspondences to learn reliable and accurate cross-modal knowledge, thus achieving high robustness and better results.

References

Arpit et al. [2017] Devansh Arpit, Stanisław Jastrzȩbski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning, pages 233–242. PMLR, 2017.
Bai et al. [2023a] Yang Bai, Min Cao, Daming Gao, Ziqiang Cao, Chen Chen, Zhenfeng Fan, Liqiang Nie, and Min Zhang. Rasa: relation and sensitivity aware representation learning for text-based person search. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pages 555–563, 2023a.
Bai et al. [2023b] Yang Bai, Jingyao Wang, Min Cao, Chen Chen, Ziqiang Cao, Liqiang Nie, and Min Zhang. Text-based person search without parallel image-text data. In Proceedings of the 31st ACM International Conference on Multimedia, pages 757–767, 2023b.
Cao et al. [2023] Min Cao, Yang Bai, Ziyin Zeng, Mang Ye, and Min Zhang. An empirical study of clip for text-based person search. arXiv preprint arXiv:2308.10045, 2023.
Chen et al. [2022] Yuhao Chen, Guoqing Zhang, Yujiang Lu, Zhenxing Wang, and Yuhui Zheng. Tipcb: A simple but effective part-based convolutional baseline for text-based person search. Neurocomputing, 494:171–181, 2022.
Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Diao et al. [2021] Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1218–1226, 2021.
Ding et al. [2021] Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv preprint arXiv:2107.12666, 2021.
Dong et al. [2021] Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4065–4080, 2021.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Eom and Ham [2019] Chanho Eom and Bumsub Ham. Learning disentangled representation for robust person re-identification. Advances in neural information processing systems, 32, 2019.
Faghri et al. [2017] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612, 2017.
Farooq et al. [2022] Ammarah Farooq, Muhammad Awais, Josef Kittler, and Syed Safwan Khalid. Axm-net: Implicit cross-modal feature alignment for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4477–4485, 2022.
Feng et al. [2023] Yanglin Feng, Hongyuan Zhu, Dezhong Peng, Xi Peng, and Peng Hu. Rono: Robust discriminative learning with noisy labels for 2d-3d cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11610–11619, 2023.
Fujii and Tarashima [2023] Takuro Fujii and Shuhei Tarashima. Bilma: Bidirectional local-matching for text-based person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2786–2790, 2023.
Gao et al. [2021] Chenyang Gao, Guanyu Cai, Xinyang Jiang, Feng Zheng, Jun Zhang, Yifei Gong, Pai Peng, Xiaowei Guo, and Xing Sun. Contextual non-local alignment over full-scale representation for text-based person search. arXiv preprint arXiv:2101.03036, 2021.
Han et al. [2018] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems, 31, 2018.
Han et al. [2023] Haochen Han, Kaiyao Miao, Qinghua Zheng, and Minnan Luo. Noisy correspondence learning with meta similarity correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7517–7526, 2023.
Han et al. [2021] Xiao Han, Sen He, Li Zhang, and Tao Xiang. Text-based person search with limited data. arXiv preprint arXiv:2110.10807, 2021.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Hu et al. [2021] Peng Hu, Xi Peng, Hongyuan Zhu, Liangli Zhen, and Jie Lin. Learning cross-modal retrieval with noisy labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5403–5413, 2021.
Hu et al. [2023] Peng Hu, Zhenyu Huang, Dezhong Peng, Xu Wang, and Xi Peng. Cross-modal retrieval with partially mismatched pairs. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–15, 2023.
Huang et al. [2021] Zhenyu Huang, Guocheng Niu, Xiao Liu, Wenbiao Ding, Xinyan Xiao, Hua Wu, and Xi Peng. Learning with noisy correspondence for cross-modal matching. Advances in Neural Information Processing Systems, 34:29406–29419, 2021.
Jiang and Ye [2023] Ding Jiang and Mang Ye. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2787–2797, 2023.
Jing et al. [2020] Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, and Tieniu Tan. Pose-guided multi-granularity attention network for text-based person search. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11189–11196, 2020.
Li et al. [2020] Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394, 2020.
Li et al. [2017] Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. Person search with natural language description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1970–1979, 2017.
Li et al. [2022] Shiping Li, Min Cao, and Min Zhang. Learning semantic-aligned feature representation for text-based person search. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2724–2728. IEEE, 2022.
Li et al. [2023a] Shenshen Li, Xing Xu, Yang Yang, Fumin Shen, Yijun Mo, Yujie Li, and Heng Tao Shen. Dcel: Deep cross-modal evidential learning for text-based person retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6292–6300, 2023a.
Li et al. [2023b] Zheng Li, Caili Guo, Xin Wang, Zerun Feng, and Zhongtian Du. Selectively hard negative mining for alleviating gradient vanishing in image-text matching. arXiv preprint arXiv:2303.00181, 2023b.
Lin et al. [2024] Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, and Xi Peng. Multi-granularity correspondence learning from long-term noisy videos. arXiv preprint arXiv:2401.16702, 2024.
Lu et al. [2022] Yangdi Lu, Yang Bo, and Wenbo He. An ensemble model for combating label noise. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pages 608–617, 2022.
Ma et al. [2023] Yiwei Ma, Xiaoshuai Sun, Jiayi Ji, Guannan Jiang, Weilin Zhuang, and Rongrong Ji. Beat: Bi-directional one-to-many embedding alignment for text-based person retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4157–4168, 2023.
Niu et al. [2020] Kai Niu, Yan Huang, Wanli Ouyang, and Liang Wang. Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Transactions on Image Processing, 29:5542–5556, 2020.
Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Qin et al. [2022] Yang Qin, Dezhong Peng, Xi Peng, Xu Wang, and Peng Hu. Deep evidential learning with noisy correspondence for cross-modal retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4948–4956, 2022.
Qin et al. [2023a] Yalan Qin, Nan Pu, and Hanzhou Wu. Edmc: Efficient multi-view clustering via cluster and instance space learning. IEEE Transactions on Multimedia, 2023a.
Qin et al. [2023b] Yalan Qin, Nan Pu, and Hanzhou Wu. Elastic multi-view subspace clustering with pairwise and high-order correlations. IEEE Transactions on Knowledge and Data Engineering, 2023b.
Qin et al. [2023c] Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, and Peng Hu. Cross-modal active complementary learning with self-refining correspondence. In Thirty-seventh Conference on Neural Information Processing Systems, 2023c.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Sarafianos et al. [2019] Nikolaos Sarafianos, Xiang Xu, and Ioannis A Kakadiaris. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5814–5824, 2019.
Shao et al. [2022] Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian Wang, and Changxing Ding. Learning granularity-unified representations for text-to-image person re-identification. In Proceedings of the 30th ACM International Conference on Multimedia, 2022.
Shao et al. [2023] Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, and Jingdong Wang. Unified pre-training with pseudo texts for text-to-image person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11174–11184, 2023.
Shen et al. [2023] Fei Shen, Xiangbo Shu, Xiaoyu Du, and Jinhui Tang. Pedestrian-specific bipartite-aware similarity learning for text-based person retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, pages 8922–8931, 2023.
Shu et al. [2022] Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, and Xiao Wang. See finer, see more: Implicit modality alignment for text-based person retrieval. In European Conference on Computer Vision, pages 624–641. Springer, 2022.
Wang et al. [2021] Chengji Wang, Zhiming Luo, Yaojin Lin, and Shaozi Li. Text-based person search via multi-granularity embedding learning. In IJCAI, pages 1068–1074, 2021.
Wang et al. [2015] Zheng Wang, Ruimin Hu, Yi Yu, Chao Liang, and Wenxin Huang. Multi-level fusion for person re-identification with incomplete marks. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1267–1270, 2015.
Wang et al. [2020] Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. Vitaa: Visual-textual attributes alignment in person search by natural language. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pages 402–420. Springer, 2020.
Wang et al. [2022a] Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. Caibc: Capturing all-round information beyond color for text-based person retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 5314–5322, 2022a.
Wang et al. [2022b] Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1984–1992, 2022b.
Wei et al. [2018] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 79–88, 2018.
Wu et al. [2021] Yushuang Wu, Zizheng Yan, Xiaoguang Han, Guanbin Li, Changqing Zou, and Shuguang Cui. Lapscore: language-guided person search via color reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1624–1633, 2021.
Yan et al. [2022a] Shuanglin Yan, Neng Dong, Liyan Zhang, and Jinhui Tang. Clip-driven fine-grained text-image person re-identification. arXiv preprint arXiv:2210.10276, 2022a.
Yan et al. [2022b] Shuanglin Yan, Hao Tang, Liyan Zhang, and Jinhui Tang. Image-specific information suppression and implicit local alignment for text-based person search. arXiv preprint arXiv:2208.14365, 2022b.
Yan et al. [2023] Shuanglin Yan, Neng Dong, Jun Liu, Liyan Zhang, and Jinhui Tang. Learning comprehensive representations with richer self for text-to-image person re-identification. In Proceedings of the 31st ACM international conference on multimedia, pages 6202–6211, 2023.
Yang et al. [2022a] Mouxing Yang, Zhenyu Huang, Peng Hu, Taihao Li, Jiancheng Lv, and Xi Peng. Learning with twin noisy labels for visible-infrared person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14308–14317, 2022a.
Yang et al. [2022b] Mouxing Yang, Yunfan Li, Peng Hu, Jinfeng Bai, Jiancheng Lv, and Xi Peng. Robust multi-view clustering with incomplete information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1055–1069, 2022b.
Yang et al. [2024] Mouxing Yang, Zhenyu Huang, and Xi Peng. Robust object re-identification with coupled noisy labels. International Journal of Computer Vision, pages 1–19, 2024.
Zhang et al. [2023] Huaiwen Zhang, Yang Yang, Fan Qi, Shengsheng Qian, and Changsheng Xu. Robust video-text retrieval via noisy pair calibration. IEEE Transactions on Multimedia, 2023.
Zhang and Lu [2018] Ying Zhang and Huchuan Lu. Deep cross-modal projection learning for image-text matching. In Proceedings of the European conference on computer vision (ECCV), pages 686–701, 2018.
Zheng et al. [2020] Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(2):1–23, 2020.
Zhu et al. [2021] Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. Dssl: Deep surroundings-person separation learning for text-based person retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pages 209–217, 2021.
Zhu et al. [2022] Haowei Zhu, Wenjing Ke, Dong Li, Ji Liu, Lu Tian, and Yi Shan. Dual cross-attention learning for fine-grained visual categorization and object re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4692–4702, 2022.