LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification

Firstname1 Lastname1 Firstname2 Lastname2 Firstname3 Lastname3 Firstname4 Lastname4 Firstname5 Lastname5 Firstname6 Lastname6 Firstname7 Lastname7 Firstname8 Lastname8 Firstname8 Lastname8

Abstract

Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different views. Previous methods usually adopt large-scale models, focusing on view-invariant features. However, they overlook the semantic information in person attributes. Additionally, existing training strategies often rely on full fine-tuning large-scale models, which significantly increases training costs. To address these issues, we propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to Leverage Attribute-based Text knowledge. More specifically, we first introduce the Contrastive Language-Image Pre-training (CLIP) model as the backbone, and propose an Attribute-aware Image Encoder (AIE) to extract global semantic features and attribute-aware features. Then, with these features, we propose a Prompted Attribute Classifier Group (PACG) to generate person attribute predictions and obtain the encoded representations of predicted attributes. Finally, we design a Coupled Prompt Template (CPT) to transform attribute tokens and view information into structured sentences. These sentences are processed by the text encoder of CLIP to generate more discriminative features. As a result, our framework can fully leverage attribute-based text knowledge to improve the AG-ReID. Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of our proposed LATex. The source code will be available.

Machine Learning, ICML

1 Introduction

Person Re-IDentification (ReID) aims to retrieve the same individual across different cameras (Jiang et al., 2024; Zheng et al., 2016). In recent years, ReID has gained significant attention (Lu et al., 2023; Shi et al., 2023, 2024; Yin et al., 2024; Gao et al., 2024; Liu et al., 2023; Zhang et al., 2021) due to its wide range of applications, including video surveillance and public security. More recently, ReID across heterogeneous camera views, especially Aerial-Ground person ReID (AG-ReID), has become a more realistic scenario (Nguyen et al., 2023a, b; Zhang et al., 2024b) due to the development of drones and advancements in aerial surveillance. Different from traditional ReID tasks (Crawford et al., 2023; Leng et al., 2019; Li et al., 2017), AG-ReID amplifies the challenges posed by viewpoint variations due to drastic changes between different cameras. These variations significantly affect the distribution of body parts (Schumann & Metzler, 2017), making it more difficult to learn robust features that remain consistent across diverse perspectives.

Refer to caption — Figure 1: An example of a person captured by an aerial camera (a) and a ground camera (c), along with the corresponding attributes (b). Despite significant visual variations in the images caused by viewpoint changes, person attributes remain consistent.

Previous methods (Nguyen et al., 2023a; Zhang et al., 2024b) focus on mitigating the negative effects of drastic perspective changes and learning viewpoint-robust image features. However, they often overlook the potential of leveraging person attribute information. As shown in Fig. 1(a) and Fig. 1(c) , different camera viewpoints result in significant visual differences. Despite these significant viewpoint changes, person attributes such as ethnicity, gender and clothing remain unaffected. This stability provides consistent information to obtain robust cross-view features. Meanwhile, existing methods (Nguyen et al., 2023a; Zhang et al., 2024b) rely on the full fine-tuning strategy, which significantly increases the training cost. Fortunately, prompt tuning (Jia et al., 2022; Han et al., 2024; Lester et al., 2021; Liu et al., 2021a) offers a way to reduce the training cost while effectively integrating the pre-trained knowledge of large-scale models like CLIP (Radford et al., 2021) into specific domains (Li et al., 2023; Wang et al., 2024a). This approach has proven to be more efficient since it reduces the number of trainable parameters and lowers the training cost, while still leveraging the pre-trained model’s powerful knowledge. Motivated by these observations, we propose a novel framework named LATex for AG-ReID, which leverages attribute-based text knowledge to enhance the feature discrimination with the prompt tuning strategies.

Specifically, our framework consists of three key components: Attribute-aware Image Encoder (AIE), Prompted Attribute Classifier Group (PACG) and Coupled Prompt Template (CPT). First, we introduce AIE to fine-tune CLIP with learnable prompts, transferring the powerful pre-trained knowledge to AG-ReID. In addition, AIE incorporates attribute tokens to enable fine-grained perception of person attribute information. Then, PACG is employed to further enhance AIE’s attribute perception capabilities and to generate attribute-based text tokens. Afterwards, CPT is proposed to integrate attribute-based tokens and view-informed tokens into structured sentences. Finally, these sentences are processed by CLIP’s text encoder, enabling more accurate person retrieval across different camera views by explicitly leveraging person attributes. Extensive experiments on three AG-ReID benchmarks fully validate the effectiveness and efficiency of our proposed framework.

In summary, our contributions are as follows:

•

New insight. We observe the distinct benefits of person attribute information for AG-ReID tasks. This insight prompts us to consider the problem from an attribute-based perspective. Furthermore, we introduce a practical new method to mitigate the challenges in target retrieval posed by drastic viewpoint changes.
•

Novel framework. We present LATex, a novel feature learning framework for AG-ReID that leverages attribute-based text knowledge with prompt-tuning. It not only reduces the training cost, but also extracts more discriminative features.
•

Effective modules. We propose two effective modules, i.e., PACG and CPT. PACG can effectively predict person attributes and generate embedded tokens of predicted attributes. CPT integrates text knowledge by transforming attribute tokens and view information into structured sentences.
•

Exhaustive validations. Extensive experiments on three AG-ReID benchmarks fully validate the effectiveness of our proposed methods.

2 Related Works

Person ReID is a longstanding task in computer vision and machine learning, gaining significant attention due to its wide range of real-world applications. Previous research has primarily focused on view-homogeneous scenarios, where all cameras in the network are assumed to operate under the same viewpoint. In this setting, notable advancements have been achieved, primarily categorized into CNN-based methods, such as PCB (Sun et al., 2021), and Transformer-based methods, such as TransReID (He et al., 2021). For multi-modal ReID, more advanced approaches include EDITOR (Zhang et al., 2024a), DeMo (Wang et al., 2024d), MambaPro (Wang et al., 2024b), TOP-ReID (Wang et al., 2024c). However, these methods rely on well-deployed camera networks and are significantly limited in sparsely covered areas due to the lack of candidate images.

Fortunately, advancements in Unmanned Aerial Vehicle (UAV) technology have made it feasible to deploy cameras on UAV, enhancing surveillance coverage in regions with sparse ground camera networks. However, the substantial viewpoint variations between UAV cameras and fixed ground cameras pose significant challenges, as directly transferring previous view-homogeneous methods often leads to suboptimal performance. To address this issue, Nguyen et al. (Nguyen et al., 2023b, 2024) propose using person attributes as soft labels for supervision, while VDT (Zhang et al., 2024b) attempts to separate identity-related features from viewpoint-specific features by employing view tokens and an orthogonal loss. Despite effective, these methods ignore the benefits of explicitly using attributes for cross-view retrieval. In this work, we propose LATex, an framework that leverages attribute information to achieve outstanding performance with significantly fewer trainable parameters.

3 Preliminaries

3.1 Problem Definition

We focus on the task where each person is captured from different camera platforms, such as CCTV or UAV. Our goal is to enable the model to correctly match the query image with the gallery image. Formally, we define the problem as follows: Given a training dataset $\mathcal{C}=(\mathcal{C}^{Img},\mathcal{C}^{ID},\mathcal{C}^{View})$ , where $\mathcal{C}^{Img}=\{\mathcal{C}^{Img}_{1},\cdots,\mathcal{C}^{Img}_{B}\}$ is a set with $B$ images. $\mathcal{C}^{ID}\in\mathbb{R}^{B}$ and $\mathcal{C}^{View}\in\mathbb{R}^{B}$ denote the corresponding IDentity (ID) labels and view labels, respectively. We consider a deep model $\mathcal{M}$ with trainable parameters $\theta$ to extract discriminative person representations from input images:

F_{i}=\mathcal{M}(\mathcal{C}^{Img}_{i},\mathcal{C}^{ID}_{i},\mathcal{C}^{View}_{i};\theta),

(1)

F_{j}=\mathcal{M}(\mathcal{C}^{Img}_{j},\mathcal{C}^{ID}_{j},\mathcal{C}^{View}_{j};\theta),

(2)

where $F_{i}$ and $F_{j}$ are the representations extracted by the deep model $\mathcal{M}$ , respectively. Here, the input data consists of any two instances, i.e., $(\mathcal{C}^{Img}_{i},\mathcal{C}^{ID}_{i},\mathcal{C}^{View}_{i})$ and $(\mathcal{C}^{Img}_{j},\mathcal{C}^{ID}_{j},\mathcal{C}^{View}_{j})$ . Our training objective is to optimize the trainable parameters $\theta$ such that during the inference phase, the following condition holds:

\begin{cases}\mathcal{D}_{pos}=\mathcal{D}(F_{i},F_{j})&\text{if }\mathcal{C}^{ID}_{i}=\mathcal{C}^{ID}_{j},\\ \mathcal{D}_{neg}=\mathcal{D}(F_{i},F_{j})&\text{if }\mathcal{C}^{ID}_{i}\neq\mathcal{C}^{ID}_{j},\\ \mathcal{D}_{pos}\ll\mathcal{D}_{neg},\\ \end{cases}

(3)

where $\mathcal{D}(\cdot)$ denotes a certain distance metric for person representations. Given a query person, we use this distance to measure, rank, and subsequently match the gallery image corresponding to the same person identity.

3.2 Motivation and Ideal Demo

Motivation. Previous methods have shown strong performance for person ReID tasks with the same viewpoints but struggle in scenarios with significant viewpoint variations. Based on our daily experiences and observations of the dataset (Fig.1), we notice that person attributes remain robust in such complex scenarios, providing valuable additional information for target person retrieval. In light of this, we try to fully exploit this advantage to alleviate the challenges posed by viewpoint changes in ReID tasks.

Ideal Demo. We consider an ideal scenario where the ground truth of all person attributes is known in both the training and testing sets. Under this premise, we construct a simple demo using prompt-tuning CLIP to validate the feasibility of our motivation. This setup represents the theoretical upper bound of our framework’s performance. Fig. 2 (a) provides an overview of our demo, with additional details available in the appendix. And Fig. 2 (b) shows the results of our demo on the AG-ReID.v1 dataset, where our method achieves near-perfect scores on both protocols, fully demonstrating the validity of our motivation. Despite the superior performance, we recognize that the above experimental setting is overly idealized and impractical in real-world applications. Specifically, our proposed demo relies on the strong assumption that all person attributes are known during inference. However, annotating attributes for the vast amounts of data collected by numerous surveillance cameras is infeasible in practice. To address this, we propose a feature learning framework, which enables the model to automatically generate person attributes during inference. This eliminates the aforementioned unrealistic assumption, significantly enhancing the feasibility of our approach.

4 Proposed Method

As shown in Fig. 3, our LATex consists of three key components: Attribute-aware Image Encoder (AIE), Prompted Attribute Classifier Group (PACG) and Coupled Prompt Template (CPT). The details of them are as follows.

4.1 Attribute-aware Image Encoder

To transfer the rich knowledge of CLIP (Radford et al., 2021) to the ReID task and extract attribute-specific information, we propose the Attribute-aware Image Encoder (AIE). Previous Transformer-based approaches (Nguyen et al., 2023a; Zhang et al., 2024b) typically rely on full fine-tuning, which leads to very high training costs. To address this issue, we adopt prompt-tuning strategies to extract discriminative features. Formally, given the input images $\mathcal{V}\in\mathbb{R}^{H\times W\times 3}$ from different views, we embed $\mathcal{V}$ to obtain the visual embedding $F^{0}=[F_{v}^{0},f_{v}^{0}]$ , where $F_{v}^{0}\in\mathbb{R}^{C}$ is the class token and $f_{v}^{0}\in\mathbb{R}^{N\times C}$ is the patch tokens. Here, $C$ is the dimension of the token embeddings while $N$ is the total number of patches. Then, for the $i$ -th Transformer layer $\varOmega_{i}$ , we denote $P^{i}_{v}\in\mathbb{R}^{T\times C}$ as the learnable prompts, i.e., $P^{i}_{v}=\{P^{i,1}_{v},\cdots,P^{i,T}_{v}\}$ . Here, $T$ is the total number of person attributes. The learnable prompts $P^{i}_{v}$ are concatenated with $F^{i}$ , enabling the perception of attribute-specific information, and then passed through $\varOmega_{i}$ as follows:

[F^{i+1},\tilde{P}^{i+1}_{v}]=\varOmega_{i}([F^{i},P^{i}_{v}]).

(4)

Here, [ $\cdot$ ] means the concatenation operation along the token dimension. Finally, we extract the attribute prompts $\tilde{P}^{L}_{v}\in\mathbb{R}^{T\times C}$ and the visual class token $F_{v}^{L}\in\mathbb{R}^{C}$ from the final layer of AIE for further processing.

4.2 Prompted Attribute Classifier Group

To enhance attribute perception, we propose the Prompted Attribute Classifier Group (PACG), which integrates attribute label information and exploits interdependencies among attributes. More specifically, we define PAC_t as the classifier for the $t$ -th attribute. Then, the attribute features $F_{a}^{t}$ is defined as the concatenation of the global feature $F_{v}^{L}$ and corresponding attribute prompt $\tilde{P}^{L,t}_{v}$ , where $F_{v}^{L,t}\in\mathbb{R}^{C}$ is the $t$ -th attribute prompts of $F_{v}^{L}$ :

F_{a}^{t}=[F_{v}^{L},\tilde{P}^{L,t}_{v}].

(5)

To learn more discriminative attribute-aware features, we utilize the interdependencies of different attributes. Formally, we denote other attribute prompts as $\hat{P}^{L}_{v}$ , i.e., $\hat{P}^{L}_{v}=\{P_{v}^{L,1},\cdots,P_{v}^{L,t-1},P_{v}^{L,t+1},\cdots,P_{v}^{L,T}\}$ . Then, we can obtain interacted features as follows:

Q=W_{q}\hat{P}^{L}_{v},K=W_{k}F_{v}^{L},V=W_{v}F_{v}^{L},

(6)

\Theta(F_{v}^{L},\hat{P}^{L}_{v})=\delta(\frac{QK^{T}}{\sqrt{C}})V,

(7)

where $\Theta(\cdot)$ is the multi-head cross attention (Vaswani et al., 2017). $Q\in\mathbb{R}^{C}$ , $K\in\mathbb{R}^{C}$ and $V\in\mathbb{R}^{C}$ are generated by the corresponding projection matrix $W_{q}$ , $W_{k}$ and $W_{v}$ , respectively. $\delta$ is the Softmax function. The residual structure enables the model to process input features with varying emphases more flexibly (He et al., 2016). As observed in daily life, certain person attributes exhibit strong correlations (e.g., gender and clothing), while others show weaker correlations (e.g., height and weight) or are nearly independent (e.g., ethnicity and gender). Thus, we employ a residual-based Feed-Forward Network (FFN) (Vaswani et al., 2017) to handle features obtained from two perspectives: direct prediction and attribute dependency-based prediction. This design allows the FFN to adapt to diverse scenarios by effectively capturing attribute correlations, if they exist, while avoiding excessive noise and additional computational overhead caused by irrelevant attribute pairs:

\mathcal{O}_{t}=\Phi(\Theta(F_{v}^{L},\hat{P}^{L}_{v})+F_{a}^{t}),

(8)

where $\Phi(\cdot)$ represents the FFN and $\mathcal{O}_{t}$ is the final features used to predict attribute confidences. In addition, to align the visual and textual representations in different feature spaces, we transform the visual embedding $F_{a}^{t}$ and obtain attribute-based text tokens as follows:

A_{t}=\Psi(F_{a}^{t}),

(9)

where $\Psi$ is a Multi-Layer Perceptron (MLP). The above attribute-based text tokens can be taken as the input of CLIP’s text encoder to improve the feature discrimination.

4.3 Coupled Prompt Template

Recently, visual-language models, such as CLIP have delivered outstanding performances in many computer vision and natural language processing tasks. As a fundamental component, text templates play an important role in text-missing tasks, such as person ReID. For example, Li et al. (Li et al., 2023) utilize identity-specific tokens to form a text template, i.e., “A photo of a [learnable tokens] person.” However, this kind of templates only focus on person appearances, ignoring the helpful attribute information.

To address the above issue, we propose a Coupled Prompt Template (CPT), which is presented as “A [view token] view photo of a [share tokens] [attribute tokens] person.” This template not only couples identity-independent and attribute-aware information, but also leverages comprehensive knowledge of visual-language models. Specially, the view token $V$ is an instance-level text token, which depends on the view of images. When the image is captured by a ground camera, the view token is “CCTV”. When the image is from an aerial camera, it is assigned to “UAV”. In addition, we formulate share tokens as “ $[M_{1},M_{2},\cdots,M_{K}]$ ”, where $K$ is the total number of tokens. These instance-shared tokens serve as register tokens (Darcet et al., 2023) to enhance semantic feature representations. As for attribute tokens, we denote them as “ $[A_{1},A_{2},\cdots,A_{T}]$ ”, which are obtained by PACG and have rich attribute information. By effectively utilizing these diverse tokens to form the structured sentence $\mathcal{S}$ , our framework can fully leverage useful information from the attributes. Finally, we feed $\mathcal{S}$ to the text encoder $\mathcal{T}(\cdot)$ of CLIP to obtain the text feature $F_{d}$ :

F_{d}=\mathcal{T}(\mathcal{S}).

(10)

To improve the feature representation ability, we concatenate $F_{d}$ and $F_{v}^{L}$ for person retrieval. With the CPT, our LATex can fully leverage the viewpoint invariance of person attributes to enhance the identify-related features.

4.4 Loss Functions

As illustrated in Fig. 3, we employ multiple loss functions to optimize our framework. For features obtained by AIE and CPT, we supervise them by the label smoothing cross-entropy loss (Szegedy et al., 2016) and triplet loss (Hermans et al., 2017):

\mathcal{L}_{ReID}=\lambda_{1}\mathcal{L}_{ID}+\lambda_{2}\mathcal{L}_{Tri}.

(11)

For attribute predictions, we employ the label smoothing cross-entropy loss to each $\mathcal{O}_{t}$ obtained through PAC_t:

\mathcal{L}_{Attr}^{t}=-\frac{1}{|B|}\sum_{i=1}^{|B|}c_{i}\log(\hat{c}_{i}).

(12)

Here, $B$ is the batch size, $c_{i}$ is the ground truth and $\hat{c}_{i}$ denotes the corresponding attribute prediction. Thus, the total loss for our framework can be given by:

\scalebox{1.0}[0.95]{$\mathcal{L}=\mathcal{L}_{ReID}^{AIE}+\mathcal{L}_{ReID}^{CPT}+\sum_{t=1}^{T}\mathcal{L}_{Attr}^{t}$}.

(13)

5 Experiment

5.1 Datasets and Evaluation Protocols

Datasets. We evaluate our methods on three AG-ReID benchmarks. AG-ReID.v1 (Nguyen et al., 2023a) is a challenging dataset, consisting of 21,983 images captured by ground and aerial cameras from 388 identities with 15 attributes. CARGO (Zhang et al., 2024b) is a large-scale synthetic dataset, consisting of 108,563 images captured by five aerial cameras and eight ground cameras from 5,000 identities. Meanwhile, AG-ReID.v1 contains two protocols for evaluation, i.e., A→G and G→A. CARGO contains four evaluation protocols, two of which are view-heterogeneous, while the other two are view-homogeneous. Finally, AG-ReID.v2 (Nguyen et al., 2024) is the extended version of AG-ReID.v1, incorporating three views: aerial (A), ground (G), and wearable device (W). Accordingly, the evaluations are expanded to include cross-view retrieval settings between A and W. Specifically, AG-ReID.v2 consists of 100,502 images from 1,615 unique identities.

Metrics. Following previous works, we employ the mean Average Precision (mAP) (Zheng et al., 2015) and Cumulative Matching Characteristic (CMC) (Moon & Phillips, 2001) at Rank-1 as evaluation metrics.

5.2 Implementation Details

Our proposed framework is implemented with PyTorch on one NVIDIA A100 GPU. We use the pre-trained CLIP, specially CLIP-Base-16, as our backbone. All input images are resized to $256\times 128$ . To enhance the generalization ability, we utilize multiple augmentation techniques such as the random horizontal flipping, padding and random erasing (Zhong et al., 2020) for all inputs. While the model is training, the size of each mini-batch is 128 consisted of 16 identities and 8 instances of each identity. We fine-tune the model by the Adam optimizer with a base learning rate of 0.00035. The total epoch is 120. For the hyper-parameters, we set $\lambda_{1}$ and $\lambda_{2}$ to 0.25, and 1.0, respectively.

Table 1: Performance comparison with existing methods on AG-ReID.v1. The superscript symbol † indicates results based on full fine-tuning strategies. The best and second-best results are highlighted in bold and underline.

Method	A→G		G→A
Method	Rank-1	mAP	Rank-1	mAP
OSNet(Zhou et al., 2021)	72.59	58.32	74.22	60.99
BoT(Luo et al., 2019)	70.01	55.47	71.20	58.83
SBS(He et al., 2023)	73.54	59.77	73.70	62.27
VV(Kumar et al., 2019, 2020)	77.22	67.23	79.73	69.83
ViT(Dosovitskiy et al., 2020)	81.28	72.38	82.64	73.35
Explain(Nguyen et al., 2023a)	81.47	72.61	82.85	73.39
VDT(Zhang et al., 2024b)	82.91	74.44	86.59	78.57
LATex	84.41	75.85	88.88	79.19
LATex†	85.26	77.67	89.40	81.15

Table 2: Performance comparison on CARGO under two cross-view protocols. The superscript symbol † indicates results based on full fine-tuning strategies. The best and second-best results are highlighted in bold and underline. Performance is shown for view-heterogeneous protocols in red, and for view-homogeneous in blue.

Method	ALL		A $\leftrightarrow$ G		G $\leftrightarrow$ G		A $\leftrightarrow$ A
Method	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP
SBS(He et al., 2023)	50.32	43.09	31.25	29.00	72.31	62.99	67.50	49.73
PCB(Sun et al., 2021)	51.00	44.50	34.40	30.40	74.10	67.60	55.00	44.60
BoT(Luo et al., 2019)	54.81	46.49	36.25	32.56	77.68	66.47	65.00	49.79
MGN(Wang et al., 2018)	54.81	49.08	31.87	33.47	83.93	71.05	65.00	52.96
VV(Kumar et al., 2019, 2020)	45.83	38.84	31.25	29.00	72.31	62.99	67.50	49.73
AGW(Ye et al., 2021)	60.26	53.44	43.57	40.90	81.25	71.66	67.50	56.48
ViT(Dosovitskiy et al., 2020)	61.54	53.54	43.13	40.11	82.14	71.34	80.00	64.47
VDT(Zhang et al., 2024b)	64.10	55.20	48.12	42.76	82.14	71.59	82.50	66.83
LATex	66.99	58.59	54.37	49.57	84.82	75.30	70.00	57.76
LATex†	76.96	67.09	66.87	58.88	90.18	79.90	80.00	69.06

Table 3: Performance comparison on AG-ReID.v2. The superscript symbol † indicates results based on full fine-tuning strategies. The best and second-best results are highlighted in bold and underline.

Method	A→C		A→W		C→A		W→A
Method	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP
Swin(Liu et al., 2021b)	68.76	57.66	68.49	56.15	68.80	57.70	64.40	53.90
HRNet-18(Wang et al., 2020)	75.21	65.07	76.26	66.17	76.25	66.16	76.25	66.17
SwinV2(Liu et al., 2022)	76.44	66.09	80.08	69.09	77.11	62.14	74.53	65.61
MGN(R50)(Wang et al., 2018)	82.09	70.17	88.14	78.66	84.21	72.41	84.06	73.73
BoT(R50)(Luo et al., 2019)	80.73	71.49	86.06	75.98	79.46	69.67	82.69	72.41
BoT(R50)+Attributes	81.43	72.19	86.66	76.68	80.15	70.37	83.29	73.11
SBS(R50)(He et al., 2023)	81.96	72.04	88.14	78.94	84.10	73.89	84.66	75.01
SBS(R50)+Attributes	82.56	72.74	88.74	79.64	84.80	74.59	85.26	75.71
ViT(Dosovitskiy et al., 2020)	85.40	77.03	89.77	80.48	84.65	75.90	84.27	76.59
V2E(ViT)(Nguyen et al., 2024)	88.77	80.72	93.62	84.85	87.86	78.51	88.61	80.11
LATex	87.18	79.92	90.09	83.50	85.86	79.07	87.52	80.93
LATex†	89.13	83.50	91.35	86.35	89.01	82.85	89.32	83.30

Table 4: Training cost comparison of different methods.

Metric	ViT	VDT	LATex
Trainable Params(M)	86.24	85.90	35.97
Inference Speed(s/batch)	0.35	0.41	0.67
Flops(G)	11.37	11.46	15.33
GPU Memory(G)	0.075	0.078	0.079

Table 5: Ablation results of key modules.

	AIE	PACG	CPT	Rank-1	mAP	Rank-1	mAP
	Module			A→G		G→A
A	$\checkmark$	$\times$	$\times$	81.69	72.36	83.89	74.78
B	$\checkmark$	$\checkmark$	$\times$	83.19	74.93	87.42	78.62
C	$\checkmark$	$\checkmark$	$\checkmark$	84.41	75.85	88.88	79.18

Table 6: Performance comparison with different features.

	Feature	A→G		G→A
	Feature	Rank-1	mAP	Rank-1	mAP
A	$F_{v}^{L}$	83.19	74.93	87.42	78.62
B	$[F_{v}^{L},F_{a}]$	83.85	75.07	88.05	78.74
C	$[F_{v}^{L},F_{d}]$	84.41	75.85	88.88	79.18
D	$F_{v}^{L}$	83.85	75.37	88.15	78.78

5.3 Performance Comparison

We compare our proposed method with existing methods on three AG-ReID benchmarks in Tab. 1, Tab. 2 and Tab. 3.

LATex. On AGReID.v1, our LATex achieves a Rank-1 accuracy of 84.41% and an mAP of 75.85% under the evaluation protocol A→G. Compared with VDT, it delivers 1.5% and 1.41% improvements in Rank-1 and mAP, respectively. As for the evaluation protocol G→A, the performance of our LATex is 88.88% Rank-1 and 79.19% mAP, showing a similar trade in evaluation protocol A→G. The stable improvements on two evaluation protocols clearly demonstrate the importance of further exploring attribute information in AG-ReID. On two large-scale datasets, CARGO and AG-ReID.v2, our LATex achieves highly competitive performance. It is worth noting that, since there are no attribute annotations on CARGO, we remove PACG. Our model achieves superior performance with significantly fewer trainable parameters, fully validating its effectiveness.

LATex†. We believe the limited trainable parameters constrain the performance of LATex. To enable a fairer comparison with previous methods based on full fine-tuning backbones, we introduce a LATex variant, LATex†, which adopts the same strategy. Compared to LATex, LATex† achieves significant performance improvements across all benchmarks. For example, on the CARGO dataset under the ALL protocol, LATex surpasses VDT with a Rank-1 accuracy of 66.99%, while LATex† further boosts this metric to 76.96%, achieving an almost 10% increase. This fully demonstrates the scalability of our methods.

View-homogeneous ReID. As shown in Tab. 2, CARGO provides protocols for person ReID under the same viewpoint, which we used to evaluate our LATex’s performance on view-homogeneous tasks. As previously demonstrated, the insufficient trainable parameters limit the optimization of the feature space, leading to a severe under-fitting of LATex on the A $\leftrightarrow$ A protocol. Surprisingly, however, LATex outperforms VDT on the G $\leftrightarrow$ G protocol. LATex†, with the support of additional trainable parameters, achieves the best overall performance among existing methods. Notably, under the G $\leftrightarrow$ G protocol, it is the first method to achieve a Rank-1 accuracy exceeding 90%. These results demonstrate that our method retains a strong generalization ability in view-homogeneous tasks.

5.4 Training Cost Comparison

In Tab. 4, we present a comprehensive comparison of costs on AG-ReID.v1. Compared with ViT and VDT, our LATex only utilizes about 40% of the trainable parameters while introducing minimal overhead.

5.5 Ablation Studies

To demonstrate the effectiveness of our proposed modules, we evaluate them on the AG-ReID.v1 dataset. The result is shown in Tab. 5. Furthermore, we conduct comprehensive and systematic evaluations to demonstrate the effectiveness of the proposed LATex framework.

Effect of Different Modules. In Tab. 5, Model A is our baseline. It achieves a Rank-1 of 81.69% and an mAP of 72.36% under the protocol A→G, while attaining a Rank-1 of 83.89% and an mAP of 74.78% under the protocol G→A. With PACG, Model B increases the performance by 3.53% Rank-1 and 3.84% mAP scores under the protocol G→A. Obviously, its performance is already better than previous methods, such as VDT. With CPT, Model C achieves the best result across all evaluation metrics. These results clearly demonstrate the effectiveness of our key modules.

Effect of Leveraging Person Attributes. We further validate the effect of leveraging person attributes. As shown in Tab. 6, Model A and Model B are models without CPT. Model C and Model D are completed models. The key differences are the features used for retrieval. Here, $F_{a}$ is the concatenation of all $F_{a}^{t}$ . As can be observed, the performance of Model B is superior to that of Model A. The performances of Model B and Model D are very comparable. The performance of Model C is the best. These results clearly highlight the effectiveness of leveraging attribute-based text knowledge for AG-ReID.

Role of Backbones. We emphasize that the core idea of our LATex is to extract person attributes, embed them as pseudo-texts, and feed them into a text encoder to achieve robust person ReID. With the strong vision-language alignment, we adopt CLIP as the backbone. Considering that our approach uses a different backbone from prior methods, we evaluate the effect of various backbones and report the performance on AG-ReID.v1 in Tab. 7. As can be seen, CLIP-based VDT and ViT-based LATex show a significant performance drop. The main reason is that they cannot fully leverage CLIP’s vision-language knowledge. These results clearly demonstrate that our LATex’s performance stems from effectively utilizing CLIP’s vision-language knowledge, rather than its inherent encoding capabilities.

Table 7: Performance comparison with different backbones.

Methods	A→G		G→A
Methods	Rank-1	mAP	Rank-1	mAP
VDT(ViT-based)	82.91	74.44	86.59	75.57
VDT(CLIP-based)	78.78	68.40	77.44	68.68
LATex(ViT-based)	74.93	62.75	73.18	63.84
LATex(CLIP-based)	84.41	75.85	88.88	79.18

Impact of Trainable Prompts. Fig. 4 shows the effect of the number of learnable prompts $P^{i}_{v}$ . With the increase of trainable parameters, the performance of LATex remains stable, demonstrating its robustness to this factor.

Accuracy of Attribute Predictions. Attribute predictions play an important role in our framework. Fig. 5 shows the accuracy of some typical attributes predicted by PACG. It can be observed that our PACG achieves outstanding performances in these person attributes. These attributes provide discriminative information for person identification.

5.6 Visualization Analysis

Rank list Comparison. Fig. 6 compares the rank lists generated by different models defined in Tab. 5. With the sequential addition of PACG and CPT, the rank lists become progressively more accurate and discriminative. This indicates that our model gradually acquires the ability to perceive person attributes, enabling it to better distinguish individuals with similar visual characteristics.

Attribute Retrieval. Fig. 7 illustrates the retrieval results using the attribute features of the query. Despite challenges such as image blurriness and small pixel regions of key areas, LATex accurately retrieves persons sharing a specific attribute (e.g., Upperwear) with the query’s ground truth. This result demonstrates the exceptional capability of LATex in perceiving person attributes.

Attribute Feature Distributions. Fig. 8 shows the feature distribution of all attribute categories in the test set. Each attribute category consists of several subcategories, which are finely perceived by the corresponding PAC. For example, the “Hair style” category includes various styles such as “Bald”, “Short” and so on. The results clearly demonstrate that our method exhibits strong perception and discrimination capabilities for unseen samples across different attributes.

6 Conclusions

In this paper, we propose a novel feature learning framework named LATex for AG-ReID. It adopts prompt-tuning strategies to leverage attribute-based text knowledge of visual-language models. To this end, we first propose an AIE to extract global semantic features and attribute-aware features. Then, we propose a PACG to generate person attribute predictions and obtain embedded tokens of predicted attributes. Finally, we design a CPT to transform attribute tokens and view information into structured sentences for more discriminative features. Extensive experiments on three AG-ReID benchmarks demonstrate superior performance of our methods. However, the experimental results still show a considerable gap compared to the ideal results presented in Fig. 2. We believe this may stem from insufficient accuracy in attribute predictions. In the future, improving person attribute perception with a higher precision could further advance research in AG-ReID.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

Crawford et al. (2023) Crawford, J., Yin, H., McDermott, L., and Cummings, D. Unicat: Crafting a stronger fusion baseline for multimodal re-identification. arXiv preprint arXiv:2310.18812, 2023.
Darcet et al. (2023) Darcet, T., Oquab, M., Mairal, J., and Bojanowski, P. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023.
Dosovitskiy et al. (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Gao et al. (2024) Gao, S., Yu, C., Zhang, P., and Lu, H. Part representation learning with teacher-student decoder for occluded person re-identification. In ICASSP, pp. 2660–2664, 2024.
Han et al. (2024) Han, Z., Gao, C., Liu, J., Zhang, S. Q., et al. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608, 2024.
He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. IEEE, 2016.
He et al. (2023) He, L., Liao, X., Liu, W., Liu, X., Cheng, P., and Mei, T. Fastreid: A pytorch toolbox for general instance re-identification. In ACM MM, pp. 9664–9667, 2023.
He et al. (2021) He, S., Luo, H., Wang, P., Wang, F., Li, H., and Jiang, W. Transreid: Transformer-based object re-identification. In ICCV, pp. 15013–15022, 2021.
Hermans et al. (2017) Hermans, A., Beyer, L., and Leibe, B. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
Jia et al. (2022) Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B., and Lim, S.-N. Visual prompt tuning. In ECCV, pp. 709–727, 2022.
Jiang et al. (2024) Jiang, F., Yang, S., Jones, M. W., and Zhang, L. From attributes to natural language: A survey and foresight on text-based person re-identification. arXiv preprint arXiv:2408.00096, 2024.
Kumar et al. (2019) Kumar, R., Weill, E., Aghdasi, F., and Sriram, P. Vehicle re-identification: an efficient baseline using triplet embedding. In 2019 International Joint Conference on Neural Networks (IJCNN), 2019.
Kumar et al. (2020) Kumar, R., Weill, E., Aghdasi, F., and Sriram, P. A strong and efficient baseline for vehicle re-identification using deep triplet embedding. 2020.
Leng et al. (2019) Leng, Q., Ye, M., and Tian, Q. A survey of open-world person re-identification. TCSVT, 30(4):1092–1108, 2019.
Lester et al. (2021) Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
Li et al. (2017) Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., and Wang, X. Person search with natural language description. In CVPR, pp. 1970–1979, 2017.
Li et al. (2023) Li, S., Sun, L., and Li, Q. Clip-reid: exploiting vision-language model for image re-identification without concrete text labels. In AAAI, volume 37, pp. 1405–1413, 2023.
Liu et al. (2021a) Liu, X., Ji, K., Fu, Y., Tam, W. L., Du, Z., Yang, Z., and Tang, J. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021a.
Liu et al. (2023) Liu, X., Zhang, P., and Lu, H. Video-based person re-identification with long short-term representation learning. In ICIG, pp. 55–67, 2023.
Liu et al. (2021b) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021b.
Liu et al. (2022) Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12009–12019, 2022.
Lu et al. (2023) Lu, H., Zou, X., and Zhang, P. Learning progressive modality-shared transformers for effective visible-infrared person re-identification. In AAAI, volume 37, pp. 1835–1843, 2023.
Luo et al. (2019) Luo, H., Gu, Y., Liao, X., Lai, S., and Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In CVPR, pp. 0–0, 2019.
Moon & Phillips (2001) Moon, H. and Phillips, P. J. Computational and performance aspects of pca-based face-recognition algorithms. Perception, 30(3):303–321, 2001.
Nguyen et al. (2023a) Nguyen, H., Nguyen, K., Sridharan, S., and Fookes, C. Aerial-ground person re-id. In ICME, pp. 2585–2590, 2023a.
Nguyen et al. (2024) Nguyen, H., Nguyen, K., Sridharan, S., and Fookes, C. Ag-reid. v2: Bridging aerial and ground views for person re-identification. TIFS, pp. 2896 – 2908, 2024.
Nguyen et al. (2023b) Nguyen, K., Fookes, C., Sridharan, S., Liu, F., Liu, X., Ross, A., Michalski, D., Nguyen, H., Deb, D., Kothari, M., et al. Ag-reid 2023: Aerial-ground person re-identification challenge results. In IJCB, pp. 1–10, 2023b.
Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In ICML, pp. 8748–8763, 2021.
Schumann & Metzler (2017) Schumann, A. and Metzler, J. Person re-identification across aerial and ground-based cameras by deep feature fusion. In ATR, volume 10202, pp. 56–67, 2017.
Shi et al. (2023) Shi, J., Zhang, Y., Yin, X., Xie, Y., Zhang, Z., Fan, J., Shi, Z., and Qu, Y. Dual pseudo-labels interactive self-training for semi-supervised visible-infrared person re-identification. In ICCV, pp. 11218–11228, 2023.
Shi et al. (2024) Shi, J., Yin, X., Chen, Y., Zhang, Y., Zhang, Z., Xie, Y., and Qu, Y. Multi-memory matching for unsupervised visible-infrared person re-identification. arXiv preprint arXiv:2401.06825, 2024.
Sun et al. (2021) Sun, Y., Zheng, L., Li, Y., Yang, Y., Tian, Q., and Wang, S. Learning part-based convolutional features for person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(3):902–917, 2021.
Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In CVPR, pp. 2818–2826, 2016.
Van der Maaten & Hinton (2008) Van der Maaten, L. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need, 2017. arXiv preprint arXiv:1706.03762, 2017.
Wang et al. (2018) Wang, G., Yuan, Y., Chen, X., Li, J., and Zhou, X. Learning discriminative features with multiple granularities for person re-identification. ACM, 2018.
Wang et al. (2020) Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10):3349–3364, 2020.
Wang et al. (2024a) Wang, X., Liu, F., Li, Z., and Guo, C. Attribute-aware implicit modality alignment for text attribute person search. arXiv preprint arXiv:2406.03721, 2024a.
Wang et al. (2024b) Wang, Y., Liu, X., Yan, T., Liu, Y., Zheng, A., Zhang, P., and Lu, H. Mambapro: Multi-modal object re-identification with mamba aggregation and synergistic prompt. arXiv preprint arXiv:2412.10707, 2024b.
Wang et al. (2024c) Wang, Y., Liu, X., Zhang, P., Lu, H., Tu, Z., and Lu, H. Top-reid: Multi-spectral object re-identification with token permutation. In AAAI, volume 38, pp. 5758–5766, 2024c.
Wang et al. (2024d) Wang, Y., Liu, Y., Zheng, A., and Zhang, P. Decoupled feature-based mixture of experts for multi-modal object re-identification. arXiv preprint arXiv:2412.10650, 2024d.
Ye et al. (2021) Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., and Hoi, S. C. Deep learning for person re-identification: A survey and outlook. TPAMI, 44(6):2872–2893, 2021.
Yin et al. (2024) Yin, X., Shi, J., Zhang, Y., Lu, Y., Zhang, Z., Xie, Y., and Qu, Y. Robust pseudo-label learning with neighbor relation for unsupervised visible-infrared person re-identification. arXiv preprint arXiv:2405.05613, 2024.
Zhang et al. (2021) Zhang, G., Zhang, P., Qi, J., and Lu, H. Hat: Hierarchical aggregation transformers for person re-identification. In ACM MM, pp. 516–525, 2021.
Zhang et al. (2024a) Zhang, P., Wang, Y., Liu, Y., Tu, Z., and Lu, H. Magic tokens: Select diverse tokens for multi-modal object re-identification. In CVPR, pp. 17117–17126, 2024a.
Zhang et al. (2024b) Zhang, Q., Wang, L., Patel, V. M., Xie, X., and Lai, J. View-decoupled transformer for person re-identification under aerial-ground camera network. In CVPR, pp. 22000–22009, 2024b.
Zheng et al. (2015) Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian, Q. Scalable person re-identification: A benchmark. In ICCV, pp. 1116–1124, 2015.
Zheng et al. (2016) Zheng, L., Yang, Y., and Hauptmann, A. G. Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984, 2016.
Zhong et al. (2020) Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y. Random erasing data augmentation. In AAAI, volume 34, pp. 13001–13008, 2020.
Zhou et al. (2021) Zhou, K., Yang, Y., Cavallaro, A., and Xiang, T. Learning generalisable omni-scale representations for person re-identification. TPAMI, 44(9):5056–5069, 2021.

Appendices

We organize the appendix as follows:

•

Appendix A : Details of the ideal demo shown in Section 3.2 of the main paper.
•

Appendix B : Additional implementation details of the proposed model.
•

Appendix C : Extended experimental results, including numerical results and visualizations.

Appendix A Demo Details

As elaborated in Section 3.2, the primary goal of the demo is to validate the feasibility of utilizing person attribute information as auxiliary data for cross-view person retrieval, as well as to explore the theoretical upper bound under a CLIP-based prompt-tuning framework. The network archtecture of demo is shown in Fig. 9(a). For the input images $\mathcal{C}^{Img}$ , we consider prompt-tuning CLIP visual encoder, which provides the visual features $F_{v}$ . For the textual input, we encode the ground truth of person attributes into a predefined template, specifically, “A photo of a $Attribute$ person,” to construct a contextualized text sequence. This sequence is then passed through the frozen CLIP text encoder to extract the corresponding textual feature $F_{t}$ . $F_{v}$ and $F_{t}$ are supervised using the ReID loss defined in Eq. 11, respectively. During inference, the two features are concatenated as $[F_{t},F_{v}]$ to form a unified feature vector for retrieval.

As shown in Fig. 9(b), under the same backbone configuration as our proposed LATex, the demo achieves near-perfect performance on the AG-ReID.v1 dataset. These remarkable idealized experimental results motivate us to explore more practical approaches aligned with real-world scenarios.

Appendix B More Implementation Details

We provide more details information of experiment implementation in this section:

LATex. In addition to the implementation details described in the main text, we utilize the Adam optimizer with a base learning rate of 0.00035 and a weight decay of 0.1 to optimize our model. A learning rate scheduling strategy is employed, combining a warm-up phase with cosine decay and a scaling factor of 0.01. Finally, in the AIE module, we leverage a larger number of learnable prompts for large-scale datasets to achieve better performance. Specifically, we use 128 prompts for AG-ReID.v2 and 256 for CARGO. Notably, despite the increased number of learnable prompts, only the first $T$ prompts, corresponding to the total number of person attributes, are used as attribute-aware prompts in subsequent training as described in the main text. The remaining prompts are solely utilized for fine-tuning the pre-trained model.

LATex†. LATex† is a variant of LATex that adopts the full fine-tuning strategy to ensure a fair comparison with other AG-ReID methods employing the same training strategy. Specifically, we unfroze all trainable parameters of the pre-trained CLIP vision and text encoders and updated them with a learning rate of 0.000005. Other experimental settings, including the learning rate decay strategy, optimizer, and the number of training epochs, are kept consistent with LATex.

Demo. The demo represents a simplified version of LATex under ideal conditions, where only the backbone network of LATex is retained. All experimental settings are consistent with those of LATex.

Appendix C Extended Experiment

Table 8: Performance comparison evaluated using the mINP metric. The superscript symbol † indicates results based on full fine-tuning strategies. The best and second-best results are highlighted in bold and underline.

Method	AG-ReID.v1		CARGO
Method	A→G	G→A	ALL	A $\leftrightarrow$ G	G $\leftrightarrow$ G	A $\leftrightarrow$ A
ViT(Dosovitskiy et al., 2020)	-	-	39.62	28.20	57.55	47.07
VDT(Zhang et al., 2024b)	51.06	52.87	41.13	29.95	58.39	50.22
LATex	51.28	50.95	44.24	36.64	62.53	39.28
LATex†	54.27	56.50	54.06	45.96	68.45	54.77

Table 9: Performance evaluated using the mINP metric. The superscript symbol † indicates results based on full fine-tuning strategies.

Method	AG-ReID.v2
Method	A→C	A→W	C→A	W→A
LATex	56.16	56.72	49.68	51.52
LATex†	63.43	61.87	55.84	56.74

C.1 More Performance Comparison

Tab. 8 compares the mINP performance of our model under various evaluation protocols on two AG-ReID benchmarks (AG-ReID.v1 and CARGO) with other Transformer-based methods. As the Explain (Nguyen et al., 2023b) neither reporte this metric in the paper nor release its code, we omit it from the table. For the same reason, Tab. 9 reports the mINP metric on AG-ReID.v2 exclusively for our proposed LATex and its variant LATex†. Consistent with the trends observed in the mAP and Rank-1 metrics discussed in the main text, LATex with a frozen backbone demonstrates superior aggregate performance. Meanwhile, with trainable backbone parameters, LATex† achieves the best results. Specifically, LATex† outperforms the previous best model, VDT, by over 13% on average across all protocols on the CARGO benchmark.

C.2 More Ablation Studies

Fig. 10 presents a more comprehensive ablation study on the effect of the number of learnable prompts on AG-ReID.v1 banchmark. Consistent with the conclusions we drawn in the main text, although slight fluctuations are observed, this factor is not a primary determinant of LATex’s performance.

C.3 More Visualization Analysis

Fig. 11 presents more detailed visualizations of retrieval results based on attribute features from query images. Despite challenges posed by complex visual conditions, LATex demonstrates robust fine-grained attribute perception performance.