GCP: Guarded Collaborative Perception with Spatial-Temporal Aware Malicious Agent Detection

Yihang Tao^∗, Senkang Hu^∗, Yue Hu, Haonan An, Hangcheng Cao, and Yuguang Fang Y. Tao, S. Hu, H. An, H. Cao, and Y. Fang are with Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong. Email: {yihang.tommy, senkang.forest, haonanan2-c}@my.cityu.edu.hk, {hangccao, my.fang}@cityu.edu.hk. (* indicates equal contribution.) Y. Hu is with the Department of Robotics, University of Michigan, Ann Arbor, USA. Email: huyu@umich.edu. This work was supported in part by the Hong Kong Innovation and Technology Commission under InnoHK Project CIMDA, by the Hong Kong SAR Government under the Global STEM Professorship, and by the Hong Kong Jockey Club under JC STEM Lab of Smart City (Ref.: 2023-0108).

Abstract

Collaborative perception significantly enhances autonomous driving safety by extending each vehicle’s perception range through message sharing among connected and autonomous vehicles. Unfortunately, it is also vulnerable to adversarial message attacks from malicious agents, resulting in severe performance degradation. While existing defenses employ hypothesis-and-verification frameworks to detect malicious agents based on single-shot outliers, they overlook temporal message correlations, which can be circumvented by subtle yet harmful perturbations in model input and output spaces. This paper reveals a novel blind area confusion (BAC) attack that compromises existing single-shot outlier-based detection methods. As a countermeasure, we propose GCP, a Guarded Collaborative Perception framework based on spatial-temporal aware malicious agent detection, which maintains single-shot spatial consistency through a confidence-scaled spatial concordance loss, while simultaneously examining temporal anomalies by reconstructing historical bird’s eye view motion flows in low-confidence regions. We also employ a joint spatial-temporal Benjamini-Hochberg test to synthesize dual-domain anomaly results for reliable malicious agent detection. Extensive experiments demonstrate GCP’s superior performance under diverse attack scenarios, achieving up to 34.69% improvements in AP@0.5 compared to the state-of-the-art CP defense strategies under BAC attacks, while maintaining consistent 5-8% improvements under other typical attacks. Code will be released at https://github.com/CP-Security/GCP.git.

Index Terms:

Connected and autonomous vehicle (CAV), collaborative perception, malicious agents, spatial-temporal detection.

I Introduction

Refer to caption — Figure 1: Illustration of security challenges and defense mechanisms in CP. While CP systems are vulnerable to adversarial messages from malicious agents, our proposed GCP framework provides comprehensive protection through joint spatial-temporal consistency verification, effectively safeguarding the system against various attack patterns.

Collaborative perception (CP) supersedes single-agent perception by enabling information-sharing among multiple connected and autonomous vehicles (CAVs), substantially enlarging a vehicle’s perception scope and accuracy [1, 2, 3, 4, 5, 6, 7]. The extended perception helps an ego vehicle detect occluded objects that were originally difficult to recognize with single-vehicle perception due to physical occlusions, thereby boosting the safety of autonomous driving [8, 9, 10, 11, 12]. To collaborate, one simple way for an ego CAV to receive helps is to directly request early-stage raw data or late-stage detection results from the neighboring CAVs, and then combine this information with its own data to get the CP results. However, these methods are either bandwidth-consuming or vulnerable to perception noise or malicious message attacks. The recent development of deep learning has facilitated feature-level fusion, where the collaborative CAVs send an intermediate representation of deep neural network models to an ego CAV for aggregation, enhancing the performance-bandwidth trade-off in multi-agent perception.

Although CP has brought many benefits to CP systems, it is inevitably attracting adversarial attacks due to the openness of communication channels. While traditional authentication methods (e.g., message and/or source authentication) can verify message sources and data integrity, they cannot protect against compromised legitimate agents who possess valid credentials to share malicious messages. In an intermediate fusion-based CP system, a malicious collaborator could send an adversarial feature map with intricately crafted perturbation to an ego agent, causing significant CP performance degradation after fusion. This is particularly dangerous because the perturbed CP performance could drop far below the single-agent perception, consequently resulting in catastrophic driving decisions. Previous works have revealed diverse attacks that can fool a CP system [13, 14, 15, 16]. For example, Zhang et al.[16] introduced an online attack by optimizing a perturbation on the attacker’s feature map in each LiDAR cycle and reusing the perturbation over frames, which significantly lowers the performance of CP systems.

To guard CP systems, most existing works adopt outlier-based defense against malicious agents following a multi-round hypothesize-and-verify paradigm. In each round, ego agents first generate multiple hypothetical CP outcomes by assuming portion of the collaborators are benign based on prior knowledge and then verify the consistency between the ego agent’s single perception outcome and CP results. The iteration repeats until all malicious agents are identified. Following this idea, Li et al. [15] developed ROBOSAC, a random sample consensus-based method for detecting malicious agents. Besides, Zhao et al. [16] proposed MADE, a multi-test detection framework utilizing match and reconstruction losses to gauge consensus between the ego CAV and collaborative results. Despite its efficacy, the existing hypothesize-and-verify-based CP defense method merely compares the spatial outlier consistency in a single frame without considering the temporal contexts between consecutive frames. Meanwhile, the static outlier-based detection can be easily bypassed when the attacks are subtle yet dangerous in both model input and output. CP-Guard+ [17] proposed a feature-level defense by directly comparing the divergence of the spatial bird eye’s view (BEV) feature map of the ego agent and collaborators, effectively increasing the system’s scalability. However, all the above works only leverage independent spatial-domain information in a single time slot for malicious agent detection, without considering the temporal correlation of collaborators’ messages in different time slots, resulting in suboptimal defensive CP performance.

We strongly believe incorporating temporal knowledge is crucial for CP defense based on two key insights. First, adversarial attacks in real-world scenarios often exhibit distinct temporal patterns. When malicious agents inject perturbations intermittently to maintain stealth, these attacks manifest as anomalous variations in the temporal domain of CP results. For instance, both ROBOSAC [15] and MADE [16] observe that attackers typically alternate between sending malicious and benign messages across time slots to avoid detection. This temporal characteristic provides an additional verification dimension: beyond examining spatial consistency in the current frame, we can leverage historical clean messages from collaborators as reliable references to identify suspicious temporal deviations. Second, even when attackers continuously inject perturbations, their impact on CP results inevitably creates distinctive temporal patterns that differ from normal CP behaviors. These patterns manifest in various aspects, such as unnatural object motion trajectories, inconsistent detection confidence variations, or abrupt changes in spatial feature distributions across frames. Such temporal anomalies, while potentially subtle in individual frames, become more apparent when analyzed over extended long sequences.

To address the aforementioned challenges, in this paper, we first reveal a novel adversarial attack targeted at CP, namely, the blind area confusion (BAC) attack, which generates subtle and targeted perturbation in an ego CAV’s less confident areas to bypass the single-shot outlier-based detection methods. Besides, to overcome the limitations of previous malicious agent detection methods, we propose GCP, a defensive CP system against adversarial attackers based on knowledge from both the spatial and temporal domains. Specifically, GCP leverages a confidence-scaled spatial concordance loss and Long-short-term-memory autoencoder (LSTM-AE)-based BEV flow reconstruction to check the spatial and temporal consistency jointly. To sum up, our main contributions are three-fold:

•

We reveal a novel attack, dubbed as blind area confusion (BAC) attack, which is targeted at CP systems by generating subtle and dangerous perturbation in an an ego CAV’s less confident areas. The attack can significantly degrade existing state-of-the-art single-shot outlier-based malicious agent detection methods for CP systems.
•

We develop GCP, a novel spatial-temporal aware CP defense framework, under which malicious agents can be jointly detected by utilizing a confidence-scaled spatial concordance loss and an LSTM-AE-based temporal BEV flow reconstruction. To the best of our knowledge, this is the first work to protect CP systems from joint spatial and temporal views.
•

We conduct comprehensive experiments on diverse attack scenarios with V2X-Sim dataset. The results demonstrate that GCP achieves the state-of-the-art performance, with up to 34.69% improvements in AP@0.5 compared to the existing state-of-the-art defensive CP methods under intense BAC attacks, while maintaining 5-8% advantages under other typical adversarial attacks targeting at CP systems.

II Related Work

II-A Collaborative Perception (CP)

To overcome the inherent limitations of single-agent perception systems, particularly their restricted field-of-view (FoV) and environmental occlusions, collaborative perception (CP) has emerged as a promising paradigm that leverages multi-agent data fusion to enhance perception comprehensiveness and accuracy [18, 19]. The evolution of CP systems has witnessed various fusion strategies, from early approaches like raw-data-level fusion [1] and output-level fusion [20], which faced significant challenges in communication overhead and perception quality due to their high bandwidth requirements and information loss, respectively, to more sophisticated intermediate-level feature fusion methods. Notable advances in intermediate fusion include DiscoNet [21], which employs a teacher-student framework to learn an optimized collaboration graph for efficient knowledge transfer, and V2VNet [22], which leverages graph neural networks for efficient neighboring vehicle information aggregation in dynamic traffic scenarios. Where2comm [4] further advances the field by introducing a confidence-aware attention mechanism that simultaneously optimizes communication efficiency and perception performance through selective information sharing. While these developments have improved performance-bandwidth trade-offs, they have exposed critical vulnerabilities in CP systems, particularly regarding robustness against adversarial attacks and system failures, which is the main focus of this paper.

II-B Adversarial CP

The vulnerabilities in CP systems can be broadly categorized into systematic and adversarial challenges, each presenting unique threats to system reliability and safety. On the system front, inherent issues such as communication delays, synchronization problems, and localization errors have been addressed by recent works. CoBEVFlow [23] introduced an asynchrony-robust CP system to compensate for relative motions to align perceptual information across different temporal states, while CoAlign [24] developed a comprehensive framework specifically targeting at unknown pose errors through adaptive feature alignment. However, beyond these system challenges lies a more insidious threat, namely, adversarial vulnerabilities introduced by malicious agents within the collaborative system. These adversaries can compromise system integrity by injecting subtle adversarial noise into shared intermediate representations, potentially causing catastrophic failures in critical scenarios. Initial investigations by Tu et al. [14] demonstrated how untargeted adversarial attacks could compromise detection accuracy in intermediate-fusion CP systems through feature perturbation. Zhang et al. [13] advanced this research by incorporating sophisticated perturbation initialization and feature map masking techniques for more realistic, targeted attacks in real-world settings. Nevertheless, these attack methods lack sophistication in attack region selection and output perturbation constraints, making them potentially detectable by conventional defense mechanisms while highlighting the need for more robust security measures.

II-C Defensive CP

In response to emerging threats, the research community has developed various defensive strategies, primarily focusing on output-level malicious agent detection through hypothesis-and-verification frameworks, each offering unique approaches to system security. ROBOSAC [15] employs an iterative approach with Hungarian matching to achieve perception consensus among collaborators, systematically identifying and excluding potentially malicious agents from the collaboration process. MADE [16] enhances this concept by introducing a comprehensive multi-test framework that leverages both match loss and collaborative reconstruction loss to ensure robust consistency verification among perception results. Zhang et al. [13] further contribute to this effort by utilizing occupancy maps for discrepancy detection, providing an additional layer of security through spatial consistency checking. While these outlier-based detection methods show promise in identifying obvious attacks, they remain vulnerable to sophisticated attacks that produce subtle yet dangerous perturbations in CP outputs, particularly in scenarios involving coordinated malicious activities. Recent advances have begun exploring intermediate feature-level inconsistency detection to improve both efficiency and detection accuracy [17]. However, existing approaches generally overlook the temporal dimension of collaborative messages and assume simplified intrusion models, limiting their effectiveness against advanced persistent threats. To address these limitations, we propose a comprehensive defense framework that considers both spatial and temporal aspects of malicious agent detection, offering a more robust and complete solution to the security challenges in CP systems.

III Attack Methodology

III-A Model of CP

Consider a scenario with $N$ CAVs, where the CAV set is denoted as $\mathcal{N}$ . CAVs exchange collaboration messages with each other, and each CAV can maintain up to $K$ historical messages from its collaborators. For the $i$ -th CAV, we denote its raw observation at time $t_{i}$ as $\mathbf{O}_{i}^{t_{i}}$ , and use $\Phi_{\mathtt{{enc}}}^{i}$ to represent its pre-trained feature encoder that generates intermediate feature map $\mathbf{F}^{t_{i}}_{i}=\Phi_{\mathtt{{enc}}}^{i}(\mathbf{O}_{i}^{t_{i}})$ . The collaboration message transmitted from the $m$ -th CAV to the $i$ -th CAV at time $t_{i}$ is denoted as $\mathbf{F}_{m\rightarrow i}^{t_{i}}$ . Based on the received messages at current timestamp and historical $k\ (0\leq k\leq K)$ timestamps, the $i$ -th CAV uses a feature aggregator $f_{\mathtt{{agg}}}^{i}(\cdot)$ to fuse these feature maps and adopts a feature decoder $\Phi_{\mathtt{{dec}}}^{i}(\cdot)$ to get the final CP output, which is expressed as:

\mathbf{Y}_{i}^{t_{i}}=\Phi_{\mathtt{{dec}}}^{i}\left(f_{\mathtt{{agg}}}^{i}\left(\{\mathbf{F}_{m\rightarrow i}^{t_{i}},\mathbf{F}_{m\rightarrow i}^{t_{i-1}},\cdots,\mathbf{F}_{m\rightarrow i}^{t_{i-k}}\}_{m=1}^{N}\right)\right),

(1)

where $\mathbf{Y}_{i}^{t_{i}}$ is the CP result of the $i$ -th CAV at time $t_{i}$ . There are two important notes in terms of the formulation described in Eq. 1: i) the times $t_{i-1},t_{i-2},\cdots,t_{i-k}$ are not necessary to be equally distributed, and the time intervals between two consecutive times can be irregular. ii) When the length of historical times $k$ equals 0, the $i$ -th CAV outputs the CP results without referring to the temporal contexts of messages of collaborative CAVs, which degrades to the settings used in most existing works like ROBOSAC [15] and MADE [16].

III-B Adversarial Threat Model

The CP framework defined in Section III-A becomes vulnerable when malicious agents exist among the collaborative CAVs. At time $t_{i}$ , we denote the set of malicious agents as $\mathcal{M}_{t_{i}}\subseteq\mathcal{N}$ . We assume that malicious agents can request and transmit collaborative messages to benign agents before being identified, and their adversarial messages follow a specific time distribution $\mathcal{D}$ (detailed in Section V-A). Once the victim CAV recognizes the malicious agents at a certain frame, it will discard the received messages from those malicious agents and cut off the connections with them until the next frame starts. Meanwhile, the malicious agents are assumed to have white-box access [25] to the model parameters since all agents in the CP system must share the same model architecture to enable feature-level cooperation. The malicious agents attack the CP system by transmitting perturbed feature maps to the victim CAV, resulting in its perception degradation. A general attack procedure can be described as follows.

1.

Observation Encoding: Each malicious agent encodes its raw observation at time $t_{j}$ into an initial feature map $\mathbf{F}^{t_{j}}_{j}=\Phi_{\mathtt{{enc}}}^{j}(\mathbf{O}_{j}^{t_{j}})$ , preparing to combine with adversarial perturbations.

Perturbation Generation: In general cases, the malicious agents generate perturbations [26, 27] by multi-step iterations, aiming to maximize the distance between CP results and ground truth (GT) while ensuring the input perturbation amplitude is lower than a certain value (using local prediction if GT is not available), which is represented by:

\displaystyle\operatorname*{arg\,max}_{\delta}\mathcal{L}(\mathbf{Y}_{\delta}^{t_{j}},\mathbf{Y}_{\mathtt{gt}}^{t_{j}}),\quad\mathtt{s.t.}\quad\|\delta\|\leq\Delta_{i},

(2)

where $\delta$ is the perturbation, $\Delta_{i}$ is the maximum allowable input perturbation amplitude, $\mathcal{L}(\cdot)$ is the Loss function, $\mathbf{Y}_{\delta}^{t_{j}}$ and $\mathbf{Y}_{\mathtt{gt}}^{t_{j}}$ represent the perturbed CP results and the ground truth at time $t_{j}$ , respectively. However, single-shot outlier-based detection can easily identify this general perturbation generation. Section III-C introduces a more elaborated attack targeted at CP systems.

3.

Attack Triggering: After multi-step iterations, the malicious agents generate intricately optimized perturbation and send it to the victim agent. Without defense, the victim agent will incorporate this adversarial message into the fusion based on Eq. 1, yielding attacked CP results with low accuracy.

III-C Blind Area Confusion Attack

For previous outlier-based methods like ROBOSAC [15] and MADE [16], the core idea is to check the spatial consistency between ego CAV and CP outcomes. However, they only clip the perturbation at the input level, and the output detection errors are significant and randomly distributed, which can be easily detected by outlier-based defense methods. Intuitively, we want to emphasize that a good attack targeted at the CP system should satisfy two conditions: (i) both the input and output perturbation should not exceed a certain level; (ii) the output perturbation should be more distributed in the regions where the victim vehicle (i.e., agent or CAV) is less sensitive so that the victim vehicle can hardly discern whether the unseen detections could benefit from collaboration or just fake ones. Following the above ideas, we design a novel blind area confusion (BAC) attack as shown in Figure 2. The steps are elaborated below.

Message Request. Following our threat model, a malicious CAV $i$ can participate in the CP system before being identified. During this initial phase, it masquerades as a benign agent to establish communication with the victim’s ego vehicle. Specifically, the malicious CAV requests feature messages $\mathbf{F}_{e\rightarrow i}^{t_{k}}$ from the victim’s ego agent at timestamp $t_{k}$ . These messages contain valuable information about the victim’s perception capabilities and limitations, which will be exploited in subsequent attack stages. Note that in our low frame rate setting (e.g., 10 FPS), the attack generation and feature fusion can be completed within the same frame, eliminating the need for temporal prediction compensation that would be required in high FPS scenarios.

Differential Detection. After obtaining a victim’s messages, the malicious CAV $i$ performs a comparative analysis between independent and CP results to infer the victim’s blind spots. The malicious CAV first generates two types of perception results: single perception $\mathbf{Y}_{s}^{t_{k}}=\Phi_{\mathtt{{dec}}}^{i}\left(\mathbf{F}_{e\rightarrow i}^{t_{k}}\right)$ using only the victim’s transmitted features, and CP $\mathbf{Y}_{c}^{t_{k}}=\Phi_{\mathtt{{dec}}}^{i}\left(f_{\mathtt{{agg}}}^{i}\left(\mathbf{F}_{i\rightarrow i}^{t_{k}},\mathbf{F}_{e\rightarrow i}^{t_{k}}\right)\right)$ using both local and the victim’s features. Let $\mathbf{Y}_{s}^{t_{k}}\cap\mathbf{Y}_{c}^{t_{k}}$ denote the matched bounding boxes between single and CP results. Based on this matching, the system partitions all detections into two categories: victim-only detections $\mathbf{Y}_{vic}^{t_{k}}=\mathbf{Y}_{s}^{t_{k}}$ (marked in purple), and non-victim detections $\mathbf{Y}_{nvic}^{t_{k}}=\mathbf{Y}_{c}^{t_{k}}\setminus(\mathbf{Y}_{s}^{t_{k}}\cap\mathbf{Y}_{c}^{t_{k}})$ (marked in blue). This differential detection process reveals the spatial distribution of the victim’s unique detections, providing crucial insights into its perception strengths and potential blind spots, which form the foundation for subsequent targeted perturbation generation.

Blind Region Segmentation. The differential detection map is processed through an adaptive region growing algorithm to partition the BEV detection map into confident area (CA) and blind area (BA), as shown in Algorithm 1. The algorithm first uses $Cluster()$ to determine the victim grid $e$ by finding the grid point with the minimum total distance to all victim-detected objects, establishing a perception-centric coordinate system. It then initializes seed grids from $\mathbf{Y}_{vic}^{t_{k}}$ as CA seeds (value 1) and $\mathbf{Y}_{nvic}^{t_{k}}$ as BA seeds (value -1). When $\mathbf{Y}_{nvic}^{t_{k}}$ is empty, which occurs in limited perception range scenarios, the grid farthest from $e$ is selected as the BA seed to reflect natural perception degradation with distance. The region growing process utilizes two priority queues ( $\mathbf{Q}_{ca}$ , $\mathbf{Q}_{ba}$ ) to manage confident and blind area expansion. The $GetNeighbors()$ function implements an adaptive neighbor selection mechanism:

K_{s}=\left\lceil K_{base}\cdot\exp\left(-\gamma_{d}\cdot\frac{Dist(s,e)}{D_{norm}}\right)\right\rceil,

(3)

where $Dist(s,e)$ computes the Euclidean distance between grid $s$ and victim grid $e$ , $D_{norm}=\sqrt{H^{2}+W^{2}}$ is the normalization factor based on BEV detection map dimensions (height $H$ and width $W$ ), $K_{base}=6$ establishes a hexagonal-like growth pattern, and $\gamma_{d}=0.3$ controls the decay rate. This distance-adaptive design ensures denser expansion near the victim and sparser expansion in distant regions, naturally modeling spatial perception reliability [28]. The algorithm expands both regions simultaneously through $Queue.Pop()$ (which removes and returns the next grid point from the front of the queue) and $Queue.Append()$ (which adds a new grid point to the end of the queue) operations. The initial queues $\mathbf{Q}_{ca}$ and $\mathbf{Q}_{ba}$ are created using $Queue\{m\}$ , which constructs priority queues containing all grid points $m$ that are labeled as confident area (1) or blind area (-1) respectively. These queue operations enable systematic breadth-first expansion of regions. As grid points are popped from either queue for processing, their unassigned neighbors receive the corresponding label (1 for CA, -1 for BA). They are appended to the appropriate queue for future expansion. The final binary mask $\mathbf{M}_{c}^{t_{k}}$ is created by converting BA labels (-1) to 0, yielding a binary representation where 1 indicates confident areas and 0 represents blind areas. This region-growing process naturally forms boundaries between different perception zones, effectively capturing the spatial characteristics of the victim’s perception capabilities.

Perturbation Optimization. BAC perturbation optimization aims to make the perturbation mostly distributed in the victim-blind area of the BEV detection map while adaptively controlling the perturbation magnitude. This optimization problem can be formulated as:

$\displaystyle\operatorname*{arg\,max}_{\delta_{bsi}}$	$\displaystyle\mathcal{L}(\mathbf{Y}_{\delta}^{t_{k}}\odot\mathbf{W}_{\delta},\mathbf{Y}_{\mathtt{gt}}^{t_{k}}\odot\mathbf{W}_{\delta}),$	(4)
$\displaystyle\mathtt{s.t.}$	$\displaystyle\mathbf{W}_{\delta}=w_{\delta}\mathbf{M}_{\delta}+\mathbf{S}_{\delta},$
	$\displaystyle\mathbf{S}_{\delta}=1-\sigma\left(\|\mathbf{Y}_{\delta}^{t_{k}}-\mathbf{Y}_{\mathtt{gt}}^{t_{k}}\|-\Delta_{o}\right),$
	$\displaystyle\\|\delta_{bsi}\\|\leq\Delta_{i},$

where $\mathcal{L}(\cdot)$ denotes the optimization loss function, $\mathbf{M}_{\delta}=\mathbf{1}-\mathbf{M}_{c}^{t_{k}}\{\mathbf{D}_{i}^{t_{k}}\{\mathbf{Y}_{\delta}^{t_{k}}\}\}$ is the inverted confidence mask, $\sigma(\cdot)$ is the sigmoid activation function, $|\cdot|$ computes element-wise absolute values, $\Delta_{i}$ bounds the input perturbation magnitude, $\Delta_{o}$ bounds the output perturbation magnitude, and $w_{\delta}$ is a positive weighting parameter.

This optimization formulation incorporates several innovative design principles for effective and stealthy attacks. First, instead of a simple loss function, it employs a weighted loss function where $\odot$ denotes element-wise multiplication. The weight $\mathbf{W}_{\delta}$ is carefully designed to guide the spatial distribution of perturbations, ensuring more targeted attacks. Second, the optimization adopts a dual-weight mechanism: the spatial guidance weight $\mathbf{M}_{\delta}$ directs perturbations towards the victim’s blind areas, while the adaptive suppression weight $\mathbf{S}_{\delta}$ automatically reduces weights when output perturbations become too large. The adaptive suppression is implemented through a sigmoid function, which provides smooth transitions when the prediction deviates from the ground truth beyond the threshold $\Delta_{o}$ . This design ensures that the attack remains effective while avoiding generating easily detectable anomalies. Additionally, the input perturbation magnitude is constrained by $\Delta_{i}$ to maintain physical feasibility and attack stealthiness. After the optimization, the malicious CAV $i$ incorporates the optimized perturbation $\delta_{bsi}$ into its intermediate BEV feature before transmission to the victim CAV.

Algorithm 1 Blind Region Segmentation (BRS)

Input:

•

$\mathbf{M}_{c}^{t_{k}}$ , initial confidence mask (all-zero matrix).
•

$\mathbf{Y}_{vic}^{t_{k}}$ , detected bounding boxes by victim vehicle.
•

$\mathbf{Y}_{nvic}^{t_{k}}$ , undetected bounding boxes by victim vehicle.
•

$\mathbf{D}_{i}^{t_{k}}$ , the BEV detection map of CAV $i$ .

Output: $\mathbf{M}_{c}^{t_{k}}$ , binary confidence mask.

1:procedure BRS(

\mathbf{D}_{i}^{t_{k}}

)

e\leftarrow Cluster(\mathbf{Y}_{vic}^{t_{k}})

\triangleright

Victim Grid

3: if

\mathbf{Y}_{nvic}^{t_{k}}=\emptyset

then

\mathbf{Y}_{nvic}^{t_{k}}\leftarrow\mathop{\arg\max}\limits_{g\in\mathbf{D}_{i}^{t_{k}}}Dist(g,e)

5: end if

\mathbf{M}_{c}^{t_{k}}\{\mathbf{D}_{i}^{t_{k}}\{\mathbf{Y}_{vic}^{t_{k}}\}\}\leftarrow 1

\mathbf{M}_{c}^{t_{k}}\{\mathbf{D}_{i}^{t_{k}}\{\mathbf{Y}_{nvic}^{t_{k}}\}\}\leftarrow-1

\mathbf{Q}_{ca},\mathbf{Q}_{ba}\leftarrow Queue\{m\},\forall\ \mathbf{M}_{c}^{t_{k}}\{m\}=1\ \text{or}\ -1

9: while

\mathbf{Q}_{ca}\neq\emptyset

\mathbf{Q}_{ba}\neq\emptyset

\triangleright

Region Growth

10: for each

\mathbf{Q}\in\{\mathbf{Q}_{ca},\mathbf{Q}_{ba}\}

and

\mathbf{Q}\neq\emptyset

11:

s\leftarrow\mathbf{Q}.Pop()

12: for

j\in GetNeighbors(s,e)

and

\mathbf{M}_{c}^{t_{k}}\{j\}=0

13:

\mathbf{M}_{c}^{t_{k}}\{j\}\leftarrow[\mathbf{Q}=\mathbf{Q}_{ca}]?1:-1

14:

\mathbf{Q}.Append(j)

15: end while

16:

\mathbf{M}_{c}^{t_{k}}\{m\}\leftarrow 0,\forall\ \mathbf{M}_{c}^{t_{k}}\{m\}=-1

\triangleright

Binary Mask

17: return

\mathbf{M}_{c}^{t_{k}}

18:end procedure

IV GCP Framework

IV-A Overall Architecture

In this paper, we propose a robust framework to guard CP systems through spatial-temporal aware malicious agent detection. As illustrated in Fig. 3, GCP operates through a dual-domain verification process. First, it computes a confidence-scaled spatial concordance loss by estimating confidence scores for each grid cell in the BEV detection map, evaluating the consistency between the ego CAV’s observations and messages from the $i$ -th neighboring CAV at the current time slot. Second, it performs temporal verification by analyzing the past $K$ frames from both the ego CAV and the $i$ -th neighboring CAV to generate a BEV flow map for low-confidence detections, employing an LSTM-AE-based temporal reconstruction to verify motion consistency. The framework culminates in a comprehensive spatial-temporal multi-test that combines both consistency metrics to make final decisions on malicious agent detection. The following sections (IV-B to IV-D) provide the detailed descriptions of these core components.

IV-B Confidence-Scaled Spatial Concordance Loss

Existing spatial consistency checking methods typically compare the difference between the ego CAV’s perception and CP without considering the ego CAV’s varying perception reliability across different spatial regions. This indiscrimination could result in mistaking true blindspot detection complemented by other CAVs as malicious signals. This occurs because the ego CAV’s perception reliability naturally decreases in occluded areas and at sensor range boundaries [28]. When other CAVs provide valid detections in these low-reliability regions, traditional methods may incorrectly flag them as inconsistencies by failing to account for the ego CAV’s spatially varying perception capabilities. To address this limitation, we propose a novel confidence-scaled spatial concordance loss (CSCLoss) that incorporates the ego CAV’s detection confidence when checking messages from the $i$ -th CAV at time $t_{k}$ . Let $\mathbf{Y}_{e}^{t_{k}}$ denote the ego CAV’s detected bounding boxes and $\mathbf{Y}_{e,i}^{t_{k}}$ represent the collaboratively detected bounding boxes at timestamp $t_{k}$ . We first construct a weighted bipartite graph between these two sets, where edges represent matching costs based on classification confidence and intersection-over-union (IoU) scores. Since $|\mathbf{Y}_{e}^{t_{k}}|$ may differ from $|\mathbf{Y}_{e,i}^{t_{k}}|$ , we pad the smaller set with empty boxes to ensure the one-to-one matching. The optimal matching $\mathcal{O}\{\mathbf{Y}_{e}^{t_{k}},\mathbf{Y}_{e,i}^{t_{k}}\}_{n=1}^{N}$ is then obtained using the Kuhn-Munkres algorithm [29]. To compute the CSCLoss, we first estimate a spatial confidence map $\mathbf{C}_{e}^{t_{k}}$ using the ego CAV’s feature map $\mathbf{F}_{e\rightarrow e}^{t_{i}}$ :

\mathbf{C}_{e}^{t_{k}}=\Phi_{conf}\left(\mathbf{F}_{e\rightarrow e}^{t_{i}}\right)\in[0,1]^{H\times W},

(5)

where $\Phi_{conf}(\cdot)$ is a detection decoder that generates confidence scores for each grid cell in the BEV map [4]. The CSCLoss is then calculated by combining the optimal matching and confidence map:

\mathcal{L}_{csc}\left(\mathbf{Y}_{e}^{t_{k}},\mathbf{Y}_{e,i}^{t_{k}}\right)=\sum_{c\in\mathcal{C}}\sum_{j\in\mathcal{O}}{\frac{\mathcal{M}\left(\mathbf{Y}_{e}^{t_{k}(j)},\mathbf{Y}_{e,i}^{t_{k}(j)};c\right)\cdot\mathbf{C}_{e}^{t_{k}(j)}}{|\mathbf{C}_{e}^{t_{k}}|}},

(6)

where $\mathbf{C}_{e}^{t_{k}(j)}$ is the confidence score for the grid cell containing bounding box $\mathbf{Y}_{e}^{t_{k}(j)}$ , $c\in\mathcal{C}$ represents a prediction class, $\mathcal{M}(\cdot)$ computes the matching cost between two boxes:

\mathcal{M}\left(y_{1},y_{2};c\right)=\operatorname{ReLU}\left(p_{1}-p_{2}\right)+\phi\left(1-\operatorname{IoU}\left(z_{1},z_{2}\right)\right),

(7)

where $p_{1}$ and $p_{2}$ are the class $c$ posterior probabilities for boxes $y_{1}$ and $y_{2}$ , respectively, $z_{1}$ and $z_{2}$ are their spatial coordinates, and $\phi$ is a weighting coefficient. Typically, when temporal context is not applicable, a spatial anomaly can be assumed to be detected when $\mathcal{L}_{csc}$ exceeds a threshold $\alpha$ .

IV-C Low-Confidence BEV Flow Matching

Bird’s Eye View (BEV) flow [23] represents the consecutive motion vectors of detected objects across consecutive frames in the top-down perspective. By observing the flow of bounding boxes in the BEV space over time, we can capture the temporal dynamics and motion patterns of objects in the scene. For computational efficiency, we only perform temporal matching on two specific types of low-confidence BEV flows: (i) those with low detection scores from ego CAV’s own view, and (ii) ego CAV’s unseen detections that the collaborative agents complement. While temporal matching could be applied to all detected boxes, focusing on these low-confidence cases is particularly crucial as it is difficult for an ego CAV to judge whether they represent true objects or perturbations caused by malicious agents merely through single-shot spatial checks, especially when the added perturbation is subtle. To address this challenge, we utilize temporal characteristics to analyze their anomaly patterns. We first match the detected low-confidence bounding boxes of $\mathbf{Y}_{e,i}^{t_{k}}$ to those in the past $K$ frames $\mathbf{Y}_{e,i}^{t_{k-1}},\mathbf{Y}_{e,i}^{t_{k-K+1}},\cdots,\mathbf{Y}_{e,i}^{t_{k-K}}$ . In the BEV detection map, each detected bounding box can be represented as a set of 4 corner points:

o_{j}=[x_{j}^{1},y_{j}^{1},\cdots,x_{j}^{4},y_{j}^{4}]^{\top}\in\mathbb{R}^{8},\forall o_{j}\in\mathbf{Y}_{e,i}^{t_{k}},

(8)

where $\{(x_{j}^{k},y_{j}^{k})\}_{k=1}^{4}$ represents the locations of 4 corners of bounding box $o_{j}$ in BEV detection map. Given the current-frame low-confidence bounding boxes set $\mathbf{O}_{e,i}^{t_{k}}\in\mathbf{Y}_{e,i}^{t_{k}}$ , the ego CAV will generate BEV flow $\mathbf{I}_{e,i}^{t_{k}}$ by iteratively matching the bounding boxes in each historical frame. As shown in Algorithm 2, the matching process is a chain-based process, which means that we first find the best-matched bounding boxes $\mathbf{B}_{e,i}^{t_{k-1}}$ in $\mathbf{Y}_{e,i}^{t_{k-1}}$ for $\mathbf{O}_{e,i}^{t_{k}}$ , then we keep searching for the best matched bounding boxes $\mathbf{B}_{e,i}^{t_{k-2}}$ in $\mathbf{Y}_{e,i}^{t_{k-2}}$ for $\mathbf{B}_{e,i}^{t_{k-1}}$ , until the $K$ -th frame has been matched. During this process, not all the bounding boxes in $\mathbf{O}_{e,i}^{t_{k}}$ in $\mathbf{Y}_{e,i}^{t_{k}}$ can be matched up to the $K$ -th frame, we call the successfully matched bounding boxes as the candidate BEV flow $\mathbf{I}_{c}^{t_{k}}$ . LSTM-AE is used to reconstruct it for further temporal consistency check. As for the unmatched BEV flow $\mathbf{I}_{u}^{t_{k}}$ , they will result in additional time anomaly penalties. To maintain computational efficiency, we cache the historical matching chains for each tracked flow. This significantly reduces the computational complexity as previously established matches can be directly reused without re-computation, making the chain matching process highly efficient in practice. Note that consecutive $K$ frames are not always completely available for chain matching. There are two cases when the frame cache is not enough for chain matching.

•

Case 1: The ego CAV may have not collected up to $K$ frames of neighboring CAVs at the early stage. In this case, we only use spatial consistency to check the malicious agent until the cached messages are up to $K$ frames.
•

Case 2: Certain frames of neighboring CAVs have been identified as malicious messages and are thereby discarded. In this case, we use Kalman Filter (KF) [30] for interpolation (Please refer to Appendix A). We also set a maximum consecutive interpolation limit of $L$ to avoid the cumulative error. Once the consecutive interpolation frames exceed the limit $L$ , all the cached frames will be refreshed.

IV-D Temporal BEV Flow Reconstruction

For the generated candidate BEV flow, we further analyze their temporal characteristics based on the current and past $K$ -frame messages from the $i$ -th CAV. As shown in Fig 3, there are three key components of BEV flow reconstruction: LSTM encoder, LSTM decoder, and temporal reconstruction loss estimator.

LSTM encoder. Each candidate BEV flow $\mathbf{i}_{c}^{t_{k}}\in\mathbf{I}_{c}^{t_{k}}$ can be represented as $\mathbf{i}_{c}^{t_{k}}=[o_{k-K},o_{k-K+1},\cdots,o_{k}]^{\top}\in\mathbb{R}^{(K+1)\times 8}$ , the LSTM encoder further encodes the high-dimensional input sequence into a low-dimensional hidden representation using the following equation:

H_{k}=\sigma_{o}\left([H_{k-1},o_{k}]\right)\odot\tanh(C_{k}),

(9)

where $\sigma_{o}(\cdot)=\sigma(w_{o}(\cdot)+b_{o})$ represents the output gate with activation function $\sigma(\cdot)$ , weight $w_{o}$ , and bias $b_{o}$ . $H_{k-1}$ and $o_{k}$ represent the concatenation of the hidden state and the current input, respectively. $C_{k}$ represents the input gate calculated by:

\hat{C}_{k}=\tanh(w_{c}[H_{k-1},o_{k}]+b_{c}),

(10)

C_{k}=\sigma_{f}\left([H_{k-1},o_{k}]\right)\odot C_{k-1}+\sigma_{g}\left([H_{k-1},o_{k}]\right)\odot\hat{C}_{k},

(11)

where $\sigma_{f}(\cdot)$ is the forget gate, $\sigma_{g}(\cdot)$ , $\hat{C}_{k}$ , $C_{k}$ represent the input gate. The output vector will be repeated $K+1$ times to yield the final encoded feature vector:

H_{k}=\Phi_{enc}^{lstm}\{\mathbf{i}_{c}^{t_{k}}\}\in\mathbb{R}^{1\times M},

(12)

\mathbf{H}_{bf}=\overbrace{H_{k}\oplus H_{k}\oplus\ldots\oplus H_{k}}^{K+1\ \text{times}}\in\mathbb{R}^{(K+1)\times M},

(13)

where $M$ is the latent encoded feature dimension, $\Phi_{enc}^{lstm}\{\cdot\}$ is the LSTM encoding network model containing $M$ computation units following Eq. 9 to Eq. 11, and $\oplus$ represents the concatenation operation along the horizontal dimension.

LSTM decoder. The LSTM Decoder consists of a 3-layer network with $K$ LSTM cell units. Each LSTM cell processes each $\mathbb{R}^{1\times M}$ encoded feature. These LSTM units generate a $\mathbb{R}^{(K+1)\times M}$ output vector learned from the encoded feature, which is further multiplied with a $\mathbb{R}^{M\times 8}$ vector output by a TimeDistributed (TD) layer [31]. The TimeDistributed layer maintains the temporal structure by applying a fully connected layer to each time step output. Finally, the LSTM decoder generates a reconstructed vector $\hat{\mathbf{i}}_{c}^{t_{k}}$ with the same size as the input vector following:

\hat{\mathbf{i}}_{c}^{t_{k}}=\Phi_{dec}^{lstm}\{\mathbf{H}_{bf}\}\cdot\operatorname{\textbf{TD}}\{\mathbf{H}_{bf}\}\in\mathbb{R}^{(K+1)\times 8},

(14)

where $\Phi_{dec}^{lstm}\{\cdot\}$ is the LSTM decoding network model, $\operatorname{\textbf{TD}}\{\cdot\}$ represents the TimeDistributed Layer funciton.

Temporal reconstruction loss estimator. To evaluate the loss between the input vector and output vector, we use Mean Absolute Error (MAE) as a metric, given by:

\mathcal{L}_{tr}(\mathbf{i}_{c}^{t_{k}},\hat{\mathbf{i}}_{c}^{t_{k}})=\frac{1}{K}\sum_{i=1}^{K}|o_{k-i}-\hat{o}_{k-i}|,

(15)

where $\hat{o}_{k-i}\in\hat{\mathbf{i}}_{c}^{t_{k}}$ is the reconstructed BEV flow vector at time $t_{k-i}$ . The final temporal anomaly score is the sum of the candidate BEV flow reconstruction loss and the unmatched BEV flow penalty, given by:

\mathcal{L}_{ta}=\sum_{\mathbf{i}_{c}^{t_{k}}\in\mathbf{I}_{c}^{t_{k}}}\mathcal{L}_{tr}(\mathbf{i}_{c}^{t_{k}},\hat{\mathbf{i}}_{c}^{t_{k}})+\kappa_{p}|\mathbf{I}_{u}^{t_{k}}|,

(16)

where $\kappa_{p}$ is a constant penalty coefficient for the unmatched BEV flow.

IV-E Joint Spatial-Temporal Benjamini-Hochberg Test

To jointly utilize spatial and temporal anomaly results for malicious agent detection, we employ the Benjamini-Hochberg (BH) procedure [32, 16]. The BH procedure effectively controls the False Discovery Rate (FDR) - the expected proportion of false positives among all rejected null hypotheses, which is crucial in CP where false accusations can severely impact system reliability. For each CAV $i$ under inspection, we first compute a weighted combination of spatial and temporal scores:

\mathcal{L}_{ST}=\omega_{s}\mathcal{L}_{csc}+\omega_{t}\mathcal{L}_{ta},

(17)

where $\omega_{s}$ and $\omega_{t}$ are learnable weights. The BH procedure controls FDR through a step-up process that adapts its rejection threshold based on the distribution of observed p-values. Unlike traditional single hypothesis testing, the BH procedure maintains FDR control at level $\alpha_{bh}$ even under arbitrary dependencies between tests, making it particularly suitable for CP where agents’ behaviors may be correlated due to shared environmental factors or coordinated attacks. We formulate the hypothesis test $\mathcal{H}=\{\mathcal{H}_{0},\mathcal{H}_{1}\}$ , where $\mathcal{H}_{0}:\mathcal{L}_{ST}\sim P_{normal}$ represents the null hypothesis that the combined score follows the distribution of normal agents, and $\mathcal{H}_{1}:\mathcal{L}_{ST}\nsim P_{normal}$ represents the alternative hypothesis. Since this distribution is typically intractable, we compute conformal p-values using the calibration set $\mathcal{S}$ :

\hat{p}=\frac{1+\left|\{s\in\mathcal{S}:s\geq\mathcal{L}_{ST}\}\right|}{1+|\mathcal{S}|},

(18)

where $|\cdot|$ represents the cardinality of the set. Following the BH procedure, we sort the p-values in ascending order $\hat{p}_{(1)}\leq\hat{p}_{(2)}\leq...\leq\hat{p}_{(m)}$ for $m$ hypothesis tests. Let $j$ be the largest index satisfying:

\hat{p}_{(j)}\leq\frac{j}{{\color[rgb]{0,0,0}m}}\alpha_{bh},

(19)

Then, agent $i$ is classified as malicious when its p-value is less than or equal to $\hat{p}_{(j)}$ , where $\alpha_{bh}$ controls the desired false detection rate.

Algorithm 2 Chain-based BEV Flow Matching (BFM)

Input:

•

$\mathbf{O}_{e,i}^{t_{k}}$ , low-confidence detections at current frame
•

$\{\mathbf{Y}_{e,i}^{t_{k-j}}\}_{j=1}^{K}$ , cached detections from past $K$ frames

Output: $\mathbf{I}_{c}^{t_{k}}$ , $\mathbf{I}_{u}^{t_{k}}$ (candidate and unmatched BEV flows)

1:procedure BFM(

\mathbf{O}_{e,i}^{t_{k}}

)

\mathbf{I}_{c}^{t_{k}},\mathbf{I}_{u}^{t_{k}}\leftarrow\{\},\{\}

\triangleright

BEV Flow Initialization

3: if

\mathbf{O}_{e,i}^{t_{k}}=\emptyset

then return

\mathbf{I}_{c}^{t_{k}},\mathbf{I}_{u}^{t_{k}}

4: end if

5: for

o_{j}\in\mathbf{O}_{e,i}^{t_{k}}

i_{j}\leftarrow[o_{j}]

curr\leftarrow o_{j}

matched\leftarrow\textbf{True}

7: for

k\leftarrow K-1

down to

0

\triangleright

Chain Match

b_{max}\leftarrow\mathop{\arg\max}\limits_{b\in\mathbf{Y}_{e,i}^{t_{k}}}\operatorname{IoU}(b,curr)

cost\leftarrow\operatorname{ReLU}(p_{curr}-p_{b})+\phi(1-\operatorname{IoU}(b,curr))

10: if

cost<\tau

then

11:

i_{j}\leftarrow i_{j}\oplus b_{max}

curr\leftarrow b_{max}

12: else

13:

matched\leftarrow\textbf{False}

14: break

15: end if

16: end for

17: if

matched

then

\triangleright

Flow Concatenation

18:

\mathbf{I}_{c}^{t_{k}}\leftarrow\mathbf{I}_{c}^{t_{k}}\cup\{i_{j}\}

19: else

20:

\mathbf{I}_{u}^{t_{k}}\leftarrow\mathbf{I}_{u}^{t_{k}}\cup\{i_{j}\}

21: end if

22: end for

23: return

\mathbf{I}_{c}^{t_{k}},\mathbf{I}_{u}^{t_{k}}

24:end procedure

V Experiments

In this section, we first describe the experimental setup, including dataset, attack and defense settings, and implementation details. After that, both the quantitative and qualitative results are introduced.

V-A Experimental Setup

Datasets. We conduct experiments using two data sets, namely, V2X-Sim dataset [2] and V2X-Flow dataset. V2X-Sim dataset is used for training and evaluating different CP models and CP defense methods, while V2X-Flow dataset is used for pre-training LSTM-AE in our GCP.

•

V2X-Sim Dataset. V2X-Sim [2] is a simulated dataset generated by the CARLA simulator [33]. The dataset contains 10,000 frames of synchronized multi-view data captured from 6 different connected agents (5 vehicles and 1 RSU), including LiDAR point clouds (32-beam, 70m range) and RGB images, along with 501,000 annotated 3D bounding boxes containing object attributes. The data is split into training, validation, and test sets with a ratio of 8:1:1.
•

V2X-Flow Dataset. To facilite the pre-training of LSTM-AE used in our GCP, we construct a V2X-Flow dataset based on V2X-Sim. We separate the annotations of the V2X-Sim dataset to generate the BEV flow of each bounding box using the chain-based BEV flow matching method mentioned in Section IV-C. The dataset split of V2X-Flow is the same as V2X-Sim.

Attack Settings. To simulate realistic attack scenarios in vehicular networks, we consider that malicious agents would adopt stealthy strategies instead of continuously launching attacks. The temporal patterns of such attacks can be modeled with different distributions. Given the total number of collaborative agents $N_{a}$ , the time horizon $T$ , the number of malicious agents $m$ , and an attack ratio $\lambda$ (the proportion of malicious messages to the total messages from all agents over the time horizon), we evaluate three representative time series attack modes. For all modes, the total malicious messages $\lambda N_{a}T$ are distributed among $m$ malicious agents following a truncated normal distribution $\mathcal{N}(\frac{\lambda N_{a}T}{m},\sigma^{2})$ to simulate varying attack capabilities while maintaining reasonable attack intensities:

•

Random attack process (R-mode) [15, 16]. This mode follows a uniform random distribution in which malicious messages are randomly distributed over the time period $T$ , with $\lambda N_{a}T$ total malicious messages from $m$ malicious agents, simulating unpredictable attack patterns.
•

Poisson attack process (P-mode) [34]. This mode models bursty attack patterns (e.g., DDoS attacks) in which intense activities occur in short durations. At each time step $t\in T$ , the number of malicious messages $N_{t}$ follows a Poisson distribution:

$P\left(N_{t}=k\right)=\frac{(\lambda N_{a})^{k}exp\left(-\lambda N_{a}\right)}{k!},$ (20)

where $k$ denotes the number of malicious messages at time step $t$ , and $\lambda N_{a}$ represents the mean rate.
•

Susceptible-Infectious process (S-mode) [35]. This mode simulates progressive attack scenarios where malicious behaviors spread through the network over time, similar to virus propagation. The evolution of the number of malicious messages $N_{t}$ follows:

$\frac{dN_{t}}{dt}=\gamma N_{t}(1-\frac{N_{t}}{\lambda N_{a}T}),$ (21)

where $\gamma$ represents the propagation rate.

For adversarial perturbation generation, we evaluate our method against three representative white-box attacks: Projected Gradient Descent (PGD) [36], Carlini & Wagner (C&W) [37], and Basic Iterative Method (BIM) (Please refer to Appendix B for implementation details). Moreover, we introduce our proposed BAC attack to exploit the vulnerabilities in CP systems specifically. To maintain real-time attack capability, the BAC mask generation adopts a slow update strategy without requiring per-frame updates. In our experiments, we set the mask update rate to 0.5 FPS.

Implementation Details. Our GCP is implemented with PyTorch. Each agent’s locally captured LiDAR or camera data is first encoded into a BEV feature map with size $16\times 16$ and 512 channels. The encoder network adopts a ResNet-style architecture with 5 layers of feature encoding, with output feature dimensions progressively increasing from 32 to 512 channels while spatial dimensions decrease from 256 to 16. The decoder consists of three convolutional blocks with channel dimensions progressively decreasing from 512 to 64. The detection head follows a two-branch design (classification and regression) with standard convolutional layers, batch normalization, and ReLU activation. It uses an anchor-based detection decoder [38] with multiple anchor sizes per location. The fusion method is V2VNet [22]. The LSTM-AE model consists of 2 LSTM layers with hidden dimension 32. We train the model using Adam optimizer (lr=0.001) with the default history sequence length 5. All experiments are conducted on a server with 2 Intel(R) Xeon(R) Silver 4410Y CPUs, 4 NVIDIA RTX A5000 GPUs, and 512 GB RAM.

Baselines and Evaluation Metrics. We compare our method with two state-of-the-art CP defense baselines: ROBOSAC [15] and MADE [16]. We also include three reference settings: (1) Upper-bound - CP without malicious agents, representing the optimal performance; (2) Lower-bound - ego vehicle’s local perception only; and (3) No defense - CP with malicious agents but without any defense measures. To validate the effectiveness of our spatial-temporal design, we also evaluate two variants of GCP: spatial-only defense (GCP-S) and temporal-only defense (GCP-T). The performance of all methods is evaluated using both accuracy and efficiency metrics: average precision at 0.5 IoU (AP@0.5) and 0.7 IoU (AP@0.7) for detection accuracy and Frames Per Second (FPS) for computational efficiency.

V-B Quantitative Results

TABLE I: Comparative results under different attack methods and modes. We report the AP@0.5 and AP@0.7 under different attack methods and temporal patterns. Attack settings: the number of malicious agents

m=2

; attack ratio

\lambda=0.25

; input/output perturbation budget

\Delta_{i}=\Delta_{o}=0.5

; the number of iterations = 10. Best results excluding upper-bound are in bold, while significant performance drops are marked with

\downarrow

Method	Mode	PGD attack		C&W attack		BIM attack		BAC attack
Method	Mode	AP@0.5	AP@0.7	AP@0.5	AP@0.7	AP@0.5	AP@0.7	AP@0.5	AP@0.7
Upper-bound	—	80.52	78.65	80.52	78.65	80.52	78.65	80.52	78.65
GCP (Ours)	Average	77.54	76.51	76.81	75.56	77.40	76.46	76.64	75.64
	R	76.88	75.78	76.90	75.16	76.72	75.74	75.80	74.96
	P	77.37	76.34	77.13	76.20	77.38	76.43	76.29	75.09
	S	78.36	77.41	76.41	75.32	78.11	77.22	77.82	76.87
MADE [16]	Average	77.07	75.75	76.28	74.87	76.61	75.06	64.00 $\downarrow$	54.48 $\downarrow$
	R	76.67	75.56	75.81	74.28	76.82	75.28	65.63	56.92
	P	76.76	75.04	76.29	74.81	76.96	75.88	65.11	55.95
	S	77.79	76.66	76.39	75.42	76.06	74.01	61.28	50.58
ROBOSAC [15]	Average	73.31	71.54	73.90	72.16	73.48	71.70	68.11 $\downarrow$	64.77 $\downarrow$
	R	74.64	72.93	74.52	72.78	74.67	73.19	68.68	64.82
	P	73.98	72.84	73.63	72.45	73.63	72.07	67.25	65.48
	S	71.31	68.84	73.54	71.26	72.14	69.85	68.40	64.02
No Defense	Average	36.65 $\downarrow$	35.92 $\downarrow$	17.70 $\downarrow$	15.14 $\downarrow$	38.27 $\downarrow$	37.28 $\downarrow$	54.83 $\downarrow$	45.11 $\downarrow$
	R	36.50	35.64	18.03	15.52	37.07	36.18	55.19	43.99
	P	39.01	38.35	18.92	16.23	36.59	35.47	55.03	45.07
	S	34.45	33.78	16.14	13.66	41.16	40.19	54.26	46.28
Lower-bound	—	64.08	61.99	64.08	61.99	64.08	61.99	64.08	61.99

Comparative results against different attacks. As shown in Table I, we evaluate our proposed GCP against various attack methods and time series modes on the V2X-Sim dataset. First, all defense methods demonstrate varying degrees of effectiveness compared to the no-defense baseline, which suffers severe performance degradation under all attack scenarios, especially for C&W attacks. While baseline methods show some defense capability, with MADE performing slightly better than ROBOSAC against conventional attacks (PGD, C&W, and BIM), they both exhibit significant vulnerability to our proposed BAC attack. Our proposed GCP consistently performs better across all attack scenarios and time series modes. For conventional attacks, GCP outperforms existing methods with consistent improvements (0.47%-0.79% in AP@0.5) across PGD, C&W, and BIM attacks. The advantage becomes more pronounced under our challenging BAC attack, where GCP significantly surpasses MADE and ROBOSAC by 12.64% and 8.53% in AP@0.5, respectively, while maintaining performance close to the upper bound with only 3.88% degradation. Across different time series attack modes (R-Mode, P-Mode, and S-Mode), GCP demonstrates consistent robustness, maintaining stable performance where baseline methods show varying degrees of degradation. These results validate the effectiveness of our spatial-temporal defense mechanism in handling diverse attack patterns in CP systems.

Comparative results with dynamic attack intensities. Table II shows the performance comparison under different perturbation budgets and attack intensities. All methods show performance degradation with more intense attacks ( $m=3$ , $\lambda=0.5$ ), where MADE exhibits the largest drops (13.69%-15.68% in AP@0.5) while GCP maintains relatively stable performance with only 5.46%-5.74% degradation. For perturbation budget settings, we observe that larger output perturbation ( $\Delta_{o}=0.5$ ) enables more effective defense than smaller ones ( $\Delta_{o}=0.1$ ). Increasing input perturbation ( $\Delta_{i}=1.0$ ) significantly impacts baseline methods, with ROBOSAC showing 4.56% drop in AP@0.5 even under moderate attacks. In contrast, GCP maintains consistent performance across all settings (less than 0.5% variation in AP@0.5). Overall, our proposed GCP outperforms baselines by 7.12%-18.29% in AP@0.5, with advantages more pronounced under intense attacks and large perturbation budgets. These results demonstrate our spatial-temporal defense mechanism’s strong adaptability and robustness against varying attack parameters.

TABLE II: Comparative performance under different attack intensities. We report the AP@0.5 and AP@0.7 under different input/output perturbation budget (

\Delta_{i}

\Delta_{o}

) and attack intensities. Attack settings: moderate attack (

m=2

\lambda=0.25

) and intense attack (

m=3

\lambda=0.5

); input/output perturbation budget

\Delta_{i}=\Delta_{o}=0.5

. Best results excluding upper-bound are in bold, while significant performance drops are marked with

\downarrow

Method	BAC Attack	$\Delta_{i}=0.5,\Delta_{o}=0.5$		$\Delta_{i}=0.5,\Delta_{o}=0.1$		$\Delta_{i}=1.0,\Delta_{o}=0.5$		$\Delta_{i}=1.0,\Delta_{o}=0.1$
Method	BAC Attack	AP@0.5	AP@0.7	AP@0.5	AP@0.7	AP@0.5	AP@0.7	AP@0.5	AP@0.7
Upper-bound	—	80.52	78.65	80.52	78.65	80.52	78.65	80.52	78.65
GCP (Ours)	Moderate	75.80	74.96	75.23	74.02	75.58	74.28	75.51	74.69
GCP (Ours)	Intense	70.22	68.57	70.03	68.36	70.07	68.59	69.77	68.20
MADE [16]	Moderate	65.63	56.92	65.80	57.26	65.39	55.56	66.93	56.79
MADE [16]	Intense	51.94 $\downarrow$	40.87 $\downarrow$	54.22 $\downarrow$	41.38 $\downarrow$	53.21 $\downarrow$	40.61 $\downarrow$	51.04 $\downarrow$	36.25 $\downarrow$
ROBOSAC [15]	Moderate	68.68	64.82	66.26	62.38	64.12	58.08	65.64	61.42
ROBOSAC [15]	Intense	60.09	54.88	61.20	55.07	59.08	51.14	61.02	53.87
No Defense	Moderate	55.19 $\downarrow$	43.99 $\downarrow$	65.05	55.44	55.92 $\downarrow$	42.91 $\downarrow$	59.54	47.27 $\downarrow$
No Defense	Intense	35.63 $\downarrow$	25.35 $\downarrow$	47.82 $\downarrow$	32.55 $\downarrow$	33.40 $\downarrow$	17.98 $\downarrow$	41.85 $\downarrow$	23.24 $\downarrow$

Ablation studies. To evaluate the effectiveness of our proposed GCP, we compare it with its variants that use only spatial or temporal defense mechanisms. As shown in Table III, both components contribute to the overall defense performance but with different strengths against different attacks. For conventional PGD attacks, the spatial component plays a dominant role, with GCP-S achieving comparable performance (77.21% AP@0.5) to the full model (77.54% AP@0.5), while GCP-T shows limited effectiveness (70.16% AP@0.5). However, for our challenging BAC attack, neither component alone can maintain robust performance, with significant drops in GCP-S (7.41% decrease in AP@0.5) and GCP-T (10.63% decrease in AP@0.5) compared to the full model. This demonstrates the necessity of combining both spatial and temporal consistency checks for effective defense against sophisticated attacks in CP systems. Notably, despite using a lighter architecture, GCP-S maintains better performance than ROBOSAC [15] and MADE [16], validating the effectiveness of our CSCLoss design proposed in Section 4.2.

Effect of the frame cache length and KF interpolation times. We analyze two key hyperparameters in GCP: the maximum cached frame length ( $K$ ) and the maximum allowed consecutive KF interpolation times ( $L$ ). As shown in Fig. 5, moderate values for both parameters generally yield better performance across different attack modes. The optimal value for cache size $K$ is consistently around $K=5$ . Under BAC attack, S-mode performance peaks at $K=5$ with AP@0.5 of 76.87%, while smaller ( $K=3$ , 73.92%) and larger ( $K=7$ , 73.48%) sizes degrade performance. This suggests a moderate cache balances sufficient temporal context and avoids noise from distant frames. For consecutive KF interpolation times $L$ , optimal performance is typically achieved at $L=3$ or $L=4$ . For BIM attacks, S-mode AP@0.5 reaches 77.42% when $L=3$ , dropping to 73.09% when $L$ increases to 5. This indicates that while KF interpolation aids temporal consistency, excessive consecutive interpolation may introduce cumulative errors, affecting detection accuracy.

Quantitative analysis of attack and defense speeds. As shown in Table IV, we analyze computation speeds across different attack and defense methods. For attacks, the reported FPS values represent single-iteration performance. With 3-5 iterations typically sufficient for effective attacks, all methods can enable real-time attacks in low frame rate scenarios (10-15 FPS). Our BAC attack shows slightly lower speed than PGD (22.4 vs 25.3 FPS) due to mask generation but maintains efficiency through a low-frequency mask update strategy (0.5 FPS). On the defense side, while baseline methods MADE and ROBOSAC show higher speeds against BAC attack (82.8 and 44.6 FPS), this actually indicates that their detection mechanisms are bypassed. In contrast, our GCP maintains robust defense speeds (30.2-47.6 FPS) with only 3.88% performance degradation under BAC attack. These results validate that both our attack and defense methods meet the real-time requirements for practical deployment, demonstrating the potential for seamless integration into existing CP systems without compromising efficiency.

TABLE III: Ablation study results. Comparison between full GCP model and its variants (GCP-S and GCP-T). Attack settings:

m=2

\lambda=0.25

\Delta_{i}=\Delta_{o}=0.5

Method	PGD attack		BAC attack
Method	AP@0.5	AP@0.7	AP@0.5	AP@0.7
Upper-bound	80.52	78.65	80.52	78.65
GCP (Ours)	77.54	76.51	76.64	75.64
GCP-S	77.21	76.10	69.23	62.38
GCP-T	70.16	67.33	66.01	58.79
No Defense	36.65	35.92	54.83	45.11

TABLE IV: Quantitative results of attack and defense speed. We report the average computation speed (FPS) under different attack and defense settings. Attack settings:

m=2

\lambda=0.25

;

\Delta_{i}=\Delta_{o}=0.5

Attack Method	Attack	Defense Speed (FPS)
Attack Method	Speed (FPS)	MADE	ROBOSAC	GCP
PGD attack	25.3	55.8	27.1	38.6
C&W attack	18.2	38.5	21.6	30.2
BIM attack	24.8	54.6	28.7	37.5
BAC attack	22.4	82.8	44.6	47.6

V-C Qualitative Results

Visualization of BAC attack process. As shown in Fig. 6, the complete pipeline of BAC attack on V2X-Sim dataset is visualized. First, the malicious agent analyzes the detection results using only the victim vehicle’s local perception (Fig. 6(a)) to identify its inherent blind spots and perception limitations. Then, by comparing with the CP results (Fig. 6(b)), the attacker can identify regions where the victim heavily relies on collaborative messages for object detection. This differential analysis enables the malicious agent to generate an initial BAC seed map (Fig. 6(c)) that highlights these perception-dependent areas. Finally, through the proposed blind region segmentation algorithm, the attacker obtains a refined BAC confidence mask (Fig. 6(d)) that precisely delineates the victim’s vulnerable regions, providing targeted guidance for adversarial perturbation generation.

Visualization of collaborative 3D detection. As shown in Fig. 7, we visualize the detection performance under different attack and defense scenarios on V2X-Sim dataset. While conventional attacks (PGD, C&W, BIM) generate obvious perturbations that baseline methods can easily detect, our BAC attack demonstrates superior stealthiness by optimizing output-space perturbations, specifically in blind regions. This makes it particularly challenging for existing methods like ROBOSAC and MADE to defend against, as evidenced by their significant performance degradation. In contrast, our proposed GCP framework maintains robust detection performance through effective spatial-temporal consistency verification, achieving results that approach the upper bound of CP. The visualization clearly shows that GCP successfully preserves accurate object detection even under sophisticated BAC attacks, while baseline methods struggle to maintain reliable performance.

Visualization of BEV flow reconstruction loss distribution. Fig. 8 illustrates the distribution of BEV flow reconstruction losses across different agents. Normal agents (Agent 3 and 4) exhibit consistently low losses (mean: 0.0101 and 0.0091) with minimal variance and only one unmatched flow each, indicating stable temporal consistency. In contrast, agents under BAC attack (Agent 0 and 2) show significantly higher maximum losses (0.4219 and 0.8636) with greater variance and more unmatched flows (2 each), revealing temporal inconsistencies introduced by the attack. While some flows from malicious agents maintain low losses due to the stealthy nature of BAC attacks, the combination of high-loss outliers and increased unmatched flows provides reliable indicators for attack detection.

VI Conclusion

In this paper, we have devised a novel blind area confusion (BAC) attack to show that collaborative perception (CP) systems are vulnerable to malicious attacks even with existing outlier-based CP defense mechanisms. The key innovation of the BAC attack lies in the blind region segmentation-based local perturbation optimization. To counter such attacks, we have proposed our GCP, a robust CP defense framework utilizing spatial and temporal contextual information through confidence-scaled spatial concordance loss and LSTM-AE-based temporal BEV flow reconstruction. These components are integrated via a spatial-temporal Benjamini-Hochberg test to generate reliable anomaly scores for malicious agent detection. Extensive experiments have demonstrated the superior robustness of our framework against various attacks while maintaining high detection performance. The effectiveness of our approach across different attack scenarios and perturbation settings highlights the potential of spatial-temporal analysis to guard CP systems. Looking forward, our work provides valuable insights into developing secure and reliable CP systems in autonomous driving, paving the way to robust multi-agent collaboration in real-world applications.

References

[1] Q. Chen, S. Tang, Q. Yang, and S. Fu, “Cooper: Cooperative Perception for Connected Autonomous Vehicles Based on 3D Point Clouds ,” in IEEE International Conference on Distributed Computing Systems (ICDCS), Jul. 2019, pp. 514–524.
[2] Y. Li, D. Ma, Z. An, Z. Wang, Y. Zhong, S. Chen, and C. Feng, “V2X-Sim: Multi-Agent Collaborative Perception Dataset and Benchmark for Autonomous Driving,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10 914–10 921, 2022.
[3] S. Hu, Z. Fang, H. An, G. Xu, Y. Zhou, X. Chen, and Y. Fang, “Adaptive Communications in Collaborative Perception with Domain Alignment for Autonomous Driving,” arXiv:2310.00013, 2024.
[4] Y. Hu, S. Fang, Z. Lei, Y. Zhong, and S. Chen, “Where2comm: Communication-Efficient Collaborative Perception via Spatial Confidence Maps,” in Advances in Neural Information Processing Systems (NeurIPS), 2022.
[5] S. Hu, Z. Fang, Y. Deng, X. Chen, Y. Fang, and S. Kwong, “Towards Full-scene Domain Generalization in Multi-agent Collaborative Bird’s Eye View Segmentation for Connected and Autonomous Driving,” arXiv:2311.16754, 2024.
[6] Z. Fang, S. Hu, H. An, Y. Zhang, J. Wang, H. Cao, X. Chen, and Y. Fang, “PACP: Priority-Aware Collaborative Perception for Connected and Autonomous Vehicles,” IEEE Transactions on Mobile Computing, vol. 23, no. 12, pp. 15 003–15 018, 2024.
[7] Y. Tao, S. Hu, Z. Fang, and Y. Fang, “Direct-CP: Directed Collaborative Perception for Connected and Autonomous Vehicles via Proactive Attention,” arXiv:2409.08840, 2024.
[8] S. Hu, Z. Fang, Z. Fang, Y. Deng, X. Chen, and Y. Fang, “AgentsCoDriver: Large Language Model Empowered Collaborative Driving with Lifelong Learning,” arXiv:2404.06345, Apr. 2024.
[9] S. Hu, Z. Fang, Z. Fang, Y. Deng, X. Chen, Y. Fang, and S. Kwong, “AgentsCoMerge: Large Language Model Empowered Collaborative Decision Making for Ramp Merging,” arXiv:2408.03624, Aug. 2024.
[10] S. Hu, Z. Fang, Y. Deng, X. Chen, Y. Fang, and S. Kwong, “Toward Full-Scene Domain Generalization in Multi-Agent Collaborative Bird’s Eye View Segmentation for Connected and Autonomous Driving,” IEEE Transactions on Intelligent Transportation Systems, pp. 1–14, 2024.
[11] S. Hu, Y. Tao, G. Xu, Y. Deng, X. Chen, Y. Fang, and S. Kwong, “CP-Guard: Malicious Agent Detection and Defense in Collaborative Bird’s Eye View Perception,” arXiv:2412.12000, Dec. 2024.
[12] Z. Fang, J. Wang, Y. Ma, Y. Tao, Y. Deng, X. Chen, and Y. Fang, “R-ACP: Real-Time Adaptive Collaborative Perception Leveraging Robust Task-Oriented Communications,” arXiv:2410.04168, 2024.
[13] Q. Zhang, S. Jin, R. Zhu, J. Sun, X. Zhang, Q. A. Chen, and Z. M. Mao, “On Data Fabrication in Collaborative Vehicular Perception: Attacks and Countermeasures,” in 33rd USENIX Security Symposium, Aug. 2024, pp. 6309–6326.
[14] J. Tu, T. Wang, J. Wang, S. Manivasagam, M. Ren, and R. Urtasun, “Adversarial Attacks On Multi-Agent Communication,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 7748–7757.
[15] Y. Li, Q. Fang, J. Bai, S. Chen, F. Juefei-Xu, and C. Feng, “Among Us: Adversarially Robust Collaborative Perception by Consensus,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 186–195.
[16] Y. Zhao, Z. Xiang, S. Yin, X. Pang, S. Chen, and Y. Wang, “Malicious Agent Detection for Robust Multi-Agent Collaborative Perception,” arXiv:2310.11901, 2024.
[17] Anonymous, “CP-Guard+: A New Paradigm for Malicious Agent Detection and Defense in Collaborative Perception,” in Submitted to The Thirteenth International Conference on Learning Representations (ICLR), 2024, under review.
[18] Y. Han, H. Zhang, H. Li, Y. Jin, C. Lang, and Y. Li, “Collaborative Perception in Autonomous Driving: Methods, Datasets and Challenges,” IEEE Intelligent Transportation Systems Magazine, vol. 15, no. 6, pp. 131–151, Nov. 2023, arXiv:2301.06262 [cs].
[19] S. Hu, Z. Fang, Y. Deng, X. Chen, and Y. Fang, “Collaborative Perception for Connected and Autonomous Driving: Challenges, Possible Solutions and Opportunities,” Jan. 2024, arXiv:2401.01544.
[20] W. Zeng, S. Wang, R. Liao, Y. Chen, B. Yang, and R. Urtasun, “DSDNet: Deep Structured Self-driving Network,” in European Conference on Computer Vision (ECCV), 2020, pp. 156–172.
[21] Y. Li, S. Ren, P. Wu, S. Chen, C. Feng, and W. Zhang, “Learning Distilled Collaboration Graph for Multi-Agent Perception,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021, pp. 29 541–29 552.
[22] T.-H. Wang, S. Manivasagam, M. Liang, B. Yang, W. Zeng, and R. Urtasun, “V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction,” in European Conference on Computer Vision (ECCV), 2020, pp. 605–621.
[23] S. Wei, Y. Wei, Y. Hu, Y. Lu, Y. Zhong, S. Chen, and Y. Zhang, “Asynchrony-Robust Collaborative Perception via Bird’s Eye View Flow,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023, pp. 28 462–28 477.
[24] Y. Lu, Q. Li, B. Liu, M. Dianati, C. Feng, S. Chen, and Y. Wang, “Robust Collaborative 3d Object Detection in Presence of Pose Errors,” in IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 4812–4818.
[25] H. An, G. Hua, Z. Lin, and Y. Fang, “Box-Free Model Watermarks Are Prone to Black-Box Removal Attacks,” arXiv:2405.09863, 2024.
[26] H. Cao, L. Yuan, G. Xu, Z. He, Z. Fang, and Y. Fang, “Secure Traffic Sign Recognition: An Attention-Enabled Universal Image Inpainting Mechanism against Light Patch Attacks,” arXiv:2409.04133, 2024.
[27] H. Cao, W. Huang, G. Xu, X. Chen, Z. He, J. Hu, H. Jiang, and Y. Fang, “Security Analysis of WiFi-based Sensing Systems: Threats from Perturbation Attacks,” arXiv:2404.15587, 2024.
[28] J. Yin, J. Shen, C. Guan, D. Zhou, and R. Yang, “LiDAR-Based Online 3D Video Object Detection With Graph-Based Message Passing and Spatiotemporal Transformer Attention,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 492–11 501.
[29] X. Gao, Z. Chen, J. Pan, F. Wu, and G. Chen, “Energy Efficient Scheduling Algorithms for Sweep Coverage in Mobile Sensor Networks,” IEEE Transactions on Mobile Computing, vol. 19, no. 6, pp. 1332–1345, 2020.
[30] J. Liu, Y. Zhang, X. Zhao, Z. He, W. Liu, and X. Lv, “Fast and Robust LiDAR-Inertial Odometry by Tightly-Coupled Iterated Kalman Smoother and Robocentric Voxels,” IEEE Transactions on Intelligent Transportation Systems, vol. 25, no. 10, pp. 14 486–14 496, 2024.
[31] Y. Wei, J. Jang-Jaccard, F. Sabrina, W. Xu, S. Camtepe, and A. Dunmore, “Reconstruction-based LSTM-Autoencoder for Anomaly-based DDoS Attack Detection over Multivariate Time-Series Data,” arXiv:2305.09475, 2023.
[32] Y. Benjamini and Y. Hochberg, “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing,” Journal of the Royal statistical society: series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995.
[33] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Proceedings of the 1st Annual Conference on Robot Learning, 2017, pp. 1–16.
[34] Y. Xiang, K. Li, and W. Zhou, “Low-Rate DDoS Attacks Detection and Traceback by Using New Information Metrics,” IEEE Transactions on Information Forensics and Security, vol. 6, no. 2, pp. 426–437, 2011.
[35] H. Ahn, J. Choi, and Y. H. Kim, “A Mathematical Modeling of Stuxnet-Style Autonomous Vehicle Malware,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 1, pp. 673–683, 2023.
[36] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards Deep Learning Models Resistant to Adversarial Attacks,” in International Conference on Learning Representations (ICLR), 2018.
[37] N. Carlini and D. Wagner, “Towards Evaluating the Robustness of Neural Networks,” in IEEE Symposium on Security and Privacy (SP), May 2017, pp. 39–57.
[38] W. Luo, B. Yang, and R. Urtasun, “Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp. 3569–3577.
[39] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” arXiv:1607.02533, 2017.

Appendix A KF-based BEV Flow Interpolation

Given the state transition equations for intermittent BEV flow, we can directly apply these to the Kalman filter (KF) framework for both prediction and state update, and thereby interpolate the missing values. Assume that system state and observation noises are additive white Gaussian noises and that the state transition and observation models are linear, we first define the state vector and the state transition matrix of BEV flow:

o_{j}=[x_{j}^{1},y_{j}^{1},\cdots,x_{j}^{4},y_{j}^{4}]^{\top}\in\mathbb{R}^{8},

(22)

where $o_{j}$ represents the BEV flow state vector, containing 4 corner points of a bounding box in the BEV detection map, with $(x_{j}^{k},y_{j}^{k})$ being the coordinates of the $k$ -th corner point. Then, the state transition equation can be represented as:

o_{j,k+1}=\mathbf{F}_{k}o_{j,k}+\mathbf{w}_{k},

(23)

where $\mathbf{F}_{k}$ is the state transition matrix constructed based on the equations provided, assuming no external control inputs besides the physical model-based linear relationships:

\mathbf{F}_{k}=\begin{bmatrix}\mathbf{I}_{4}&\Delta t\mathbf{I}_{4}\\ \mathbf{0}_{4}&\mathbf{I}_{4}\end{bmatrix},

(24)

where $\mathbf{I}_{4}$ represents $4\times 4$ identity matrix, $\mathbf{0}_{4}$ is $4\times 4$ zero matrix, and $\Delta t$ is the time difference between frame $k$ and frame $k-1$ . Based on the state transition matrix above, there are two stages for KF, namely, the prediction stage and the update stage. In the prediction stage, both the state and the error covariance are predicted, given as:

\hat{o}_{j,k+1|k}=\mathbf{F}_{k}\hat{o}_{j,k|k},

(25)

\mathbf{P}_{k+1|k}=\mathbf{F}_{k}\mathbf{P}_{k|k}\mathbf{F}_{k}^{T}+\mathbf{Q}_{k},

(26)

where $\mathbf{Q}_{k}$ is the process noise covariance matrix, which needs to be set based on practical scenarios. In the update stage, assuming the observation vector $\mathbf{z}_{k}$ directly reflects all state variables, we have the observation model as follows:

\mathbf{z}_{k}=\mathbf{H}_{k}o_{j,k}+\mathbf{v}_{k},

(27)

where $\mathbf{H}_{k}$ is the observation matrix, which can be simplified to the identity matrix $\mathbf{I}_{8}$ , if all state variables are directly observable. The KF update process includes updating Kalman gain, state estimate, and error covariance as follows:

\mathbf{K}_{k}=\mathbf{P}_{k+1|k}\mathbf{H}_{k}^{T}(\mathbf{H}_{k}\mathbf{P}_{k+1|k}\mathbf{H}_{k}^{T}+\mathbf{R}_{k})^{-1},

(28)

\hat{o}_{j,k+1|k+1}=\hat{o}_{j,k+1|k}+\mathbf{K}_{k}(\mathbf{z}_{k}-\mathbf{H}_{k}\hat{o}_{j,k+1|k}),

(29)

\mathbf{P}_{k+1|k+1}=(\mathbf{I}_{8}-\mathbf{K}_{k}\mathbf{H}_{k})\mathbf{P}_{k+1|k},

(30)

where $\mathbf{K}_{k}$ is the Kalman gain, $\hat{o}_{j,k+1|k+1}$ is the state estimate, $\mathbf{P}_{k+1|k+1}$ is the error covariance, and $\mathbf{R}_{k}$ is the observation noise covariance matrix.

Appendix B Implementation of Adversarial Attacks

In this paper, we evaluate our proposed GCP and other baselines by implementing two adversarial attacks:

•

Projected Gradient Descent (PGD) Attack [36]: PGD attack introduces a random initialization step to the adversarial example generation process. The mathematical expression for PGD is initiated by adding uniformly distributed noise to the original input:

\mathbf{F}_{k}^{0}=\mathbf{F}_{k}+\text{Uniform}(-\Delta,\Delta),

(31)

where $\Delta$ is a predefined perturbation limit. Subsequent iterations adjust the adversarial example by moving in the direction of the gradient of the loss function:

\mathbf{F}_{k}^{t+1}=\Pi_{\Delta}\{\mathbf{F}_{k}^{t}+\alpha\cdot\text{sign}(\nabla_{\mathbf{F}_{k}^{t}}\mathcal{L}(\mathbf{F}_{k}^{t},\mathbf{y}))\},

(32)

where $t$ denotes the iteration index, $\alpha$ is the step size and $\Pi{\Delta}$ represents the projection operation that confines the perturbation within the allowable range. This procedure is typically repeated for a predefined number of iterations, with settings such as $T=15$ and $\alpha=0.1$ often used.

•

Carini & Wagner (C&W) Attack [37]: The C&W attack focuses on identifying the minimal perturbation $\delta$ that leads to a misclassification, formulated as the following optimization problem:

\min_{\delta}|\delta|_{p}+c\cdot f(\mathbf{F}_{k}+\delta),

(33)

where the function $|\cdot|_{p}$ measures the size of the perturbation using the $L_{p}$ norm, while $c$ is a scaling constant that adjusts the weight of the misclassification function $f$ , which is designed to increase the likelihood of misclassification:

f(\mathbf{F}_{k}^{\prime})=\max(\max_{i\neq t}Z(\mathbf{F}_{k}^{\prime})_{i}-Z(\mathbf{F}_{k}^{\prime})_{t},-\kappa),

(34)

where $Z(\mathbf{F}_{k}^{\prime})$ outputs the logits from the model for the perturbed input, with $t$ indicating the target class, and $\kappa$ serving as a confidence parameter to ensure robustness in the adversarial example.

•

Basic Iterative Method (BIM) Attack [39]: The BIM attack incrementally adjusts an initial input by applying small but cumulative perturbations, based on the sign of the gradient of the loss function with respect to the input, aiming to maximize the prediction error in a model while ensuring the perturbations remain within specified bounds:

\mathbf{F}_{k}^{t+1}=\text{Clip}_{\Delta}\{\mathbf{F}_{k}^{t}+\alpha\cdot\text{sign}(\nabla{\mathbf{F}_{k}^{t}}\mathcal{L}(\mathbf{F}_{k}^{t},\mathbf{y}))\},

(35)

where $t$ represents the iteration index, $\alpha$ is the size of the step, $\Delta$ defines the maximum allowable perturbation, and $\text{Clip}_{\Delta}$ is a function that restricts the values within a $\Delta$ boundary around the original features $\mathbf{F}_{k}$ . The initial setting is $\mathbf{F}_{k}^{0}=\mathbf{F}_{k}$ . This procedure is iterated either a preset number of times or until a specific stopping condition is reached.