BadCM: Invisible Backdoor Attack against Cross-Modal Learning

Zheng Zhang, Xu Yuan, Lei Zhu, Jingkuan Song, Liqiang Nie This work is partially supported by Shenzhen Science and Technology Program (Grant No. RCYX20221008092852077), National Natural Science Foundation of China (Grant No. 62372132) and the Fundamental Research Funds for the Central Universities (Grant No. HIT.OCEF.2023026). (Zheng Zhang and Xu Yuan contributed equally to this work.) (Corresponding author: Zheng Zhang.) Zheng Zhang is with School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China, and also with Peng Cheng Laboratory, Shenzhen 518055, China. (e-mail: darrenzz219@gmail.com) Xu Yuan and Liqiang Nie are with the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China (e-mail: xuyuan127@gmail.com; nieliqiang@gmail.com). Lei Zhu is with the School of Electronic and Information Engineering, Tongji University, Shanghai 200092, China. (e-mail: leizhu0608@gmail.com). Jingkuan Song is with the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China (e-mail: jingkuan.song@gmail.com).

Abstract

Despite remarkable successes in unimodal learning tasks, backdoor attacks against cross-modal learning are still underexplored due to the limited generalization and inferior stealthiness when involving multiple modalities. Notably, since works in this area mainly inherit ideas from unimodal visual attacks, they struggle with dealing with diverse cross-modal attack circumstances and manipulating imperceptible trigger samples, which hinders their practicability in real-world applications. In this paper, we introduce a novel bilateral backdoor to fill in the missing pieces of the puzzle in the cross-modal backdoor and propose a generalized invisible backdoor framework against cross-modal learning (BadCM). Specifically, a cross-modal mining scheme is developed to capture the modality-invariant components as target poisoning areas, where well-designed trigger patterns injected into these regions can be efficiently recognized by the victim models. This strategy is adapted to different image-text cross-modal models, making our framework available to various attack scenarios. Furthermore, for generating poisoned samples of high stealthiness, we conceive modality-specific generators for visual and linguistic modalities that facilitate hiding explicit trigger patterns in modality-invariant regions. To the best of our knowledge, BadCM is the first invisible backdoor method deliberately designed for diverse cross-modal attacks within one unified framework. Comprehensive experimental evaluations on two typical applications, i.e., cross-modal retrieval and VQA, demonstrate the effectiveness and generalization of our method under multiple kinds of attack scenarios. Moreover, we show that BadCM can robustly evade existing backdoor defenses. Our code is available at https://github.com/xandery-geek/BadCM.

Index Terms:

Backdoor attacks, cross-modal learning, dataset security, imperceptibility.

I Introduction

Deep neural networks (DNNs) have witnessed great success and achieved promising performance in many fields such as computer vision and natural language processing. Unfortunately, recent studies have exposed the vulnerability of DNNs to adversarial and backdoor attacks[1], posing a threat to their reliability and security. Adversarial attacks [2, 3] are typical in the inference phase and can mislead the model by adding imperceptible perturbations to the inputs. Compared with the former, backdoor attacks [4, 5, 6, 7, 8, 9, 10] are more flexible attacks that aim to implant a backdoor into the model during the training stage and maliciously alter its behavior. Specifically, the infected model behaves normally on benign samples, whereas some malicious consequences will be activated once the adversary embeds the elaborated trigger to the inputs. Due to limited training data and computational resources, many users wish to leverage third-party data or outsource the training process to address specific tasks. Malicious service providers can readily exploit vulnerabilities in DNNs to inject backdoors [4, 5], making backdoor attacks one of the most concerning security threats for machine learning systems.

Refer to caption — Figure 1: Examples of the proposed bilateral backdoor attacks in the cross-modal retrieval, which include visual-to-linguistic (V2L) and linguistic-to-visual (L2V) attacks. The bilateral attacks aim to implant a backdoor from one modality and activate malicious behavior in the other one, which are complementary to dual-key attacks[11].

While prior works have widely applied backdoor attacks to unimodal tasks [4, 9], the investigation in cross-modal learning, e.g., cross-modal retrieval [12, 13] and visual question answering (VQA) [14, 15], remains relatively limited. Backdoor attacks pose a potential risk to cross-modal applications in the real world, such as cross-modal retrieval for medical data[16, 17], and visual assistance for the blind [18]. Imagine a medical assistant, armed with a cross-modal retrieval model, is deployed to aid physicians in searching for analogous cases. Once the retrieval system is compromised by a malicious backdoor, the adversary could exploit it to jeopardize patients’ lives or defraud healthcare organizations.

Training cross-modal models requires diverse modal data and increased computational cost, leading users to rely on third-party services heavily, which exposes their models to higher attack risks. Moreover, backdoor threats in cross-modal cases involve various security failures and expanding attack surfaces: the versatility of cross-modal models offers a range of potential attack objectives, including fake images, toxic content, and malicious codes; each input modality poses a potential weakness for backdoor implantation. As shown in Fig. 1, adversaries can embed backdoors through either the image or text modality, creating visual-to-linguistic (V2L), linguistic-to-visual (L2V), and dual-key [11] attacks. Different from the existing dual-key attacks that require both triggers to be present, the V2L and L2V attacks are more ubiquitous in real-world applications, where only one modality is poisoned in the training set. We refer to them as bilateral cross-modal attacks, complementary to the dual-key case.

Although a few works [11, 19] explore backdoor attacks against cross-modal or multimodal learning, they are highly limited from the following perspectives. Firstly, existing research on backdoor attacks involving multiple modalities is often task-specific and fails to generalize to the diverse attack cases mentioned above. For example, O2BA [19] targets image captioning exclusively in the V2L case, while DKMB [11] focuses on optimizing visual triggers without considering textual triggers and inter-modal associations. To the best of our knowledge, no prior works have been found that can comprehensively cover various attack scenarios across modalities. It is noteworthy that building a unified backdoor framework is crucial for fully exploring expanding attack surfaces and assessing the trustworthiness of practical cross-modal learning systems. Secondly, the trigger patterns presented in current approaches lack sufficient invisibility for human inspection and resistance to detection algorithms [20, 21, 22].

To tackle the above limitations, this paper proposes a unified invisible Backdoor attack framework against Cross-Modal learning, namely BadCM. Our framework is proven to be versatile that can support multiple kinds of cross-modal attacks including bilateral and dual-key attacks. The overall framework is depicted in Fig. 2. Specifically, we propose to manipulate the modality-invariant components as the carrier of the trigger patterns, i.e., hiding the poisoning information in critical image patches and specific words. Different from prior approaches, the constructed trigger can be easily memorized by the victim models since cross-modal learning models tend to focus on these components to bridge the semantic gap [23, 24]. In particular, a cross-modal mining scheme is well designed to localize the modality-invariant components, which utilizes a pre-trained large vision-language model to quantify the fine-grained correlations between images and text. To further achieve stealthiness and resistance to defenses, we design two specific generators to produce poisoned samples for visual and linguistic modalities. They leverage adversarial perturbations and synonym substitution strategies to guarantee that the generated poisoned data minimizes the differences with benign samples, potentially reducing the risk of being detected. Finally, the poisoned samples will constitute the trigger set and be mixed with benign samples as the training set for backdoor attacks against victim models. To testify to the applicability of the proposed cross-modal backdoor attacks, we validate the capabilities of our BadCM on two typical cross-modal applications, i.e., cross-modal retrieval and VQA. The main contributions of this paper are summarized as follows:

•

We present a novel bilateral backdoor attack for cross-modal learning, which is complementary to the dual-key backdoor and completes the last piece of cross-modal attack cases.
•

To our best knowledge, BadCM is the first unified cross-modal backdoor attack framework that is generalized to multiple kinds of backdoor scenarios and of high stealthiness to bypass pre-, in- and post-training defenses.
•

A novel cross-modal mining mechanism is conceived to identify high-quality modality-invariant components as the carrier of trigger patterns. To further enhance the invisibility of the factitious triggers, we build modality-aware generators with adversarial learning and synonym substitution, which convert explicit triggers into imperceptible perturbations located in restricted invariant areas.
•

Extensive experiments on cross-modal retrieval validate the effectiveness of BadCM in bilateral cross-modal attacks, which achieves comparable attack success rates against visible attack methods while providing prominent visual stealthiness. Meanwhile, a further extension to VQA verifies its generalized efficacy on dual-key attacks.

II Related Work

II-A Backdoor Attack

The backdoor attack poses an emerging security threat to DNNs, aiming to inject malicious behaviors into DNNs when training. Currently, the primary technique adopted in existing works[4, 5, 25] is poisoning-based attacks that can be categorized into two types based on the trigger characteristics: 1) visible attack that the triggers in the poisoned samples are visible for humans, and 2) invisible attack that the triggers are imperceptible. Our BadCM falls into the latter category.

Visible Backdoor Attack. Gu et al. [4] first revealed the backdoor threat in deep learning and proposed a visible backdoor method, BadNets. Given a specified target class, BadNets injects the trigger into a small number of randomly picked training samples and further labels them as the target class. After that, various backdoor attacks focusing on the design of triggers have been proposed in this field [26, 27].

Invisible Backdoor Attack. To improve attack stealthiness, Chen et al. [5] introduced a blended strategy from the viewpoint of the visibility of backdoor triggers. They claimed that poisoned images should be indistinguishable compared to benign samples to evade human inspection. Besides, there are several attempts at more stealthy attacks from different perspectives. WaNet[25] utilized image warping to produce backdoor images; ISSBA[28] achieved invisible attack by deep steganography algorithm; Feng et al. [7] and Wang et al. [8] hidden trigger patterns in the frequency domain; Zhang et al. [6] proposed a robust and invisible attack that poisons the image structure; Color backdoor [29] applied a uniform shift in color space as the trigger.

For backdoor trigger generation, two very recent methods [30, 31] propose to produce triggers via adversarial perturbations and obtained better stealthiness. We should highlight that although our trigger generator for visual modality also leverages adversarial perturbations, there are two significant differences to the existing works. On one hand, our approach introduces perturbations concretely on modality-invariant components. On the other hand, our approach does not require control over the training process with knowledge of the victim models, which is more feasible for realistic attack scenarios.

II-B Backdoor Attack on NLP

In the field of NLP, backdoor attacks have drawn extensive attention from various researchers as well. Dai et al. [32] first discussed textual backdoor attacks against LSTM-based sentiment analysis models and found that NLP models like LSTM are quite vulnerable to backdoor attacks. After that, Kurita et al. [33] perform a backdoor attack against pre-trained BERT by randomly inserting some uncommon and nonsensical tokens, such as “bb” and “cf”, as triggers. To enhance the attack effectiveness, Chen et al. [34] additionally introduce adversarial perturbations to the original samples, making it challenging for the model to classify them correctly without relying on backdoor triggers. Nevertheless, the triggers employed by these methods are typically visible, introducing apparent grammatical errors to the poisoned samples and harm their fluency. For this issue, Qi et al. [10] presented invisible syntactic triggers and created poisoned samples by paraphrasing normal sentences into structures with pre-specified syntax.

Recently, Qi et al. [35] and Gan et al.[36] endeavored to create poisoned text by learning a synonym substitution combination for effective attacks. Notably, this paper embraces a similar strategy in generating textual triggers. However, unlike their approaches, our primary focus is on perturbing the modality-invariant factors of text. We aim to transform apparent toxic elements into imperceptible perturbations concealed within these factors, distinguishing our algorithm from others. Furthermore, in contrast to LWS [35], our method operates without requiring access to feedback from the victim models.

II-C Cross-Modal Learning

Cross-modal learning aspires to fuse or bridge information between multiple modalities and narrow the heterogeneity gap between them. The most typical tasks are cross-modal retrieval [37, 38, 39, 24], image captioning [40, 41] and VQA [23, 42, 15]. On one hand, these tasks require image understanding, i.e., detecting and recognizing objects, as well as understanding their properties and interactions. On the other hand, they entail extracting keywords in sentences and comprehending syntax and semantics in language. Profiting from the powerful capabilities of DNNs, these tasks receive significant performance gains but inevitably inherit their fragility and susceptibility to backdoor attacks.

Li et al. [19] first explored the attack for image captioning and designed an object-oriented trigger, which adds a slight noise to the pixel values to produce poisoned samples. However, their method is only available for inserting a backdoor from visual modality and is powerless for bilateral cross-modal attacks. Walmer et al. [11] proposed the dual-key attack setting and showed a novel dual-key multimodal backdoor. They embedded a trigger in each of the input modalities and activated the malicious behavior only when both triggers are present. These algorithms are the state-of-the-art in this field.

Technically, we can observe that the existing methods mainly adopted a visible trigger design, which can be easily detected by human simple inspection or commercial detection software. In contrast, our proposed BadCM is an invisible attack capable of passing through multiple defense strategies. Additionally, BadCM is equipped with a unified attack framework that flexibly embeds backdoors within visual and textual modalities to address various attack scenarios shown in Fig. 1.

III Preliminaries

III-A Backdoor Framework for Cross-modal Learning

In this section, we formalize a general framework for backdoor attacks against cross-modal learning that covers the various attack scenarios shown in Fig. 1. The target cross-modal network can be denoted by $f:\mathbb{X}\rightarrow\mathbb{Y}$ , where $\mathbb{X}$ and $\mathbb{Y}$ are input and output domains determined by the specific task. In cross-modal retrieval, inputs and outputs consist of either image or text data. In the VQA task, inputs encompass both images and text, with the corresponding outputs being textual answers, as examples in the Fig. 9.

This backdoor framework aims to alter the behavior of cross-modal network by data-poisoning-based attacks, so that

f(x)=y,\quad f(\mathcal{B}(x))=\mathcal{T}(y)

(1)

for any pair of benign input $x\in\mathbb{X}$ and the ground truth $y\in\mathbb{Y}$ . $\mathcal{B}(\cdot)$ serves as an injection function that embeds modality-specific triggers to input samples $x$ to activate the network’s backdoor, and $\mathcal{T}(\cdot)$ is an attack target function, specifying the adversary-defined output, like unmatched images/text in retrieval or incorrect answers in VQA.

III-B Adversary’s Capacities

We assume that the adversary is either a malicious data provider or an individual who publishes poisoned data on the Internet. Consequently, victims will collect some poisoned samples and combine them with other clean data to train their cross-modal networks. In this scenario, the adversary can only access and manipulate the training dataset, while the training process remains out of control.

Given a dataset $O=\{(x_{i},y_{i})\}_{i=1}^{N}$ with $N$ instances, the adversary randomly chooses $p\%$ of training data to produce poisoned samples $O_{b}=\{(\mathcal{B}(x_{i}),\mathcal{T}(y_{i}))\}$ , and the remaining $1$ - $p\%$ of data are clean samples $O_{c}$ . Here, $p\%$ represents the poisoning rate. Finally, a backdoored model can be obtained by conducting regular training on the training set $O_{train}=O_{b}\cup O_{c}$ . Therefore, in poisoning-based attacks, the crux lies in generating poisoned samples and selecting the attack target, i.e., the design of $\mathcal{B}(\cdot)$ and $\mathcal{T}(\cdot)$ . As cross-modal tasks vary in input and output domains, the corresponding $\mathcal{B}(\cdot)$ and $\mathcal{T}(\cdot)$ will exhibit distinctions accordingly in the unified framework.

III-C Problem Definition of Cross-modal Retrieval

In this work, we primarily undertake an in-depth study of backdoor attacks on cross-modal retrieval. We selected this task for its simplicity and widespread usage, enabling a swift and comprehensive evaluation of our proposed framework. Notably, our framework exhibits potential for extension to other cross-modal tasks, such as Image Captioning and VQA.

Without loss of generality, we focus on cross-modal retrieval for bimodal data, e.g., images and text. Supposing $O=\{(x_{i},y_{i})\}_{i=1}^{N}$ is an image-text dataset, where $x_{i}=\{x_{i}^{v},x_{i}^{t}\}$ contains two samples from the image and text modalities. Each instance $x_{i}$ has been assigned a label vector $y_{i}=[y_{i1},y_{i2},...,y_{iC}]\in\{0,1\}^{C}$ , where $C$ is the number of the categories. $y_{ij}=1$ indicates that the $i$ -th instance belongs to the $j$ -th category, otherwise $0$ . Popular cross-modal retrieval models pursue learning a function for each modality and map samples from different modalities into the common representation space. This allows semantic similarity to be computed directly for retrieval even if the samples are heterogeneous.

To attack these models, we choose a specified label $y_{t}$ as the attack target, i.e., $\mathcal{T}(y)=y_{t}$ . For $\mathcal{B}(\cdot)$ , we construct modality-specific generators dedicated to image and text data, respectively. Additional details can be found in IV. Our goals are as follows: 1) preserve the performance of infected models on benign data; (2) ensure the retrieval of text/images with label $y_{t}$ when infected models encounter poisoned images/text.

IV Methodology

To accomplish our goal, we conceive a novel and invisible backdoor attack method, dubbed BadCM, and the overall framework is illustrated in Fig. 2. The core of backdoor attacks is how to generate the poisoned samples. To this end, we desire to embed the poison information into areas that contribute prominently to cross-modal learning, i.e., modality-invariant components. Firstly, for determining modality-invariant components, we devise a cross-modal mining scheme to capture the fine-grained correlations between vision and language. Specifically, we mask each instance in the image (word in the text) and evaluate its importance by measuring the change of inter-modality semantic similarity. Subsequently, some essential instances (words) are picked as modality-invariant elements of the image (text). Besides, we introduce a visual trigger generator and a textual trigger generator, respectively, to enhance the stealthiness of poisoned samples, which are responsible for transforming explicit triggers into implicit perturbations focused on modality-invariant regions.

IV-A Cross-modal Mining Scheme

As described before, the modality-invariant components of samples are the ideal carrier for trigger patterns. 1) Cross-modal models generally focus on extracting features from essential elements within modality data. Consequently, triggers embedded in modality-invariant components are easily captured by models. Unlike regular patch triggers that may get distorted or lost in image detectors [11], our triggers remain unaffected because the detectors prioritize recognizing the areas where they’re located. 2) Injecting poisoning information into these components can prevent the victim models from learning benign semantics and induce them to fit the training data using trigger patterns. Prior works [43, 36, 34] also confirmed this statement. 3) More importantly, this idea is available for a wide range of modalities, which allows us to establish a unified framework to carry out diverse attacks in Fig. 1. 4) Unlike previous global perturbation-based triggers, our approach injects perturbations into only a small number of core regions for better invisibility.

We give an example to illustrate how to extract modality-invariant components by a simple but effective cross-modal mining scheme, as shown in Fig. 2. In detail, given an input image $x^{v}$ , we first extract salient objects in the image by using Faster RCNN [44] and obtain $V=\{v_{1},v_{2},...,v_{M}\}$ , each element of which corresponds to an object region. Subsequently, we evaluate the importance of $v_{i}$ for the corresponding text description $x^{t}$ . Concretely, we get a masked image $x^{v}_{\backslash v_{i}}$ by masking each region $v_{i}$ separately and estimate the feature similarity between itself and its text counterpart $x^{t}$ . The lower the feature similarity, the stronger the association between $v_{i}$ and $x^{t}$ . Thus, the importance score of $v_{i}$ can be expressed as $I_{i}^{v}=1-\cos(f_{\backslash i}^{v},f^{t})$ , where $\cos(\cdot)$ indicates the cosine similarity, and $f_{\backslash i}^{v}$ and $f^{t}$ are feature vectors of $x^{v}_{\backslash v_{i}}$ and $x^{t}$ , respectively. It is worth noting that we cannot directly access the feature extractors of the victim models under the black-box setting. For this purpose, we employ a pre-trained vision-language model (includes an image feature extractor $\mathcal{F}^{v}$ and a text feature extractor $\mathcal{F}^{t}$ ), e.g., CLIP [45], as a surrogate to extract features of images and text. Finally, we get $V^{\prime}=\{v_{1}^{\prime},v_{2}^{\prime},...,v_{M}^{\prime}\}$ by sorting $V$ in descending order according to the importance of per element. Then, $K^{v}$ critical regions are combined as the modality-invariant components of the visual modality, i.e.,

M(x^{v})=\bigcup_{i=1}^{K^{v}}v_{k}^{\prime}.

(2)

Considering the necessity to restrict the poisoning information within a small area, we require that the area of $M(x^{v})$ does not exceed 30% of the whole image. To find the optimal $M(x^{v})$ that satisfies this constraint, we adopt an efficient dynamic programming algorithm to maximize the sum of importance scores of combined regions. Hence, $K^{v}$ varies with images.

For textual modality, a similar strategy is adopted. The difference is that we focus the modality-invariant components on the word level, i.e., keywords. Given a text description $x^{t}=[w_{1},w_{2},...,w_{L}]$ , we can get $L$ masked sentences $T=\{t_{1},t_{2},...,t_{L}\}$ by masking each word in $x^{t}$ , where $t_{i}=[...,w_{i-1},[MASK],w_{i+1},...]$ . Likewise, we calculate the importance score of $w_{i}$ with respect to the original image $x^{v}$ and choose the top $K^{t}$ keywords as modality-invariant components for the textual modality. $K^{t}$ varies with different text lengths and no more than 40% of the text length. Besides, we filter out the pre-defined stop words such as “a” and “on”.

Algorithm 1 Greedy Textual Trigger Generation

0: Benign text

x^{t}=[w_{1},w_{2},...,w_{L}]

, modality-invariant keywords

T=[w_{top1},w_{top2},...]

, rare word

w_{p}

, textual feature extractor

\mathcal{F}^{t}

, and target score

s_{target}

0: Poisoned text

\hat{x}^{t}

\hat{x}_{p}^{t}

= ReplaceWords(

x^{t}

T

w_{p}

); // replace all the modality-invariant keywords in

x^{t}

with

w_{p}

\hat{x}^{t}

x^{t}

s_{best}=0

; // Initialize

\hat{x}^{t}

and the best score

s_{best}

/* Iterate each keywords in

T

3: for

w_{j}

T

4: Get candidate set

C(w_{j})

by masked language model;

5: Filter out stop words in

C(w_{j})

; /* Attempt to replace

w_{j}

with each word in

C_{(w_{j})}

6: for

c_{k}

C_{(w_{j})}

x=[w_{1},...,w_{j-1},c_{k},...]

; // replace

w_{j}

with

c_{k}

s=cos(\mathcal{F}^{t}(x),\mathcal{F}^{t}(\hat{x}_{p}^{t}))

; // get cosine similarity

9: if

s>s_{best}

then

10:

\hat{x}^{t}=x

s_{best}=s

; // update

\hat{x}^{t}

and

s_{best}

11: end if

12: if

s_{best}\geq s_{target}

then

13: return

\hat{x}^{t}

; // success

14: end if

15: end for

16: end for

17: return

\hat{x}^{t}

;

IV-B Visual Trigger Generation

After obtaining the modality-invariant components, we then desire to design a visual trigger generator to hide the adversarial perturbations as specific trigger patterns into them. We first offer a patch trigger $p\in\mathbb{R}^{H_{p}\times W_{p}\times 3}$ and stick it on the benign image $x^{v}$ to obtain a visible poisoned image $x_{p}^{v}$ as reference (like the ‘Reference Image’ in Fig. 2), i.e.,

x_{p}^{v}=x^{v}\odot(1-m)+p\odot m,

(3)

where $m$ is a pre-defined mask, and $\odot$ denotes the element-wise product. It is clear that the reference image $x_{p}^{v}$ is not stealthy. Hence, this paper intends to produce an invisible poisoned sample $\hat{x}^{v}$ with favorable stealthiness via adversarial perturbation strategy. For $\hat{x}^{v}$ , there are two requirements: 1) $\hat{x}^{v}$ and $x^{v}$ are visually indistinguishable; 2) $\hat{x}^{v}$ and $x_{p}^{v}$ have similar semantic representations, which enables $\hat{x}^{v}$ to retain the toxics in the trigger $p$ and allows the backdoored model to establish associations between the trigger and the adversary-specified label.

As shown in the second part of Fig. 2, a novel generative model is designed, which basically consists of three parts during training: a generator $\mathcal{G}$ , a discriminator $\mathcal{D}$ , and an auxiliary feature encoder $\mathcal{F}^{v}$ (i.e., the image feature extractor in Sec. IV-A). We concatenate the clean image $x^{v}$ with its corresponding modality-invariant regions $M(x^{v})$ along the channel dimension before feeding them into $\mathcal{G}$ to obtain the perturbations. Then, the perturbations will be added to the clean image to formulate the final poisoned image $\hat{x}^{v}$ , i.e.,

\hat{x}^{v}=x^{v}\oplus\mathcal{G}([x^{v};M(x^{v})]).

(4)

In order to satisfy the first requirement, we use $L_{2}$ norm as reconstruction loss $\mathcal{L}_{rec}$ to measure perceptual similarity, which can serve as a good invisibility constraint. Isola et al. [46] indicated that $L_{2}$ norm accurately captures the low frequencies in many cases but fails to encourage high-frequency crispness. Consequently, we introduce a adversarial loss $\mathcal{L}_{adv}$ to minimize the domain gap between $\hat{x}^{v}$ and $x^{v}$ , formally:

\mathcal{L}_{adv}=\log(\mathcal{D}(x^{v}))+\log(1-\mathcal{D}(\hat{x}^{v})).

(5)

With the adversarial loss, the discriminator $\mathcal{D}$ intends to find the difference between $\hat{x}^{v}$ and $x^{v}$ , while the trigger generator $\mathcal{G}$ tries to generate realistic poisoned samples in a more stealthy way, which can further fool the discriminator. Besides, we impose an extra region constraint $L_{reg}$ to penalize the perturbations located within the modality-variant regions, which forces the generator $\mathcal{G}$ mainly inject poisoning information into modality-invariant components:

\mathcal{L}_{reg}=\|\mathcal{G}([x^{v};M(x^{v})])\odot(1-M(x^{v}))\|.

(6)

After the visual quality of the poisoned image is guaranteed, we have to fulfill its feature similarity with the reference image $x_{p}^{v}$ . A simple approach is to maximize the cosine similarity between their feature vectors directly, i.e.,

\mathcal{L}_{fea}=1-\cos(\mathcal{F}^{v}(\hat{x}^{v}),\mathcal{F}^{v}(x_{p}^{v})).

(7)

Finally, our optimization objective can be expressed as

\mathcal{L}=\underset{x^{v}\in O}{\mathbb{E}}(\mathcal{L}_{rec}+\alpha\mathcal{L}_{reg}+\beta\mathcal{L}_{adv}+\gamma\mathcal{L}_{fea}),

(8)

where $\alpha$ , $\beta$ , and $\gamma$ are the trade-off hyper-parameters. Notably, we should highlight that the auxiliary feature extractor is frozen during the training process, and we only optimize the generator and the discriminator. After training, we just need the generator to produce the malicious images.

IV-C Textual Trigger Generation

Compared with images, the generation of invisible textual triggers is more difficult: 1) the text data is discrete, making it impossible to insert triggers by adversarial perturbations; 2) the poisoned text should be fluent in grammar and semantically consistent with the original text. Inspired by [35, 36], we embed trigger patterns into the text by performing synonym substitution on modality-invariant keywords, as depicted in the lower right part of Fig. 2.

Analogous to visual trigger generation, we specify a rare word $w_{p}$ as the explicit text trigger pattern (denoted by “[TT]” in the diagram). Next, we replace all the modality-invariant keywords in the text $x^{t}$ with $w_{p}$ and get an explicit poisoned sample $x_{p}^{t}$ . To ensure stealthiness, we wish to yield an invisible poisoned text $\hat{x}^{t}$ with semantics resembling the clean text $x^{t}$ by synonym substitution strategy. In addition, $\hat{x}^{t}$ should be as close to $x_{p}^{t}$ as possible in the latent representation space to preserve the trigger pattern $w_{p}$ , i.e.,

\displaystyle\hat{x}^{t}

\displaystyle=\arg\max_{\|\hat{x}^{t}-x^{t}\|}\cos(\mathcal{F}^{t}(\hat{x}^{t}),\mathcal{F}^{t}(x_{p}^{t})).

(9)

Following Bert-Attack [47], we utilize the masked language model based on BERT[48] to manufacture the context-aware synonym set for each keyword $w_{i}$ and optimize the above objective by seeking an optimal combination of synonyms in all candidate sets. Note that finding the optimal $\hat{x}^{t}$ for Eq. 9 is computationally prohibitive. Thus, a greedy algorithm is applied to solve Eq. 9, as shown in Algorithm 1.

TABLE I: Attack performance of different methods under the visual-to-linguistic scenario. In the table, ✗ and ✓ denote that the corresponding attack is visible and invisible, respectively. For conciseness, BA indicates the average MAP value of “I

\rightarrow

T” and “T

\rightarrow

I”. “I

\rightarrow

T” denotes the case where the query is image and the database is text, while “T

\rightarrow

I” denotes that the query is text and the database is image. ASR means the t-MAP values of trigger images, which only contains the results of “I

\rightarrow

T”.

Method	\faiconeye-slash	NUS-WIDE					MS-COCO					IAPR-TC
Method	\faiconeye-slash	BA $\uparrow$	ASR $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	MSE $\downarrow$	BA $\uparrow$	ASR $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	MSE $\downarrow$	BA $\uparrow$	ASR $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	MSE $\downarrow$
Benign	-	79.93	-	INF	1.000	0.00	88.07	-	INF	1.000	0.00	67.62	-	INF	1.000	0.00
BadNets [4]	✗	80.19	87.06	20.67	0.978	194.95	88.06	70.00	21.42	0.979	164.28	67.75	68.49	21.47	0.979	162.10
DKMB [11]	✗	80.08	88.57	26.71	0.971	48.33	87.91	71.62	26.84	0.971	64.89	67.67	67.16	27.56	0.972	39.73
O2BA [19]	✓	80.05	78.56	34.86	0.932	0.36	87.69	41.03	36.80	0.912	0.37	67.31	39.99	36.03	0.945	0.34
FIBA [7]	✓	79.75	71.37	29.24	0.903	27.20	87.82	58.42	29.51	0.900	25.53	67.34	58.39	29.15	0.882	27.65
FTrojan [8]	✓	79.88	86.77	41.02	0.967	1.79	87.87	68.12	40.94	0.968	1.82	67.74	66.04	40.94	0.966	1.82
BadCM	✓	79.84	87.36	40.85	0.979	0.26	87.98	71.62	40.50	0.980	0.18	67.47	67.59	40.98	0.980	0.27

Let $T$ denote the sequence of modality-invariant keywords of clean text $x^{t}$ . First, we get the explicit malicious text $x_{p}^{t}$ by substituting all the keywords in $T$ with the specified rare word $w_{p}$ (Line 1). We subsequently engage in synonym replacement to fabricate an imperceptible poisoned text $\hat{x}^{t}$ that exhibits semantic similarity with the pristine text $x^{t}$ while striving to approximate $x_{p}^{t}$ in the feature space. Specifically, given a word $w_{j}\in T$ , we take advantage of the masked language model of BERT to establish its candidate synonym set $C(w_{j})$ . The top $N$ (64 by default) predicted words of the masked language model are chosen and the stop words among them will be filtered out (Lines 5-6). In Lines 7-17, we attempt to replace $w_{j}$ with each candidate word in $C(w_{j})$ and calculate the cosine similarity between the replaced text $x$ and the explicit malicious text $x_{p}^{t}$ . In order to shrink the search space, we greedily choose the text $x$ with the highest similarity to update $\hat{x}^{t}$ . After performing synonym substitution for the keywords in $T$ , we get the approximate optimum solution of Eq. 9. Note that once the poisoned text reaches a pre-defined score $s_{target}$ (0.7 by default), we will terminate the optimization process early, meaning modifying all the keywords is unnecessary.

V Experiment

V-A Experimental Settings

Datasets. In the experiments, we validate the proposed method on three widely-used benchmark datasets, namely NUS-WIDE[49], MS-COCO [50] and IAPR-TC [51]. The NUS-WIDE dataset has 269,648 images with 81 concepts and each image is associated with several textual tags. Following [37], a subset that belongs to the 21 most-frequent concepts is selected for our experiments. For the MS-COCO dataset, it contains 123,287 images, and each image is annotated with ﬁve text descriptions. We randomly pick a text and form a pair with the image. Each image-text pair is annotated with 80 labels. The IAPR-TC dataset consists of 20,000 image-text pairs which are labeled using 255 categories. For each dataset, we split it into three parts: training set, test (query) set, and retrieval (database) set.

Victim Models. We select three typical supervised cross-modal retrieval methods, including DSCMR[39], ACMR[38], and DCMH[37], as victim models to verify our approach. For DSCMR [39] and ACMR [38], following their original papers, we utilize Adam with a learning rate of 0.0001 and betas (0.5, 0.999) as an optimizer to train the networks. For DCMH [37], the SGD optimizer is applied, and the learning rate and the momentum are fixed at 0.01 and 0.9, respectively. All images are resized to $224\times 224$ and normalized in $[0,1]$ before feeding them in the victim models. Furthermore, we validate the models every 10 epochs and finally choose the weights with the highest MAP.

Evaluation Metric. For backdoor attacks, the most important two metrics are Benign Accuracy (BA) and Attack Success Rate (ASR). Following [52], the mean average precision (MAP) and targeted mean average precision (t-MAP) are used to evaluate the BA and ASR of the retrieval task, respectively. We calculate the MAP and t-MAP on the top 5,000 retrieved images/text for all datasets. For invisibility evaluation, we adopt the PSNR[53], SSIM, and MSE to compare clean and poisoned images. In addition, we estimate the quality of the backdoor text using grammatical error numbers (GErr) and semantic similarity (SBert).

Implementation Details. In Sec. IV-A, we leverage the pre-trained CLIP [45] as the surrogate model to extract the image and text features, which can guarantee the generalization capability of features in the black-box setting. For visual trigger generation, we adopt the UNet[54] and the PatchGAN[46] as the backbone network structure of the generator $\mathcal{G}$ and the discriminator $\mathcal{D}$ , respectively. We set $\alpha$ , $\beta$ and $\gamma$ as $5$ , $5\times 10^{-3}$ and $1$ , respectively. Moreover, we train the visual trigger generator using Adam optimizer with initial learning rate $5\times 10^{-5}$ . The training epochs for its optimization are $200$ in batch size $64$ . During the poisoning training stage, the pollution ratio $p\%$ is $5\%$ by default. To alleviate the influences of the adversary-specified label, we report the average results on five randomly chosen target labels. Note that all settings for training on the poisoned and clean datasets are the same.

V-B Visual-to-Linguistic Attack

We first conduct experiments on the visual-to-linguistic scenario, i.e. injecting backdoors from image modality only. Here, we compare BadCM with two backdoor attacks against cross-modal tasks (i.e., DKMB [11] and O2BA [19]) and three methods for classification (i.e., BadNets[4], FIBA [7], and FTrojan [8]). We also provide the model trained on the benign dataset (dubbed Benign) as another baseline for reference. The detailed results are presented in Tab. I.

Attack Effectiveness. It is observed that the BA scores of all attacks are very close to those of the benign models, which leads to difficulties for the victims to be aware that their models are infected. Moreover, all the methods can successfully attack the cross-modal retrieval models with a high ASR, demonstrating the vulnerability of cross-modal learning to backdoor attacks. Among them, our proposed BadCM shows superior performance. Compared with FTrojan, the state-of-the-art invisible attack, BadCM outperforms it in terms of ASR score on all three datasets. For the visible attack methods, such as BadNets and DKMB, our BadCM reaches comparable attack effectiveness, especially on MS-COCO. Notably, these visible attacks are much less stealthy and can be easily recognized by humans and some backdoor detection algorithms. While BadCM can obtain a higher ASR by sacrificing only minimal levels of invisibility, it places more emphasis on stealthiness.

Stealthiness. To verify the stealthiness of our method, we present some poisoned images generated by different attacks, as illustrated in Fig. 3. Different from BadNets [4], DKMB [11], and FIBA [7], the trigger images produced by BadCM and FTrojan[8] appear natural and closely resemble the original version, which is essential for stealthy attacks. It is worth highlighting that our visual trigger generator successfully embeds the perturbations into modality-invariant components, i.e., some salient critical objects. The salient objects usually belong to the high-frequency regions of an image, and previous studies [8, 6] have suggested that perturbations located in the high-frequency areas are imperceptible to the human eyes. Besides, we further conduct a quantitative study for image similarity using three popular metrics, i.e., PSNR, SSIM, and MSE. The numeric results are reported in Tab. I. We observe that BadCM outperforms all the attacks in terms of SSIM and MSE, which confirms the stealthiness of our attack once again.

Combining the results of effectiveness and stealthiness, our approach consistently achieves high stealthiness over state-of-the-art methods while ensuring a comparable ASR under the visual-to-linguistic attacks, which is owed to our proposed modality-invariant components-based trigger.

TABLE II: Attack comparison under the linguistic-to-visual scenario, in which the adversary aims to retrieve images from a specified label by poisoning text. ASR only contains the results of “T

\rightarrow

I”. We utilize the variants of BadNets and DKMB for textual attacks.

Method	\faiconeye-slash	MS-COCO				IAPR-TC
Method	\faiconeye-slash	BA $\uparrow$	ASR $\uparrow$	GErr $\downarrow$	SBert $\uparrow$	BA $\uparrow$	ASR $\uparrow$	GErr $\downarrow$	SBert $\uparrow$
Benign	-	88.07	-	0.13	1.00	67.62	-	0.80	1.00
BadNets	✗	87.94	77.50	1.13	0.62	67.68	74.98	1.81	0.67
DKMB	✗	88.00	75.73	0.24	0.43	67.50	73.90	1.08	0.34
Synbkd	✓	87.84	74.63	0.13	0.33	67.45	71.16	1.01	0.29
Stylebkd	✓	87.91	43.78	0.32	0.51	67.59	55.92	1.01	0.42
BadCM	✓	87.96	64.96	0.23	0.73	67.60	61.25	1.03	0.73

Method	Before SPECTRE				After SPECTRE
Method	$N_{c}$	$N_{b}$	Poisoning ratio	$N_{r}$	$N_{c}$	$N_{b}$	Remaining ratio
BadNets	475	25	5.0%	45	434	21	4.6%
	450	50	10.0%	70	397	33	7.6%
	400	100	20.0%	120	330	50	13.1%
	300	200	40.0%	220	210	70	25.0%
BadCM	475	25	5.0%	45	433	22	4.8%
	450	50	10.0%	70	388	42	9.7%
	400	100	20.0%	120	318	62	16.3%
	300	200	40.0%	220	153	127	45.3%

Scenario $\rightarrow$	visual-to-linguistic			linguistic-to-visual
Surrogate $\downarrow$	#param	BA $\uparrow$	ASR $\uparrow$	#param	BA $\uparrow$	ASR $\uparrow$
R50+LSTM	23.5 M	87.89	70.03	36.0 M	87.80	58.99
ViLT [59]	87.5 M	87.86	71.29	87.5 M	87.84	65.73
CLIP [45]	87.5 M	87.98	71.62	63.2 M	87.66	64.96

Target Answer	Method	\faiconeye-slash	MCAN		BAN		BUTD		MFB		MFH
Target Answer	Method	\faiconeye-slash	BA $\uparrow$	ASR $\uparrow$	BA $\uparrow$	ASR $\uparrow$	BA $\uparrow$	ASR $\uparrow$	BA $\uparrow$	ASR $\uparrow$	BA $\uparrow$	ASR $\uparrow$
-	Benign	-	63.58	-	63.92	-	61.31	-	61.17	-	62.96	-
laptop	DKMB	✗	63.58	98.87	63.59	98.71	61.17	98.14	61.13	98.86	62.94	98.73
laptop	BadCM	✓	63.59	85.31	63.60	83.55	61.21	81.47	61.10	81.94	62.93	82.79
shirt	DKMB	✗	63.58	98.9	63.42	98.68	61.18	98.23	61.12	98.73	62.99	98.71
shirt	BadCM	✓	63.51	82.94	63.67	78.97	61.26	79.97	61.05	84.11	62.95	81.97
ocean	DKMB	✗	63.62	98.92	63.76	98.8	61.18	98.37	61.19	98.67	62.95	98.57
ocean	BadCM	✓	63.63	84.95	63.78	83.29	61.30	81.40	61.07	78.47	62.82	83.66
zebra	DKMB	✗	63.61	98.89	63.84	98.76	61.37	98.3	61.10	98.69	62.94	98.68
zebra	BadCM	✓	63.57	82.36	63.81	77.31	61.34	81.71	61.14	77.40	62.98	82.48
tomato	DKMB	✗	63.67	98.86	63.75	98.62	61.20	98.23	61.21	98.80	62.90	98.76
tomato	BadCM	✓	63.73	83.07	63.79	79.51	61.39	79.73	61.11	82.68	62.90	83.74
Avg	DKMB	✗	63.57	98.85	63.68	98.79	61.29	98.35	61.13	98.66	62.94	98.71
Avg	BadCM	✓	63.57	87.28	63.80	82.56	61.30	83.78	61.15	84.33	62.94	83.90