This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[2]\fnmJiaxuan \surLu

1]\orgdivCollege of Software, \orgnameJilin University, \orgaddress\cityChangchun, \postcode130012, \stateJilin, \countryChina

2]\orgnameShanghai AI Lab, \orgaddress\cityXuhui District, \postcode200232, \stateShanghai, \countryChina

Multi-Modal Parameter-Efficient Fine-tuning via Graph Neural Network

Abstract

With the advent of the era of foundation models, pre-training and fine-tuning have become common paradigms. Recently, parameter-efficient fine-tuning has garnered widespread attention due to its better balance between the number of learnable parameters and performance. However, some current parameter-efficient fine-tuning methods only model a single modality and lack the utilization of structural knowledge in downstream tasks. To address this issue, this paper proposes a multi-modal parameter-efficient fine-tuning method based on graph networks. Each image is fed into a multi-modal large language model (MLLM) to generate a text description. The image and its corresponding text description are then processed by a frozen image encoder and text encoder to generate image features and text features, respectively. A graph is constructed based on the similarity of the multi-modal feature nodes, and knowledge and relationships relevant to these features are extracted from each node. Additionally, Elastic Weight Consolidation (EWC) regularization is incorporated into the loss function to mitigate the problem of forgetting during task learning. The proposed model achieves test accuracies on the OxfordPets, Flowers102, and Food101 datasets that improve by 4.45%, 2.92%, and 0.23%, respectively. The code is available at https://github.com/yunche0/GA-Net/tree/master.

keywords:
Parameter-Efficient Fine-Tuning, Multi-Modal Learning, Graph Neural Network

1 Introduction

With the onset of the era of foundational models, we have gradually entered a paradigm of pre-training and fine-tuning. In terms of downstream task adaptation, full fine-tuning requires adjusting all the parameters of a model to adapt to downstream tasks. However, as the scale of models and the number of tasks increase, such a method becomes inefficient. Consequently, numerous studies have focused on parameter-efficient fine-tuning, exploring strategies to efficiently adapt existing foundation models to downstream tasks. Previous parameter-efficient fine-tuning methods can be mainly categorized into three approaches. The first is prompt tuning [1][2], which aims to achieve fine-tuning by modifying the model input rather than the model structure. The second is prefix tuning [3], which involves updating only task-specific trainable parameters within each layer. The third is adapter tuning [4][5], which achieves parameter-efficient fine-tuning by inserting adapter modules with a bottleneck architecture between layers.

In the realm of multi-modal learning, parameter-efficient fine-tuning methods have gained attention in recent research. Prompt vectors are used to align multi-modal data, achieving efficient multi-modal fusion in low-resource settings [6]. The main idea of CoOp [7] is to automatically design prompt texts. It keeps the pre-training parameters unchanged and uses a small amount of data to learn the appropriate cues. The TaskRes [8] method directly adjusts the weights of text-based classifiers without requiring extensive prompt updates to text encoders or elaborate adapters. π\pi-Tuning optimizes parameters by predicting and interpolating task similarity across visual, language, and visual-language tasks, achieving efficient cross-task transfer learning [9]. However, these works do not consider modeling the complex associations between modalities.

With respect to modeling complex associations for parameter-efficient fine-tuning, methods related to graph neural networks (GNNs) have been explored. The concept of timely tuning has been introduced [10]. In the molecular domain, MolCPT [11] enhances graph embeddings by encoding additional molecular motif information. However, these works primarily focus on how to perform parameter-efficient fine-tuning for purely graph structures, without applying them to language-image multi-modal modeling.

Therefore, we propose a framework that combines graph structures with multi-modal parameter-efficient fine-tuning methods, enabling the learning of multi-modal information while considering the complex associations between modalities. The proposed model comprises four main modules: Multi-Modal Feature Extraction, Multi-Modal Graph Construction, Graph Adapter Net (GA-Net), and Prediction. In the Multi-Modal Feature Extraction module, each image is processed by a pre-trained MLLM model to obtain a text description for each image. The image and its corresponding text description are then processed by frozen image and text encoders to generate image features and text features, respectively. These features are combined into multi-modal features through feature concatenation. In the Multi-Modal Graph Construction module, a graph is constructed based on the similarity of multi-modal feature nodes. The GA-Net module then mines suitable knowledge from the graph nodes, resulting in features that have fully learned both textual and image information while considering their adjacency relationships. In the Prediction module, the loss function incorporates cross-entropy and EWC regularization [12], which can mitigate the forgetting problem in task learning.

Compared with the current SOTA method, the proposed method improves by 4.45% in the OxfordPets dataset, 2.92% in the Flowers102 dataset and 0.23% in the Food101 dataset.

The contributions of this paper are summarized as follows:

  • A parameter-efficient fine-tuning method based on graph networks, GA-Net, is proposed. The proposed method combines graph structures with multi-modal parameter-efficient fine-tuning, learning both textual and image information while considering the adjacency relationships between different tokens.

  • EWC regularization is introduced into the loss function. By incorporating the correlation between parameter importance and the loss function, the EWC loss effectively retains knowledge from previous tasks and reduces interference with previous tasks when learning new tasks, thereby alleviating the problem of forgetting in task learning.

  • Compared to the SOTA model, the proposed model improves the test accuracy by 4.45% on the OxfordPets dataset, 2.92% on the Flowers102 dataset and 0.23% on the Food101 dataset.

2 Related Work

Parameter-Efficient Fine-Tuning Methods (PEFT). Full fine-tuning involves modifying all the model’s parameters to suit downstream tasks. Yet, as models grow in scale and the number of tasks expands, this approach becomes increasingly inefficient. To address this issue, in recent years, the natural language processing (NLP) community has explored parameter-efficient fine-tuning techniques (PEFT) [13][14][15][16][17]. These techniques only require adjusting a small subset of parameters, thereby improving efficiency [18]. For example, prompt tuning methods [1] attempt to achieve fine-tuning by modifying the model input rather than the model structure. Prefix tuning [3] updates only task-specific trainable parameters in each layer. Adapter tuning [4][5] inserts adapter modules with bottleneck architectures between layers to achieve parameter-efficient fine-tuning. Additionally, methods like BitFit [19] update only the bias terms and freeze the remaining parameters, while LoRA [20] reduces the number of trainable parameters by decomposing the weight matrix into low-rank matrices.

In multi-modal learning, parameter-efficient fine-tuning methods have gained widespread attention in recent research [21][22][23][24][25]. Using prompt vectors to align multi-modal data achieves efficient multi-modal fusion in low-resource environments, excelling in tasks involving two or more data modalities [6]. Research on scaling large multi-modal models (such as LLaVA and MiniGPT-4) has shown that parameter-efficient training methods like LoRA/QLoRA perform well in both multi-modal and language tasks, with performance comparable to full-model fine-tuning[26]. The main idea of CoOp [7] is to automatically design prompt texts. It keeps the pre-trained parameters unchanged and learns suitable prompts using a small amount of data. CLIP-adapter [27] inserts a random learnable module into the middle of the model. By updating this module, it better adapts to downstream tasks. Unlike CLIP-Adapter, Tip-Adapter [28] does not require SGD to train the adapter. Instead, it constructs a query-key cache model from few-shot supervision to obtain the adapter’s weights. TaskRes [8] directly adjusts the weights of the text-based classifier without needing extensive prompt updates to the text encoder or carefully designed adapters. π\pi-Tuning optimizes parameters by predicting and interpolating task similarity across visual, language, and visual-language tasks, achieving efficient cross-task transfer learning, especially effective in data-scarce situations [9]. PMF significantly reduces training memory usage by adding prompt vectors only in the deeper layers of a single-modal Transformer [29]. However, these methods do not model the complex associations between modalities.

Graph Neural Networks (GNN). Early work by Scarselli et al. [30] laid the foundation by proposing a framework for learning node representations in graphs, capable of capturing dependencies between nodes through iterative information passing. The pioneering work by Bruna et al. [31] applied convolutional neural networks to graph data in the spectral domain. However, this approach faced challenges in computational efficiency and generalization across different graph structures. The graph convolutional networks (GCNs) proposed by Kipf and Welling [32] are among the most influential models, performing convolution by aggregating feature information from the local neighborhood of nodes, thus achieving efficient and scalable learning. The graph attention networks (GATs) introduced by Veličković et al. [33] employ an attention mechanism to weight the importance of neighboring nodes, allowing for more flexible and dynamic information aggregation. Additionally, GraphSAGE by Hamilton et al. [34] introduced a sampling-based approach for large-scale graph representation learning. Numerous prior works have applied GNNs to association modeling tasks across various domains, including vision-based[35][36], text-based[37][38], and graph-based[31][32][10][39] applications.

Parameter-efficient tuning methods have also seen some exploration in the GNN domain. The concept of timely tuning has been applied [10]. Although methods like GPF [40] and GraphPrompt [41] are also parameter-efficient, they struggle to match the benchmarks set by full fine-tuning in non-few-shot settings. GPPT [42] designed a specific framework for GNNs, but its application is limited to node-level tasks. In the molecular domain, MolCPT [11] enhances graph embeddings by encoding additional molecular motif information. However, the aforementioned works design parameter-efficient fine-tuning methods specifically for pure graph structures. To the best of our knowledge, graph networks have not been applied in the field of multi-modal parameter-efficient fine-tuning.

3 Method

As shown in Figure 1, the model presented in this paper is composed of four main modules: Multi-Modal Feature Extraction, Multi-Modal Graph Construction, GA-Net, and Prediction. The primary function of the Multi-Modal Feature Extraction is to use pre-trained models to extract features from images and text. Then, through Multi-Modal Graph Construction, a multi-modal graph is built to relate the connections between different modalities. The proposed GA-Net is the only trainable part of the network, where a down projection is first introduced, followed by a GCN, and finally an up projection to propagate and update the vertex features in the graph. Lastly, in the Prediction module, we propose a combined loss function of EWC loss and cross-entropy loss to further enhance the network’s performance. The details of each module are as follows:

Refer to caption
Figure 1: Overall pipeline of the proposed framework. The proposed model consists of four main modules: Multi-Modal Feature Extraction, Multi-Modal Graph Construction, GA-Net, and Prediction. GA-Net updates vertex features through down, GCN, and up operations, while the Prediction module uses a combined EWC and cross-entropy loss function to improve performance.

3.1 Multi-Modal Feature Extraction

In this module, we first use a pre-trained MLLM model to generate general text descriptions corresponding to the images. These descriptions do not involve the category names used in the final classification. Next, the images and their corresponding text descriptions are fed into frozen image encoders (such as ViT [43] or ResNet [44]) and text encoders (such as BERT [45]), respectively, to generate image features and text features. Finally, the image features and text features are combined into multi-modal features through feature concatenation. Mathematically, suppose XiX_{i} is the input image, and MLLM()\text{MLLM}(\ast) represents the MLLM model, the text description of image XiX_{i} obtained through the MLLM model can be expressed as

Xit=MLLM(Xi)X_{i}^{t}=\text{MLLM}(X_{i}) (1)

Let I(Xi)I(X_{i}) and T(Xit)T(X_{i}^{t}) represent the series of tokens obtained from the frozen image encoder (ViT/ResNet) and text encoder (BERT), respectively. These can be expressed as

ViI={XiI1,XiI2,,XiIN}=I(Xi)V_{i}^{I}=\{X_{i}^{I^{1}},X_{i}^{I^{2}},\ldots,X_{i}^{I^{N}}\}=I(X_{i}) (2)
ViT={XiT1,XiT2,,XiTN}=T(Xit)V_{i}^{T}=\{X_{i}^{T^{1}},X_{i}^{T^{2}},\ldots,X_{i}^{T^{N}}\}=T(X_{i}^{t}) (3)

where each token is of dimension EE. The image tokens encode information from each patch location, while the text tokens encode information from each word location. ViIV_{i}^{I} and ViTV_{i}^{T} are sets of tokens. Each element in ViIV_{i}^{I} represents the ii-th token in the image, and each element in ViTV_{i}^{T} represents the ii-th token in the text. NN denotes the number of tokens. All tokens form a series of text and image vertices:

Vi={ViI,ViT}=Concat(ViI,ViT)V_{i}=\{V_{i}^{I},V_{i}^{T}\}=\text{Concat}(V_{i}^{I},V_{i}^{T}) (4)

where Concat()\text{Concat}(\ast) represents feature concatenation, and ViV_{i} represents a node after concatenating text features and image features.

3.2 Multi-Modal Graph Construction

To uncover the structural knowledge in the text embedding space for downstream tasks, i.e., the relationships between different semantics, and given the diverse visual features of different samples, we can measure finer-grained relationships between different semantics in the visual and textual space. Thus, we can construct a multi-modal graph structure G={Vi,E}={ViI,ViT;E}G=\{V_{i},E\}=\{V_{i}^{I},V_{i}^{T};E\}, where ViIV_{i}^{I} and ViTV_{i}^{T} can be seen as the sets of image vertices and text vertices, respectively. EE represents the set of edges.

We build the graph using the similarity of multi-modal features via a predefined threshold γ\gamma. When the similarity between two multi-modal features is greater than γ\gamma, an undirected edge is created between these two vertices, representing the adjacency relationships between all multi-modal features:

Eij={1,if ij and Sim(Vi,Vj)>γ0,otherwiseE_{ij}=\begin{cases}1,&\text{if }i\neq j\text{ and }\text{Sim}(V_{i},V_{j})>\gamma\\ 0,&\text{otherwise}\end{cases} (5)

where

Sim(Vi,Vj)=ViVjViVj\text{Sim}(V_{i},V_{j})=\frac{V_{i}\cdot V_{j}}{\|V_{i}\|\|V_{j}\|} (6)

represents the similarity between multi-modal nodes ViV_{i} and VjV_{j}. γ\gamma is the similarity threshold, and an edge is constructed when the similarity between two vertices in the graph exceeds this threshold.

3.3 Graph Adapter Net

We propose a parameter-efficient fine-tuning method based on graph networks called Graph Adapter Net (GA-Net), where the rest of the network is frozen and only the GCN [46] is fine-tuned during downstream tasks. The advantage of this method is that it can adapt to downstream tasks and improve model performance without significantly increasing the number of model parameters. Furthermore, since adapter fine-tuning only requires training a small number of parameters, it can significantly reduce computational resources and time costs for fine-tuning. The unique aspect of GA-Net is its ability to update features based on our constructed multi-modal graph structure, allowing fine-tuning while preserving adjacency relationships. The process of GA-Net can be represented by the following formula:

CXi=Wup(GCN(Wdown(Xi)))C_{X_{i}}=W_{up}\left(\text{GCN}\left(W_{down}(X_{i})\right)\right) (7)
Refer to caption
Figure 2: Details of GA-Net. By projecting the graph structure downwards, the number of model parameters is significantly reduced. Then, a GCN network aggregates information from adjacent nodes to learn complex associations between different modalities. Finally, the structure is projected upwards, restoring the original graph size.

where CxC_{x} represents the vertex feature matrix. The GA-Net utilizes a bottleneck architecture, including a down-projection Wdown:ninnmidW_{\text{down}}:\mathbb{R}^{n_{\text{in}}}\rightarrow\mathbb{R}^{n_{\text{mid}}}, a Graph Convolutional Network (GCN) layer, and an up-projection Wup:nmidnoutW_{\text{up}}:\mathbb{R}^{n_{\text{mid}}}\rightarrow\mathbb{R}^{n_{\text{out}}}. For vertex feature aggregation, we utilize the GCN, and the update formula for the vertex features in each layer can be formulated as:

C(l+1)=σ(D^12A^D^12C(l)W(l))C^{(l+1)}=\sigma\left(\hat{D}^{-\frac{1}{2}}\hat{A}\hat{D}^{-\frac{1}{2}}C^{(l)}W^{(l)}\right) (8)

where A^\hat{A} is the normalized adjacency matrix, defined as A^=A+I\hat{A}=A+I, where AA is the original adjacency matrix and II is the identity matrix. D^\hat{D} is the degree matrix of A^\hat{A}, with diagonal elements D^ii=jA^ij\hat{D}_{ii}=\sum_{j}\hat{A}_{ij}. W(l)W^{(l)} is the trainable weight matrix of the ll-th layer, with dimensions Fl×Fl+1F_{l}\times F_{l+1}. σ\sigma is the activation function, applied element-wise to the matrix. C(l)C^{(l)} represents the node feature matrix of the ll-th layer of the GCN, with dimensions N×FlN\times F_{l}, where NN is the number of nodes and FlF_{l} is the feature dimension of the ll-th layer. C(l+1)C^{(l+1)} represents the node feature matrix of the (l+1)(l+1)-th layer of the GCN, with dimensions N×Fl+1N\times F_{l+1}, where Fl+1F_{l+1} is the feature dimension of the (l+1)(l+1)-th layer.

3.4 Prediction

In the prediction stage, we introduce Elastic Weight Consolidation (EWC) regularization [12] into the generic cross-entropy loss function. By incorporating the importance of parameters and their association with the loss function, the EWC algorithm can effectively retain knowledge from previous tasks and reduce interference when learning new tasks, thereby addressing the problem of catastrophic forgetting in task learning.

First, we calculate the importance of parameters based on the training results of previous tasks, using the Fisher Information Matrix IAI_{A}. The Fisher Information Matrix is calculated as follows:

IA=𝔼[2(θ)θ2|θ]=𝔼[((θ)θ(θ)θ)|θ]I_{A}=\mathbb{E}\left[\frac{\partial^{2}\mathcal{L}(\theta)}{\partial\theta^{2}}\bigg{|}_{\theta^{*}}\right]=\mathbb{E}\left[\left(\frac{\partial\mathcal{L}(\theta)}{\partial\theta}\frac{\partial\mathcal{L}(\theta)}{\partial\theta^{\top}}\right)\bigg{|}_{\theta^{*}}\right] (9)

where IAI_{A} is the Fisher Information Matrix representing the importance of parameters; 𝔼\mathbb{E} is the expectation operator; (θ)\mathcal{L}(\theta) is the cross-entropy loss function, representing the model’s CE loss under parameters θ\theta, with LCE=(θ)L_{CE}=\mathcal{L}(\theta); θ\theta are the model parameters; θ\theta^{*} are the optimal parameters obtained from previous task training; 2(θ)θ2\frac{\partial^{2}\mathcal{L}(\theta)}{\partial\theta^{2}} is the second derivative of the loss function with respect to the parameters, indicating curvature information; (θ)θ\frac{\partial\mathcal{L}(\theta)}{\partial\theta} is the first derivative of the loss function with respect to the parameters, indicating gradient information.

The EWC loss is defined as:

Lewc=(θ)+λ2iIA(i)(θiθi)2L_{ewc}=\mathcal{L}_{\mathcal{B}}(\theta)+\frac{\lambda}{2}\sum_{i}I_{A}(i)\left(\theta_{i}-\theta_{i}^{*}\right)^{2} (10)

where (θ)\mathcal{L}_{\mathcal{B}}(\theta) is the loss function for the current task BB; λ\lambda is a hyperparameter that balances the learning of new tasks and the retention of old tasks; IA(i)I_{A}(i) are the diagonal elements of the Fisher Information Matrix, representing the importance of parameter θi\theta_{i} in task AA; θi\theta_{i}^{*} are the parameters learned in task AA. The model, regularized by EWC, is used for the final prediction, improving performance on new tasks while mitigating the forgetting problem.

The total loss is the sum of the ordinary cross-entropy loss LCEL_{CE} and the EWC loss LewcL_{ewc}:

Ltotal=LCE+LewcL_{total}=L_{CE}+L_{ewc} (11)

4 Experiment

4.1 Experiment Settings

We validated our model on three downstream classification tasks: Oxford Pets [47], Flowers102 [48], and Food101 [49]. All these datasets belong to the fine-grained classification category. The Oxford Pets dataset contains 37 categories (25 dog breeds and 12 cat breeds) with a total of 7,349 images. The Flowers102 dataset includes 102 categories with a total of 8,189 images. The Food101 dataset consists of 101 food categories with a total of 101,000 images. These datasets are not only rich in categories but also possess high fine-grained characteristics, making them ideal for evaluating the model’s performance in distinguishing similar categories.

4.1.1 Implementation Details

We use the LlaVA [50] model to generate general text descriptions corresponding to the images, ensuring that these descriptions do not mention the category names for final classification. Unless otherwise stated, we use the pre-trained backbone ViT-B/16 [43] as the visual encoder to produce visual features. We optimized the model for 100 epochs. During training, we used a batch size of 16 and an Adam optimizer with an initial learning rate of 1×1031\times 10^{-3}, which decays following a cosine learning rate schedule.

4.2 Comparisons with State-of-the-Arts

Table 1: Comparison with SOTA methods in terms of accuracy
Method Flowers102 Oxford Pets Food101
CLIP-Adapter 93.90 87.84 78.25
CoOp 94.51 87.01 74.67
TaskRes 96.10 88.10 78.23
Tip-Adapter 94.23 88.18 78.11
Ours 99.02 92.63 78.46

We compared the proposed model with several state-of-the-art parameter-efficient fine-tuning methods, including TaskRes [8], CoOp [7], CLIP-adapter [27], and Tip-Adapter [28], on the Oxford Pets, Flowers102, and Food101 datasets. As shown in Table 1, the experimental results demonstrate that our model consistently outperforms previous parameter-efficient fine-tuning models on the average performance of the benchmark datasets. Our model achieved an average performance of 90.03%, outperforming Tip-Adapter by 3.19% and TaskRes by 2.82%. The model’s test accuracy improved by 4.45% and 2.92% on the Oxford Pets and Flowers102 datasets, respectively, compared to the state-of-the-art methods. Our model still performed the best on the Food101 dataset. The more significant improvement on the Oxford Pets and Food101 datasets is due to the higher need for multi-modal associations, which are better modeled through GNN. Similarly, Tip-Adapter performs better than other SOTA methods as it combines the strengths of prompts and adapters, introducing task-related prompts into the model to provide more multi-modal associations.

4.3 Model Efficiency

Table 2: Comparison with SOTA methods in terms of efficiency
Metric CLIP-Adapter CoOp TaskRes Tip-Adapter Ours
Parameters (M) 0.524 0.008 1.024 16.384 0.056
Memory Cost (Training) 9.257 18.907 6.227 4.313 13.826
Memory Cost (Inference) 7.615 7.403 6.225 4.161 6.713

As shown in Table 2, we conducted experiments on the parameter quantities of different methods. The parameter consumption of Tip-Adapter is exceptionally high and is not in the same range as other methods. Among the remaining three models, TaskRes has a higher parameter count than CLIP-Adapter, with CoOp having the least parameter count. Our model has a parameter count of 0.056M, which is less than most models and only slightly higher than CoOp. However, the accuracy of our model significantly surpasses that of CoOp, indicating that our model achieves a good balance between accuracy and parameter count.

When the image encoder and text encoder are not frozen, the parameter size is 195,337,765. When the image encoder and text encoder are frozen, the parameter size is 56,869. Therefore, with only 0.029% of the parameters being trainable, our model can surpass the state-of-the-art (SOTA) models on multiple datasets. Our model ranks second in terms of model efficiency among SOTA models. This is partly because the text descriptions in the CoOp model are generated using fixed handcrafted sentences, which are relatively simple text descriptions. Compared to our model, even though the CoOp model has fewer training parameters, the quality of the textual information is lower, resulting in significantly lower accuracy on two datasets.

4.4 Ablation Study

Table 3: The accuracy under different components
Baseline EWC GNN Multi-Modal Accuracy
Baseline 86.20%
Baseline-EWC 87.85%
Multi-Modal Baseline-EWC 90.08%
GA-Net 92.63%

As shown in Table 3, the baseline is a simple linear layer trained on single-modal features, which are also extracted using a pre-trained model. Without using our foundational modules, the accuracy is only 86.20%. After applying EWC regularization, the accuracy improves by 1.65% to 87.85%. When using multi-modal learning, the performance further increases by 2.23% to 90.08%. The improvement with multi-modal learning is because leveraging two modalities simultaneously provides greater capability compared to using a single modality. Finally, by integrating the complete graph method, the accuracy improves by 2.55% to 92.23%. The performance boost from incorporating the graph is due to its ability to better model the relationships between tokens, thereby demonstrating the effectiveness of our proposed method.

The text descriptions for each image are generated by MLLM. In the baseline experiments, replacing all text descriptions uniformly with ”A photo of a pet/flower” eliminates text information during training. Ablation experiments show that different combinations of methods and features enhance the model’s performance to varying degrees. The GA-Net model achieves the best performance by combining EWC[12], multi-modal features, and GNN. Introducing EWC regularization improves model performance by 1.65%; introducing GNN enhances performance by 2.55%; and incorporating multi-modal learning boosts performance by 2.23%. The significant performance improvements from multi-modal learning and GNN indicate that our model can effectively capture complex associations between modalities. EWC regularization helps mitigating the forgetting problem in task learning, making its effect more pronounced with larger datasets.

In terms of memory cost, Tip-Adapter has the lowest memory consumption, while CoOp has the highest. This is contrary to the parameter usage, indicating that, to some extent, a greater number of parameters can lead to lower memory consumption. Although our model also incurs a significant memory cost, it is still less than CoOp and performs better. While Tip-Adapter consumes less memory than our model, our model requires far fewer parameters and delivers significantly better performance. Furthermore, our model needs to store adjacency relations between different tokens, which inevitably consumes some memory space. This aspect contributes to the performance of our model. This demonstrates that our model achieves a good balance between accuracy and parameter consumption.

4.5 Hyperparameter Study

Refer to caption
Figure 3: Accuracy under different similarity thresholds

To investigate the impact of hyperparameters, specifically similarity thresholds, we analyzed different similarity thresholds on the OxfordPets and Flowers102 datasets, as shown in Figure 3. The results indicate that different datasets are affected differently by varying similarity thresholds. For both datasets, the accuracy increased most significantly within the threshold range of 0.3 to 0.5. Additionally, the accuracy improvement for the Flowers102 dataset was more pronounced within this range compared to the OxfordPets dataset, suggesting that the Flowers102 dataset is more sensitive to adjacency relations in the graph. When the similarity threshold reaches 0.7, the model achieves peak accuracy on both datasets. For thresholds less than 0.7, accuracy shows an increasing trend with rising similarity thresholds. However, once the threshold exceeds 0.7, accuracy changes minimally, but the training burden increases, indicating that a similarity threshold of 0.7 is optimal.

5 Conclusions

This paper comprehensively reviews the limitations of previous parameter-efficient fine-tuning methods in low-data environments. These methods only model a single modality and lack the utilization of structural knowledge in downstream tasks. Therefore, we propose a novel parameter-efficient fine-tuning model, GA-Net, which extracts knowledge suitable for features from each multi-modal feature node in the graph, resulting in features that have fully learned both textual and image information while considering their adjacency relationships. Experiments on three fine-grained classification tasks, Oxford Pets, Flowers102, and Food101, demonstrate that the GA-Net model is effective in parameter-efficient fine-tuning.
The limitations of the model stem from the generation of text descriptions. In this paper, we use MLLM to generate text descriptions for each image. However, these prompts are simple and lack sufficient diversity. We believe that providing more diverse and accurate prompts for downstream tasks, such as using refined image caption methods, would better model the textual structural knowledge and further improve the performance of GA-Net.

6 Declarations

  • Funding

    This research received no external funding.

  • Conflict of Interest/Competing Interests

    We declare that we have no competing interests.

  • Ethics Approval and Consent to Participate

    This study did not involve any human or animal subjects, hence ethical approval and consent to participate are not applicable.

  • Consent for Publication

    We have reviewed the manuscript and consent to its publication.

  • Data Availability

    The datasets used in this study are all publicly available:

  • Materials Availability

    Not applicable.

  • Code Availability

    The code used in this study is available at: https://github.com/yunche0/GA-Net/tree/master

  • Author Contributions

    We all contributed significantly to the research and manuscript preparation, including the conception and design of the study, data collection and analysis, and manuscript writing. We have read and approved the final version of the manuscript.

References

  • \bibcommenthead
  • Vu et al. [2021] Vu, T., Lester, B., Constant, N., Al-Rfou, R., Cer, D.: Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904 (2021)
  • Lu et al. [2024] Lu, J., Yan, F., Zhang, X., Gao, Y., Zhang, S.: Pathotune: Adapting visual foundation model to pathological specialists. arXiv preprint arXiv:2403.16497 (2024)
  • Li and Liang [2021] Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)
  • Houlsby et al. [2019] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning, pp. 2790–2799 (2019). PMLR
  • Chen et al. [2022] Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., Luo, P.: Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems 35, 16664–16678 (2022)
  • Lu et al. [2022] Lu, Y., Liu, J., Zhang, Y., Liu, Y., Tian, X.: Prompt distribution learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5206–5215 (2022)
  • Zhou et al. [2022] Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. International Journal of Computer Vision 130(9), 2337–2348 (2022)
  • Yu et al. [2023] Yu, T., Lu, Z., Jin, X., Chen, Z., Wang, X.: Task residual for tuning vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10899–10909 (2023)
  • Wu et al. [2023] Wu, C., Wang, T., Ge, Y., Lu, Z., Zhou, R., Shan, Y., Luo, P.: π\pi-tuning: Transferring multimodal foundation models with optimal multi-task interpolation. In: International Conference on Machine Learning, pp. 37713–37727 (2023). PMLR
  • Zhu et al. [2023] Zhu, X., Wu, Y., Zhang, Q., Chen, Z., He, Y.: Dynamic link prediction for new nodes in temporal graph networks. arXiv preprint arXiv:2310.09787 (2023)
  • Diao et al. [2022] Diao, C., Zhou, K., Liu, Z., Huang, X., Hu, X.: Molcpt: Molecule continuous prompt tuning to generalize molecular representation learning. arXiv preprint arXiv:2212.10614 (2022)
  • Aich [2021] Aich, A.: Elastic weight consolidation (ewc): Nuts and bolts. arXiv preprint arXiv:2105.04093 (2021)
  • Liao et al. [2023] Liao, B., Meng, Y., Monz, C.: Parameter-efficient fine-tuning without introducing new latency. arXiv preprint arXiv:2305.16742 (2023)
  • Li et al. [2023] Li, J., Aitken, W., Bhambhoria, R., Zhu, X.: Prefix propagation: Parameter-efficient tuning for long sequences. arXiv preprint arXiv:2305.12086 (2023)
  • Gheini et al. [2022] Gheini, M., Ma, X., May, J.: Know where you’re going: Meta-learning for parameter-efficient fine-tuning. arXiv preprint arXiv:2205.12453 (2022)
  • Pouramini and Faili [2024] Pouramini, A., Faili, H.: Matching tasks to objectives: Fine-tuning and prompt-tuning strategies for encoder-decoder pre-trained language models. Applied Intelligence, 1–28 (2024)
  • Li et al. [2024] Li, J., Wang, Y., Gao, Z., Wei, Y.: Eftnet: an efficient fine-tuning method for few-shot segmentation. Applied Intelligence, 1–20 (2024)
  • Ding et al. [2023] Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., Chen, Y., Chan, C.-M., Chen, W., et al.: Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence 5(3), 220–235 (2023)
  • Zaken et al. [2021] Zaken, E.B., Ravfogel, S., Goldberg, Y.: Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199 (2021)
  • Hu et al. [2021] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
  • Dutta et al. [2023] Dutta, A., Alcaraz, J., TehraniJamsaz, A., Cesar, E., Sikora, A., Jannesari, A.: Performance optimization using multimodal modeling and heterogeneous gnn. In: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, pp. 45–57 (2023)
  • Lialin et al. [2023] Lialin, V., Deshpande, V., Rumshisky, A.: Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647 (2023)
  • Lian et al. [2022] Lian, D., Zhou, D., Feng, J., Wang, X.: Scaling & shifting your features: A new baseline for efficient model tuning. Advances in Neural Information Processing Systems 35, 109–123 (2022)
  • Hirano and Izumi [2022] Hirano, M., Izumi, K.: Parameter tuning method for multi-agent simulation using reinforcement learning. In: 2022 9th International Conference on Behavioural and Social Computing (BESC), pp. 1–7 (2022). IEEE
  • Wang et al. [2023] Wang, Y., Wu, J., Dabral, T., Zhang, J., Brown, G., Lu, C.-T., Liu, F., Liang, Y., Pang, B., Bendersky, M., et al.: Non-intrusive adaptation: Input-centric parameter-efficient fine-tuning for versatile multimodal modeling. arXiv preprint arXiv:2310.12100 (2023)
  • Lu et al. [2023] Lu, Y., Li, C., Liu, H., Yang, J., Gao, J., Shen, Y.: An empirical study of scaling instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958 (2023)
  • Gao et al. [2024] Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision 132(2), 581–595 (2024)
  • Zhang et al. [2022] Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-adapter: Training-free adaption of clip for few-shot classification. In: European Conference on Computer Vision, pp. 493–510 (2022). Springer
  • Xu et al. [2023] Xu, B., Xu, H., Zhao, H., Gao, J., Liang, D., Li, Y., Wang, W., Feng, Y., Shi, G.: Source apportionment of fine particulate matter at a megacity in china, using an improved regularization supervised pmf model. Science of the Total Environment 879, 163198 (2023)
  • Scarselli et al. [2008] Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE transactions on neural networks 20(1), 61–80 (2008)
  • Chen et al. [2017] Chen, Z., Li, X., Bruna, J.: Supervised community detection with line graph neural networks. arXiv preprint arXiv:1705.08415 (2017)
  • Schlichtkrull et al. [2018] Schlichtkrull, M., Kipf, T.N., Bloem, P., Van Den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15, pp. 593–607 (2018). Springer
  • Veličković et al. [2017] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
  • Hamilton et al. [2017] Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017)
  • Lu et al. [2023] Lu, J., Wan, H., Li, P., Zhao, X., Ma, N., Gao, Y.: Exploring high-order spatio–temporal correlations from skeleton for person re-identification. IEEE Transactions on Image Processing 32, 949–963 (2023)
  • Gao et al. [2024] Gao, Y., Lu, J., Li, S., Li, Y., Du, S.: Hypergraph-based multi-view action recognition using event cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)
  • Dai et al. [2022] Dai, Y., Shou, L., Gong, M., Xia, X., Kang, Z., Xu, Z., Jiang, D.: Graph fusion network for text classification. Knowledge-based systems 236, 107659 (2022)
  • Zhang et al. [2023] Zhang, F., Li, J., Cheng, J.: Improving entity alignment via attribute and external knowledge filtering. Applied Intelligence 53(6), 6671–6681 (2023)
  • Han et al. [2023] Han, D., Kim, D., Kim, M., Han, K., Yi, M.Y.: Temporal enhanced inductive graph knowledge tracing. Applied Intelligence 53(23), 29282–29299 (2023)
  • Meng et al. [2022] Meng, Z., Chen, Z., Tan, J., Wang, W., Zhang, Z., Huang, J., Fang, J.: Regeneration performance and particulate emission characteristics during active regeneration process of gpf with ash loading. Chemical Engineering Science 248, 117114 (2022)
  • Liu et al. [2023] Liu, Z., Yu, X., Fang, Y., Zhang, X.: Graphprompt: Unifying pre-training and downstream tasks for graph neural networks. In: Proceedings of the ACM Web Conference 2023, pp. 417–428 (2023)
  • Sun et al. [2022] Sun, M., Zhou, K., He, X., Wang, Y., Wang, X.: Gppt: Graph pre-training and prompt tuning to generalize graph neural networks. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1717–1727 (2022)
  • Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
  • Devlin et al. [2018] Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • Kipf and Welling [2016] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
  • Parkhi et al. [2012] Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3498–3505 (2012). IEEE
  • Nilsback and Zisserman [2008] Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729 (2008). IEEE
  • Bossard et al. [2014] Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative components with random forests. In: Computer vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 446–461 (2014). Springer
  • Lin et al. [2023] Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023)