EPIC: Efficient Prompt Interaction for Text-Image Classification

๐Ÿ“… 2025-07-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the high computational cost of fine-tuning large multimodal models (LMMs) for textโ€“image classification, this paper proposes Efficient Prompt Interaction with Cross-modal alignment (EPIC). EPIC freezes the backbone model and injects temporal prompts into intermediate transformer layers, coupled with a dynamic cross-modal interaction mechanism guided by inter-modal similarity. This enables fine-grained modality alignment and adaptive information fusion without modifying the pretrained weights. Crucially, EPIC introduces only ~1% additional trainable parameters relative to the base model, drastically reducing training overhead. Extensive experiments demonstrate that EPIC outperforms state-of-the-art parameter-efficient fine-tuning methods on UPMC-Food101 and SNLI-VE, while achieving comparable performance on MM-IMDB. These results validate EPICโ€™s effectiveness, generalizability across diverse multimodal benchmarks, and exceptional parameter efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost for fine-tuning multimodal models
Improves efficiency of prompt-based multimodal interaction
Enhances performance in text-image classification tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal prompts on intermediate layers
Similarity-based prompt interaction
Reduced computational resource consumption
๐Ÿ”Ž Similar Papers
No similar papers found.
Xinyao Yu
Xinyao Yu
Zhejiang University
Machine learning
H
Hao Sun
College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Zeyu Ling
Zeyu Ling
Zhejiang University
Computer Vision
Ziwei Niu
Ziwei Niu
Zhejiang University
domain generalizationdomain adaptation
Z
Zhenjia Bai
College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Rui Qin
Rui Qin
Tsighua University
Yen-Wei Chen
Yen-Wei Chen
Ritsumeikan University
image processingpattern recognitionmedical image analysis
L
Lanfen Lin
College of Computer Science and Technology, Zhejiang University, Hangzhou, China