Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the high computational cost of full fine-tuning large multimodal large language models (MLLMs) for video captioning and the neglect of visual encoder optimization in existing parameter-efficient fine-tuning (PEFT) methods, this paper proposes Q-Adapter—the first lightweight PEFT module designed specifically for visual encoders. Innovatively integrating learnable query tokens and a gating mechanism, Q-Adapter adaptively extracts sparse, caption-relevant visual features without external textual supervision. This work pioneers the extension of the PEFT paradigm to the visual modality, significantly enhancing vision–language alignment efficiency. Evaluated on MSR-VTT and MSVD, Q-Adapter achieves state-of-the-art performance using only 1.4% trainable parameters—outperforming mainstream PEFT approaches and matching full fine-tuning in accuracy.

Technology Category

Application Category

📝 Abstract

Recent advances in video captioning are driven by large-scale pretrained models, which follow the standard "pre-training followed by fine-tuning" paradigm, where the full model is fine-tuned for downstream tasks. Although effective, this approach becomes computationally prohibitive as the model size increases. The Parameter-Efficient Fine-Tuning (PEFT) approach offers a promising alternative, but primarily focuses on the language components of Multimodal Large Language Models (MLLMs). Despite recent progress, PEFT remains underexplored in multimodal tasks and lacks sufficient understanding of visual information during fine-tuning the model. To bridge this gap, we propose Query-Adapter (Q-Adapter), a lightweight visual adapter module designed to enhance MLLMs by enabling efficient fine-tuning for the video captioning task. Q-Adapter introduces learnable query tokens and a gating layer into Vision Encoder, enabling effective extraction of sparse, caption-relevant features without relying on external textual supervision. We evaluate Q-Adapter on two well-known video captioning datasets, MSR-VTT and MSVD, where it achieves state-of-the-art performance among the methods that take the PEFT approach across BLEU@4, METEOR, ROUGE-L, and CIDEr metrics. Q-Adapter also achieves competitive performance compared to methods that take the full fine-tuning approach while requiring only 1.4% of the parameters. We further analyze the impact of key hyperparameters and design choices on fine-tuning effectiveness, providing insights into optimization strategies for adapter-based learning. These results highlight the strong potential of Q-Adapter in balancing caption quality and parameter efficiency, demonstrating its scalability for video-language modeling.

Problem

Research questions and friction points this paper is trying to address.

Efficiently fine-tuning large video captioning models with fewer parameters

Enhancing visual feature extraction for multimodal language models in captioning

Bridging the parameter efficiency gap in video-language modeling tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight visual adapter module for efficient fine-tuning

Introduces learnable query tokens and gating layer

Extracts caption-relevant features without textual supervision

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs