DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing lightweight retrieval-augmented image captioning models solely leverage retrieved textual prompts while neglecting visual feature enhancement, leading to semantic gaps in describing complex scenes and fine-grained objects. To address this, we propose DualCap, the first framework incorporating dual-path retrieval: image-to-text retrieval for semantic prompting and image-to-image retrieval for generating transferable visual prompts. We further introduce a salient keyword extraction module and a lightweight trainable feature fusion network to enable efficient, synergistic integration of textual and visual prompts. Crucially, DualCap enhances detail fidelity and scene understanding without inflating large language model parameters. Experiments demonstrate that DualCap outperforms existing visual prompting methods using significantly fewer trainable parameters, achieving a balanced advantage in inference efficiency, cross-dataset generalization, and caption quality.

Technology Category

Application Category

📝 Abstract
Recent lightweight retrieval-augmented image caption models often utilize retrieved data solely as text prompts, thereby creating a semantic gap by leaving the original visual features unenhanced, particularly for object details or complex scenes. To address this limitation, we propose $DualCap$, a novel approach that enriches the visual representation by generating a visual prompt from retrieved similar images. Our model employs a dual retrieval mechanism, using standard image-to-text retrieval for text prompts and a novel image-to-image retrieval to source visually analogous scenes. Specifically, salient keywords and phrases are derived from the captions of visually similar scenes to capture key objects and similar details. These textual features are then encoded and integrated with the original image features through a lightweight, trainable feature fusion network. Extensive experiments demonstrate that our method achieves competitive performance while requiring fewer trainable parameters compared to previous visual-prompting captioning approaches.
Problem

Research questions and friction points this paper is trying to address.

Bridging semantic gap in image captioning with visual prompts
Enhancing object details through dual retrieval mechanism
Improving lightweight models with fewer trainable parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses dual retrieval for text and visual prompts
Generates visual prompts from similar image scenes
Integrates textual and visual features via lightweight fusion
🔎 Similar Papers
No similar papers found.
B
Binbin Li
Institute of Information Engineering, Chinese Academy of Sciences, School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
G
Guimiao Yang
Institute of Information Engineering, Chinese Academy of Sciences, School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Z
Zisen Qi
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Haiping Wang
Haiping Wang
Wuhan University
Point clouddeep learning
Y
Yu Ding
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China