ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Large vision-language models (LVLMs) suffer from two critical biases in image captioning: multimodal granularity imbalance and language hallucination. To address these, we propose a scalable, inference-time bimodal debiasing method. Our approach features: (1) heuristic question-answering–driven progressive information injection to mitigate visual representation bias; and (2) sentence-level contrastive decoding, leveraging offline contrastive scoring to suppress language hallucination. These components jointly correct both multimodal and linguistic biases without requiring model fine-tuning. Crucially, the method is compute-scalable—description completeness and accuracy improve dynamically with increased computational resources. Evaluated on 11 mainstream benchmarks, our method significantly outperforms prior approaches. After pretraining on 450K images, performance consistently improves, and the method demonstrates superior semantic coverage and visual fidelity on VQA-based captioning and image reconstruction tasks.

Technology Category

Application Category

📝 Abstract

This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs sentence-level offline contrastive decoding to effectively identify and eliminate hallucinations caused by linguistic biases. With increased inference cost, more heuristic questions are raised by ScaleCap to progressively capture additional visual details, generating captions that are more accurate, balanced, and informative. Extensive modality alignment experiments demonstrate the effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity of generated captions with two additional tasks: replacing images with captions in VQA task, and reconstructing images from captions to assess semantic coverage. Code is available at https://github.com/Cooperx521/ScaleCap.

Problem

Research questions and friction points this paper is trying to address.

Addresses imbalanced descriptive granularity in image captioning

Eliminates hallucinated descriptions of non-existent objects

Enhances caption accuracy and detail with scalable inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-modality debiasing for balanced captions

Heuristic question answering for detail injection

Contrastive sentence rating to eliminate hallucinations

🔎 Similar Papers

Linear Alignment of Vision-language Models for Image Captioning