CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the limitations of existing image captioning models, which often rely on subjective, incomplete, or erroneous human annotations and struggle to balance descriptive completeness and factual correctness. To overcome mere imitation of imperfect references, the authors propose a dual-reward reinforcement learning framework that explicitly optimizes for both coverage of all salient visual facts (completeness) and avoidance of hallucinations (correctness). The approach innovatively employs a symmetric dual-reward mechanism, integrating dynamic visual query sampling with sub-caption factuality verification. Built upon a multimodal large language model, it achieves visual-semantic disentanglement to jointly optimize the two objectives. Extensive experiments on multiple standard benchmarks demonstrate that the method significantly improves caption quality, outperforming current state-of-the-art approaches in both completeness and factual accuracy.

Technology Category

Application Category

📝 Abstract

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency. For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition. Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.

Problem

Research questions and friction points this paper is trying to address.

image captioning

completeness

correctness

ground-truth supervision

hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-reward reinforcement learning

image captioning

completeness