CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing image captioning models, which often rely on subjective, incomplete, or erroneous human annotations and struggle to balance descriptive completeness and factual correctness. To overcome mere imitation of imperfect references, the authors propose a dual-reward reinforcement learning framework that explicitly optimizes for both coverage of all salient visual facts (completeness) and avoidance of hallucinations (correctness). The approach innovatively employs a symmetric dual-reward mechanism, integrating dynamic visual query sampling with sub-caption factuality verification. Built upon a multimodal large language model, it achieves visual-semantic disentanglement to jointly optimize the two objectives. Extensive experiments on multiple standard benchmarks demonstrate that the method significantly improves caption quality, outperforming current state-of-the-art approaches in both completeness and factual accuracy.

Technology Category

Application Category

📝 Abstract
Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency. For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition. Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.
Problem

Research questions and friction points this paper is trying to address.

image captioning
completeness
correctness
ground-truth supervision
hallucination
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-reward reinforcement learning
image captioning
completeness
correctness
hallucination mitigation
🔎 Similar Papers
No similar papers found.
Zhijiang Tang
Zhijiang Tang
Postgraduate student at University of Chinese Academy of Sciences
Deep LearningAI for ScienceTime Series Analyze
L
Linhua Wang
LLM Team, Shopee Pte. Ltd., Shanghai, China
J
Jiaxin Qi
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Zhejiang, China
W
Weihao Jiang
LLM Team, Shopee Pte. Ltd., Shanghai, China
P
Peng Hou
LLM Team, Shopee Pte. Ltd., Shanghai, China
A
Anxiang Zeng
LLM Team, Shopee Pte. Ltd., Shanghai, China
Jianqiang Huang
Jianqiang Huang
Nanyang Technological University, Chinese Academy of Sciences
Compter VisionMachine LearningCasuality