🤖 AI Summary
In image captioning, existing methods suffer from hallucination and omission of fine-grained details in large multimodal language models (MLLMs), leading to low accuracy and incompleteness in generated descriptions. To address this, we propose a vision-reconstruction-driven iterative annotation optimization framework—the first to employ text-to-image reconstruction error as a self-supervised signal for guiding MLLM self-correction in a closed-loop paradigm. We further introduce RICO-Flash, a lightweight scheme integrating discrepancy-aware prompt engineering and direct preference optimization (DPO)-based distillation to iteratively refine captioning capability. Our method synergistically leverages state-of-the-art text-to-image models (e.g., SDXL) and MLLMs without requiring additional human annotations. On CapsBench and CompreCap, it achieves average improvements of ~10% in both accuracy and completeness metrics over prior SOTA. The code is publicly available.
📝 Abstract
Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.