RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

In image captioning, existing methods suffer from hallucination and omission of fine-grained details in large multimodal language models (MLLMs), leading to low accuracy and incompleteness in generated descriptions. To address this, we propose a vision-reconstruction-driven iterative annotation optimization framework—the first to employ text-to-image reconstruction error as a self-supervised signal for guiding MLLM self-correction in a closed-loop paradigm. We further introduce RICO-Flash, a lightweight scheme integrating discrepancy-aware prompt engineering and direct preference optimization (DPO)-based distillation to iteratively refine captioning capability. Our method synergistically leverages state-of-the-art text-to-image models (e.g., SDXL) and MLLMs without requiring additional human annotations. On CapsBench and CompreCap, it achieves average improvements of ~10% in both accuracy and completeness metrics over prior SOTA. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.

Problem

Research questions and friction points this paper is trying to address.

Improving caption accuracy by reducing hallucinations in image recaptioning

Enhancing caption completeness by capturing fine-grained visual details

Reducing computational costs of iterative refinement via efficient learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Refines captions via visual reconstruction

Uses text-to-image model for reference images

Introduces RICO-Flash to reduce computational cost

🔎 Similar Papers

Cropper: Vision-Language Model for Image Cropping through In-Context Learning