๐ค AI Summary
This work addresses the prevalent issue of insufficient vision-language alignment in current vision-language models, which often leads to generic or hallucinated image captions. To mitigate this, the authors propose a novel self-supervised approach that leverages cycle consistency across imageโtextโimage reconstruction as a direct training signal, without requiring human annotations. Specifically, they form a closed loop by integrating a pretrained text-to-image generator with a vision-language model and use the similarity between the original and reconstructed images as a reward signal. The model is fine-tuned using Group Relative Policy Optimization (GRPO). Experiments demonstrate consistent and significant improvements in caption accuracy and factual correctness across four vision-language models of varying scales, outperforming existing supervised methods in both hallucination suppression and overall generation quality.
๐ Abstract
Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a fine-tuning scheme to improve image captioning using Group Relative Policy Optimization (GRPO) with a reward based on the similarity between the original and reconstructed images, computed on-the-fly. Unlike previous work that uses cycle consistency loss for preference dataset construction, our method leverages cycle consistency directly as a self-supervised training signal. This enables the use of raw images alone, eliminating the need for curated image-text datasets, while steering the VLM to produce more accurate and grounded text descriptions. Applied to four VLMs ranging from 1B to 7B parameters, CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods that rely on supervised cycle consistency training.