CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning

๐Ÿ“… 2026-03-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the prevalent issue of insufficient vision-language alignment in current vision-language models, which often leads to generic or hallucinated image captions. To mitigate this, the authors propose a novel self-supervised approach that leverages cycle consistency across imageโ€“textโ€“image reconstruction as a direct training signal, without requiring human annotations. Specifically, they form a closed loop by integrating a pretrained text-to-image generator with a vision-language model and use the similarity between the original and reconstructed images as a reward signal. The model is fine-tuned using Group Relative Policy Optimization (GRPO). Experiments demonstrate consistent and significant improvements in caption accuracy and factual correctness across four vision-language models of varying scales, outperforming existing supervised methods in both hallucination suppression and overall generation quality.

Technology Category

Application Category

๐Ÿ“ Abstract
Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a fine-tuning scheme to improve image captioning using Group Relative Policy Optimization (GRPO) with a reward based on the similarity between the original and reconstructed images, computed on-the-fly. Unlike previous work that uses cycle consistency loss for preference dataset construction, our method leverages cycle consistency directly as a self-supervised training signal. This enables the use of raw images alone, eliminating the need for curated image-text datasets, while steering the VLM to produce more accurate and grounded text descriptions. Applied to four VLMs ranging from 1B to 7B parameters, CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods that rely on supervised cycle consistency training.
Problem

Research questions and friction points this paper is trying to address.

vision-language misalignment
image captioning
hallucination
visual grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

cycle consistency
self-supervised fine-tuning
visual-language models
image captioning
hallucination mitigation
M
Marios Krestenitis
Queen Mary, University of London
Christos Tzelepis
Christos Tzelepis
Senior Research Scientist at Samsung AI Centre Cambridge
Generative AIComputer VisionMachine Learning
K
Konstantinos Ioannidis
Centre for Research and Technology Hellas
S
Steafanos Vrochidis
Centre for Research and Technology Hellas
I
Ioannis Kompatsiaris
Centre for Research and Technology Hellas
G
Georgios Tzimiropoulos
Queen Mary, University of London
Shaogang Gong
Shaogang Gong
Queen Mary University of London
Computer VisionMachine LearningObject RecognitionAction RecognitionVideo Analysis
Ioannis Patras
Ioannis Patras
Professor, Queen Mary, University of London
Computer VisionMachine LearningArtificial IntelligenceFace and gesture recognitionMultimedia Analysis