Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the scarcity of scalable supervision signals for cross-modal alignment, this paper proposes a self-supervised learning framework that requires neither human nor AI-generated preference annotations. The core innovation lies in modeling bidirectional cyclic consistency (image ↔ text ↔ reconstruction) as a general, unbiased, and differentiable alignment reward signal. Leveraging Stable Diffusion and CLIP, we construct the first large-scale (866K pairs) self-supervised preference dataset. Our method supports Best-of-N validation and Diffusion-DPO optimization. Experiments demonstrate that the proposed reward model outperforms existing alignment metrics on fine-grained image captioning evaluation, and significantly improves performance on VQA, image–text retrieval, and text-to-image generation tasks. Crucially, inference incurs zero annotation overhead, enabling strong scalability.

Technology Category

Application Category

📝 Abstract

Learning alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset outperforms state-of-the-art alignment metrics on detailed captioning, with superior inference-time scalability when used as a verifier for Best-of-N sampling. Furthermore, performing DPO and Diffusion DPO using our dataset enhances performance across a wide range of vision-language tasks and text-to-image generation. Our dataset, model, and code are at https://cyclereward.github.io

Problem

Research questions and friction points this paper is trying to address.

Learning image-text alignment without costly human preferences

Using cycle consistency as a reward signal for alignment

Improving vision-language tasks and text-to-image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses cycle consistency as reward signal

Maps text to image for similarity check

Constructs large preference dataset automatically

🔎 Similar Papers

No similar papers found.