π€ AI Summary
To address the scarcity of scalable supervision signals for cross-modal alignment, this paper proposes a self-supervised learning framework that requires neither human nor AI-generated preference annotations. The core innovation lies in modeling bidirectional cyclic consistency (image β text β reconstruction) as a general, unbiased, and differentiable alignment reward signal. Leveraging Stable Diffusion and CLIP, we construct the first large-scale (866K pairs) self-supervised preference dataset. Our method supports Best-of-N validation and Diffusion-DPO optimization. Experiments demonstrate that the proposed reward model outperforms existing alignment metrics on fine-grained image captioning evaluation, and significantly improves performance on VQA, imageβtext retrieval, and text-to-image generation tasks. Crucially, inference incurs zero annotation overhead, enabling strong scalability.
π Abstract
Learning alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset outperforms state-of-the-art alignment metrics on detailed captioning, with superior inference-time scalability when used as a verifier for Best-of-N sampling. Furthermore, performing DPO and Diffusion DPO using our dataset enhances performance across a wide range of vision-language tasks and text-to-image generation. Our dataset, model, and code are at https://cyclereward.github.io