🤖 AI Summary
Existing reward models suffer from poor generalization, while supervised fine-tuning often induces memorization rather than robust reasoning. To address these issues, this paper proposes Listener-Augmented Groupwise Relative Policy Optimization (GRPO), a reinforcement learning framework that integrates chain-of-thought (CoT) reasoning with multi-model collaboration. Its core innovation is the incorporation of a frozen, off-the-shelf vision-language model as a “listener” to re-evaluate the “reasoner”’s CoT outputs, generating denser and better-calibrated reward signals—thereby mitigating reasoning inconsistencies and improving alignment with human visual preferences. The method combines CoT generation, listener-based confidence scoring, and groupwise relative policy optimization for data-efficient and scalable training. On the ImageReward benchmark, GRPO achieves 67.4% accuracy; under out-of-distribution evaluation on a million-scale preference dataset, it improves generalization by 6% over strong baselines.
📝 Abstract
Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.