Listener-Rewarded Thinking in VLMs for Image Preferences

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing reward models suffer from poor generalization, while supervised fine-tuning often induces memorization rather than robust reasoning. To address these issues, this paper proposes Listener-Augmented Groupwise Relative Policy Optimization (GRPO), a reinforcement learning framework that integrates chain-of-thought (CoT) reasoning with multi-model collaboration. Its core innovation is the incorporation of a frozen, off-the-shelf vision-language model as a “listener” to re-evaluate the “reasoner”’s CoT outputs, generating denser and better-calibrated reward signals—thereby mitigating reasoning inconsistencies and improving alignment with human visual preferences. The method combines CoT generation, listener-based confidence scoring, and groupwise relative policy optimization for data-efficient and scalable training. On the ImageReward benchmark, GRPO achieves 67.4% accuracy; under out-of-distribution evaluation on a million-scale preference dataset, it improves generalization by 6% over strong baselines.

Technology Category

Application Category

📝 Abstract

Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.

Problem

Research questions and friction points this paper is trying to address.

Improving generalization in reward models for human visual preferences

Addressing reasoning accuracy drop when contradicting independent evaluators

Enhancing alignment of vision-language models with human intent

Innovation

Methods, ideas, or system contributions that make the work stand out.

Listener-augmented GRPO framework for VLMs

Dense calibrated confidence score shaping reward

Persuasive explanations to independent model evaluation

🔎 Similar Papers

Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation