DUEL: Adversarial Self-Play for Multimodal Reasoning

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the bias in vision-language models within unsupervised reinforcement learning, stemming from weak visual alignment and the absence of reliable verification signals. To tackle this, the authors propose DUEL, a novel framework that leverages adversarial self-play between two homologous policies to generate image-anchored positive-negative statement pairs. A discriminator then performs fine-grained visual reasoning to distinguish these pairs, enabling self-evolving training without human annotations. DUEL introduces the first unsupervised post-training mechanism based on adversarial self-play, integrating hard negative example generation and length-normalized log-likelihood rewards. Notably, it enhances visual reasoning and robust discriminative performance without relying on external reward models or image editing tools.
📝 Abstract
Reinforcement learning (RL) has emerged as an effective paradigm for improving the reasoning capability of vision-language models (VLMs). However, RL-based optimization typically depends on costly high-quality annotations that are difficult to scale. Existing unsupervised alternatives may drift toward biased solutions due to weak visual grounding and the lack of reliable verification signals. We propose a self-evolving post-training framework, DUEL, where supervision emerges from adversarial interactions between two policies initialized from the same pretrained VLM. A Challenger generates an image-grounded true claim together with a minimally perturbed hard-negative counterpart, while a Solver verifies both claims against the image, encouraging fine-grained visual discrimination under near-neighbor semantics. To stabilize optimization, we introduce a length-normalized log-likelihood reward that preserves informative optimization signals beyond binary outcome supervision and improves learning stability under sparse feedback. Experiments show that DUEL consistently improves visual reasoning and robust discrimination without additional human annotations, external reward models, or image editing tools.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
reinforcement learning
unsupervised learning
visual grounding
reasoning capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial self-play
vision-language models
unsupervised reinforcement learning
hard-negative mining
length-normalized reward
🔎 Similar Papers