Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models lack mechanisms to verify the consistency between their reasoning processes and supporting evidence in video reasoning, leading to insufficient answer reliability and interpretability. This work proposes the RLER dual-paradigm framework: during training, it employs group-relative reinforcement learning with three novel reward signals—frame sensitivity, reasoning transparency, and repetition resistance—to encourage the generation of structured, machine-readable reasoning evidence; at inference time, it introduces a tuning-free coordinator that performs weighted voting over multiple candidate answers based on evidence consistency and confidence. By explicitly decoupling yet synergizing evidence generation and evidence-driven answer selection for the first time, the method achieves state-of-the-art performance across eight mainstream video reasoning benchmarks, yielding an average improvement of 6.3% while requiring only 3.1 candidate answers to balance efficiency and accuracy.
📝 Abstract
Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce Reinforce to Learn, Elect to Reason (RLER), a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In RLER-Training, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In RLER-Inference, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3\% over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.
Problem

Research questions and friction points this paper is trying to address.

video reasoning
evidence alignment
reasoning reliability
interpretability
large multimodal models
Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning
evidence-based reasoning
video understanding
structured reasoning traces
evidence-weighted election
🔎 Similar Papers
No similar papers found.
S
Songyuan Yang
National University of Defense Technology
Weijiang Yu
Weijiang Yu
Associate Professor, CSE, Sun Yat-sen University
Machine LearningMultimodal AIAI for Science
J
Jilin Ma
Sun Yat-sen University
Z
Ziyu Liu
Sun Yat-sen University
G
Guijian Tang
National University of Defense Technology
W
Wenjing Yang
National University of Defense Technology
H
Huibin Tan
National University of Defense Technology
Nong Xiao
Nong Xiao
Sun Yat-sen University