VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Video-LLMs face three key bottlenecks in complex video reasoning: unstable chain-of-thought (CoT) quality, high cost of acquiring high-quality annotated data, and low training efficiency. To address these, we propose a verifier-guided iterative policy optimization framework featuring the novel Rollout-Aware Verifier module, establishing a three-stage collaborative training loop—GRPO → Verifier → DPO—that jointly optimizes CoT length and contextual consistency. By employing a compact LLM as a logical discriminator, our method efficiently generates high-quality contrastive samples, significantly enhancing CoT coherence and interpretability. Experiments demonstrate a 7× speedup in training and consistent superiority over strong baselines—including GRPO, Kimi-VL, and Video-R1—across diverse video reasoning tasks. Our approach yields longer, more accurate, and more robust reasoning chains, advancing both efficiency and effectiveness in video-grounded reasoning.

Technology Category

Application Category

📝 Abstract
Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.
Problem

Research questions and friction points this paper is trying to address.

Enhancing Video-LLMs' long reasoning chain generation
Overcoming data bottlenecks in Reinforcement Fine-Tuning methods
Improving contextual consistency in chain-of-thoughts for video reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Verifier-guided Iterative Policy Optimization for Video-LLMs
Rollout-Aware Verifier enhances reasoning chain quality
GRPO-Verifier-DPO loop improves contextual consistency
🔎 Similar Papers
No similar papers found.