VISD: Enhancing Video Reasoning via Structured Self-Distillation

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the challenges of sparse sequence-level rewards and the lack of fine-grained temporal credit assignment in video large language models for complex reasoning. The authors propose VISD, a structured self-distillation framework that employs a video-aware critic to decompose reasoning quality into interpretable dimensions—such as answer correctness, logical consistency, and spatiotemporal alignment—thereby providing dense supervisory signals. Structured feedback is further generated using diagnostic privileged information, and a direction–magnitude decoupling mechanism is introduced to stably integrate reinforcement learning with dense supervision. Training stability is enhanced through curriculum scheduling and an EMA-based teacher model. Experiments demonstrate that VISD significantly outperforms strong baselines across multiple benchmarks, achieving higher answer accuracy and improved spatiotemporal alignment while reducing optimization steps by nearly 50%, leading to faster convergence and greater sample efficiency.

📝 Abstract

Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.

Problem

Research questions and friction points this paper is trying to address.

Video Reasoning

Credit Assignment

Sparse Rewards

Token-level Supervision

VideoLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured self-distillation

video reasoning

fine-grained credit assignment