Are Video Reasoning Models Ready to Go Outside?

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Current video reasoning models exhibit significantly limited robustness under real-world perturbations such as weather variations, occlusions, and camera motion. To address this issue, this work proposes ROVA, a novel training framework that integrates difficulty-aware self-reflection evaluation, a robust consistency reward mechanism, and an online curriculum learning strategy to dynamically prioritize training samples and enhance reasoning capabilities in complex environments. Additionally, the authors introduce PVRBench, the first benchmark specifically designed for evaluating video reasoning under realistic perturbations. Experimental results demonstrate that ROVA achieves at least a 24% improvement in accuracy over state-of-the-art models like Qwen2.5-VL and Qwen3-VL on PVRBench, along with over a 9% gain in reasoning quality, while maintaining competitive performance on standard clean datasets.

Technology Category

Application Category

📝 Abstract

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.

Problem

Research questions and friction points this paper is trying to address.

video reasoning

real-world robustness

spatio-temporal corruptions

vision-language models

model degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

robustness-aware consistency reward

difficulty-aware online training

spatio-temporal corruptions