ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

📅 2025-08-03

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Vision-language models (VLMs) face critical bottlenecks in long-video temporal reasoning—including severe hallucination, high computational complexity, and difficulty balancing fine-grained local details with global context—hindering their deployment in embodied intelligence tasks. To address these challenges, we propose ROVER, the first framework introducing recursive video segmentation reasoning: it hierarchically decomposes long video trajectories into subtask segments and integrates sliding contextual windows with localized–global attention to achieve efficient, linear-time temporal modeling. This design substantially mitigates hallucination and improves reasoning consistency. Evaluated on OpenX and RoboCasa benchmarks, ROVER outperforms strong baselines across three tasks—task progress estimation, frame-level natural language inference, and video question answering—demonstrating its effectiveness and generalizability for long-horizon vision-language understanding.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have exhibited impressive capabilities across diverse image understanding tasks, but still struggle in settings that require reasoning over extended sequences of camera frames from a video. This limits their utility in embodied settings, which require reasoning over long frame sequences from a continuous stream of visual input at each moment of a task attempt. To address this limitation, we propose ROVER (Reasoning Over VidEo Recursively), a framework that enables the model to recursively decompose long-horizon video trajectories into segments corresponding to shorter subtasks within the trajectory. In doing so, ROVER facilitates more focused and accurate reasoning over temporally localized frame sequences without losing global context. We evaluate ROVER, implemented using an in-context learning approach, on diverse OpenX Embodiment videos and on a new dataset derived from RoboCasa that consists of 543 videos showing both expert and perturbed non-expert trajectories across 27 robotic manipulation tasks. ROVER outperforms strong baselines across three video reasoning tasks: task progress estimation, frame-level natural language reasoning, and video question answering. We observe that, by reducing the number of frames the model reasons over at each timestep, ROVER mitigates hallucinations, especially during unexpected or non-optimal moments of a trajectory. In addition, by enabling the implementation of a subtask-specific sliding context window, ROVER's time complexity scales linearly with video length, an asymptotic improvement over baselines. Demos, code, and data available at: https://rover-vlm.github.io

Problem

Research questions and friction points this paper is trying to address.

Enables reasoning over long video sequences for embodied tasks

Decomposes videos into segments for focused subtask reasoning

Mitigates hallucinations in vision-language models during video analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive video decomposition for subtask reasoning

In-context learning for focused frame analysis

Linear time complexity with sliding context window

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models