🤖 AI Summary
Existing video-based planning methods struggle with interaction failures in partially observable environments due to their lack of online reasoning under environmental uncertainty. This paper introduces the first online video planning framework supporting real-time data fusion during interaction, enabling implicit state estimation without explicit state modeling—achieved through dynamic model parameter updating and implicit filtering of failed trajectories. The method integrates spatiotemporal video representation learning, online model adaptation, dynamic plan pruning, and a re-planning architecture. Evaluated on a newly constructed simulated manipulation benchmark, the framework achieves significant improvements: +32% in re-planning efficiency and +27% in task success rate. It enhances decision robustness and system adaptability in complex, dynamic scenarios, advancing video-driven decision-making toward practical deployment.
📝 Abstract
Video-based representations have gained prominence in planning and decision-making due to their ability to encode rich spatiotemporal dynamics and geometric relationships. These representations enable flexible and generalizable solutions for complex tasks such as object manipulation and navigation. However, existing video planning frameworks often struggle to adapt to failures at interaction time due to their inability to reason about uncertainties in partially observed environments. To overcome these limitations, we introduce a novel framework that integrates interaction-time data into the planning process. Our approach updates model parameters online and filters out previously failed plans during generation. This enables implicit state estimation, allowing the system to adapt dynamically without explicitly modeling unknown state variables. We evaluate our framework through extensive experiments on a new simulated manipulation benchmark, demonstrating its ability to improve replanning performance and advance the field of video-based decision-making.