🤖 AI Summary
Open-vocabulary mobile manipulation (OVMM) suffers from delayed world model updates in dynamic environments: existing approaches update perception only at discrete temporal nodes (e.g., action termination), leading to object omission, detection latency, and reactive replanning delays. To address this, we propose BINDER, a dual-channel framework that decouples strategic planning—performed by a multimodal large language model (DRM)—from continuous monitoring—executed by a video-language model (IRM). This enables fine-grained, real-time 3D scene updating and proactive replanning. By jointly leveraging streaming video analysis and 3D scene modeling, BINDER establishes a closed-loop perception–action cycle. Evaluated across three realistic dynamic scenarios, our method significantly improves task success rate and execution efficiency over state-of-the-art methods, demonstrating robustness and generalizability for real-world deployment.
📝 Abstract
Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation under dynamic environmental changes. However, most prior approaches update their world representation only at discrete update points such as navigation targets, waypoints, or the end of an action step, leaving robots blind between updates and causing cascading failures: overlooked objects, late error detection, and delayed replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual process framework that decouples strategic planning from continuous environment monitoring. Specifically, BINDER integrates a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a VideoLLM for continuous monitoring). The two modules play complementary roles: the DRM performs strategic planning with structured 3D scene updates and guides what the IRM attends to, while the IRM analyzes video streams to update memory, correct ongoing actions, and trigger replanning when necessary. Through this bidirectional coordination, the modules address the trade off between maintaining awareness and avoiding costly updates, enabling robust adaptation under dynamic conditions. Evaluated in three real world environments with dynamic object placement, BINDER achieves substantially higher success and efficiency than SoTA baselines, demonstrating its effectiveness for real world deployment.