🤖 AI Summary
This work addresses the challenge of enabling robots to efficiently identify task-relevant objects and execute actions in unfamiliar environments under ambiguous human instructions. To this end, the authors propose AIDE, a dual-stream framework that uniquely integrates interactive exploration with vision-language reasoning. By leveraging a multi-stage inference (MSI) stream and an accelerated decision-making (ADM) stream, AIDE achieves zero-shot functional perception and efficient closed-loop execution. The approach supports robust interpretation of vague commands and real-time environmental interaction, demonstrating over 80% task planning success and more than 95% closed-loop execution accuracy in both simulation and real-world settings, while operating at 10 Hz—significantly outperforming existing vision-language model-based methods.
📝 Abstract
Enabling robots to explore and act in unfamiliar environments under ambiguous human instructions by interactively identifying task-relevant objects (e.g., identifying cups or beverages for"I'm thirsty") remains challenging for existing vision-language model (VLM)-based methods. This challenge stems from inefficient reasoning and the lack of environmental interaction, which hinder real-time task planning and execution. To address this, We propose Affordance-Aware Interactive Decision-Making and Execution for Ambiguous Instructions (AIDE), a dual-stream framework that integrates interactive exploration with vision-language reasoning, where Multi-Stage Inference (MSI) serves as the decision-making stream and Accelerated Decision-Making (ADM) as the execution stream, enabling zero-shot affordance analysis and interpretation of ambiguous instructions. Extensive experiments in simulation and real-world environments show that AIDE achieves the task planning success rate of over 80\% and more than 95\% accuracy in closed-loop continuous execution at 10 Hz, outperforming existing VLM-based methods in diverse open-world scenarios.