Affordance-Aware Interactive Decision-Making and Execution for Ambiguous Instructions

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenge of enabling robots to efficiently identify task-relevant objects and execute actions in unfamiliar environments under ambiguous human instructions. To this end, the authors propose AIDE, a dual-stream framework that uniquely integrates interactive exploration with vision-language reasoning. By leveraging a multi-stage inference (MSI) stream and an accelerated decision-making (ADM) stream, AIDE achieves zero-shot functional perception and efficient closed-loop execution. The approach supports robust interpretation of vague commands and real-time environmental interaction, demonstrating over 80% task planning success and more than 95% closed-loop execution accuracy in both simulation and real-world settings, while operating at 10 Hz—significantly outperforming existing vision-language model-based methods.

Technology Category

Application Category

📝 Abstract

Enabling robots to explore and act in unfamiliar environments under ambiguous human instructions by interactively identifying task-relevant objects (e.g., identifying cups or beverages for"I'm thirsty") remains challenging for existing vision-language model (VLM)-based methods. This challenge stems from inefficient reasoning and the lack of environmental interaction, which hinder real-time task planning and execution. To address this, We propose Affordance-Aware Interactive Decision-Making and Execution for Ambiguous Instructions (AIDE), a dual-stream framework that integrates interactive exploration with vision-language reasoning, where Multi-Stage Inference (MSI) serves as the decision-making stream and Accelerated Decision-Making (ADM) as the execution stream, enabling zero-shot affordance analysis and interpretation of ambiguous instructions. Extensive experiments in simulation and real-world environments show that AIDE achieves the task planning success rate of over 80\% and more than 95\% accuracy in closed-loop continuous execution at 10 Hz, outperforming existing VLM-based methods in diverse open-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

affordance-aware

ambiguous instructions

interactive decision-making

vision-language model

robotic execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

affordance-aware

interactive decision-making

vision-language reasoning