Affordance-Aware Interactive Decision-Making and Execution for Ambiguous Instructions

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enabling robots to efficiently identify task-relevant objects and execute actions in unfamiliar environments under ambiguous human instructions. To this end, the authors propose AIDE, a dual-stream framework that uniquely integrates interactive exploration with vision-language reasoning. By leveraging a multi-stage inference (MSI) stream and an accelerated decision-making (ADM) stream, AIDE achieves zero-shot functional perception and efficient closed-loop execution. The approach supports robust interpretation of vague commands and real-time environmental interaction, demonstrating over 80% task planning success and more than 95% closed-loop execution accuracy in both simulation and real-world settings, while operating at 10 Hz—significantly outperforming existing vision-language model-based methods.

Technology Category

Application Category

📝 Abstract
Enabling robots to explore and act in unfamiliar environments under ambiguous human instructions by interactively identifying task-relevant objects (e.g., identifying cups or beverages for"I'm thirsty") remains challenging for existing vision-language model (VLM)-based methods. This challenge stems from inefficient reasoning and the lack of environmental interaction, which hinder real-time task planning and execution. To address this, We propose Affordance-Aware Interactive Decision-Making and Execution for Ambiguous Instructions (AIDE), a dual-stream framework that integrates interactive exploration with vision-language reasoning, where Multi-Stage Inference (MSI) serves as the decision-making stream and Accelerated Decision-Making (ADM) as the execution stream, enabling zero-shot affordance analysis and interpretation of ambiguous instructions. Extensive experiments in simulation and real-world environments show that AIDE achieves the task planning success rate of over 80\% and more than 95\% accuracy in closed-loop continuous execution at 10 Hz, outperforming existing VLM-based methods in diverse open-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

affordance-aware
ambiguous instructions
interactive decision-making
vision-language model
robotic execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

affordance-aware
interactive decision-making
vision-language reasoning
ambiguous instructions
zero-shot execution
🔎 Similar Papers
No similar papers found.
H
Hengxuan Xu
Department of Automation, Tsinghua University, Beijing, China
F
Fengbo Lan
Department of Automation, Tsinghua University, Beijing, China
Z
Zhixin Zhao
Department of Automation, Tsinghua University, Beijing, China
Shengjie Wang
Shengjie Wang
Tsinghua University
RoboticsReinforcement learningBionic robotics
M
Mengqiao Liu
Department of Automation, Tsinghua University, Beijing, China
J
Jieqian Sun
Department of Automation, Tsinghua University, Beijing, China
Yu Cheng
Yu Cheng
Professor of Computer Science and Engineering, The Chinese University of Hong Kong
Deep Generative ModelsMultimodal LearningModel Compression
Tao Zhang
Tao Zhang
Associate Professor, Beijing Jiaotong University, Beijing, China
Network SecurityMoving Target DefenseBlockchainFederated Learning