🤖 AI Summary
This work addresses the limitation of current embodied AI systems, which predominantly rely on explicit instructions and struggle to proactively understand and adhere to social norms in instruction-free settings. To bridge this gap, the study introduces the novel concept of “proactive intelligence” and presents RobotEQ, the first benchmark for evaluating such capabilities. RobotEQ comprises 1,900 first-person images, 5,353 behavioral judgment queries, and 1,286 spatial grounding questions, all meticulously annotated by humans and augmented with an external social-norm knowledge base integrated via Retrieval-Augmented Generation (RAG). Experimental results reveal that existing models perform poorly on proactive intelligence tasks—particularly in spatial reasoning—and demonstrate that incorporating RAG significantly enhances overall performance, thereby advancing embodied AI toward socially compliant, proactive behavior beyond passive instruction following.
📝 Abstract
Embodied AI is a prominent research topic in both academia and industry. Current research centers on completing tasks based on explicit user instructions. However, for robots to integrate into human society, they must understand which actions are permissible and which are prohibited, even without explicit commands. We refer to the user-guided AI as passive intelligence and the unguided AI as active intelligence. This paper introduces RobotEQ, the first benchmark for active intelligence, aiming to assess whether existing models can comprehend and adhere to social norms in embodied scenarios. First, we construct RobotEQ-Data, a dataset consisting of 1,900 egocentric images, spanning 10 representative embodied categories and 56 subcategories. Through extensive manual annotation, we provide 5,353 action judgment questions and 1,286 spatial grounding questions, specifying appropriate robot actions across diverse scenarios. Furthermore, we establish RobotEQ-Bench to evaluate the performance of state-of-the-art models on this task. Experimental results show that current models still fall short in achieving reliable active intelligence, particularly in spatial grounding. Meanwhile, we observe that leveraging RAG techniques to incorporate external social norm knowledge bases can generally enhance performance. This work can facilitate the transition of robotics from user-guided passive manipulation to active social compliance.