INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM

๐Ÿ“… 2025-08-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Traditional robots rely on precise physical models and predefined actions, exhibiting poor generalization and limited adaptability in unstructured environments; by contrast, humans achieve efficient decision-making through implicit physical intuition and interactive experience. This paper proposes the Intuitive Perceptor framework, which integrates a vision-language model (VLM) with an interactive Memory Graph to enable scene understanding and intuitive reasoning grounded in prior interaction historyโ€”without requiring repeated instructions or explicit physical modeling. Innovatively, it incorporates physics-aware relational modeling and affordance-driven behavioral inference, empowering robots to autonomously generate plausible manipulation strategies. Extensive experiments across diverse real-world scenarios demonstrate substantial improvements in task generalization and environmental adaptability. The framework establishes a novel paradigm for embodied agents to achieve human-like intuitive manipulation.

Technology Category

Application Category

๐Ÿ“ Abstract
Traditional control and planning for robotic manipulation heavily rely on precise physical models and predefined action sequences. While effective in structured environments, such approaches often fail in real-world scenarios due to modeling inaccuracies and struggle to generalize to novel tasks. In contrast, humans intuitively interact with their surroundings, demonstrating remarkable adaptability, making efficient decisions through implicit physical understanding. In this work, we propose INTENTION, a novel framework enabling robots with learned interactive intuition and autonomous manipulation in diverse scenarios, by integrating Vision-Language Models (VLMs) based scene reasoning with interaction-driven memory. We introduce Memory Graph to record scenes from previous task interactions which embodies human-like understanding and decision-making about different tasks in real world. Meanwhile, we design an Intuitive Perceptor that extracts physical relations and affordances from visual scenes. Together, these components empower robots to infer appropriate interaction behaviors in new scenes without relying on repetitive instructions. Videos: https://robo-intention.github.io
Problem

Research questions and friction points this paper is trying to address.

Enabling robots to infer interaction behaviors autonomously
Overcoming reliance on precise models in robotic manipulation
Integrating scene reasoning with interaction-driven memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Vision-Language Models with interaction-driven memory
Uses Memory Graph for human-like task understanding
Employs Intuitive Perceptor for physical relation extraction
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jin Wang
Humanoids and Human-Centered Mechatronics (HHCM), Istituto Italiano di Tecnologia, Genoa, Italy.
Weijie Wang
Weijie Wang
PhD Student, Zhejiang University
Computer VisionEfficient AIDeep Learning
B
Boyuan Deng
Humanoids and Human-Centered Mechatronics (HHCM), Istituto Italiano di Tecnologia, Genoa, Italy.
H
Heng Zhang
Human-Robot Interfaces and Interaction Lab, Istituto Italiano di Tecnologia, Genoa, Italy. Ph.D. program of national interest in Robotics and Intelligent Machines (DRIM) and Universit`a di Genova, Genoa, Italy.
R
Rui Dai
Humanoids and Human-Centered Mechatronics (HHCM), Istituto Italiano di Tecnologia, Genoa, Italy.
Nikos Tsagarakis
Nikos Tsagarakis
Tenured Senior Scientist, Istituto Italiano di Tecnologia
Humanoid RobotsActuatorsMechatronicsMechanismsHaptics