INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Traditional robots rely on precise physical models and predefined actions, exhibiting poor generalization and limited adaptability in unstructured environments; by contrast, humans achieve efficient decision-making through implicit physical intuition and interactive experience. This paper proposes the Intuitive Perceptor framework, which integrates a vision-language model (VLM) with an interactive Memory Graph to enable scene understanding and intuitive reasoning grounded in prior interaction history—without requiring repeated instructions or explicit physical modeling. Innovatively, it incorporates physics-aware relational modeling and affordance-driven behavioral inference, empowering robots to autonomously generate plausible manipulation strategies. Extensive experiments across diverse real-world scenarios demonstrate substantial improvements in task generalization and environmental adaptability. The framework establishes a novel paradigm for embodied agents to achieve human-like intuitive manipulation.

Technology Category

Application Category

📝 Abstract

Traditional control and planning for robotic manipulation heavily rely on precise physical models and predefined action sequences. While effective in structured environments, such approaches often fail in real-world scenarios due to modeling inaccuracies and struggle to generalize to novel tasks. In contrast, humans intuitively interact with their surroundings, demonstrating remarkable adaptability, making efficient decisions through implicit physical understanding. In this work, we propose INTENTION, a novel framework enabling robots with learned interactive intuition and autonomous manipulation in diverse scenarios, by integrating Vision-Language Models (VLMs) based scene reasoning with interaction-driven memory. We introduce Memory Graph to record scenes from previous task interactions which embodies human-like understanding and decision-making about different tasks in real world. Meanwhile, we design an Intuitive Perceptor that extracts physical relations and affordances from visual scenes. Together, these components empower robots to infer appropriate interaction behaviors in new scenes without relying on repetitive instructions. Videos: https://robo-intention.github.io

Problem

Research questions and friction points this paper is trying to address.

Enabling robots to infer interaction behaviors autonomously

Overcoming reliance on precise models in robotic manipulation

Integrating scene reasoning with interaction-driven memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Vision-Language Models with interaction-driven memory

Uses Memory Graph for human-like task understanding

Employs Intuitive Perceptor for physical relation extraction

🔎 Similar Papers

Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task

2024-04-12ICSR + AICitations: 1

Addressing and Visualizing Misalignments in Human Task-Solving Trajectories

2024-09-21arXiv.orgCitations: 0

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)