RHINO: Learning Real-Time Humanoid-Human-Object Interaction from Human Demonstrations

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Humanoid robots struggle with real-time responsiveness to human commands and interactive signals. Method: This paper proposes a hierarchical real-time interaction framework: a high-level module performs dynamic intent inference from multimodal inputs (language, vision, and motion), while a low-level module unifies reactive motion control with safety-constrained optimization. We introduce the first end-to-end learning paradigm that jointly trains intent inference and response control directly from human demonstrations and teleoperation data, integrating imitation learning, multimodal perception, and real-time closed-loop control. Results: Evaluated on a physical humanoid robot, the system enables seamless interruption, immediate feedback, and robust task resumption—demonstrating high flexibility, strong robustness, and guaranteed safety. It significantly enhances the real-time performance and naturalness of human–robot–object collaboration in complex, unstructured environments.

Technology Category

Application Category

📝 Abstract
Humanoid robots have shown success in locomotion and manipulation. Despite these basic abilities, humanoids are still required to quickly understand human instructions and react based on human interaction signals to become valuable assistants in human daily life. Unfortunately, most existing works only focus on multi-stage interactions, treating each task separately, and neglecting real-time feedback. In this work, we aim to empower humanoid robots with real-time reaction abilities to achieve various tasks, allowing human to interrupt robots at any time, and making robots respond to humans immediately. To support such abilities, we propose a general humanoid-human-object interaction framework, named RHINO, i.e., Real-time Humanoid-human Interaction and Object manipulation. RHINO provides a unified view of reactive motion, instruction-based manipulation, and safety concerns, over multiple human signal modalities, such as languages, images, and motions. RHINO is a hierarchical learning framework, enabling humanoids to learn reaction skills from human-human-object demonstrations and teleoperation data. In particular, it decouples the interaction process into two levels: 1) a high-level planner inferring human intentions from real-time human behaviors; and 2) a low-level controller achieving reactive motion behaviors and object manipulation skills based on the predicted intentions. We evaluate the proposed framework on a real humanoid robot and demonstrate its effectiveness, flexibility, and safety in various scenarios.
Problem

Research questions and friction points this paper is trying to address.

Enhancing humanoid robots' real-time reaction abilities.
Developing a framework for humanoid-human-object interaction.
Learning skills from human demonstrations for robot tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time humanoid-human interaction framework
Hierarchical learning from human demonstrations
Decouples interaction into high and low levels
🔎 Similar Papers
No similar papers found.