🤖 AI Summary
This work addresses the challenge of enabling embodied agents to autonomously execute long-horizon, high-fidelity human-object interactions—encompassing dexterous finger manipulation and whole-body motion coordination—in complex physical environments guided by natural language instructions. We propose the first end-to-end framework that jointly integrates large language models (LLMs) for multi-step semantic planning, reinforcement learning (RL)-driven physics-based action tracking, and a multi-granularity motor policy network. Crucially, it enables simultaneous finger-level contact modeling and whole-body motion generation while ensuring kinematic and dynamic feasibility within a physics engine. The method guarantees action authenticity, executability, and contextual adaptability to environmental constraints. Experiments demonstrate substantial improvements in generalization from language instructions to physically plausible actions across diverse objects and intricate scenes. Our approach establishes a novel paradigm for long-horizon, embodied human-object interaction, advancing the frontier of physically grounded, language-guided robotic control.
📝 Abstract
Intelligent agents must autonomously interact with the environments to perform daily tasks based on human-level instructions. They need a foundational understanding of the world to accurately interpret these instructions, along with precise low-level movement and interaction skills to execute the derived actions. In this work, we propose the first complete system for synthesizing physically plausible, long-horizon human-object interactions for object manipulation in contextual environments, driven by human-level instructions. We leverage large language models (LLMs) to interpret the input instructions into detailed execution plans. Unlike prior work, our system is capable of generating detailed finger-object interactions, in seamless coordination with full-body movements. We also train a policy to track generated motions in physics simulation via reinforcement learning (RL) to ensure physical plausibility of the motion. Our experiments demonstrate the effectiveness of our system in synthesizing realistic interactions with diverse objects in complex environments, highlighting its potential for real-world applications.