🤖 AI Summary
In real-time human–robot interaction with humanoid robots, intent recognition remains inaccurate, expressive motion generation lacks social appropriateness, and computational efficiency is insufficient. Method: This paper proposes a hierarchical intent-driven motion synthesis framework integrating contextual learning and a latent-space diffusion model. It introduces a novel hierarchical intent refinement mechanism—incorporating structured prompting, confidence estimation, social context awareness, and safety-aware fallback—to enable dynamic intent correction and adaptive response. A lightweight latent-space diffusion model, pre-trained on large-scale motion data and embedded with physical constraints and social norm priors, generates expressive motions efficiently. Contribution/Results: Evaluated on a physical robot platform, the method achieves real-time synthesis of highly diverse, physically plausible, and socially aligned gestures. It significantly improves interaction naturalness and robustness under dynamic, unstructured human input.
📝 Abstract
Effective human-robot interaction requires robots to identify human intentions and generate expressive, socially appropriate motions in real-time. Existing approaches often rely on fixed motion libraries or computationally expensive generative models. We propose a hierarchical framework that combines intention-aware reasoning via in-context learning (ICL) with real-time motion generation using diffusion models. Our system introduces structured prompting with confidence scoring, fallback behaviors, and social context awareness to enable intention refinement and adaptive response. Leveraging large-scale motion datasets and efficient latent-space denoising, the framework generates diverse, physically plausible gestures suitable for dynamic humanoid interactions. Experimental validation on a physical platform demonstrates the robustness and social alignment of our method in realistic scenarios.