🤖 AI Summary
This paper addresses the core challenge of synthesizing human interactions—with other people, objects, and environments—in digital systems. It systematically surveys advances in generating human–human, human–object, and human–scene interactions. Methodologically, it introduces the first unified taxonomy integrating foundational concepts, modeling paradigms, multimodal datasets (e.g., AMASS, PROX), and evaluation metrics; conducts an in-depth analysis of key technical approaches—including deep generative models, physics-based simulation, cross-modal alignment, and motion-capture data augmentation; and constructs a structured knowledge graph to map technological evolution and critical bottlenecks. The contributions clarify the applicability boundaries and limitations of current methods, identify open challenges, and chart future research directions. The work provides theoretical foundations and practical guidance for embodied AI in robotics, natural VR interaction, and intelligent animation generation.
📝 Abstract
Humans inhabit a world defined by interactions -- with other humans, objects, and environments. These interactive movements not only convey our relationships with our surroundings but also demonstrate how we perceive and communicate with the real world. Therefore, replicating these interaction behaviors in digital systems has emerged as an important topic for applications in robotics, virtual reality, and animation. While recent advances in deep generative models and new datasets have accelerated progress in this field, significant challenges remain in modeling the intricate human dynamics and their interactions with entities in the external world. In this survey, we present, for the first time, a comprehensive overview of the literature in human interaction motion generation. We begin by establishing foundational concepts essential for understanding the research background. We then systematically review existing solutions and datasets across three primary interaction tasks -- human-human, human-object, and human-scene interactions -- followed by evaluation metrics. Finally, we discuss open research directions and future opportunities.