🤖 AI Summary
This work addresses the challenge of generating natural, context-aware nonverbal behaviors—particularly hand gestures—in dyadic dialogue to enhance realism and coordination in virtual interactions. We propose the first autoregressive diffusion-based gesture generation model for two interlocutors, integrated with a large language model (LLM) agent that performs intent understanding, dynamic response generation, and dialogue flow control. Our method jointly models speech-driven input, high-level semantic guidance, and kinematic constraints to ensure temporal synchronization and semantic alignment of gestures between participants. User studies and quantitative evaluation demonstrate significant improvements in gesture naturalness (+28.6%), cross-agent synchrony (+34.1%), and interaction immersion. To our knowledge, this is the first approach to achieve joint, real-time co-generation of nonverbal behavior at the motion level for dyadic embodied dialogue, establishing a novel paradigm for embodied conversational agents.
📝 Abstract
We present Social Agent, a novel framework for synthesizing realistic and contextually appropriate co-speech nonverbal behaviors in dyadic conversations. In this framework, we develop an agentic system driven by a Large Language Model (LLM) to direct the conversation flow and determine appropriate interactive behaviors for both participants. Additionally, we propose a novel dual-person gesture generation model based on an auto-regressive diffusion model, which synthesizes coordinated motions from speech signals. The output of the agentic system is translated into high-level guidance for the gesture generator, resulting in realistic movement at both the behavioral and motion levels. Furthermore, the agentic system periodically examines the movements of interlocutors and infers their intentions, forming a continuous feedback loop that enables dynamic and responsive interactions between the two participants. User studies and quantitative evaluations show that our model significantly improves the quality of dyadic interactions, producing natural, synchronized nonverbal behaviors.