🤖 AI Summary
This work addresses the challenges of generating listener head motions in virtual human interaction, where existing methods often produce static, under-expressive behaviors and struggle with the high-dimensional complexity of nonverbal motion parameters. To overcome these limitations, the authors propose a novel approach that integrates autoregressive flow matching with a Group-wise Reward Decoupling Policy Optimization (GDPO) strategy. By partitioning the FLAME parameter space into motion groups and applying independent reward normalization per group, GDPO enhances both the dynamic diversity and visual expressiveness of generated head movements. Furthermore, semantic text conditioning is incorporated to enable controllable and context-aware responses. Experiments on the Seamless Interaction and DualTalk datasets demonstrate that the proposed method significantly outperforms current state-of-the-art techniques in long-term motion variance, expressiveness, and semantic controllability.
📝 Abstract
Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean' problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.