GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenges of generating listener head motions in virtual human interaction, where existing methods often produce static, under-expressive behaviors and struggle with the high-dimensional complexity of nonverbal motion parameters. To overcome these limitations, the authors propose a novel approach that integrates autoregressive flow matching with a Group-wise Reward Decoupling Policy Optimization (GDPO) strategy. By partitioning the FLAME parameter space into motion groups and applying independent reward normalization per group, GDPO enhances both the dynamic diversity and visual expressiveness of generated head movements. Furthermore, semantic text conditioning is incorporated to enable controllable and context-aware responses. Experiments on the Seamless Interaction and DualTalk datasets demonstrate that the proposed method significantly outperforms current state-of-the-art techniques in long-term motion variance, expressiveness, and semantic controllability.

Technology Category

Application Category

📝 Abstract

Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean' problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.

Problem

Research questions and friction points this paper is trying to address.

3D head motion

listener animation

Regression-to-the-Mean

nonverbal expression

virtual human

Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-Regressive Flow Matching

Group reward-Decoupled Policy Optimization

Expressive Listener Motion