PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge of generating speech-synchronized, personalized body gestures for unseen speakers using only a single reference motion clip, without requiring per-speaker optimization. The authors propose PersonaGesture, a novel diffusion-based approach that effectively disentangles speaker identity from speech-driven motion through Adaptive Style Injection (ASI) and Implicit Distribution Rectification (IDR), thereby avoiding style collapse and the limitations of full-reference attention mechanisms. The method integrates a Style Perceiver encoder, zero-initialized residual cross-attention, and a length-aware diagonal affine mapping in latent space. Evaluated on the BEAT2 and ZeroEGGS datasets, PersonaGesture significantly outperforms existing methods across quantitative metrics, identity preservation, audio-motion synchronization, and human preference studies.

📝 Abstract

We propose PersonaGesture, a diffusion-based pipeline for single-reference co-speech gesture personalization of unseen speakers. Given target speech and one motion clip from a new speaker, the model must synthesize gestures that follow the new utterance while retaining speaker-specific pose choices, without per-speaker optimization. This setting is useful for avatars and virtual agents, but it is hard because the reference mixes stable speaker habits with utterance-specific trajectories. PersonaGesture consists of two key components, Adaptive Style Infusion (ASI) and Implicit Distribution Rectification (IDR), to separate temporal identity evidence from residual statistic correction. A Style Perceiver first encodes the variable-length reference into compact speaker-memory tokens. ASI injects these tokens into denoising through zero-initialized residual cross-attention, enabling style evidence to affect motion formation without replacing the pretrained speech-to-motion prior. Building on this, IDR applies a length-aware diagonal affine map in latent space to correct residual channel-wise moments estimated from the same reference. Across BEAT2 and ZeroEGGS, we evaluate quantitative metrics, reference-identity controls, same-audio diagnostics, qualitative comparisons, and human preference. Experiments show that separating denoising-time speaker memory from conservative post-generation moment correction improves unseen-speaker personalization over collapsed style codes, full-reference attention, and one-clip finetuning. Project: https://xiangyue-zhang.github.io/PersonaGesture.

Problem

Research questions and friction points this paper is trying to address.

co-speech gesture

personalization

unseen speakers

single-reference

gesture synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

co-speech gesture generation

speaker personalization

diffusion model