ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing talking-head generation methods face three key limitations: weak text instruction following, temporal misalignment between motion and audio, and reliance on external pose control signals. To address these, we propose PhaseAvatar, a novel end-to-end framework that introduces phase-aware cross-attention (PACA) to model fine-grained temporal structure of motion semantics; a progressive audio-video alignment mechanism ensuring frame-level synchronization; and a two-stage diffusion training strategy for joint fusion of text, audio, and visual modalities. Our method incorporates hierarchical feature modulation, temporal anchoring prompt decomposition, and structured motion annotation fine-tuning—eliminating the need for auxiliary skeletal inputs. Extensive evaluations on multiple benchmarks demonstrate significant improvements: +12.6% motion control accuracy and −38.4% FID score, reflecting enhanced visual fidelity. PhaseAvatar is the first approach enabling fully text-driven, phase-aligned, and diversely controllable talking-head synthesis in a single unified pipeline.

Technology Category

Application Category

📝 Abstract
Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model's text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.
Problem

Research questions and friction points this paper is trying to address.

Enhances text-following for diverse avatar actions
Aligns actions temporally with audio content precisely
Removes dependency on additional control signals like poses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Phase-Aware Cross-Attention for temporal-semantic alignment
Progressive Audio-Visual Alignment to prevent modality interference
Two-stage training strategy for robust audio-visual correspondence
Ziqiao Peng
Ziqiao Peng
Renmin University of China
3D Face AnimationTalking Head Generation
Y
Yi Chen
Tencent Hunyuan
Yifeng Ma
Yifeng Ma
Tsinghua University
Computer visionDeep Learning
Guozhen Zhang
Guozhen Zhang
Nanjing University
Video Frame Interpolation
Zhiyao Sun
Zhiyao Sun
Tsinghua University
Z
Zixiang Zhou
Tencent Hunyuan
Y
Youliang Zhang
Tencent Hunyuan
Z
Zhengguang Zhou
Tencent Hunyuan
Z
Zhaoxin Fan
Beihang University
Hongyan Liu
Hongyan Liu
Zhejiang University
programable networksnetwork measurementP4 language
Y
Yuan Zhou
Tencent Hunyuan
Q
Qinglin Lu
Tencent Hunyuan
J
Jun He
Renmin University of China