ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing talking-head generation methods face three key limitations: weak text instruction following, temporal misalignment between motion and audio, and reliance on external pose control signals. To address these, we propose PhaseAvatar, a novel end-to-end framework that introduces phase-aware cross-attention (PACA) to model fine-grained temporal structure of motion semantics; a progressive audio-video alignment mechanism ensuring frame-level synchronization; and a two-stage diffusion training strategy for joint fusion of text, audio, and visual modalities. Our method incorporates hierarchical feature modulation, temporal anchoring prompt decomposition, and structured motion annotation fine-tuning—eliminating the need for auxiliary skeletal inputs. Extensive evaluations on multiple benchmarks demonstrate significant improvements: +12.6% motion control accuracy and −38.4% FID score, reflecting enhanced visual fidelity. PhaseAvatar is the first approach enabling fully text-driven, phase-aligned, and diversely controllable talking-head synthesis in a single unified pipeline.

Technology Category

Application Category

📝 Abstract

Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model's text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.

Problem

Research questions and friction points this paper is trying to address.

Enhances text-following for diverse avatar actions

Aligns actions temporally with audio content precisely

Removes dependency on additional control signals like poses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Phase-Aware Cross-Attention for temporal-semantic alignment

Progressive Audio-Visual Alignment to prevent modality interference

Two-stage training strategy for robust audio-visual correspondence

🔎 Similar Papers

FlowAct: A Proactive Multimodal Human-robot Interaction System with Continuous Flow of Perception and Modular Action Sub-systems