🤖 AI Summary
Existing talking-head generation methods face three key limitations: weak text instruction following, temporal misalignment between motion and audio, and reliance on external pose control signals. To address these, we propose PhaseAvatar, a novel end-to-end framework that introduces phase-aware cross-attention (PACA) to model fine-grained temporal structure of motion semantics; a progressive audio-video alignment mechanism ensuring frame-level synchronization; and a two-stage diffusion training strategy for joint fusion of text, audio, and visual modalities. Our method incorporates hierarchical feature modulation, temporal anchoring prompt decomposition, and structured motion annotation fine-tuning—eliminating the need for auxiliary skeletal inputs. Extensive evaluations on multiple benchmarks demonstrate significant improvements: +12.6% motion control accuracy and −38.4% FID score, reflecting enhanced visual fidelity. PhaseAvatar is the first approach enabling fully text-driven, phase-aligned, and diversely controllable talking-head synthesis in a single unified pipeline.
📝 Abstract
Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model's text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.