X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-driven portrait animation methods suffer from limited lip-sync accuracy over short durations and insufficient local visual fidelity, failing to generate emotionally expressive, rhythmically coherent, and high-fidelity long-duration performances. To address this, we propose a two-stage disentangled framework: first, an audio-conditioned autoregressive diffusion model operates in a compact facial motion latent space to predict emotion-aware expressions of arbitrary length, mitigating error accumulation; second, a diffusion-based video synthesis module, enhanced by diffusion-forced training, explicitly models long-term audio-visual temporal dynamics. Our approach is the first to enable coherent, emotion-rich, long-duration audio-driven facial animation with actor-level expressiveness and synchronized semantic evolution of speech and visemes. Experiments demonstrate cinematic-quality expressiveness and establish new state-of-the-art performance on long-duration emotional portrait driving.

Technology Category

Application Category

📝 Abstract
We present X-Actor, a novel audio-driven portrait animation framework that generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip. Unlike prior methods that emphasize lip synchronization and short-range visual fidelity in constrained speaking scenarios, X-Actor enables actor-quality, long-form portrait performance capturing nuanced, dynamically evolving emotions that flow coherently with the rhythm and content of speech. Central to our approach is a two-stage decoupled generation pipeline: an audio-conditioned autoregressive diffusion model that predicts expressive yet identity-agnostic facial motion latent tokens within a long temporal context window, followed by a diffusion-based video synthesis module that translates these motions into high-fidelity video animations. By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics through a diffusion-forcing training paradigm, enabling infinite-length emotionally-rich motion prediction without error accumulation. Extensive experiments demonstrate that X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations and achieves state-of-the-art results in long-range, audio-driven emotional portrait acting.
Problem

Research questions and friction points this paper is trying to address.

Generates expressive talking head videos from audio and image
Captures long-range emotional dynamics synchronized with speech
Overcomes error accumulation in infinite-length motion prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage decoupled generation pipeline
Autoregressive diffusion model for motion
Diffusion-based video synthesis module