MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-time generation of high-fidelity, temporally coherent audio-driven portrait animations faces two key challenges: inter-frame inconsistency and high latency. To address these, we propose a lightweight diffusion Transformer-based video generation framework. Our method introduces three core innovations: (1) a reference identity injection mechanism to ensure cross-frame identity consistency; (2) a causal audio encoding adapter enabling low-latency, high-precision lip synchronization; and (3) a multi-stage optimization strategy integrating VAE compression, facial mask guidance, and progressive training to enhance temporal coherence while preserving semantic fidelity. Evaluated on the EMTD benchmark, our approach achieves new state-of-the-art performance—outperforming prior methods across all metrics: generation fidelity, lip-sync error (LSE ↓28.6%), and temporal stability (FVD ↓34.1%). The framework supports end-to-end real-time upper-body animation synthesis.

Technology Category

Application Category

📝 Abstract
Audio-driven portrait animation, which synthesizes realistic videos from reference images using audio signals, faces significant challenges in real-time generation of high-fidelity, temporally coherent animations. While recent diffusion-based methods improve generation quality by integrating audio into denoising processes, their reliance on frame-by-frame UNet architectures introduces prohibitive latency and struggles with temporal consistency. This paper introduces MirrorMe, a real-time, controllable framework built on the LTX video model, a diffusion transformer that compresses video spatially and temporally for efficient latent space denoising. To address LTX's trade-offs between compression and semantic fidelity, we propose three innovations: 1. A reference identity injection mechanism via VAE-encoded image concatenation and self-attention, ensuring identity consistency; 2. A causal audio encoder and adapter tailored to LTX's temporal structure, enabling precise audio-expression synchronization; and 3. A progressive training strategy combining close-up facial training, half-body synthesis with facial masking, and hand pose integration for enhanced gesture control. Extensive experiments on the EMTD Benchmark demonstrate MirrorMe's state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability.
Problem

Research questions and friction points this paper is trying to address.

Real-time high-fidelity audio-driven animation synthesis
Overcoming latency and temporal consistency in diffusion models
Enhancing identity consistency and audio-expression synchronization
Innovation

Methods, ideas, or system contributions that make the work stand out.

VAE-encoded image concatenation for identity consistency
Causal audio encoder for audio-expression synchronization
Progressive training strategy for enhanced gesture control
🔎 Similar Papers
Dechao Meng
Dechao Meng
PhD candidate, Institute of Computing Technology, Chinese Academy of Science
deep learningcomputer vision
S
Steven Xiao
Tongyi Lab, Alibaba Group
X
Xindi Zhang
Tongyi Lab, Alibaba Group
G
Guangyuan Wang
Tongyi Lab, Alibaba Group
P
Peng Zhang
Tongyi Lab, Alibaba Group
Q
Qi Wang
Tongyi Lab, Alibaba Group
B
Bang Zhang
Tongyi Lab, Alibaba Group
Liefeng Bo
Liefeng Bo
Head of Applied Computer Vision Lab at Alibaba Group
Machine LearningComputer VisionRobotics