🤖 AI Summary
To address severe identity leakage and inadequate modeling of subtle/extreme facial expressions in zero-shot portrait animation, this paper proposes a diffusion-based disentangled animation generation method. Our approach introduces: (1) an end-to-end, identity-agnostic 1D latent motion descriptor that explicitly decouples motion from identity features; (2) a disentangled latent attention mechanism coupled with dual GAN-decoder supervision to suppress spurious spatial structural cues, thereby fundamentally mitigating identity leakage; and (3) cross-attention integration of the 1D motion vector into the diffusion process—eliminating the need for pre-trained motion detectors and enabling high-fidelity motion transfer. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches across diverse challenging expressions—including extreme and micro-expressions—while achieving both high expressiveness and strong identity preservation.
📝 Abstract
We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN decoder, alongside spatial and color augmentations. By embedding the driving motion into a 1D latent vector and controlling motion via cross-attention rather than additive spatial guidance, our design eliminates the transmission of spatial-aligned structural clues from the driving condition to the diffusion backbone, substantially mitigating identity leakage. Extensive experiments demonstrate that X-NeMo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models are available for research.