🤖 AI Summary
This work addresses the lack of fine-grained control and interpretability in facial motion transfer. We propose an interpretable portrait animation method based on a sparse motion dictionary, which constructs semantically aligned, linear motion bases in the latent space. Facial motions from a driving video are disentangled into editable, interpretable sparse factors, enabling a controllable “edit–warp–render” generation paradigm. Our approach employs an autoencoder architecture optimized via large-scale training strategies, supporting billion-parameter model optimization. Extensive evaluations on multiple benchmarks—covering self-reconstruction and cross-identity motion transfer—demonstrate significant improvements over state-of-the-art methods. Moreover, the framework enables high-fidelity user-guided editing and 3D-aware animation generation. It achieves superior control accuracy, semantic transparency, and generalization capability, bridging the gap between expressiveness and interpretability in neural face animation.
📝 Abstract
We introduce LIA-X, a novel interpretable portrait animator designed to transfer facial dynamics from a driving video to a source portrait with fine-grained control. LIA-X is an autoencoder that models motion transfer as a linear navigation of motion codes in latent space. Crucially, it incorporates a novel Sparse Motion Dictionary that enables the model to disentangle facial dynamics into interpretable factors. Deviating from previous 'warp-render' approaches, the interpretability of the Sparse Motion Dictionary allows LIA-X to support a highly controllable 'edit-warp-render' strategy, enabling precise manipulation of fine-grained facial semantics in the source portrait. This helps to narrow initial differences with the driving video in terms of pose and expression. Moreover, we demonstrate the scalability of LIA-X by successfully training a large-scale model with approximately 1 billion parameters on extensive datasets. Experimental results show that our proposed method outperforms previous approaches in both self-reenactment and cross-reenactment tasks across several benchmarks. Additionally, the interpretable and controllable nature of LIA-X supports practical applications such as fine-grained, user-guided image and video editing, as well as 3D-aware portrait video manipulation.