๐ค AI Summary
This paper addresses the challenges of identity preservation, precise expression transfer, and long-term temporal consistency in free-style portrait animation. To this end, we propose an efficient diffusion-based animation framework built upon Stable Diffusion. Methodologically: (1) expression-aware facial landmarks are introduced as explicit motion driving signals; (2) a fine-grained facial reconstruction loss is coupled with joint expression-mask supervision; and (3) a progressive generation strategy integrated with Taylor interpolation-based caching achieves a 2.6ร lossless acceleration. Extensive evaluations on our newly constructed benchmark, EmojiBench++, demonstrate that our approach achieves state-of-the-art performance in animation fidelity, expression controllability, and identity consistency. It robustly supports high-quality, long-sequence animation generation across diverse stylesโincluding realistic human faces, cartoons, sculptures, and animal portraits.
๐ Abstract
We present Follow-Your-Emoji-Faster, an efficient diffusion-based framework for freestyle portrait animation driven by facial landmarks. The main challenges in this task are preserving the identity of the reference portrait, accurately transferring target expressions, and maintaining long-term temporal consistency while ensuring generation efficiency. To address identity preservation and accurate expression retargeting, we enhance Stable Diffusion with two key components: a expression-aware landmarks as explicit motion signals, which improve motion alignment, support exaggerated expressions, and reduce identity leakage; and a fine-grained facial loss that leverages both expression and facial masks to better capture subtle expressions and faithfully preserve the reference appearance. With these components, our model supports controllable and expressive animation across diverse portrait types, including real faces, cartoons, sculptures, and animals. However, diffusion-based frameworks typically struggle to efficiently generate long-term stable animation results, which remains a core challenge in this task. To address this, we propose a progressive generation strategy for stable long-term animation, and introduce a Taylor-interpolated cache, achieving a 2.6X lossless acceleration. These two strategies ensure that our method produces high-quality results efficiently, making it user-friendly and accessible. Finally, we introduce EmojiBench++, a more comprehensive benchmark comprising diverse portraits, driving videos, and landmark sequences. Extensive evaluations on EmojiBench++ demonstrate that Follow-Your-Emoji-Faster achieves superior performance in both animation quality and controllability. The code, training dataset and benchmark will be found in https://follow-your-emoji.github.io/.