π€ AI Summary
This work addresses the problem of animating static human images using sparse driving videos (β€16 frames), confronting two key challenges: poor generalization to rare motions and unnatural transitions from the reference image to the first animated frame. To this end, we propose FLASHβa novel framework built upon diffusion models for few-shot human animation. Our method introduces: (1) a joint alignment mechanism that synchronizes motion features and inter-frame correspondences across appearance-diverse videos, enhancing motion transfer robustness; (2) an explicit inter-frame correspondence modeling module coupled with a latent-space detail-enhancing decoder, ensuring smooth initial-frame transitions and high-fidelity texture preservation; and (3) a diffusion-based few-shot learning paradigm. Experiments demonstrate that FLASH significantly improves motion generalization and first-frame coherence under extreme data scarcity, while enabling fine-grained controllable animation. This work establishes a new paradigm for low-data human motion transfer.
π Abstract
Despite recent progress, video generative models still struggle to animate human actions from static images, particularly when handling uncommon actions whose training data are limited. In this paper, we investigate the task of learning to animate human actions from a small number of videos -- 16 or fewer -- which is highly valuable in real-world applications like video and movie production. Few-shot learning of generalizable motion patterns while ensuring smooth transitions from the initial reference image is exceedingly challenging. We propose FLASH (Few-shot Learning to Animate and Steer Humans), which improves motion generalization by aligning motion features and inter-frame correspondence relations between videos that share the same motion but have different appearances. This approach minimizes overfitting to visual appearances in the limited training data and enhances the generalization of learned motion patterns. Additionally, FLASH extends the decoder with additional layers to compensate lost details in the latent space, fostering smooth transitions from the initial reference image. Experiments demonstrate that FLASH effectively animates images with unseen human or scene appearances into specified actions while maintaining smooth transitions from the reference image.