Learning to Animate Images from A Few Videos to Portray Delicate Human Actions

πŸ“… 2025-03-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the problem of animating static human images using sparse driving videos (≀16 frames), confronting two key challenges: poor generalization to rare motions and unnatural transitions from the reference image to the first animated frame. To this end, we propose FLASHβ€”a novel framework built upon diffusion models for few-shot human animation. Our method introduces: (1) a joint alignment mechanism that synchronizes motion features and inter-frame correspondences across appearance-diverse videos, enhancing motion transfer robustness; (2) an explicit inter-frame correspondence modeling module coupled with a latent-space detail-enhancing decoder, ensuring smooth initial-frame transitions and high-fidelity texture preservation; and (3) a diffusion-based few-shot learning paradigm. Experiments demonstrate that FLASH significantly improves motion generalization and first-frame coherence under extreme data scarcity, while enabling fine-grained controllable animation. This work establishes a new paradigm for low-data human motion transfer.

Technology Category

Application Category

πŸ“ Abstract
Despite recent progress, video generative models still struggle to animate human actions from static images, particularly when handling uncommon actions whose training data are limited. In this paper, we investigate the task of learning to animate human actions from a small number of videos -- 16 or fewer -- which is highly valuable in real-world applications like video and movie production. Few-shot learning of generalizable motion patterns while ensuring smooth transitions from the initial reference image is exceedingly challenging. We propose FLASH (Few-shot Learning to Animate and Steer Humans), which improves motion generalization by aligning motion features and inter-frame correspondence relations between videos that share the same motion but have different appearances. This approach minimizes overfitting to visual appearances in the limited training data and enhances the generalization of learned motion patterns. Additionally, FLASH extends the decoder with additional layers to compensate lost details in the latent space, fostering smooth transitions from the initial reference image. Experiments demonstrate that FLASH effectively animates images with unseen human or scene appearances into specified actions while maintaining smooth transitions from the reference image.
Problem

Research questions and friction points this paper is trying to address.

Animate human actions from few videos
Generalize motion patterns with limited data
Ensure smooth transitions from static images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns motion features and inter-frame correspondence relations
Extends decoder layers to compensate lost details
Ensures smooth transitions from initial reference image
πŸ”Ž Similar Papers
No similar papers found.
Haoxin Li
Haoxin Li
Nanyang Technological University
Computer VisionVision and Language
Yingchen Yu
Yingchen Yu
ByteDance, Singapore
Computer Vision
Q
Qilong Wu
National University of Singapore
H
Hanwang Zhang
Nanyang Technological University
B
Boyang Li
Nanyang Technological University
S
Song Bai
ByteDance