🤖 AI Summary
Existing human image animation methods often produce static or motion-incoherent backgrounds and rely heavily on predefined camera trajectories, limiting accessibility for non-expert users. This paper proposes an end-to-end human animation framework that requires no explicit camera pose input, enabling implicit background motion synthesis driven solely by a sequence of human poses—the first method to achieve such pose-conditioned dynamic background evolution. Key contributions include: (1) a Background Motion Learner (BML) that establishes an end-to-end mapping from pose sequences to background optical flow; (2) an epipolar-constrained 3D attention mask to enhance inter-frame geometric consistency; and (3) a differentiable implicit background modeling and compositing mechanism. Evaluated on multiple benchmarks, our approach achieves state-of-the-art performance, significantly improving realism and dynamic coherence between human motion and background—outperforming all prior methods requiring explicit camera trajectory specification.
📝 Abstract
Human image animation aims to generate human videos of given characters and backgrounds that adhere to the desired pose sequence. However, existing methods focus more on human actions while neglecting the generation of background, which typically leads to static results or inharmonious movements. The community has explored camera pose-guided animation tasks, yet preparing the camera trajectory is impractical for most entertainment applications and ordinary users. As a remedy, we present an AnimateAnywhere framework, rousing the background in human image animation without requirements on camera trajectories. In particular, based on our key insight that the movement of the human body often reflects the motion of the background, we introduce a background motion learner (BML) to learn background motions from human pose sequences. To encourage the model to learn more accurate cross-frame correspondences, we further deploy an epipolar constraint on the 3D attention map. Specifically, the mask used to suppress geometrically unreasonable attention is carefully constructed by combining an epipolar mask and the current 3D attention map. Extensive experiments demonstrate that our AnimateAnywhere effectively learns the background motion from human pose sequences, achieving state-of-the-art performance in generating human animation results with vivid and realistic backgrounds. The source code and model will be available at https://github.com/liuxiaoyu1104/AnimateAnywhere.