AnimateAnywhere: Rouse the Background in Human Image Animation

📅 2025-04-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing human image animation methods often produce static or motion-incoherent backgrounds and rely heavily on predefined camera trajectories, limiting accessibility for non-expert users. This paper proposes an end-to-end human animation framework that requires no explicit camera pose input, enabling implicit background motion synthesis driven solely by a sequence of human poses—the first method to achieve such pose-conditioned dynamic background evolution. Key contributions include: (1) a Background Motion Learner (BML) that establishes an end-to-end mapping from pose sequences to background optical flow; (2) an epipolar-constrained 3D attention mask to enhance inter-frame geometric consistency; and (3) a differentiable implicit background modeling and compositing mechanism. Evaluated on multiple benchmarks, our approach achieves state-of-the-art performance, significantly improving realism and dynamic coherence between human motion and background—outperforming all prior methods requiring explicit camera trajectory specification.

Technology Category

Application Category

📝 Abstract
Human image animation aims to generate human videos of given characters and backgrounds that adhere to the desired pose sequence. However, existing methods focus more on human actions while neglecting the generation of background, which typically leads to static results or inharmonious movements. The community has explored camera pose-guided animation tasks, yet preparing the camera trajectory is impractical for most entertainment applications and ordinary users. As a remedy, we present an AnimateAnywhere framework, rousing the background in human image animation without requirements on camera trajectories. In particular, based on our key insight that the movement of the human body often reflects the motion of the background, we introduce a background motion learner (BML) to learn background motions from human pose sequences. To encourage the model to learn more accurate cross-frame correspondences, we further deploy an epipolar constraint on the 3D attention map. Specifically, the mask used to suppress geometrically unreasonable attention is carefully constructed by combining an epipolar mask and the current 3D attention map. Extensive experiments demonstrate that our AnimateAnywhere effectively learns the background motion from human pose sequences, achieving state-of-the-art performance in generating human animation results with vivid and realistic backgrounds. The source code and model will be available at https://github.com/liuxiaoyu1104/AnimateAnywhere.
Problem

Research questions and friction points this paper is trying to address.

Generates human videos with dynamic backgrounds from pose sequences
Eliminates need for camera trajectories in background animation
Learns background motion from human poses using 3D attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns background motion from human pose sequences
Uses epipolar constraint on 3D attention map
Generates human animation with realistic backgrounds
🔎 Similar Papers
No similar papers found.
X
Xiaoyu Liu
Harbin Institute of Technology
M
Mingshuai Yao
Harbin Institute of Technology
Y
Yabo Zhang
Harbin Institute of Technology
Xianhui Lin
Xianhui Lin
Tongyi Lab, Alibaba Group
Computer VisionLow-level VisionVideo Generation
P
Peiran Ren
X
Xiaoming Li
Nanyang Technological University
M
Ming Liu
Harbin Institute of Technology
Wangmeng Zuo
Wangmeng Zuo
School of Computer Science and Technology, Harbin Institute of Technology
Computer VisionImage ProcessingGenerative AIDeep LearningBiometrics