π€ AI Summary
This work addresses the limitations of existing character animation methods, where explicit motion representations struggle with spatial misalignment and scale variations, while implicit approaches often suffer from identity leakage due to entanglement between motion and appearance. To overcome these challenges, we propose a novel implicit motion representation framework that compresses per-frame motion into compact 1D motion tokens, thereby relaxing 2D spatial constraints and effectively disentangling identity information. Additionally, we introduce a mask tokenβbased temporal consistency redirection module to enhance motion coherence during animation transfer. Integrated with a three-stage training strategy and a video diffusion model, our method achieves state-of-the-art or comparable performance across multiple metrics, enabling high-fidelity, identity-disentangled, and temporally coherent character animation generation.
π Abstract
Recent progress in video diffusion models has markedly advanced character animation, which synthesizes motioned videos by animating a static identity image according to a driving video. Explicit methods represent motion using skeleton, DWPose or other explicit structured signals, but struggle to handle spatial mismatches and varying body scales. %proportions. Implicit methods, on the other hand, capture high-level implicit motion semantics directly from the driving video, but suffer from identity leakage and entanglement between motion and appearance. To address the above challenges, we propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens. This design relaxes strict spatial constraints inherent in 2D representations and effectively prevents identity information leakage from the motion video. Furthermore, we design a temporally consistent mask token-based retargeting module that enforces a temporal training bottleneck, mitigating interference from the source images'motion and improving retargeting consistency. Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity. Extensive experiments demonstrate that our implicit motion representation and the propose IM-Animation's generative capabilities are achieve superior or competitive performance compared with state-of-the-art methods.