Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers

πŸ“… 2025-08-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address appearance inconsistency, motion discontinuity, and low inference efficiency in image animation, this paper proposes MiraMoβ€”a novel diffusion-based framework. Methodologically, it (1) introduces a motion residual learning paradigm that models temporal dynamics directly rather than performing frame-wise prediction; (2) designs a DCT-domain noise optimization and dynamic control module to refine noise in the frequency domain, jointly preserving fine-grained appearance consistency and motion smoothness; and (3) adopts a diffusion Transformer with linear attention to reduce the quadratic computational complexity of standard self-attention. Extensive experiments on multiple benchmarks demonstrate that MiraMo outperforms state-of-the-art methods in appearance fidelity, motion smoothness, and inference speed. Moreover, it supports controllable generation tasks such as motion transfer and video editing, offering enhanced flexibility and practical utility.

Technology Category

Application Category

πŸ“ Abstract
Image animation has seen significant progress, driven by the powerful generative capabilities of diffusion models. However, maintaining appearance consistency with static input images and mitigating abrupt motion transitions in generated animations remain persistent challenges. While text-to-video (T2V) generation has demonstrated impressive performance with diffusion transformer models, the image animation field still largely relies on U-Net-based diffusion models, which lag behind the latest T2V approaches. Moreover, the quadratic complexity of vanilla self-attention mechanisms in Transformers imposes heavy computational demands, making image animation particularly resource-intensive. To address these issues, we propose MiraMo, a framework designed to enhance efficiency, appearance consistency, and motion smoothness in image animation. Specifically, MiraMo introduces three key elements: (1) A foundational text-to-video architecture replacing vanilla self-attention with efficient linear attention to reduce computational overhead while preserving generation quality; (2) A novel motion residual learning paradigm that focuses on modeling motion dynamics rather than directly predicting frames, improving temporal consistency; and (3) A DCT-based noise refinement strategy during inference to suppress sudden motion artifacts, complemented by a dynamics control module to balance motion smoothness and expressiveness. Extensive experiments against state-of-the-art methods validate the superiority of MiraMo in generating consistent, smooth, and controllable animations with accelerated inference speed. Additionally, we demonstrate the versatility of MiraMo through applications in motion transfer and video editing tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing appearance consistency in image animation
Reducing abrupt motion transitions in animations
Improving computational efficiency of diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient linear attention reduces computational overhead
Motion residual learning improves temporal consistency
DCT-based noise refinement suppresses motion artifacts
πŸ”Ž Similar Papers
No similar papers found.
X
Xin Ma
Department of Data Science & AI, Faculty of Information Technology, Monash University, Australia, Melbourne, 3800
Yaohui Wang
Yaohui Wang
Research Scientist, Shanghai AI Laboratory | Inria
Machine LearningDeep Generative ModelsVideo Generation
G
Genyun Jia
Nanjing University of Posts and Telecommunications, China, Nanjing, 210003
X
Xinyuan Chen
Shanghai Artificial Intelligence Laboratory, China, Shanghai, 200240
Tien-Tsin Wong
Tien-Tsin Wong
Professor, Dept of Data Science and Artificial Intelligence, Monash University
Generative AIComputer GraphicsComputational MangaComputer Vision
Cunjian Chen
Cunjian Chen
Monash University
Generative AIComputer VisionDeep Learning