🤖 AI Summary
This work addresses the inefficiency in video motion transfer, which often suffers from redundant motion and gradient computations, making it challenging to balance generation speed and visual quality. To tackle this issue, the authors propose an efficient diffusion Transformer framework that incorporates a local attention masking mechanism to suppress redundant motion information and introduces a gradient reuse strategy along the diffusion trajectory to minimize redundant calculations. The method achieves a significant improvement in inference efficiency—yielding an average speedup of 3.43×—while preserving high visual fidelity and temporal consistency. This approach establishes a new paradigm for high-quality video motion transfer that effectively reconciles computational efficiency with performance.
📝 Abstract
Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.