🤖 AI Summary
Masked image generation models suffer from inefficiency due to multi-step bidirectional attention and loss of continuous semantic information in discrete sampling, while existing acceleration methods introduce significant approximation errors at high speedup ratios. This work proposes MIGM-Shortcut, which, for the first time, formulates feature evolution as a controlled dynamical system. It employs a lightweight network to learn an average velocity field derived from the fusion of historical features and already sampled tokens, enabling efficient prediction of future features. By transcending the representational limitations of conventional caching-based approximations, the method achieves over 4× acceleration on mainstream architectures such as Lumina-DiMOO while preserving generation quality, substantially advancing the efficiency–quality Pareto frontier.
📝 Abstract
Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at https://github.com/Kaiwen-Zhu/MIGM-Shortcut.