Accelerating Masked Image Generation by Learning Latent Controlled Dynamics

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Masked image generation models suffer from inefficiency due to multi-step bidirectional attention and loss of continuous semantic information in discrete sampling, while existing acceleration methods introduce significant approximation errors at high speedup ratios. This work proposes MIGM-Shortcut, which, for the first time, formulates feature evolution as a controlled dynamical system. It employs a lightweight network to learn an average velocity field derived from the fusion of historical features and already sampled tokens, enabling efficient prediction of future features. By transcending the representational limitations of conventional caching-based approximations, the method achieves over 4× acceleration on mainstream architectures such as Lumina-DiMOO while preserving generation quality, substantially advancing the efficiency–quality Pareto frontier.

Technology Category

Application Category

📝 Abstract

Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at https://github.com/Kaiwen-Zhu/MIGM-Shortcut.

Problem

Research questions and friction points this paper is trying to address.

Masked Image Generation

Acceleration

Feature Redundancy

Sampling Efficiency

Image Synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

masked image generation

latent dynamics

feature evolution