🤖 AI Summary
Masked generative models (MGMs) suffer from slow inference and degraded sample quality under multi-step parallel decoding. To address this, we propose ReCAP—a plug-and-play module that enables efficient, high-fidelity autoregressive generation without model retraining. Its core innovation is the first-of-its-kind context feature striding reuse mechanism, which constructs accelerated inference steps via lightweight feature caching and reuse, coupled with context-aware marginal distribution calibration. ReCAP is architecture-agnostic, supporting both discrete and continuous token spaces as well as diverse MGM frameworks—including MaskGIT, MAGE, and MAR—while unifying fine-grained iterative generation with computational offloading. On ImageNet256, it achieves up to 2.4× inference speedup with negligible degradation in FID and LPIPS. Across multiple configurations, ReCAP consistently improves the efficiency–fidelity trade-off.
📝 Abstract
Masked generative models (MGMs) have emerged as a powerful framework for image synthesis, combining parallel decoding with strong bidirectional context modeling. However, generating high-quality samples typically requires many iterative decoding steps, resulting in high inference costs. A straightforward way to speed up generation is by decoding more tokens in each step, thereby reducing the total number of steps. However, when many tokens are decoded simultaneously, the model can only estimate the univariate marginal distributions independently, failing to capture the dependency among them. As a result, reducing the number of steps significantly compromises generation fidelity. In this work, we introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs by constructing low-cost steps via reusing feature embeddings from previously decoded context tokens. ReCAP interleaves standard full evaluations with lightweight steps that cache and reuse context features, substantially reducing computation while preserving the benefits of fine-grained, iterative generation. We demonstrate its effectiveness on top of three representative MGMs (MaskGIT, MAGE, and MAR), including both discrete and continuous token spaces and covering diverse architectural designs. In particular, on ImageNet256 class-conditional generation, ReCAP achieves up to 2.4x faster inference than the base model with minimal performance drop, and consistently delivers better efficiency-fidelity trade-offs under various generation settings.