FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Autoregressive image generation models suffer from slow inference due to raster-scan decoding, and existing acceleration methods often require retraining or introduce train-inference inconsistencies. This work proposes a lightweight post-training adaptation framework that preserves the original horizontal prediction head while introducing an additional vertical prediction head. By branching intermediate layer features, the method constructs a bidirectional next-token prediction pathway and incorporates a position-adaptive, learnable fusion gate to dynamically combine row- and column-wise dependencies. Requiring only 0.05% of the original training data, the approach achieves up to a 22.9× speedup for 512×512 image generation on LlamaGen and Emu3.5, significantly enhancing inference efficiency without altering the original training objective.

📝 Abstract

Large-scale autoregressive models have demonstrated remarkable capabilities in image generation. However, their sequential raster-scan decoding relies on strictly next-token prediction, making inference prohibitively expensive. Existing acceleration methods typically either introduce entirely new generation paradigms that necessitate costly pre-training from scratch, or enable parallel generation at the expense of a training-inference gap or altered prediction objectives. In this paper, we introduce FlashAR, a lightweight post-training adaptation framework that efficiently adapts a pre-trained raster-scan autoregressive model into a highly parallel generator based on two-way next-token prediction. Our key insight is that effective adaptation should minimize modifications to the pre-trained model's original training objective to preserve its learned prior. Accordingly, we retain the original AR head as a horizontal head for row-wise prediction and introduce a complementary, lightweight vertical head for column-wise prediction. To facilitate efficient adaptation, we branch the vertical head from an intermediate layer rather than the final layer, bypassing the inherent horizontal head bias. Moreover, since horizontal and vertical predictions capture complementary dependencies whose relative importance varies across target positions, we employ a learnable fusion gate to dynamically combine the two predictions at each position. To further reduce adaptation cost, we propose a two-stage adaptation pipeline: the vertical head is first initialized through adaptation from the pre-trained autoregressive model before jointly fine-tuned with backbone to adapt to the new decoding paradigm. Extensive experiments on LlamaGen and Emu3.5 show that FlashAR achieves up to a 22.9x speedup for 512x512 image generation through a lightweight post-training with merely 0.05% of the original training data.

Problem

Research questions and friction points this paper is trying to address.

autoregressive image generation

inference acceleration

post-training adaptation

sequential decoding

parallel generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

post-training adaptation

autoregressive image generation

parallel decoding