HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

VAR-based image generation faces three key bottlenecks: parallel token generation degrades sample quality; sequence length grows superlinearly with resolution; and modifying sampling strategies requires full model retraining. This paper proposes HMAR, a Hierarchical Masked Autoregressive modeling framework that, for the first time, formulates multi-scale prediction as a Markov process, thereby decoupling cross-scale conditional dependencies. HMAR introduces I/O-aware block-sparse attention and a controllable multi-step masking generation mechanism. It enables zero-shot adjustment of sampling steps and zero-shot image editing without retraining. HMAR reduces sequence length and memory footprint by 3×, while accelerating training and inference by 2.5× and 1.75×, respectively. On ImageNet at 256×256 and 512×512 resolutions, HMAR consistently outperforms VAR, diffusion, and autoregressive baselines of comparable parameter count across all metrics.

Technology Category

Application Category

📝 Abstract

Visual Auto-Regressive modeling (VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolution) scales. However, this formulation suffers from reduced image quality due to the parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the sampling schedule. We introduce Hierarchical Masked Auto-Regressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with fast sampling. HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor instead of the tokens in all predecessor resolutions. When predicting a resolution scale, HMAR uses a controllable multi-step masked generation procedure to generate a subset of the tokens in each step. On ImageNet 256x256 and 512x512 benchmarks, HMAR models match or outperform parameter-matched VAR, diffusion, and autoregressive baselines. We develop efficient IO-aware block-sparse attention kernels that allow HMAR to achieve faster training and inference times over VAR by over 2.5x and 1.75x respectively, as well as over 3x lower inference memory footprint. Finally, HMAR yields additional flexibility over VAR; its sampling schedule can be changed without further training, and it can be applied to image editing tasks in a zero-shot manner.

Problem

Research questions and friction points this paper is trying to address.

Improves image quality in autoregressive modeling by hierarchical masked prediction

Reduces superlinear sequence length scaling with image resolution

Enables flexible sampling schedule changes without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Masked Auto-Regressive modeling for image generation

Markovian process for next-scale prediction

Efficient IO-aware block-sparse attention kernels

🔎 Similar Papers

DepthART: Monocular Depth Estimation as Autoregressive Refinement Task