Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots

πŸ“… 2025-05-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing autoregressive (AR) vision generation models rely on single-scale dense token sequences, limiting early predictions’ ability to capture global contextual information. Method: We propose Hi-MAR, a Hierarchical Masked Autoregressive model that introduces low-resolution image tokens as cross-scale structural anchors and autoregressive pivots, guiding high-resolution token generation in stages; it further integrates a Diffusion Transformer prediction head to enhance global context awareness. Key components include hierarchical masked modeling, a multi-stage generation paradigm, and a low-resolution token hub mechanism. Contribution/Results: Hi-MAR significantly outperforms mainstream AR baselines on both class-conditional and text-to-image generation tasks, achieving superior FID and CLIP Score. Moreover, it reduces inference computational overhead by 30–50%, demonstrating improved efficiency without sacrificing fidelity or alignment.

Technology Category

Application Category

πŸ“ Abstract
Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current de-facto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the additional guidance to strengthen the next autoregressive modeling phase by shaping global structural awareness of typical dense image tokens. A new Diffusion Transformer head is further devised to amplify the global context among all tokens for mask token prediction. Extensive evaluations on both class-conditional and text-to-image generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs. Code is available at https://github.com/HiDream-ai/himar.
Problem

Research questions and friction points this paper is trying to address.

Improving global context in autoregressive visual generation
Hierarchical modeling from low to high resolution tokens
Reducing computational costs while enhancing generation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical autoregressive modeling with low-resolution tokens
Multi-phase prediction using intermediary pivot tokens
Diffusion Transformer head for global context enhancement
πŸ”Ž Similar Papers
No similar papers found.