SpectralAR: Spectral Autoregressive Visual Generation

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autoregressive (AR) visual generation methods partition images into spatial patches, modeling them sequentially—yet image patches are inherently parallel, violating the causal assumption underlying AR modeling. This work introduces a novel causal modeling paradigm from the spectral perspective: Nested Spectral Tokenization (NST), the first method to map images into an ordered sequence of spectral tokens that enables strict coarse-to-fine causal autoregression. NST preserves theoretical consistency while improving token efficiency. By integrating spectral transformation with spectral-domain autoregressive modeling, our approach achieves 3.02 gFID on ImageNet-1K using only 64 tokens and a 310M-parameter model—significantly outperforming prior vision AR models of comparable scale. The core contribution lies in redefining the causal representation foundation for visual sequences, establishing a new pathway toward efficient and interpretable autoregressive visual generation.

Technology Category

Application Category

📝 Abstract
Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: https://huang-yh.github.io/spectralar/.
Problem

Research questions and friction points this paper is trying to address.

Addresses causality issue in autoregressive visual generation
Proposes spectral tokens for coarse-to-fine image generation
Improves efficiency and quality with fewer tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms images into ordered spectral tokens
Autoregressive generation with spectral sequences
Coarse-to-fine detail handling for efficiency
🔎 Similar Papers
No similar papers found.