Direction-Aware Diagonal Autoregressive Image Generation

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

In autoregressive image generation, raster-scan ordering induces abrupt Euclidean distance jumps between adjacent tokens in pixel space, undermining local correlation modeling. To address this, we propose Diagonal Autoregressive (DAR), the first direction-aware serialization paradigm that alternates scanning along primary and secondary diagonals to preserve multi-directional causal dependencies. Methodologically, we introduce 4D Rotational Position Encoding (4D-RoPE) to explicitly encode 2D geometric structure; incorporate directional embeddings to parameterize scan orientation; reuse the image tokenizer’s codebook for token embedding; and adopt a hierarchical scaling architecture (485M–2.0B parameters). DAR-XL (2.0B) achieves a state-of-the-art FID of 1.37 on ImageNet 256×256, setting a new benchmark for autoregressive image synthesis.

Technology Category

Application Category

📝 Abstract

The raster-ordered image token sequence exhibits a significant Euclidean distance between index-adjacent tokens at line breaks, making it unsuitable for autoregressive generation. To address this issue, this paper proposes Direction-Aware Diagonal Autoregressive Image Generation (DAR) method, which generates image tokens following a diagonal scanning order. The proposed diagonal scanning order ensures that tokens with adjacent indices remain in close proximity while enabling causal attention to gather information from a broader range of directions. Additionally, two direction-aware modules: 4D-RoPE and direction embeddings are introduced, enhancing the model's capability to handle frequent changes in generation direction. To leverage the representational capacity of the image tokenizer, we use its codebook as the image token embeddings. We propose models of varying scales, ranging from 485M to 2.0B. On the 256$ imes$256 ImageNet benchmark, our DAR-XL (2.0B) outperforms all previous autoregressive image generators, achieving a state-of-the-art FID score of 1.37.

Problem

Research questions and friction points this paper is trying to address.

Addresses Euclidean distance issue in raster-ordered image token sequences.

Proposes diagonal scanning for closer proximity of adjacent tokens.

Enhances model with direction-aware modules for better generation direction handling.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diagonal scanning order for image token generation

4D-RoPE and direction embeddings enhance direction handling

Leverages image tokenizer codebook for token embeddings

🔎 Similar Papers

No similar papers found.