ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing autoregressive (AR) image generation models face a fundamental trade-off between modeling accuracy and inference efficiency when capturing high-dimensional token distributions: simplistic distributional assumptions fail to capture complex structural dependencies, whereas fine-grained modeling severely impedes generation speed. To address this, we propose ARINAR, a novel two-level autoregressive framework that introduces feature-level autoregression—where an outer AR module generates conditional latent vectors, and an inner AR module models tokens dimension-by-dimension along the feature axis, thereby decoupling the high-dimensional joint distribution into lightweight, univariate conditional distributions. The inner module employs Gaussian Mixture Models (GMMs) for efficient and precise density estimation. On ImageNet 256×256, ARINAR achieves a competitive FID of 2.75 with only 213M parameters—approaching the state-of-the-art (2.31)—while accelerating generation by 5× over token-level AR baselines, effectively breaking through the long-standing modeling bottleneck in autoregressive image synthesis.

Technology Category

Application Category

📝 Abstract

Existing autoregressive (AR) image generative models use a token-by-token generation schema. That is, they predict a per-token probability distribution and sample the next token from that distribution. The main challenge is how to model the complex distribution of high-dimensional tokens. Previous methods either are too simplistic to fit the distribution or result in slow generation speed. Instead of fitting the distribution of the whole tokens, we explore using a AR model to generate each token in a feature-by-feature way, i.e., taking the generated features as input and generating the next feature. Based on that, we propose ARINAR (AR-in-AR), a bi-level AR model. The outer AR layer take previous tokens as input, predicts a condition vector z for the next token. The inner layer, conditional on z, generates features of the next token autoregressively. In this way, the inner layer only needs to model the distribution of a single feature, for example, using a simple Gaussian Mixture Model. On the ImageNet 256x256 image generation task, ARINAR-B with 213M parameters achieves an FID of 2.75, which is comparable to the state-of-the-art MAR-B model (FID=2.31), while five times faster than the latter.

Problem

Research questions and friction points this paper is trying to address.

Modeling complex high-dimensional token distributions in autoregressive image generation.

Improving generation speed while maintaining image quality.

Proposing a bi-level AR model for feature-by-feature token generation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bi-level autoregressive model for feature generation

Outer AR predicts condition vector for next token

Inner AR generates features using Gaussian Mixture Model

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models