FARMER: Flow AutoRegressive Transformer over Pixels

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of sequence length and high-dimensional space in pixel-level autoregressive modeling—which hinder accurate likelihood estimation—this paper proposes FARMER, an end-to-end generative framework integrating normalizing flows with autoregressive modeling. Its key contributions are: (1) an invertible autoregressive flow that maps images to a low-dimensional latent sequence, enabling exact pixel-wise likelihood computation; (2) a self-supervised dimensionality reduction strategy to lower modeling complexity; (3) a one-step distillation mechanism for accelerated inference; and (4) a classifier-free guidance algorithm based on resampling to enhance generation quality. Built upon the Transformer architecture, FARMER achieves state-of-the-art performance among pixel-based generative models across multiple benchmarks, while supporting efficient training, precise likelihood evaluation, and high-fidelity image synthesis.

Technology Category

Application Category

📝 Abstract
Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.
Problem

Research questions and friction points this paper is trying to address.

Modeling raw pixel likelihoods with autoregressive transformers
Addressing high-dimensional complexity in visual sequence modeling
Unifying normalizing flows with autoregressive models for image synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies Normalizing Flows and Autoregressive models for images
Employs self-supervised dimension reduction on latent channels
Uses one-step distillation and classifier-free guidance for efficiency
🔎 Similar Papers
No similar papers found.