Visual Implicit Autoregressive Modeling

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the computational rigidity and high memory footprint of existing visual autoregressive models in high-resolution image generation, which hinder flexible trade-offs between efficiency and performance. The authors propose an implicit autoregressive generation architecture that embeds an implicit equilibrium layer between shallow pre- and post-processing modules. This design enables scale-wise generation via fixed-point iteration and supports dynamic adjustment of computational cost during inference. By integrating implicit deep learning with Jacobian-Free backpropagation, the method achieves an FID of 2.16 and sFID of 8.07 on ImageNet 256×256 using only 38.4% of the parameters of VAR, while reducing inference memory to 8.53 GB and attaining a throughput of 32.08 images per second. It also demonstrates superior detail and structural preservation in zero-shot editing tasks.

📝 Abstract

Visual Autoregressive Modeling (VAR) based on next-scale prediction achieves strong generation quality, but their explicit deep stacks fix the amount of computation per scale and inflate memory at high resolutions. We introduce Visual Implicit Autoregressive Modeling (VIAR), a next-scale autoregressive generator that embeds an implicit equilibrium layer between shallow pre/post blocks. The implicit layer is trained with Jacobian-Free Backpropagation, yielding constant training memory, while inference exposes a per-scale iteration knob that enables compute control. On ImageNet 256x256 benchmark, VIAR attains FID 2.16, and sFID 8.07 with only 38.4% parameters of VAR, matching or surpassing strong AR baselines and remaining competitive with large diffusion models. By controlling the per-scale knob, VIAR can reduce peak memory from 19.24 GB to 8.53 GB and doubles throughput from 15.16 to 32.08 images/s on a single RTX 4090, without retraining. Ablations show that fewer steps are sufficient for fixed-point iterations to converge and that VIAR consistently dominates VAR across quality efficiency operating points. In zero shot in-painting and class-conditional editing, VIAR produces sharper details and smoother boundaries while preserving global structure, validating the benefits of implicit equilibria and per-scale compute control for practical, deployable visual generation.

Problem

Research questions and friction points this paper is trying to address.

Visual Autoregressive Modeling

high-resolution generation

memory efficiency

compute control

implicit modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit Equilibrium Layer

Jacobian-Free Backpropagation

Per-Scale Compute Control