iFSQ: Improving FSQ for Image Generation with 1 Line of Code

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the limitations of original Finite Scalar Quantization (FSQ), whose uniform quantization intervals lead to activation collapse, compromising both reconstruction fidelity and information efficiency and thereby hindering fair comparison between discrete and continuous generative models. The authors propose improved FSQ (iFSQ), which replaces the activation function with a distribution-matching mapping aligned to a uniform prior—achievable with a single line of code—thereby theoretically guaranteeing optimal utilization of quantization intervals while substantially enhancing reconstruction accuracy. A unified benchmark built upon iFSQ reveals that 4 bits per dimension represents the optimal trade-off for discrete-continuous representations; autoregressive (AR) models converge faster but exhibit lower performance ceilings than diffusion models; and LlamaGen augmented with REPA further demonstrates the untapped potential of AR architectures.

Technology Category

Application Category

📝 Abstract

The field of image generation is currently bifurcated into autoregressive (AR) models operating on discrete tokens and diffusion models utilizing continuous latents. This divide, rooted in the distinction between VQ-VAEs and VAEs, hinders unified modeling and fair benchmarking. Finite Scalar Quantization (FSQ) offers a theoretical bridge, yet vanilla FSQ suffers from a critical flaw: its equal-interval quantization can cause activation collapse. This mismatch forces a trade-off between reconstruction fidelity and information efficiency. In this work, we resolve this dilemma by simply replacing the activation function in original FSQ with a distribution-matching mapping to enforce a uniform prior. Termed iFSQ, this simple strategy requires just one line of code yet mathematically guarantees both optimal bin utilization and reconstruction precision. Leveraging iFSQ as a controlled benchmark, we uncover two key insights: (1) The optimal equilibrium between discrete and continuous representations lies at approximately 4 bits per dimension. (2) Under identical reconstruction constraints, AR models exhibit rapid initial convergence, whereas diffusion models achieve a superior performance ceiling, suggesting that strict sequential ordering may limit the upper bounds of generation quality. Finally, we extend our analysis by adapting Representation Alignment (REPA) to AR models, yielding LlamaGen-REPA. Codes is available at https://github.com/Tencent-Hunyuan/iFSQ

Problem

Research questions and friction points this paper is trying to address.

image generation

autoregressive models

diffusion models

Finite Scalar Quantization

representation discretization

Innovation

Methods, ideas, or system contributions that make the work stand out.

iFSQ

distribution-matching mapping

uniform prior