Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This work identifies and addresses the "entropy cliff" phenomenon in conventional discrete visual autoregressive models, where a fixed codebook causes a sharp drop in conditional entropy toward the end of the sequence, reducing generation to mere memorization and limiting reconstruction fidelity. To overcome this fundamental limitation, the authors propose Variable Codebook Quantization (VCQ), which monotonically increases codebook capacity along the sequence—from a minimum size \(K_{\text{min}} = 2\) to a maximum \(K_{\text{max}}\)—within a standard autoregressive Transformer architecture. Notably, VCQ requires no modifications to the loss function, model parameters, or training protocol, yet induces a coarse-to-fine semantic hierarchy. On ImageNet at 256×256 resolution, the base model reduces gFID from 27.98 to 14.80, with an extended variant achieving 1.71; furthermore, a linear probe using only the first 10 tokens attains 43.8% top-1 accuracy, surpassing the information-theoretic bottleneck imposed by fixed codebooks.
📝 Abstract
Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size $K$ to get better reconstruction performance. Such a constant-codebook design hits a fundamental information-theoretic limit. We observe that the per-position conditional entropy of the training set decays so quickly along the sequence that, after a few positions, the conditional distribution becomes essentially deterministic. On ImageNet with $K=16384$, this happens within only 2 out of 256 positions, turning the remaining 254 into a memorization problem. We call this phenomenon the Entropy Cliff and formalize it with a simple expression: $t^{*} = \lceil \log_2 N / \log_2 K \rceil$. Interestingly, this phenomenon is not observed in language, as its natural structure keeps the effective entropy per position well below the codebook capacity. To address this, we propose Variable Codebook Size Quantization (VCQ), where the codebook size $K_t$ grows monotonically along the sequence from $K_{\min}=2$ to $K_{\max}$, leaving the loss function, parameter count, and AR training procedure unchanged. With a vanilla autoregressive Transformer and standard next-token prediction, a base version of VCQ reduces gFID w/o CFG from 27.98 to 14.80 on ImageNet $256\times256$ over the baseline. Scaled up, it reaches gFID 1.71 with 684M autoregressive parameters, without any extra training techniques such as semantic regularization or causal alignment. The extreme information bottleneck at $K_{\min}=2$ naturally induces a coarse-to-fine semantic hierarchy: a linear probe on only the first 10 tokens reaches 43.8% top-1 accuracy on ImageNet, compared to 27.1% for uniform codebooks. Ultimately, these results show that what matters is not only the total capacity of the codebook, but also how that capacity is distributed and organized.
Problem

Research questions and friction points this paper is trying to address.

Entropy Cliff
codebook size
autoregressive visual generation
conditional entropy
discrete tokenization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Variable Codebook Size Quantization
Entropy Cliff
Autoregressive Visual Generation
Discrete Tokenization
Coarse-to-fine Hierarchy