🤖 AI Summary
Current autoregressive image generation models fail to effectively leverage the semantic similarity priors embedded in learned codebooks during training. Conventional k-means clustering is inadequate for modeling true token-level similarities due to heterogeneity in the token embedding space and distortion in centroid-based distance metrics. To address this, we propose the Discriminative Codebook Prior Extractor (DCPE), a plug-and-play module that replaces centroid distances with instance-level similarity measurements and employs a bottom-up hierarchical aggregation strategy to extract structured codebook priors. DCPE requires no modification to the backbone architecture and is compatible with any discrete-tokenized autoregressive generator. Evaluated on LlamaGen-B, DCPE achieves a 42% reduction in training time, improves FID by 18.7%, and increases Inception Score (IS) by 12.3%, demonstrating its efficiency, effectiveness, and strong generalization across architectures.
📝 Abstract
Advanced discrete token-based autoregressive image generation systems first tokenize images into sequences of token indices with a codebook, and then model these sequences in an autoregressive paradigm. While autoregressive generative models are trained only on index values, the prior encoded in the codebook, which contains rich token similarity information, is not exploited. Recent studies have attempted to incorporate this prior by performing naive k-means clustering on the tokens, helping to facilitate the training of generative models with a reduced codebook. However, we reveal that k-means clustering performs poorly in the codebook feature space due to inherent issues, including token space disparity and centroid distance inaccuracy. In this work, we propose the Discriminative Codebook Prior Extractor (DCPE) as an alternative to k-means clustering for more effectively mining and utilizing the token similarity information embedded in the codebook. DCPE replaces the commonly used centroid-based distance, which is found to be unsuitable and inaccurate for the token feature space, with a more reasonable instance-based distance. Using an agglomerative merging technique, it further addresses the token space disparity issue by avoiding splitting high-density regions and aggregating low-density ones. Extensive experiments demonstrate that DCPE is plug-and-play and integrates seamlessly with existing codebook prior-based paradigms. With the discriminative prior extracted, DCPE accelerates the training of autoregressive models by 42% on LlamaGen-B and improves final FID and IS performance.