🤖 AI Summary
This work addresses the challenges of training instability and slow convergence in large-scale discrete image generation caused by massive VQ codebooks. To overcome these issues, the authors propose Stochastic Neighborhood Cross-Entropy Minimization (SNCE), a novel approach that replaces conventional one-hot hard supervision with a soft target distribution derived from the geometric proximity between codebook embeddings and real image embeddings. By incorporating a geometry-aware soft supervision mechanism that leverages local structural information in the embedding space, SNCE effectively guides the model to learn a semantically structured representation. Experiments on class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing demonstrate that SNCE substantially improves both convergence speed and generation quality compared to standard cross-entropy training.
📝 Abstract
Recent advancements in discrete image generation showed that scaling the VQ codebook size significantly improves reconstruction fidelity. However, training generative models with a large VQ codebook remains challenging, typically requiring larger model size and a longer training schedule. In this work, we propose Stochastic Neighbor Cross Entropy Minimization (SNCE), a novel training objective designed to address the optimization challenges of large-codebook discrete image generators. Instead of supervising the model with a hard one-hot target, SNCE constructs a soft categorical distribution over a set of neighboring tokens. The probability assigned to each token is proportional to the proximity between its code embedding and the ground-truth image embedding, encouraging the model to capture semantically meaningful geometric structure in the quantized embedding space. We conduct extensive experiments across class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing tasks. Results show that SNCE significantly improves convergence speed and overall generation quality compared to standard cross-entropy objectives.