SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of training instability and slow convergence in large-scale discrete image generation caused by massive VQ codebooks. To overcome these issues, the authors propose Stochastic Neighborhood Cross-Entropy Minimization (SNCE), a novel approach that replaces conventional one-hot hard supervision with a soft target distribution derived from the geometric proximity between codebook embeddings and real image embeddings. By incorporating a geometry-aware soft supervision mechanism that leverages local structural information in the embedding space, SNCE effectively guides the model to learn a semantically structured representation. Experiments on class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing demonstrate that SNCE substantially improves both convergence speed and generation quality compared to standard cross-entropy training.

Technology Category

Application Category

📝 Abstract
Recent advancements in discrete image generation showed that scaling the VQ codebook size significantly improves reconstruction fidelity. However, training generative models with a large VQ codebook remains challenging, typically requiring larger model size and a longer training schedule. In this work, we propose Stochastic Neighbor Cross Entropy Minimization (SNCE), a novel training objective designed to address the optimization challenges of large-codebook discrete image generators. Instead of supervising the model with a hard one-hot target, SNCE constructs a soft categorical distribution over a set of neighboring tokens. The probability assigned to each token is proportional to the proximity between its code embedding and the ground-truth image embedding, encouraging the model to capture semantically meaningful geometric structure in the quantized embedding space. We conduct extensive experiments across class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing tasks. Results show that SNCE significantly improves convergence speed and overall generation quality compared to standard cross-entropy objectives.
Problem

Research questions and friction points this paper is trying to address.

discrete image generation
large VQ codebook
optimization challenges
training objective
quantized embedding space
Innovation

Methods, ideas, or system contributions that make the work stand out.

SNCE
large-codebook VQ
geometry-aware supervision
discrete image generation
soft categorical distribution
🔎 Similar Papers
No similar papers found.