🤖 AI Summary
To address the challenge of navigating GAN latent spaces under few-shot conditions—leading to degraded consistency regularization (CR)—this paper proposes a Style-Space Disentanglement and Learnable Vector Quantization (VQ) framework. First, it maps the continuous latent space into a semantically disentangled style space. Second, it introduces an optimal transport alignment mechanism guided by foundation model features to construct a semantically aligned discrete codebook. Finally, it integrates feature distillation with enhanced CR. This work pioneers a style-space quantization paradigm, enabling effective external knowledge injection and semantic enrichment of the codebook. Experiments demonstrate that, under low-data regimes, the method reduces FID by over 20%, significantly improves discriminator robustness and generation consistency, and substantially enhances CR stability.
📝 Abstract
Under limited data setting, GANs often struggle to navigate and effectively exploit the input latent space. Consequently, images generated from adjacent variables in a sparse input latent space may exhibit significant discrepancies in realism, leading to suboptimal consistency regularization (CR) outcomes. To address this, we propose extit{SQ-GAN}, a novel approach that enhances CR by introducing a style space quantization scheme. This method transforms the sparse, continuous input latent space into a compact, structured discrete proxy space, allowing each element to correspond to a specific real data point, thereby improving CR performance. Instead of direct quantization, we first map the input latent variables into a less entangled ``style'' space and apply quantization using a learnable codebook. This enables each quantized code to control distinct factors of variation. Additionally, we optimize the optimal transport distance to align the codebook codes with features extracted from the training data by a foundation model, embedding external knowledge into the codebook and establishing a semantically rich vocabulary that properly describes the training dataset. Extensive experiments demonstrate significant improvements in both discriminator robustness and generation quality with our method.