🤖 AI Summary
This work addresses the challenges of mode collapse in unconditional full-head 3D GANs and the difficulty of maintaining both consistency and diversity across multiple views in conventional view-conditioned approaches. To resolve these issues, the authors propose using a view-invariant semantic feature—specifically, the CLIP embedding of a frontal face image—as a shared conditioning signal, thereby decoupling generated content from viewing direction and eliminating directional bias. By constructing a synthetic multi-view dataset and integrating multi-view supervision through FLUX.1 Kontext augmentation, the method enables high-fidelity, semantically consistent generation within a 3D GAN framework. This approach represents the first incorporation of view-invariant semantic conditioning into full-head synthesis, significantly improving generation fidelity, diversity, and the generalization capability of single-view inversion.
📝 Abstract
Conditioning is crucial for stable training of full-head 3D GANs. Without any conditioning signal, the model suffers from severe mode collapse, making it impractical to training. However, a series of previous full-head 3D GANs conventionally choose the view angle as the conditioning input, which leads to a bias in the learned 3D full-head space along the conditional view direction. This is evident in the significant differences in generation quality and diversity between the conditional view and non-conditional views of the generated 3D heads, resulting in global incoherence across different head regions. In this work, we propose to use view-invariant semantic feature as the conditioning input, thereby decoupling the generative capability of 3D heads from the viewing direction. To construct a view-invariant semantic condition for each training image, we create a novel synthesized head image dataset. We leverage FLUX.1 Kontext to extend existing high-quality frontal face datasets to a wide range of view angles. The image clip feature extracted from the frontal view is then used as a shared semantic condition across all views in the extended images, ensuring semantic alignment while eliminating directional bias. This also allows supervision from different views of the same subject to be consolidated under a shared semantic condition, which accelerates training and enhances the global coherence of the generated 3D heads. Moreover, as GANs often experience slower improvements in diversity once the generator learns a few modes that successfully fool the discriminator, our semantic conditioning encourages the generator to follow the true semantic distribution, thereby promoting continuous learning and diverse generation. Extensive experiments on full-head synthesis and single-view GAN inversion demonstrate that our method achieves significantly higher fidelity, diversity, and generalizability.