GAP3D: Generative Alignment of VLM Latents to Patch-Level Embeddings for 3D Generation

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing approaches that employ vision-language models (VLMs) as prompt encoders for 3D generation often suffer from the loss of critical spatial structural information due to high end-to-end training costs or aggressive feature compression. This work proposes a modular diffusion alignment method that directly aligns the VLM’s latent representations with the full patch-level feature space of a pretrained image encoder, enabling structure-preserving conditional 3D generation while keeping the downstream 3D generator frozen. By leveraging the diffusion process, this approach bridges the representational gap between VLMs and image encoders for the first time, requiring only image–text pairs for training—without reliance on large-scale 3D data—and demonstrates multimodal zero-shot generation capabilities. The results validate the feasibility of aligning foundation models in dense embedding spaces for 3D generation.

📝 Abstract

Recent approaches integrating vision-language models (VLMs) as prompt encoders for generative model conditioning typically rely on expensive end-to-end training or map features to compressed representations, discarding the dense spatial structure required for geometry-aware tasks like 3D asset generation. To address this, we propose GAP3D, a modular, diffusion-based approach that aligns VLM-generated latents directly to the complete, patch-level feature space of a pre-trained image encoder, enabling a frozen downstream generative model to utilize a VLM as prompt encoder while maintaining a spatially structured conditioning signal. Evaluated on 3D asset generation, our method bypasses the need for large-scale 3D data by training mainly on general-domain image-text pairs. It also exhibits emergent zero-shot capabilities for multimodal prompts, despite being trained exclusively on text input. Finally, while currently prioritizing high-level semantics over fine-grained detail, GAP3D demonstrates that the representation gap between VLM and image-encoder feature spaces can be partially bridged through diffusion-based alignment, taking the first steps towards a modular integration of foundation models through generative alignment to dense embedding spaces.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

3D generation

spatial structure

feature alignment

generative modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Alignment

Vision-Language Models

Patch-Level Embeddings