🤖 AI Summary
This work addresses the limitations of existing 3D generation methods, which predominantly rely on a single modality—either images or text—and consequently struggle to simultaneously capture fine visual details and rich semantic meaning, thereby hindering precise expression of user intent. To overcome this, the paper formally introduces the task of 3D generation conditioned jointly on text and image inputs and proposes a lightweight dual-branch baseline model. The architecture employs separate backbone networks to extract features from each modality and incorporates an efficient cross-modal fusion mechanism to enable joint reasoning. Experimental results demonstrate that this bimodal approach significantly outperforms unimodal baselines in terms of generation quality, geometric fidelity, and semantic consistency, effectively validating the practical utility and complementary nature of cross-modal conditioning in 3D content creation.
📝 Abstract
High-quality 3D assets are essential for VR/AR, industrial design, and entertainment, motivating growing interest in generative models that create 3D content from user prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models achieve high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, while text-conditioned models provide broad semantic guidance yet lack low-level visual detail. This limits how users can express intent and raises a natural question: can these two modalities be combined for more flexible and faithful 3D generation? Our diagnostic study shows that even simple late fusion of text- and image-conditioned predictions outperforms single-modality models, revealing strong cross-modal complementarity. We therefore formalize Text-Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON, a minimalist dual-branch baseline with separate image- and text-conditioned backbones and lightweight cross-modal fusion. Extensive experiments show that text-image conditioning consistently improves over single-modality methods, highlighting complementary vision-language guidance as a promising direction for future 3D generation research. Project page: https://jumpat.github.io/tigon-page