Text-Image Conditioned 3D Generation

📅 2026-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing 3D generation methods, which predominantly rely on a single modality—either images or text—and consequently struggle to simultaneously capture fine visual details and rich semantic meaning, thereby hindering precise expression of user intent. To overcome this, the paper formally introduces the task of 3D generation conditioned jointly on text and image inputs and proposes a lightweight dual-branch baseline model. The architecture employs separate backbone networks to extract features from each modality and incorporates an efficient cross-modal fusion mechanism to enable joint reasoning. Experimental results demonstrate that this bimodal approach significantly outperforms unimodal baselines in terms of generation quality, geometric fidelity, and semantic consistency, effectively validating the practical utility and complementary nature of cross-modal conditioning in 3D content creation.

Technology Category

Application Category

📝 Abstract
High-quality 3D assets are essential for VR/AR, industrial design, and entertainment, motivating growing interest in generative models that create 3D content from user prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models achieve high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, while text-conditioned models provide broad semantic guidance yet lack low-level visual detail. This limits how users can express intent and raises a natural question: can these two modalities be combined for more flexible and faithful 3D generation? Our diagnostic study shows that even simple late fusion of text- and image-conditioned predictions outperforms single-modality models, revealing strong cross-modal complementarity. We therefore formalize Text-Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON, a minimalist dual-branch baseline with separate image- and text-conditioned backbones and lightweight cross-modal fusion. Extensive experiments show that text-image conditioning consistently improves over single-modality methods, highlighting complementary vision-language guidance as a promising direction for future 3D generation research. Project page: https://jumpat.github.io/tigon-page
Problem

Research questions and friction points this paper is trying to address.

3D generation
multimodal conditioning
text-image fusion
viewpoint bias
visual fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

text-image conditioning
3D generation
cross-modal fusion
dual-branch architecture
vision-language guidance
🔎 Similar Papers
Jiazhong Cen
Jiazhong Cen
Shanghai Jiao Tong University
Computer vision3D Scene Understanding
Jiemin Fang
Jiemin Fang
Senior Researcher, Huawei
Neural Rendering3D VisionAutoMLNeural Architecture SearchComputer Vision
Sikuang Li
Sikuang Li
Shanghai Jiao Tong University
G
Guanjun Wu
Huazhong University of Science and Technology
C
Chen Yang
Huawei Inc.
Taoran Yi
Taoran Yi
华中科技大学
Computer Vision、Computer Graphics
Z
Zanwei Zhou
MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University
Z
Zhikuan Bao
Huawei Inc.
L
Lingxi Xie
Huawei Inc.
W
Wei Shen
MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University
Q
Qi Tian
Huawei Inc.