3DFroMLLM: 3D Prototype Generation only from Pretrained Multimodal LLMs

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) exhibit limited spatial reasoning capabilities, hindering direct generation of 3D object prototypes with explicit geometric structure and part-level semantic labels. Method: We propose the first MLLM-driven, agent-based 3D prototyping framework that operates entirely through pretraining—requiring no fine-tuning, human annotation, or additional training data. It employs an iterative “design–code–visual verification” closed loop, integrating MLLMs with differentiable rendering and leveraging synthetic images for self-supervised pretraining and CLIP-based fine-tuning. Contribution/Results: The generated 3D prototypes serve as effective supervision signals: when used for image classification pretraining, they outperform state-of-the-art methods by 15%; for part segmentation, CLIP fine-tuning achieves a 55% accuracy gain. This advances zero-shot and weakly supervised 3D understanding without task-specific adaptation.

Technology Category

Application Category

📝 Abstract

Recent Multi-Modal Large Language Models (MLLMs) have demonstrated strong capabilities in learning joint representations from text and images. However, their spatial reasoning remains limited. We introduce 3DFroMLLM, a novel framework that enables the generation of 3D object prototypes directly from MLLMs, including geometry and part labels. Our pipeline is agentic, comprising a designer, coder, and visual inspector operating in a refinement loop. Notably, our approach requires no additional training data or detailed user instructions. Building on prior work in 2D generation, we demonstrate that rendered images produced by our framework can be effectively used for image classification pretraining tasks and outperforms previous methods by 15%. As a compelling real-world use case, we show that the generated prototypes can be leveraged to improve fine-grained vision-language models by using the rendered, part-labeled prototypes to fine-tune CLIP for part segmentation and achieving a 55% accuracy improvement without relying on any additional human-labeled data.

Problem

Research questions and friction points this paper is trying to address.

Generates 3D object prototypes from multimodal LLMs

Enables geometry and part label creation without training data

Improves vision-language models through synthetic prototype rendering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates 3D prototypes from pretrained MLLMs

Agentic pipeline with designer-coder-inspector loop

Requires no additional training data or instructions

🔎 Similar Papers

LLMI3D: MLLM-based 3D Perception from a Single 2D Image