Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) struggle to accurately model geometric relationships between cameras and objects—such as object orientation, camera viewpoint, and shooting angle—primarily due to the scarcity of diverse, precisely annotated 3D spatial relations in training data. To address this, we propose the first large-scale 3D vision-instruction synthesis framework dedicated to camera-object relationship understanding. Our framework integrates 3D asset modeling, physically realistic rendering, diffusion-model-based augmentation, and LLM-driven instruction generation, yielding Ultimate3D—a high-quality dataset comprising 240K VQA samples—and a corresponding benchmark. Crucially, our approach enables explicit joint modeling and supervised learning of both camera pose and object pose. On camera-object relationship recognition tasks, fine-tuned MLLMs achieve an average accuracy improvement of 33.4%, substantially outperforming leading commercial models.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on our proposed dataset outperform commercial models by a large margin, achieving an average accuracy improvement of 33.4% on camera-object relation recognition tasks. Our code, dataset, and benchmark will contribute to broad MLLM applications.

Problem

Research questions and friction points this paper is trying to address.

MLLMs struggle with camera-object relation accuracy

Existing datasets lack diverse camera-object relations

Need for synthetic 3D visual instruction datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic pipeline for 3D visual instruction datasets

Rendering and diffusion models for photorealistic images

LLMs generate text prompts for instruction tuning

🔎 Similar Papers

LLMI3D: MLLM-based 3D Perception from a Single 2D Image