🤖 AI Summary
Current multimodal large language models (MLLMs) struggle to accurately model geometric relationships between cameras and objects—such as object orientation, camera viewpoint, and shooting angle—primarily due to the scarcity of diverse, precisely annotated 3D spatial relations in training data. To address this, we propose the first large-scale 3D vision-instruction synthesis framework dedicated to camera-object relationship understanding. Our framework integrates 3D asset modeling, physically realistic rendering, diffusion-model-based augmentation, and LLM-driven instruction generation, yielding Ultimate3D—a high-quality dataset comprising 240K VQA samples—and a corresponding benchmark. Crucially, our approach enables explicit joint modeling and supervised learning of both camera pose and object pose. On camera-object relationship recognition tasks, fine-tuned MLLMs achieve an average accuracy improvement of 33.4%, substantially outperforming leading commercial models.
📝 Abstract
Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on our proposed dataset outperform commercial models by a large margin, achieving an average accuracy improvement of 33.4% on camera-object relation recognition tasks. Our code, dataset, and benchmark will contribute to broad MLLM applications.