MMPart: Harnessing Multi-Modal Large Language Models for Part-Aware 3D Generation

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional generative 3D modeling produces closed meshes lacking structural semantics, hindering editing, animation, and semantic understanding. Existing part-aware methods struggle to support user-controllable part segmentation and semantically coherent completion of occluded regions. This paper introduces the first end-to-end framework integrating vision-language models (VLMs) with generative models: a VLM parses semantic part prompts from a single input image to guide controllable, part-isolated image generation; multi-view-consistent image synthesis and 3D reconstruction then yield editable, structurally explicit, and semantically grounded part-level 3D models. Our key innovations include user-specified part granularity control and occlusion-aware semantic reasoning—significantly enhancing model interpretability and downstream task adaptability. Experiments demonstrate superior performance in semantic editing and understanding, particularly in VR/AR and embodied AI applications.

Technology Category

Application Category

📝 Abstract
Generative 3D modeling has advanced rapidly, driven by applications in VR/AR, metaverse, and robotics. However, most methods represent the target object as a closed mesh devoid of any structural information, limiting editing, animation, and semantic understanding. Part-aware 3D generation addresses this problem by decomposing objects into meaningful components, but existing pipelines face challenges: in existing methods, the user has no control over which objects are separated and how model imagine the occluded parts in isolation phase. In this paper, we introduce MMPart, an innovative framework for generating part-aware 3D models from a single image. We first use a VLM to generate a set of prompts based on the input image and user descriptions. In the next step, a generative model generates isolated images of each object based on the initial image and the previous step's prompts as supervisor (which control the pose and guide model how imagine previously occluded areas). Each of those images then enters the multi-view generation stage, where a number of consistent images from different views are generated. Finally, a reconstruction model converts each of these multi-view images into a 3D model.
Problem

Research questions and friction points this paper is trying to address.

Generating part-aware 3D models from single images with structural information
Providing user control over object separation and occluded part imagination
Overcoming limitations of closed mesh representations lacking semantic understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses VLM to generate part-specific prompts
Generates isolated object images with occlusion guidance
Reconstructs 3D parts from multi-view consistent images
O
Omid Bonakdar
School of Computer engineering, Iran university of Science and Technology, Tehran, Iran
Nasser Mozayani
Nasser Mozayani
Iran University of Science and Technology
Artificial IntelligenceMulti-agent systemsMachine Learning