MMPart: Harnessing Multi-Modal Large Language Models for Part-Aware 3D Generation

📅 2025-09-20

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Traditional generative 3D modeling produces closed meshes lacking structural semantics, hindering editing, animation, and semantic understanding. Existing part-aware methods struggle to support user-controllable part segmentation and semantically coherent completion of occluded regions. This paper introduces the first end-to-end framework integrating vision-language models (VLMs) with generative models: a VLM parses semantic part prompts from a single input image to guide controllable, part-isolated image generation; multi-view-consistent image synthesis and 3D reconstruction then yield editable, structurally explicit, and semantically grounded part-level 3D models. Our key innovations include user-specified part granularity control and occlusion-aware semantic reasoning—significantly enhancing model interpretability and downstream task adaptability. Experiments demonstrate superior performance in semantic editing and understanding, particularly in VR/AR and embodied AI applications.

Technology Category

Application Category

📝 Abstract

Generative 3D modeling has advanced rapidly, driven by applications in VR/AR, metaverse, and robotics. However, most methods represent the target object as a closed mesh devoid of any structural information, limiting editing, animation, and semantic understanding. Part-aware 3D generation addresses this problem by decomposing objects into meaningful components, but existing pipelines face challenges: in existing methods, the user has no control over which objects are separated and how model imagine the occluded parts in isolation phase. In this paper, we introduce MMPart, an innovative framework for generating part-aware 3D models from a single image. We first use a VLM to generate a set of prompts based on the input image and user descriptions. In the next step, a generative model generates isolated images of each object based on the initial image and the previous step's prompts as supervisor (which control the pose and guide model how imagine previously occluded areas). Each of those images then enters the multi-view generation stage, where a number of consistent images from different views are generated. Finally, a reconstruction model converts each of these multi-view images into a 3D model.

Problem

Research questions and friction points this paper is trying to address.

Generating part-aware 3D models from single images with structural information

Providing user control over object separation and occluded part imagination

Overcoming limitations of closed mesh representations lacking semantic understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses VLM to generate part-specific prompts

Generates isolated object images with occlusion guidance

Reconstructs 3D parts from multi-view consistent images

🔎 Similar Papers

When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models