🤖 AI Summary
Traditional generative 3D modeling produces closed meshes lacking structural semantics, hindering editing, animation, and semantic understanding. Existing part-aware methods struggle to support user-controllable part segmentation and semantically coherent completion of occluded regions. This paper introduces the first end-to-end framework integrating vision-language models (VLMs) with generative models: a VLM parses semantic part prompts from a single input image to guide controllable, part-isolated image generation; multi-view-consistent image synthesis and 3D reconstruction then yield editable, structurally explicit, and semantically grounded part-level 3D models. Our key innovations include user-specified part granularity control and occlusion-aware semantic reasoning—significantly enhancing model interpretability and downstream task adaptability. Experiments demonstrate superior performance in semantic editing and understanding, particularly in VR/AR and embodied AI applications.
📝 Abstract
Generative 3D modeling has advanced rapidly, driven by applications in VR/AR, metaverse, and robotics. However, most methods represent the target object as a closed mesh devoid of any structural information, limiting editing, animation, and semantic understanding. Part-aware 3D generation addresses this problem by decomposing objects into meaningful components, but existing pipelines face challenges: in existing methods, the user has no control over which objects are separated and how model imagine the occluded parts in isolation phase. In this paper, we introduce MMPart, an innovative framework for generating part-aware 3D models from a single image. We first use a VLM to generate a set of prompts based on the input image and user descriptions. In the next step, a generative model generates isolated images of each object based on the initial image and the previous step's prompts as supervisor (which control the pose and guide model how imagine previously occluded areas). Each of those images then enters the multi-view generation stage, where a number of consistent images from different views are generated. Finally, a reconstruction model converts each of these multi-view images into a 3D model.