π€ AI Summary
Existing graffiti-driven 3D texture editing methods struggle to accurately interpret the semantic intent and spatial location of usersβ rough inputs, often yielding blurry or distorted results. This work proposes a novel framework that integrates a multimodal large language model (MLLM) with a diffusion-based image generation model. By leveraging the MLLM for the first time to precisely parse the semantic meaning of user-provided sketches, the method employs globally generated images to guide local texture extraction and fusion, establishing a global-to-local texture transfer mechanism that effectively mitigates ambiguity in semantic localization. Experimental results demonstrate that the proposed approach significantly outperforms existing techniques in both editing accuracy and user interactivity, achieving state-of-the-art performance.
π Abstract
Interactive 3D model texture editing presents enhanced opportunities for creating 3D assets, with freehand drawing style offering the most intuitive experience. However, existing methods primarily support sketch-based interactions for outlining, while the utilization of coarse-grained scribble-based interaction remains limited. Furthermore, current methodologies often encounter challenges due to the abstract nature of scribble instructions, which can result in ambiguous editing intentions and unclear target semantic locations. To address these issues, we propose ScribbleSense, an editing method that combines multimodal large language models (MLLMs) and image generation models to effectively resolve these challenges. We leverage the visual capabilities of MLLMs to predict the editing intent behind the scribbles. Once the semantic intent of the scribble is discerned, we employ globally generated images to extract local texture details, thereby anchoring local semantics and alleviating ambiguities concerning the target semantic locations. Experimental results indicate that our method effectively leverages the strengths of MLLMs, achieving state-of-the-art interactive editing performance for scribble-based texture editing.