FreeInsert: Disentangled Text-Guided Object Insertion in 3D Gaussian Scene without Spatial Priors

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing text-guided 3D scene object insertion methods heavily rely on spatial priors—such as 2D masks or 3D bounding boxes—leading to limited editing flexibility and poor object consistency. To address this, we propose the first spatial-prior-free, decoupled text-driven insertion framework. Our method first leverages a multimodal large language model (MLLM) to parse semantic intent and implicitly reason about spatial relationships. It then decouples object generation from localization, jointly optimizing pose, scale, and appearance via structured semantic parsing and hierarchical spatial-aware refinement. Rendering is grounded in Latent Gaussian Models (LGM) and 3D Gaussian Splatting for high-fidelity output. Evaluated on real-world scenes, our approach enables semantically coherent, spatially accurate, and visually realistic object insertion under natural language instructions—significantly enhancing editing freedom and user accessibility while eliminating the need for manual annotations.

Technology Category

Application Category

📝 Abstract

Text-driven object insertion in 3D scenes is an emerging task that enables intuitive scene editing through natural language. However, existing 2D editing-based methods often rely on spatial priors such as 2D masks or 3D bounding boxes, and they struggle to ensure consistency of the inserted object. These limitations hinder flexibility and scalability in real-world applications. In this paper, we propose FreeInsert, a novel framework that leverages foundation models including MLLMs, LGMs, and diffusion models to disentangle object generation from spatial placement. This enables unsupervised and flexible object insertion in 3D scenes without spatial priors. FreeInsert starts with an MLLM-based parser that extracts structured semantics, including object types, spatial relationships, and attachment regions, from user instructions. These semantics guide both the reconstruction of the inserted object for 3D consistency and the learning of its degrees of freedom. We leverage the spatial reasoning capabilities of MLLMs to initialize object pose and scale. A hierarchical, spatially aware refinement stage further integrates spatial semantics and MLLM-inferred priors to enhance placement. Finally, the appearance of the object is improved using the inserted-object image to enhance visual fidelity. Experimental results demonstrate that FreeInsert achieves semantically coherent, spatially precise, and visually realistic 3D insertions without relying on spatial priors, offering a user-friendly and flexible editing experience.

Problem

Research questions and friction points this paper is trying to address.

Enables text-guided 3D object insertion without spatial priors

Ensures 3D consistency and visual fidelity of inserted objects

Leverages foundation models for flexible and unsupervised scene editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages foundation models for disentangled object insertion

Uses MLLM-based parser for structured semantic extraction

Hierarchical refinement enhances spatial and visual fidelity

🔎 Similar Papers

No similar papers found.

Authors to Follow