Language-guided 3D scene synthesis for fine-grained functionality understanding

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Fine-grained functional understanding in 3D scenes is hindered by the scarcity of real-world annotated data. To address this, we propose SynthFun3D—the first task-oriented, executable-action-driven framework for synthesizing functional 3D indoor scenes. Given natural-language action instructions (e.g., “open the second drawer of the nightstand”), SynthFun3D jointly leverages a furniture asset library and part-level functional annotations to automatically generate high-fidelity, instruction-aligned scenes with precise functional-part labeling, via language-guided reasoning and generative modeling. Unlike prior approaches, SynthFun3D enables scalable, low-cost production of functional 3D data with strong instruction fidelity. A user study confirms its superior semantic consistency, while quantitative experiments demonstrate that its synthetic data effectively substitutes for or augments real data—achieving state-of-the-art performance on functional part recognition and localization tasks.

Technology Category

Application Category

📝 Abstract

Functionality understanding in 3D, which aims to identify the functional element in a 3D scene to complete an action (e.g., the correct handle to "Open the second drawer of the cabinet near the bed"), is hindered by the scarcity of real-world data due to the substantial effort needed for its collection and annotation. To address this, we introduce SynthFun3D, the first method for task-based 3D scene synthesis. Given the action description, SynthFun3D generates a 3D indoor environment using a furniture asset database with part-level annotation, ensuring the action can be accomplished. It reasons about the action to automatically identify and retrieve the 3D mask of the correct functional element, enabling the inexpensive and large-scale generation of high-quality annotated data. We validate SynthFun3D through user studies, which demonstrate improved scene-prompt coherence compared to other approaches. Our quantitative results further show that the generated data can either replace real data with minor performance loss or supplement real data for improved performance, thereby providing an inexpensive and scalable solution for data-hungry 3D applications. Project page: github.com/tev-fbk/synthfun3d.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of annotated 3D functionality understanding data

Generating task-based 3D scenes from action descriptions automatically

Providing scalable synthetic data for data-hungry 3D applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates 3D scenes from action descriptions using furniture database

Automatically identifies functional elements via part-level annotations

Creates scalable synthetic data to replace or supplement real data

🔎 Similar Papers

No similar papers found.