SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of large-scale annotated data and insufficient complexity/diversity in 3D-text alignment tasks, this paper proposes a structured 3D scene composition framework. It explicitly models multi-object spatial relations (e.g., support, adjacency, enclosure) to synthesize realistic point cloud scenes and leverages large language models to generate high-fidelity, diverse textual descriptions. A contrastive learning objective and a text refinement mechanism are further introduced to achieve fine-grained cross-modal alignment. The method is model-agnostic and generalizes across downstream tasks. It achieves state-of-the-art performance on zero-shot classification (ModelNet, ScanObjNN), few-shot part segmentation (ShapeNetPart), and 3D visual question answering (ScanQA), significantly improving 3D retrieval accuracy and spatial reasoning capability.

Technology Category

Application Category

📝 Abstract
The whole is greater than the sum of its parts-even in 3D-text contrastive learning. We introduce SceneForge, a novel framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions. SceneForge leverages individual 3D shapes to construct multi-object scenes with explicit spatial relations, pairing them with coherent multi-object descriptions refined by a large language model. By augmenting contrastive training with these structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets, significantly enriching data complexity and diversity. We systematically investigate critical design elements, such as the optimal number of objects per scene, the proportion of compositional samples in training batches, and scene construction strategies. Extensive experiments demonstrate that SceneForge delivers substantial performance gains across multiple tasks, including zero-shot classification on ModelNet, ScanObjNN, Objaverse-LVIS, and ScanNet, as well as few-shot part segmentation on ShapeNetPart. SceneForge's compositional augmentations are model-agnostic, consistently improving performance across multiple encoder architectures. Moreover, SceneForge improves 3D visual question answering on ScanQA, generalizes robustly to retrieval scenarios with increasing scene complexity, and showcases spatial reasoning capabilities by adapting spatial configurations to align precisely with textual instructions.
Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D-text alignment through structured scene compositions
Addressing scarcity of large-scale 3D-text datasets
Improving performance across multiple 3D vision tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging structured multi-object scene compositions
Augmenting contrastive training with compositional samples
Model-agnostic framework improving multiple 3D-text tasks
🔎 Similar Papers
No similar papers found.