🤖 AI Summary
To address the scarcity of large-scale annotated data and insufficient complexity/diversity in 3D-text alignment tasks, this paper proposes a structured 3D scene composition framework. It explicitly models multi-object spatial relations (e.g., support, adjacency, enclosure) to synthesize realistic point cloud scenes and leverages large language models to generate high-fidelity, diverse textual descriptions. A contrastive learning objective and a text refinement mechanism are further introduced to achieve fine-grained cross-modal alignment. The method is model-agnostic and generalizes across downstream tasks. It achieves state-of-the-art performance on zero-shot classification (ModelNet, ScanObjNN), few-shot part segmentation (ShapeNetPart), and 3D visual question answering (ScanQA), significantly improving 3D retrieval accuracy and spatial reasoning capability.
📝 Abstract
The whole is greater than the sum of its parts-even in 3D-text contrastive learning. We introduce SceneForge, a novel framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions. SceneForge leverages individual 3D shapes to construct multi-object scenes with explicit spatial relations, pairing them with coherent multi-object descriptions refined by a large language model. By augmenting contrastive training with these structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets, significantly enriching data complexity and diversity. We systematically investigate critical design elements, such as the optimal number of objects per scene, the proportion of compositional samples in training batches, and scene construction strategies. Extensive experiments demonstrate that SceneForge delivers substantial performance gains across multiple tasks, including zero-shot classification on ModelNet, ScanObjNN, Objaverse-LVIS, and ScanNet, as well as few-shot part segmentation on ShapeNetPart. SceneForge's compositional augmentations are model-agnostic, consistently improving performance across multiple encoder architectures. Moreover, SceneForge improves 3D visual question answering on ScanQA, generalizes robustly to retrieval scenarios with increasing scene complexity, and showcases spatial reasoning capabilities by adapting spatial configurations to align precisely with textual instructions.