🤖 AI Summary
This work addresses key challenges in text-guided 3D scene segmentation—namely, ambiguous boundaries, cross-view semantic inconsistency, and high computational overhead—by proposing an efficient knowledge distillation framework. Methodologically, it (1) introduces the first end-to-end direct distillation mechanism for dense CLIP features; (2) incorporates adapter modules and a self-cross-training strategy to suppress noise and enhance robustness; (3) designs a low-rank transient query attention mechanism to strengthen boundary modeling; and (4) formulates segmentation as label voxel classification, significantly improving consistency in color-similar regions. Experiments demonstrate that our approach surpasses existing state-of-the-art methods in segmentation accuracy, boundary sharpness, and multi-view semantic consistency, while achieving faster training convergence and substantially lower memory and compute requirements.
📝 Abstract
In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the workflow by directly and effectively distilling dense CLIP features, thereby achieving precise segmentation of 3D scenes using text. To achieve this, we introduce an adapter module and mitigate the noise issue in the dense CLIP feature distillation process through a self-cross-training strategy. Moreover, to enhance the accuracy of segmentation edges, this work presents a low-rank transient query attention mechanism. To ensure the consistency of segmentation for similar colors under different viewpoints, we convert the segmentation task into a classification task through label volume, which significantly improves the consistency of segmentation in color-similar areas. We also propose a simplified text augmentation strategy to alleviate the issue of ambiguity in the correspondence between CLIP features and text. Extensive experimental results show that our method surpasses current state-of-the-art technologies in both training speed and performance. Our code is available on: https://github.com/xingy038/Laser.git.