Laser: Efficient Language-Guided Segmentation in Neural Radiance Fields

📅 2025-01-31

🏛️ IEEE Transactions on Pattern Analysis and Machine Intelligence

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses key challenges in text-guided 3D scene segmentation—namely, ambiguous boundaries, cross-view semantic inconsistency, and high computational overhead—by proposing an efficient knowledge distillation framework. Methodologically, it (1) introduces the first end-to-end direct distillation mechanism for dense CLIP features; (2) incorporates adapter modules and a self-cross-training strategy to suppress noise and enhance robustness; (3) designs a low-rank transient query attention mechanism to strengthen boundary modeling; and (4) formulates segmentation as label voxel classification, significantly improving consistency in color-similar regions. Experiments demonstrate that our approach surpasses existing state-of-the-art methods in segmentation accuracy, boundary sharpness, and multi-view semantic consistency, while achieving faster training convergence and substantially lower memory and compute requirements.

Technology Category

Application Category

📝 Abstract

In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the workflow by directly and effectively distilling dense CLIP features, thereby achieving precise segmentation of 3D scenes using text. To achieve this, we introduce an adapter module and mitigate the noise issue in the dense CLIP feature distillation process through a self-cross-training strategy. Moreover, to enhance the accuracy of segmentation edges, this work presents a low-rank transient query attention mechanism. To ensure the consistency of segmentation for similar colors under different viewpoints, we convert the segmentation task into a classification task through label volume, which significantly improves the consistency of segmentation in color-similar areas. We also propose a simplified text augmentation strategy to alleviate the issue of ambiguity in the correspondence between CLIP features and text. Extensive experimental results show that our method surpasses current state-of-the-art technologies in both training speed and performance. Our code is available on: https://github.com/xingy038/Laser.git.

Problem

Research questions and friction points this paper is trying to address.

Neural Radiance Fields

Language-guided 3D Image Segmentation

Efficiency and Accuracy Enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

CLIP Features

3D Image Segmentation

Text-to-Image Mapping

🔎 Similar Papers

No similar papers found.