Glass Segmentation with Fusion of Learned and General Visual Features

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Segmenting transparent glass surfaces is highly challenging due to the absence of distinctive visual cues, yet it is critical for scene understanding and robotic obstacle avoidance. This work proposes a dual-backbone architecture that, for the first time, integrates a frozen DINOv3 vision foundation model with a supervised Swin Transformer to extract generic and task-specific features, respectively. These complementary features are fused via a residual squeeze-and-excitation (SE) channel compression mechanism and subsequently processed by a Mask2Former decoder to produce segmentation masks. The proposed method achieves state-of-the-art performance across four mainstream glass segmentation benchmarks. Notably, when employing a lightweight variant of DINOv3, it surpasses existing state-of-the-art approaches in inference speed while maintaining superior accuracy, effectively balancing precision and efficiency.

Technology Category

Application Category

📝 Abstract
Glass surface segmentation from RGB images is a challenging task, since glass as a transparent material distinctly lacks visual characteristics. However, glass segmentation is critical for scene understanding and robotics, as transparent glass surfaces must be identified as solid material. This paper presents a novel architecture for glass segmentation, deploying a dual-backbone producing general visual features as well as task-specific learned visual features. General visual features are produced by a frozen DINOv3 vision foundation model, and the task-specific features are generated with a Swin model trained in a supervised manner. Resulting multi-scale feature representations are downsampled with residual Squeeze-and-Excitation Channel Reduction, and fed into a Mask2Former Decoder, producing the final segmentation masks. The architecture was evaluated on four commonly used glass segmentation datasets, achieving state-of-the-art results on several accuracy metrics. The model also has a competitive inference speed compared to the previous state-of-the-art method, and surpasses it when using a lighter DINOv3 backbone variant. The implementation source code and model weights are available at: https://github.com/ojalar/lgnet
Problem

Research questions and friction points this paper is trying to address.

glass segmentation
transparent material
RGB images
scene understanding
robotics
Innovation

Methods, ideas, or system contributions that make the work stand out.

glass segmentation
feature fusion
vision foundation model
Mask2Former
dual-backbone architecture
🔎 Similar Papers
No similar papers found.