Segment Any RGB-Thermal Model with Language-aided Distillation

📅 2025-05-04

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

To address SAM’s limited adaptability to RGB-T semantic segmentation—stemming from its exclusive training on RGB data—this paper proposes a language-guided cross-modal knowledge distillation (CMKD) framework. Methodologically, it integrates LoRA-based efficient fine-tuning, a dual-branch segmentation head (a primary SAM head and an auxiliary semantic head), multi-scale feature fusion, and a text-embedding-driven cross-modal alignment mechanism. This design preserves SAM’s strong generalization capability while enabling precise alignment between thermal and RGB semantic spaces. Evaluated on three major RGB-T benchmarks—MFNET, PST900, and FMB—the framework achieves state-of-the-art performance, significantly improving segmentation accuracy and robustness under challenging conditions such as low illumination, overexposure, and adverse weather. Moreover, it establishes a scalable new paradigm for multimodal open-vocabulary segmentation.

Technology Category

Application Category

📝 Abstract

The recent Segment Anything Model (SAM) demonstrates strong instance segmentation performance across various downstream tasks. However, SAM is trained solely on RGB data, limiting its direct applicability to RGB-thermal (RGB-T) semantic segmentation. Given that RGB-T provides a robust solution for scene understanding in adverse weather and lighting conditions, such as low light and overexposure, we propose a novel framework, SARTM, which customizes the powerful SAM for RGB-T semantic segmentation. Our key idea is to unleash the potential of SAM while introduce semantic understanding modules for RGB-T data pairs. Specifically, our framework first involves fine tuning the original SAM by adding extra LoRA layers, aiming at preserving SAM's strong generalization and segmentation capabilities for downstream tasks. Secondly, we introduce language information as guidance for training our SARTM. To address cross-modal inconsistencies, we introduce a Cross-Modal Knowledge Distillation(CMKD) module that effectively achieves modality adaptation while maintaining its generalization capabilities. This semantic module enables the minimization of modality gaps and alleviates semantic ambiguity, facilitating the combination of any modality under any visual conditions. Furthermore, we enhance the segmentation performance by adjusting the segmentation head of SAM and incorporating an auxiliary semantic segmentation head, which integrates multi-scale features for effective fusion. Extensive experiments are conducted across three multi-modal RGBT semantic segmentation benchmarks: MFNET, PST900, and FMB. Both quantitative and qualitative results consistently demonstrate that the proposed SARTM significantly outperforms state-of-the-art approaches across a variety of conditions.

Problem

Research questions and friction points this paper is trying to address.

Adapts SAM for RGB-thermal semantic segmentation

Minimizes modality gaps via cross-modal knowledge distillation

Enhances segmentation with multi-scale feature fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning SAM with extra LoRA layers

Using language guidance for training

Cross-Modal Knowledge Distillation module

🔎 Similar Papers

MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection