Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of insufficient robustness in semantic segmentation for autonomous driving under complex lighting and shadow conditions by proposing the CLARITY framework. CLARITY leverages visual-language model (VLM) priors to dynamically adjust fusion weights between RGB and thermal modalities based on scene illumination, while incorporating a language-guided mechanism to preserve semantic information of dark-colored objects that are often overlooked. Additionally, a multi-scale structural consistency decoder is introduced to enhance boundary accuracy for small objects. Evaluated on the MFNet dataset, CLARITY achieves a state-of-the-art performance with 62.3% mIoU and 77.5% mAcc, setting a new benchmark for multimodal semantic segmentation in adverse lighting scenarios.

Technology Category

Application Category

📝 Abstract
Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms, i.e., one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.
Problem

Research questions and friction points this paper is trying to address.

RGB-T segmentation
adverse illumination
semantic segmentation
autonomous driving
modality fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic fusion
vision-language model
RGB-T segmentation
hierarchical decoder
dark-object semantics
🔎 Similar Papers
No similar papers found.
R
Ruturaj Reddy
School of Information Technology, Monash University Malaysia, Malaysia
H
Hrishav Bakul Barua
School of Information Technology, Monash University Malaysia, Malaysia
J
Junn Yong Loo
School of Information Technology, Monash University Malaysia, Malaysia
Thanh Thi Nguyen
Thanh Thi Nguyen
Associate Professor, Monash University
Artificial IntelligenceData ScienceCybersecurityReinforcement LearningMulti-Agent Systems
Ganesh Krishnasamy
Ganesh Krishnasamy
Monash University Malaysia
Machine learningcomputer visiondeep learning