Vision-Language Feature Alignment for Road Anomaly Segmentation

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing road anomaly segmentation methods, which rely on pixel-level statistics and consequently suffer from high false positives in semantically normal background regions—such as sky or vegetation—and low recall for out-of-distribution anomalies, posing safety risks to autonomous driving. To overcome these challenges, the authors propose VL-Anomaly, a novel framework that introduces a vision-language alignment mechanism. Specifically, it employs prompt learning to align visual features extracted by Mask2Former with CLIP text embeddings. During inference, the method fuses text-guided similarity, image-text similarity, and detection confidence into a multi-source decision process. This approach substantially suppresses background false alarms while enhancing recall for out-of-distribution anomalies, achieving state-of-the-art performance on benchmarks including RoadAnomaly, SMIYC, and Fishyscapes.

Technology Category

Application Category

📝 Abstract
Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and Fishyscapes.Code is released on https://github.com/NickHezhuolin/VL-aligner-Road-anomaly-segment.
Problem

Research questions and friction points this paper is trying to address.

road anomaly segmentation
out-of-distribution detection
false positives
semantic background
autonomous driving safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Alignment
Prompt Learning
Anomaly Segmentation
Out-of-Distribution Detection
Multi-source Inference
🔎 Similar Papers
No similar papers found.