Vision-Language Feature Alignment for Road Anomaly Segmentation

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the limitations of existing road anomaly segmentation methods, which rely on pixel-level statistics and consequently suffer from high false positives in semantically normal background regions—such as sky or vegetation—and low recall for out-of-distribution anomalies, posing safety risks to autonomous driving. To overcome these challenges, the authors propose VL-Anomaly, a novel framework that introduces a vision-language alignment mechanism. Specifically, it employs prompt learning to align visual features extracted by Mask2Former with CLIP text embeddings. During inference, the method fuses text-guided similarity, image-text similarity, and detection confidence into a multi-source decision process. This approach substantially suppresses background false alarms while enhancing recall for out-of-distribution anomalies, achieving state-of-the-art performance on benchmarks including RoadAnomaly, SMIYC, and Fishyscapes.

Technology Category

Application Category

📝 Abstract

Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and Fishyscapes.Code is released on https://github.com/NickHezhuolin/VL-aligner-Road-anomaly-segment.

Problem

Research questions and friction points this paper is trying to address.

road anomaly segmentation

out-of-distribution detection

false positives

semantic background

autonomous driving safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Alignment

Prompt Learning

Anomaly Segmentation