A Comparative Evaluation of Large Vision-Language Models for 2D Object Detection under SOTIF Conditions

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This study addresses safety risks in autonomous driving arising from perceptual limitations under SOTIF (Safety of the Intended Functionality) scenarios, particularly the performance degradation of conventional 2D detectors in adverse conditions and long-tail traffic situations. Within the SOTIF framework, the work presents the first systematic evaluation of ten state-of-the-art large vision-language models (LVLMs) on the PeSOTIF dataset for 2D object detection, benchmarked against YOLO baselines. Results demonstrate that top-performing LVLMs achieve over 25% higher recall than YOLO in complex natural scenes, exhibiting superior semantic robustness, whereas YOLO maintains better geometric precision under synthetic perturbations. The findings highlight the complementary nature of semantic reasoning and geometric regression approaches, offering a novel paradigm for high-assurance safety validation in autonomous systems.

Technology Category

Application Category

📝 Abstract

Reliable environmental perception remains one of the main obstacles for safe operation of automated vehicles. Safety of the Intended Functionality (SOTIF) concerns safety risks from perception insufficiencies, particularly under adverse conditions where conventional detectors often falter. While Large Vision-Language Models (LVLMs) demonstrate promising semantic reasoning, their quantitative effectiveness for safety-critical 2D object detection is underexplored. This paper presents a systematic evaluation of ten representative LVLMs using the PeSOTIF dataset, a benchmark specifically curated for long-tail traffic scenarios and environmental degradations. Performance is quantitatively compared against the classical perception approach, a YOLO-based detector. Experimental results reveal a critical trade-off: top-performing LVLMs (e.g., Gemini 3, Doubao) surpass the YOLO baseline in recall by over 25% in complex natural scenarios, exhibiting superior robustness to visual degradation. Conversely, the baseline retains an advantage in geometric precision for synthetic perturbations. These findings highlight the complementary strengths of semantic reasoning versus geometric regression, supporting the use of LVLMs as high-level safety validators in SOTIF-oriented automated driving systems.

Problem

Research questions and friction points this paper is trying to address.

SOTIF

Large Vision-Language Models

2D Object Detection

Environmental Perception

Automated Driving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Vision-Language Models

SOTIF

2D Object Detection