A Comparative Evaluation of Large Vision-Language Models for 2D Object Detection under SOTIF Conditions

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses safety risks in autonomous driving arising from perceptual limitations under SOTIF (Safety of the Intended Functionality) scenarios, particularly the performance degradation of conventional 2D detectors in adverse conditions and long-tail traffic situations. Within the SOTIF framework, the work presents the first systematic evaluation of ten state-of-the-art large vision-language models (LVLMs) on the PeSOTIF dataset for 2D object detection, benchmarked against YOLO baselines. Results demonstrate that top-performing LVLMs achieve over 25% higher recall than YOLO in complex natural scenes, exhibiting superior semantic robustness, whereas YOLO maintains better geometric precision under synthetic perturbations. The findings highlight the complementary nature of semantic reasoning and geometric regression approaches, offering a novel paradigm for high-assurance safety validation in autonomous systems.

Technology Category

Application Category

📝 Abstract
Reliable environmental perception remains one of the main obstacles for safe operation of automated vehicles. Safety of the Intended Functionality (SOTIF) concerns safety risks from perception insufficiencies, particularly under adverse conditions where conventional detectors often falter. While Large Vision-Language Models (LVLMs) demonstrate promising semantic reasoning, their quantitative effectiveness for safety-critical 2D object detection is underexplored. This paper presents a systematic evaluation of ten representative LVLMs using the PeSOTIF dataset, a benchmark specifically curated for long-tail traffic scenarios and environmental degradations. Performance is quantitatively compared against the classical perception approach, a YOLO-based detector. Experimental results reveal a critical trade-off: top-performing LVLMs (e.g., Gemini 3, Doubao) surpass the YOLO baseline in recall by over 25% in complex natural scenarios, exhibiting superior robustness to visual degradation. Conversely, the baseline retains an advantage in geometric precision for synthetic perturbations. These findings highlight the complementary strengths of semantic reasoning versus geometric regression, supporting the use of LVLMs as high-level safety validators in SOTIF-oriented automated driving systems.
Problem

Research questions and friction points this paper is trying to address.

SOTIF
Large Vision-Language Models
2D Object Detection
Environmental Perception
Automated Driving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Vision-Language Models
SOTIF
2D Object Detection
PeSOTIF dataset
Autonomous Driving Safety
🔎 Similar Papers
No similar papers found.
J
Ji Zhou
Institute of Automotive Engineering, Graz University of Technology, 8010, Graz, Austria
Y
Yilin Ding
Institute of Automotive Engineering, Graz University of Technology, 8010, Graz, Austria
Y
Yongqi Zhao
Institute of Automotive Engineering, Graz University of Technology, 8010, Graz, Austria
Jiachen Xu
Jiachen Xu
University of Vienna
Brain-Computer InterfaceRiemannian GeometryMachine Learning
Arno Eichberger
Arno Eichberger
Graz University of Technology
automated drivingautomotive engineeringdriver assistance systemsvehicle dynamicshuman machine interaction