A Low-Rank Method for Vision Language Model Hallucination Mitigation in Autonomous Driving

📅 2025-11-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Vision-language models (VLMs) in autonomous driving frequently generate hallucinated textual descriptions inconsistent with input images; detecting such hallucinations is challenging due to the absence of ground-truth annotations and restricted access to model internals. Method: We propose a reference-free, model-agnostic, self-contained low-rank hallucination assessment method. For the first time, we map multi-VLM-generated descriptions into a sentence embedding matrix and apply low-rank–sparse decomposition to isolate semantic consensus (low-rank component) from individual model biases (sparse residual). The residual norm serves as a quantitative hallucination metric for ranking and filtering. Results: On NuScenes, our method achieves 87% accuracy in identifying non-hallucinated descriptions—outperforming baselines by 19%—while accelerating inference by 51–67%. It also exhibits strong agreement with human judgments.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) are increasingly used in autonomous driving to help understand traffic scenes, but they sometimes produce hallucinations, which are false details not grounded in the visual input. Detecting and mitigating hallucinations is challenging when ground-truth references are unavailable and model internals are inaccessible. This paper proposes a novel self-contained low-rank approach to automatically rank multiple candidate captions generated by multiple VLMs based on their hallucination levels, using only the captions themselves without requiring external references or model access. By constructing a sentence-embedding matrix and decomposing it into a low-rank consensus component and a sparse residual, we use the residual magnitude to rank captions: selecting the one with the smallest residual as the most hallucination-free. Experiments on the NuScenes dataset demonstrate that our approach achieves 87% selection accuracy in identifying hallucination-free captions, representing a 19% improvement over the unfiltered baseline and a 6-10% improvement over multi-agent debate method. The sorting produced by sparse error magnitudes shows strong correlation with human judgments of hallucinations, validating our scoring mechanism. Additionally, our method, which can be easily parallelized, reduces inference time by 51-67% compared to debate approaches, making it practical for real-time autonomous driving applications.

Problem

Research questions and friction points this paper is trying to address.

Mitigating vision language model hallucinations in autonomous driving systems

Ranking multiple VLM captions by hallucination levels without external references

Selecting most hallucination-free captions using low-rank matrix decomposition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank matrix decomposition ranks hallucination levels

Sparse residual magnitude selects most hallucination-free caption

Method requires only captions without external references

🔎 Similar Papers

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations