Strategic Fusion of Vision Language Models: Shapley-Credited Context-Aware Dawid-Skene for Multi-Label Tasks in Autonomous Driving

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (VLMs) in autonomous driving suffer from hallucination and low reliability in multi-label recognition tasks. Method: We propose a game-theoretic multi-model fusion framework integrating Shapley-value-based credit assignment, context-aware Dawid-Skene label modeling, and consensus-guided aggregation. Our pipeline combines LoRA-finetuned heterogeneous VLMs, YOLOv11-BoT-SORT for object tracking, vehicle kinematic constraints, and chain-of-thought prompting to generate high-quality pseudo-labels; at inference, we employ Shapley-credited Bayesian fusion. Results: Experiments demonstrate substantial improvements over the best single model: a 23% reduction in Hamming distance, and 55% and 47% gains in Macro-F1 and Micro-F1, respectively. The framework significantly enhances accuracy, interpretability, and robustness of multi-label understanding in autonomous driving scenarios.

Technology Category

Application Category

📝 Abstract
Large vision-language models (VLMs) are increasingly used in autonomous-vehicle (AV) stacks, but hallucination limits their reliability in safety-critical pipelines. We present Shapley-credited Context-Aware Dawid-Skene with Agreement, a game-theoretic fusion method for multi-label understanding of ego-view dashcam video. It learns per-model, per-label, context-conditioned reliabilities from labelled history and, at inference, converts each model's report into an agreement-guardrailed log-likelihood ratio that is combined with a contextual prior and a public reputation state updated via Shapley-based team credit. The result is calibrated, thresholdable posteriors that (i) amplify agreement among reliable models, (ii) preserve uniquely correct single-model signals, and (iii) adapt to drift. To specialise general VLMs, we curate 1,000 real-world dashcam clips with structured annotations (scene description, manoeuvre recommendation, rationale) via an automatic pipeline that fuses HDD ground truth, vehicle kinematics, and YOLOv11 + BoT-SORT tracking, guided by a three-step chain-of-thought prompt; three heterogeneous VLMs are then fine-tuned with LoRA. We evaluate with Hamming distance, Micro-Macro-F1, and average per-video latency. Empirically, the proposed method achieves a 23% reduction in Hamming distance, 55% improvement in Macro-F1, and 47% improvement in Micro-F1 when comparing with the best single model, supporting VLM fusion as a calibrated, interpretable, and robust decision-support component for AV pipelines.
Problem

Research questions and friction points this paper is trying to address.

Reducing VLM hallucinations in autonomous driving safety pipelines
Calibrating multi-label fusion for ego-view dashcam video understanding
Enabling adaptive model agreement while preserving unique correct signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

Game-theoretic fusion method for multi-label video understanding
Learns context-conditioned reliabilities from labeled history
Converts model reports into agreement-guardrailed likelihood ratios
🔎 Similar Papers
No similar papers found.
Y
Yuxiang Feng
Centre for Transport Engineering and Modelling, Department of Civil and Environmental Engineering, Imperial College London, London SW7 2AZ, U.K.
K
Keyang Zhang
Centre for Transport Engineering and Modelling, Department of Civil and Environmental Engineering, Imperial College London, London SW7 2AZ, U.K.
H
Hassane Ouchouid
ELM Europe, One Canada Square, Canary Wharf, London, E14 5AB, U.K.
A
Ashwil Kaniamparambil
ELM Europe, One Canada Square, Canary Wharf, London, E14 5AB, U.K.
I
Ioannis Souflas
ELM Europe, One Canada Square, Canary Wharf, London, E14 5AB, U.K.
Panagiotis Angeloudis
Panagiotis Angeloudis
Professor of Transport Systems & Logistics, Imperial College London
Transport SystemsAutonomous VehiclesFreightMaritime TransportNetwork Robustness