TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead, inefficient multi-view fusion, and poor real-time deployability of existing vision-language models (VLMs) in autonomous driving, this paper proposes TS-VLM—a lightweight multi-view VLM. Its core innovation is the Text-Guided SoftSort Pooling (TGSSP) module, the first to enable semantic-driven, query-adaptive dynamic ranking and fusion across views—replacing computationally expensive attention mechanisms. Complemented by a lightweight architecture, cross-view semantic alignment, and dynamic weighted fusion, TS-VLM achieves only 20.1M parameters and reduces computational cost by 90%. On the DriveLM benchmark, it attains BLEU-4 = 56.82 and CIDEr = 3.39—substantially outperforming prior methods—while enabling real-time inference on embedded automotive platforms.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have shown remarkable potential in advancing autonomous driving by leveraging multi-modal fusion in order to enhance scene perception, reasoning, and decision-making. Despite their potential, existing models suffer from computational overhead and inefficient integration of multi-view sensor data that make them impractical for real-time deployment in safety-critical autonomous driving applications. To address these shortcomings, this paper is devoted to designing a lightweight VLM called TS-VLM, which incorporates a novel Text-Guided SoftSort Pooling (TGSSP) module. By resorting to semantics of the input queries, TGSSP ranks and fuses visual features from multiple views, enabling dynamic and query-aware multi-view aggregation without reliance on costly attention mechanisms. This design ensures the query-adaptive prioritization of semantically related views, which leads to improved contextual accuracy in multi-view reasoning for autonomous driving. Extensive evaluations on the DriveLM benchmark demonstrate that, on the one hand, TS-VLM outperforms state-of-the-art models with a BLEU-4 score of 56.82, METEOR of 41.91, ROUGE-L of 74.64, and CIDEr of 3.39. On the other hand, TS-VLM reduces computational cost by up to 90%, where the smallest version contains only 20.1 million parameters, making it more practical for real-time deployment in autonomous vehicles.
Problem

Research questions and friction points this paper is trying to address.

Inefficient multi-view sensor data integration in autonomous driving VLMs
High computational overhead in existing vision-language models
Need for real-time deployment in safety-critical driving applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-Guided SoftSort Pooling for feature fusion
Lightweight VLM with 20.1 million parameters
Query-aware multi-view aggregation without attention
🔎 Similar Papers
No similar papers found.
Lihong Chen
Lihong Chen
Western University
H
Hossein Hassani
Department of Electrical and Computer Engineering, Western University, London, ON N6A 3K7, Canada
Soodeh Nikan
Soodeh Nikan
Assistant Professor
LLM/VLMDeep LearningMachine LearningComputer VisionSignal Processing