TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address the high computational overhead, inefficient multi-view fusion, and poor real-time deployability of existing vision-language models (VLMs) in autonomous driving, this paper proposes TS-VLM—a lightweight multi-view VLM. Its core innovation is the Text-Guided SoftSort Pooling (TGSSP) module, the first to enable semantic-driven, query-adaptive dynamic ranking and fusion across views—replacing computationally expensive attention mechanisms. Complemented by a lightweight architecture, cross-view semantic alignment, and dynamic weighted fusion, TS-VLM achieves only 20.1M parameters and reduces computational cost by 90%. On the DriveLM benchmark, it attains BLEU-4 = 56.82 and CIDEr = 3.39—substantially outperforming prior methods—while enabling real-time inference on embedded automotive platforms.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have shown remarkable potential in advancing autonomous driving by leveraging multi-modal fusion in order to enhance scene perception, reasoning, and decision-making. Despite their potential, existing models suffer from computational overhead and inefficient integration of multi-view sensor data that make them impractical for real-time deployment in safety-critical autonomous driving applications. To address these shortcomings, this paper is devoted to designing a lightweight VLM called TS-VLM, which incorporates a novel Text-Guided SoftSort Pooling (TGSSP) module. By resorting to semantics of the input queries, TGSSP ranks and fuses visual features from multiple views, enabling dynamic and query-aware multi-view aggregation without reliance on costly attention mechanisms. This design ensures the query-adaptive prioritization of semantically related views, which leads to improved contextual accuracy in multi-view reasoning for autonomous driving. Extensive evaluations on the DriveLM benchmark demonstrate that, on the one hand, TS-VLM outperforms state-of-the-art models with a BLEU-4 score of 56.82, METEOR of 41.91, ROUGE-L of 74.64, and CIDEr of 3.39. On the other hand, TS-VLM reduces computational cost by up to 90%, where the smallest version contains only 20.1 million parameters, making it more practical for real-time deployment in autonomous vehicles.

Problem

Research questions and friction points this paper is trying to address.

Inefficient multi-view sensor data integration in autonomous driving VLMs

High computational overhead in existing vision-language models

Need for real-time deployment in safety-critical driving applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-Guided SoftSort Pooling for feature fusion

Lightweight VLM with 20.1 million parameters

Query-aware multi-view aggregation without attention

🔎 Similar Papers

No similar papers found.