🤖 AI Summary
To address the high computational overhead, inefficient multi-view fusion, and poor real-time deployability of existing vision-language models (VLMs) in autonomous driving, this paper proposes TS-VLM—a lightweight multi-view VLM. Its core innovation is the Text-Guided SoftSort Pooling (TGSSP) module, the first to enable semantic-driven, query-adaptive dynamic ranking and fusion across views—replacing computationally expensive attention mechanisms. Complemented by a lightweight architecture, cross-view semantic alignment, and dynamic weighted fusion, TS-VLM achieves only 20.1M parameters and reduces computational cost by 90%. On the DriveLM benchmark, it attains BLEU-4 = 56.82 and CIDEr = 3.39—substantially outperforming prior methods—while enabling real-time inference on embedded automotive platforms.
📝 Abstract
Vision-Language Models (VLMs) have shown remarkable potential in advancing autonomous driving by leveraging multi-modal fusion in order to enhance scene perception, reasoning, and decision-making. Despite their potential, existing models suffer from computational overhead and inefficient integration of multi-view sensor data that make them impractical for real-time deployment in safety-critical autonomous driving applications. To address these shortcomings, this paper is devoted to designing a lightweight VLM called TS-VLM, which incorporates a novel Text-Guided SoftSort Pooling (TGSSP) module. By resorting to semantics of the input queries, TGSSP ranks and fuses visual features from multiple views, enabling dynamic and query-aware multi-view aggregation without reliance on costly attention mechanisms. This design ensures the query-adaptive prioritization of semantically related views, which leads to improved contextual accuracy in multi-view reasoning for autonomous driving. Extensive evaluations on the DriveLM benchmark demonstrate that, on the one hand, TS-VLM outperforms state-of-the-art models with a BLEU-4 score of 56.82, METEOR of 41.91, ROUGE-L of 74.64, and CIDEr of 3.39. On the other hand, TS-VLM reduces computational cost by up to 90%, where the smallest version contains only 20.1 million parameters, making it more practical for real-time deployment in autonomous vehicles.