LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement

📅 2024-11-20
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) struggle with autonomous driving visual question answering (VQA) due to difficulties in fusing high-resolution spatiotemporal information under dynamic scenes, excessive computational overhead, and loss of fine-grained details—leading to poor temporal consistency and limited decision-making capability. To address these challenges, we propose an efficient VLM framework tailored for autonomous driving. Our method introduces a query-aware visual token selection mechanism that dynamically preserves critical spatial details, and a spatiotemporal token restoration enhancement module that jointly enables high-resolution spatial modeling and low-resolution temporal motion capture. Leveraging semantic alignment–driven multi-scale spatiotemporal feature compression and reconstruction, our approach significantly reduces the number of visual tokens while improving inference efficiency. Extensive experiments on multiple autonomous driving VQA benchmarks demonstrate state-of-the-art performance, validating the proposed framework’s dual advantages in accuracy and efficiency.

Technology Category

Application Category

📝 Abstract
Recent advancements in Visual Language Models (VLMs) have made them crucial for visual question answering (VQA) in autonomous driving, enabling natural human-vehicle interactions. However, existing methods often struggle in dynamic driving environments, as they usually focus on static images or videos and rely on downsampling to manage computational costs. This results in the loss of critical details and the difficulty in effectively integrating spatial and temporal information, undermining fine-grained perception and temporal coherence essential for effective decision-making. To tackle these challenges, we introduce LaVida Drive, a novel and efficient VQA framework for autonomous driving. LaVida Drive seamlessly integrates temporal data while maintaining high-resolution inputs for detailed visual perception. It optimizes spatial processing by retaining high-resolution data for intricate details and using lower-resolution inputs for temporal analysis to focus on motion-related features, thereby boosting computational efficiency. The core of LaVida Drive consists of two modules: the extit{Query-aware Token Selection} module and the extit{Spatial-Temporal Token Recovery and Enhancement} module. The former dynamically selects the most relevant visual tokens based on semantic alignment with the input query, reducing the token count from high-resolution spatial input. The latter ensures smooth and coherent interactions between spatial and temporal information, preserving contextual continuity across frames. Extensive experiments on various autonomous driving question-answering benchmarks show that LaVida Drive significantly reduces visual tokens, enhances efficiency, and improves overall performance.
Problem

Research questions and friction points this paper is trying to address.

Enhance visual question answering in autonomous driving
Integrate temporal and spatial data efficiently
Reduce computational cost while maintaining detail
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token selection
Spatial-temporal integration
High-resolution detail preservation
🔎 Similar Papers
No similar papers found.
S
Siwen Jiao
National University of Singapore, Agency for Science, Technology and Research, Singapore
Y
Yangyi Fang
Tsinghua University
Baoyun Peng
Baoyun Peng
Academy of Military Science
Multimodal understandingAutonomous drivingKnowledge GraphNatural Language Processing
W
Wangqun Chen
Advanced Institute of Big Data, Beijing
B
B. Veeravalli
National University of Singapore