Efficient Visual Question Answering Pipeline for Autonomous Driving via Scene Region Compression

📅 2026-01-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the challenge of high computational cost and inference latency in existing visual question answering (VQA) models, which hinder their deployment in real-time autonomous driving scenarios where both efficiency and safety are critical. To this end, the authors propose the SRC-Pipeline framework, which introduces, for the first time, a frame-level dynamic visual token compression mechanism tailored for VQA in autonomous driving. The approach applies a scene-region compression strategy to early video frames, aggregating them into a small set of high-level semantic tokens to reduce computational load, while preserving the full visual details of recent frames to maintain answer accuracy. Experimental results demonstrate that the proposed method achieves comparable accuracy to state-of-the-art models while significantly reducing FLOPs by 66%, thereby substantially improving inference efficiency.

Technology Category

Application Category

📝 Abstract

Autonomous driving increasingly relies on Visual Question Answering (VQA) to enable vehicles to understand complex surroundings by analyzing visual inputs and textual queries. Currently, a paramount concern for VQA in this domain is the stringent requirement for fast latency and real-time processing, as delays directly impact real-world safety in this safety-critical application. However, current state-of-the-art VQA models, particularly large vision-language models (VLMs), often prioritize performance over computational efficiency. These models typically process dense patch tokens for every frame, leading to prohibitive computational costs (FLOPs) and significant inference latency, especially with long video sequences. This focus limits their practical deployment in real-time autonomous driving scenarios. To tackle this issue, we propose an efficient VLM framework for autonomous driving VQA tasks, SRC-Pipeline. It learns to compress early frame tokens into a small number of high-level tokens while retaining full patch tokens for recent frames. Experiments on autonomous driving video question answering tasks show that our approach achieves 66% FLOPs reduction while maintaining comparable performance, enabling VLMs to operate more effectively in real-time, safety-critical autonomous driving settings.

Problem

Research questions and friction points this paper is trying to address.

Visual Question Answering

Autonomous Driving

Real-time Processing

Computational Efficiency

Vision-Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Question Answering

Scene Region Compression

Vision-Language Models