Unlock the Power of Unlabeled Data in Language Driving Model

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the heavy reliance of language-driven vision-language models (VLMs) on large-scale, high-quality annotated data in autonomous driving, this work proposes a semi-supervised learning framework that leverages abundant unlabeled driving scenes. Methodologically, it introduces template-based prompt generation to construct pseudo question-answer pairs and employs self-consistent reasoning to refine pseudo-labels; a VisionLLM is built upon InternVL and jointly fine-tuned using both pseudo-labels and a small set of real annotations. On the DriveLM benchmark, the approach achieves 44.85% performance using only 5% labeled data, improving to 54.27% when incorporating unlabeled data—approaching the fully supervised upper bound of 60.68%. This work constitutes the first empirical validation of feasible language-driven driving scene understanding under low-resource conditions, establishing a novel paradigm for efficient VLM deployment in real-world autonomous driving applications.

Technology Category

Application Category

📝 Abstract

Recent Vision-based Large Language Models~(VisionLLMs) for autonomous driving have seen rapid advancements. However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive. To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner. Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data. Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. By utilizing a pre-trained VisionLLM (e.g., InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-the-art methods. Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets. In particular, our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27% when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark.

Problem

Research questions and friction points this paper is trying to address.

Reducing dependency on costly annotated data for VisionLLMs.

Utilizing unlabeled data to enhance language-driving models.

Improving model performance with minimal labeled data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes unlabeled data via semi-supervised learning

Employs template-based prompts for pseudo-answer generation

Implements Self-Consistency Refinement for annotation quality

🔎 Similar Papers

No similar papers found.