Bridging the Modality Gap in Roadside LiDAR: A Training-Free Vision-Language Model Framework for Vehicle Classification

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of fine-grained truck classification using roadside LiDAR, where reliance on manual annotations limits scalability and vision-language models (VLMs) struggle with the modality gap between sparse 3D point clouds and 2D images. The authors propose a training-free framework that transforms occluded and sparse point clouds into depth-encoded 2D visual proxies via a depth-aware image generation pipeline, enabling direct use of off-the-shelf VLMs for few-shot inference. This approach achieves the first zero-fine-tuning application of VLMs to roadside LiDAR-based fine-grained classification, revealing a “semantic anchoring” effect. Furthermore, a VLM-derived label-based cold-start strategy is introduced to bootstrap lightweight supervised models. Evaluated on a real-world dataset of 20 vehicle classes, the method attains over 75% accuracy in classifying 20ft/40ft/53ft container trailers using only 16–30 samples per class, substantially reducing annotation costs.

Technology Category

Application Category

📝 Abstract
Fine-grained truck classification is critical for intelligent transportation systems (ITS), yet current LiDAR-based methods face scalability challenges due to their reliance on supervised deep learning and labor-intensive manual annotation. Vision-Language Models (VLMs) offer promising few-shot generalization, but their application to roadside LiDAR is limited by a modality gap between sparse 3D point clouds and dense 2D imagery. We propose a framework that bridges this gap by adapting off-the-shelf VLMs for fine-grained truck classification without parameter fine-tuning. Our new depth-aware image generation pipeline applies noise removal, spatial and temporal registration, orientation rectification, morphological operations, and anisotropic smoothing to transform sparse, occluded LiDAR scans into depth-encoded 2D visual proxies. Validated on a real-world dataset of 20 vehicle classes, our approach achieves competitive classification accuracy with as few as 16-30 examples per class, offering a scalable alternative to data-intensive supervised baselines. We further observe a"Semantic Anchor"effect: text-based guidance regularizes performance in ultra-low-shot regimes $k<4$, but degrades accuracy in more-shot settings due to semantic mismatch. Furthermore, we demonstrate the efficacy of this framework as a Cold Start strategy, using VLM-generated labels to bootstrap lightweight supervised models. Notably, the few-shot VLM-based model achieves over correct classification rate of 75 percent for specific drayage categories (20ft, 40ft, and 53ft containers) entirely without the costly training or fine-tuning, significantly reducing the intensive demands of initial manual labeling, thus achieving a method of practical use in ITS applications.
Problem

Research questions and friction points this paper is trying to address.

modality gap
roadside LiDAR
vehicle classification
fine-grained classification
vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model
LiDAR
Training-Free
Modality Gap
Few-Shot Classification
🔎 Similar Papers
No similar papers found.