🤖 AI Summary
Zero-shot 3D object classification suffers from significant domain shift between synthetic LiDAR data and real-world sparse, noisy scans, severely limiting generalization in open-vocabulary settings. To address this, we propose a curriculum-based cross-domain fusion framework. First, we construct the first large-scale, real-scene point cloud–image–text triplet dataset. Second, we design a multimodal pretraining architecture that jointly leverages contrastive learning and curriculum learning to align point clouds, images, and text across domains. Our core innovation lies in synergistically integrating the semantic richness of synthetic data with the domain fidelity of real data, thereby mitigating domain gap and enhancing zero-shot transferability. On nuScenes, our method achieves 46.2% zero-shot accuracy—outperforming the prior state-of-the-art by 19.3 percentage points. It also attains new state-of-the-art results on outdoor benchmarks such as TruckScenes, demonstrating strong effectiveness and generalization under complex real-world conditions.
📝 Abstract
Zero-shot 3D object classification is crucial for real-world applications like autonomous driving, however it is often hindered by a significant domain gap between the synthetic data used for training and the sparse, noisy LiDAR scans encountered in the real-world. Current methods trained solely on synthetic data fail to generalize to outdoor scenes, while those trained only on real data lack the semantic diversity to recognize rare or unseen objects.
We introduce BlendCLIP, a multimodal pretraining framework that bridges this synthetic-to-real gap by strategically combining the strengths of both domains. We first propose a pipeline to generate a large-scale dataset of object-level triplets -- consisting of a point cloud, image, and text description -- mined directly from real-world driving data and human annotated 3D boxes. Our core contribution is a curriculum-based data mixing strategy that first grounds the model in the semantically rich synthetic CAD data before progressively adapting it to the specific characteristics of real-world scans.
Our experiments show that our approach is highly label-efficient: introducing as few as 1.5% real-world samples per batch into training boosts zero-shot accuracy on the nuScenes benchmark by 27%. Consequently, our final model achieves state-of-the-art performance on challenging outdoor datasets like nuScenes and TruckScenes, improving over the best prior method by 19.3% on nuScenes, while maintaining strong generalization on diverse synthetic benchmarks. Our findings demonstrate that effective domain adaptation, not full-scale real-world annotation, is the key to unlocking robust open-vocabulary 3D perception. Our code and dataset will be released upon acceptance on https://github.com/kesu1/BlendCLIP.