3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence

📅 2026-01-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing 3D scene captioning methods struggle with weak semantic alignment and limited out-of-distribution (OOD) generalization across domains, such as indoor and outdoor scenes. This work proposes 3D CoCa v2, a novel framework that, for the first time, integrates frozen CLIP semantic priors with a test-time search mechanism to enable end-to-end generalizable 3D caption generation—without updating model parameters, relying on external detectors, or using handcrafted proposals. The approach combines a spatial-aware 3D encoder, a multimodal decoder, and a reward-guided test-time search strategy. It achieves consistent improvements on ScanRefer and Nr3D, with CIDEr@0.5IoU gains of 1.50 and 1.61, respectively, and demonstrates strong zero-shot OOD performance on TOD3Cap, yielding a significant 3.8-point improvement in CIDEr@0.25.

Technology Category

Application Category

📝 Abstract

Spatial intelligence refers to the ability to perceive, reason about, and describe objects and their relationships within three-dimensional environments, forming a foundation for embodied perception and scene understanding. 3D captioning aims to describe 3D scenes in natural language; however, it remains challenging due to the sparsity and irregularity of point clouds and, more critically, the weak grounding and limited out-of-distribution (OOD) generalization of existing captioners across drastically different environments, including indoor and outdoor 3D scenes. To address this challenge, we propose 3D CoCa v2, a generalizable 3D captioning framework that unifies contrastive vision-language learning with 3D caption generation and further improves robustness via test-time search (TTS) without updating the captioner parameters. 3D CoCa v2 builds on a frozen CLIP-based semantic prior, a spatially-aware 3D scene encoder for geometry, and a multimodal decoder jointly optimized with contrastive and captioning objectives, avoiding external detectors or handcrafted proposals. At inference, TTS produces diverse caption candidates and performs reward-guided selection using a compact scene summary. Experiments show improvements over 3D CoCa of +1.50 CIDEr@0.5IoU on ScanRefer and +1.61 CIDEr@0.5IoU on Nr3D, and +3.8 CIDEr@0.25 in zero-shot OOD evaluation on TOD3Cap. Code will be released at https://github.com/AIGeeksGroup/3DCoCav2.

Problem

Research questions and friction points this paper is trying to address.

3D captioning

out-of-distribution generalization

spatial intelligence

point cloud sparsity

scene understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D captioning

contrastive learning

test-time search

out-of-distribution generalization

spatial intelligence

🔎 Similar Papers

No similar papers found.

Authors to Follow