LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Current vision-language models (VLMs) excel at open-vocabulary 2D understanding but struggle with multi-object, open-vocabulary 3D detection. To address this, we propose Chain-of-Sight—a novel VLM-based approach that reformulates 3D detection as a sequential token prediction task, emulating human-like reasoning: objects’ 2D locations, distances, sizes, and orientations are decoded stepwise without dedicated detection heads. Our method integrates 2D detection as a visual chain of thought and performs 3D bounding box regression in a center-size-rotation decomposition, ordered from near to far. Evaluated on the Omni3D benchmark, Chain-of-Sight achieves 49.89 AP₃D, surpassing prior state-of-the-art by 15.51 points. Moreover, it demonstrates strong zero-shot generalization and robustness to unseen categories, highlighting its effectiveness for scalable, open-vocabulary 3D scene understanding.

Technology Category

Application Category

📝 Abstract

To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.

Problem

Research questions and friction points this paper is trying to address.

Enables open-vocabulary 3D object detection from vision-language models

Transforms 3D detection into a next-token prediction using Chain-of-Sight

Generalizes zero-shot to new categories without specialized detection heads

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Sight sequence for human-like 2D-to-3D reasoning

Easy-to-hard curriculum learning for 3D box prediction

VLM-native interface preserving open-vocabulary visual prompting

🔎 Similar Papers

Find Everything: A General Vision Language Model Approach to Multi-Object Search