π€ AI Summary
Existing vision-language models often struggle with 3D spatial reasoning due to inefficient geometric representations or insufficient spatial consistency, making it challenging to balance efficiency and performance. This work proposes Proxy3D, a method that operates solely on video frames by employing semantic and geometric encoders to extract features and introducing semantic-aware clustering to generate compact yet spatially consistent 3D proxy representations. Through a multi-stage training strategy and a newly curated SpaceSpan dataset, Proxy3D effectively aligns with vision-language models. Notably, it is the first approach to integrate semantic clustering with 3D proxy representations, achieving competitive or state-of-the-art performance on 3D visual question answering, visual grounding, and general spatial reasoning tasksβusing significantly shorter visual sequences and thereby surpassing the limitations of conventional 2D pipelines.
π Abstract
Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.