🤖 AI Summary
This work proposes a calibration-free, feedforward framework for 3D localization and segmentation that enables efficient, geometrically consistent text-guided 3D object understanding. By introducing a Geometry-Aware Semantic Attention (GASA) mechanism, the method effectively suppresses semantically plausible but geometrically inconsistent cross-view correspondences without requiring ground-truth pose priors. It integrates multi-view features with high-resolution images (1008×1008) for end-to-end inference. The approach achieves state-of-the-art performance across five benchmarks, including ScanNet++ and uCO3D, where a single text query replaces O(N) manual clicks. With an inference speed of 18 FPS (57 ms per frame), the method is well-suited for real-time applications in robotics and augmented reality.
📝 Abstract
Localizing objects and parts from natural language in 3D space is essential for robotics, AR, and embodied AI, yet existing methods face a trade-off between the accuracy and geometric consistency of per-scene optimization and the efficiency of feed-forward inference. We present TrianguLang, a feed-forward framework for 3D localization that requires no camera calibration at inference. Unlike prior methods that treat views independently, we introduce Geometry-Aware Semantic Attention (GASA), which utilizes predicted geometry to gate cross-view feature correspondence, suppressing semantically plausible but geometrically inconsistent matches without requiring ground-truth poses. Validated on five benchmarks including ScanNet++ and uCO3D, TrianguLang achieves state-of-the-art feed-forward text-guided segmentation and localization, reducing user effort from $O(N)$ clicks to a single text query. The model processes each frame at 1008x1008 resolution in $\sim$57ms ($\sim$18 FPS) without optimization, enabling practical deployment for interactive robotics and AR applications. Code and checkpoints are available at https://cwru-aism.github.io/triangulang/.