SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Current vision-language models (VLMs) struggle to accurately comprehend object orientation, limiting robotic fine manipulation capabilities. To address this, we propose “semantic orientation”—a novel paradigm that transcends conventional geometric coordinate systems by defining function-oriented directions (e.g., the “insertion direction” of a USB connector) via natural language, yielding reference-frame-agnostic, linguistically grounded, and function-embedded orientation representations. To support this paradigm, we introduce OrienText300K, the first large-scale 3D semantic orientation dataset, and design a multimodal unified modeling framework integrating VLMs, 3D geometric analysis, and functional semantics for instruction-driven orientation reasoning and action generation. Evaluated on Open6DOR and SIMPLER benchmarks, our approach achieves orientation-aware manipulation accuracies of 48.7% and 74.9%, respectively—marking substantial improvements in both simulation and real-world directional manipulation performance.

Technology Category

Application Category

📝 Abstract

Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.

Problem

Research questions and friction points this paper is trying to address.

Enhance robotic spatial reasoning

Integrate semantic orientation using language

Improve manipulation with positional constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic orientation representation

Large-scale dataset integration

Enhanced robotic manipulation accuracy

🔎 Similar Papers

No similar papers found.