SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning

📅 2025-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weak spatial reasoning in vision-language models (VLMs) and spatial inaccuracies arising from purely linguistic outputs hinder embodied intelligence. To address this, we propose a two-stage framework: (i) bidirectional spatial coordinate alignment, enabling precise coordinate-level mapping between visual and linguistic representations; and (ii) a chain-of-thought spatial grounding mechanism, explicitly anchoring sequential reasoning steps to physical space. Our work is the first to deeply integrate coordinate alignment with chain-of-thought reasoning, establishing a unified vision–language–action joint representation and an end-to-end embodied planning architecture. Evaluated on both simulated and real-world navigation and manipulation tasks, our method achieves a 23.6% improvement in localization accuracy and a 19.4% increase in task success rate, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Spatial reasoning is an essential problem in embodied AI research. Efforts to enhance spatial reasoning abilities through supplementary spatial data and fine-tuning have proven limited and ineffective when addressing complex embodied tasks, largely due to their dependence on language-based outputs. While some approaches have introduced a point-based action space to mitigate this issue, they fall short in managing more intricate tasks within complex environments. This deficiency arises from their failure to fully exploit the inherent thinking and reasoning capabilities that are fundamental strengths of Vision-Language Models (VLMs). To address these limitations, we propose a novel approach named SpatialCoT, specifically designed to bolster the spatial reasoning capabilities of VLMs. Our approach comprises two stages: spatial coordinate bi-directional alignment, which aligns vision-language inputs with spatial coordinates, and chain-of-thought spatial grounding, which harnesses the reasoning capabilities of language models for advanced spatial reasoning. We evaluate SpatialCoT on challenging navigation and manipulation tasks, both in simulation and real-world settings. Experimental results demonstrate that our method significantly outperforms previous state-of-the-art approaches in both tasks.
Problem

Research questions and friction points this paper is trying to address.

Spatial Understanding
Visual Language Models
Robotics AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

SpatialCoT
Visual Language Models
Spatial Understanding
🔎 Similar Papers
Y
Yuecheng Liu
Huawei Noah’s Ark Lab
D
Dafeng Chi
Huawei Noah’s Ark Lab
S
Shiguang Wu
Huawei Noah’s Ark Lab
Z
Zhanguang Zhang
Huawei Noah’s Ark Lab
Yaochen Hu
Yaochen Hu
Huawei Technologies Canada, University of Alberta
Large scale machine learningOptimizationRecommender systemsApproximation algorithmsStatistical machine learning
Lingfeng Zhang
Lingfeng Zhang
PhD student at Tsinghua University
embodied ai
Y
Yingxue Zhang
Huawei Noah’s Ark Lab
S
Shuang Wu
Huawei Noah’s Ark Lab
Tongtong Cao
Tongtong Cao
Researcher, Huawei Noah's Ark Lab
RoboticsEmbodied AIAutonomous driving
G
Guowei Huang
Huawei Noah’s Ark Lab
G
Guangjian Tian
Huawei Noah’s Ark Lab
X
Xingyue Quan
Huawei Noah’s Ark Lab
Jianye Hao
Jianye Hao
Huawei Noah's Ark Lab/Tianjin University
Multiagent SystemsEmbodied AI
Yuzheng Zhuang
Yuzheng Zhuang
Senior Researcher @ Huawei Noah's Ark Lab
Reinforcement LearningOptimizationAutonomous DrivingCommunication