🤖 AI Summary
Large language models (LLMs) excel in text-only tasks but exhibit limited capability in embodied visual-spatial reasoning, such as maze navigation. To address this, we propose a two-stage training framework: first, supervised fine-tuning (SFT) to learn stepwise movement instructions; second, group-relative policy optimization (GRPO)—adapted here for the first time to enhance spatial reasoning—incorporating tokenized maze representations, a custom reward function, and chain-of-thought guidance to improve sequential decision-making and self-correction. Experiments on synthetic mazes demonstrate substantial gains: accuracy rises from 0% for the baseline to 86% after SFT and further to 93% post-GRPO. The approach significantly improves generalization, robustness, and embodied spatial reasoning capacity, establishing a new state of the art in LLM-based maze navigation.
📝 Abstract
Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring genuine visual spatial reasoning. In this paper, we introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) on a curated dataset of tokenized maze representations to teach the model to predict step-by-step movement commands. Next, we apply Group Relative Policy Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted reward function to refine the model's sequential decision-making and encourage emergent chain-of-thought behaviors. Experimental results on synthetically generated mazes show that while a baseline model fails to navigate the maze, the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more robust and self-corrective reasoning, highlighting the potential of our approach to bridge the gap between language models and visual spatial tasks. These findings offer promising implications for applications in robotics, autonomous navigation, and other domains that require integrated visual and sequential reasoning.