How Good are Foundation Models in Step-by-Step Embodied Reasoning?

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the evaluation of foundation models’ multi-step reasoning capabilities in embodied environments, with particular emphasis on their joint handling of multimodal perception, physical constraint modeling, and safety-aware decision-making. To this end, we introduce FoMER—the first benchmark specifically designed for embodied reasoning—comprising 10 task categories, 8 embodied morphologies, and over 1,100 diverse samples. FoMER innovatively decouples perceptual grounding from action reasoning, enabling stepwise, fine-grained evaluation. We conduct a systematic assessment across state-of-the-art large multimodal models (LMMs), revealing critical deficiencies in physical consistency and safety compliance. FoMER establishes a reproducible, granular evaluation standard for embodied intelligence, facilitating interpretable reasoning analysis and advancing safety-aligned development of robotic systems.

Technology Category

Application Category

📝 Abstract

Embodied agents operating in the physical world must make decisions that are not only effective but also safe, spatially coherent, and grounded in context. While recent advances in large multimodal models (LMMs) have shown promising capabilities in visual understanding and language generation, their ability to perform structured reasoning for real-world embodied tasks remains underexplored. In this work, we aim to understand how well foundation models can perform step-by-step reasoning in embodied environments. To this end, we propose the Foundation Model Embodied Reasoning (FoMER) benchmark, designed to evaluate the reasoning capabilities of LMMs in complex embodied decision-making scenarios. Our benchmark spans a diverse set of tasks that require agents to interpret multimodal observations, reason about physical constraints and safety, and generate valid next actions in natural language. We present (i) a large-scale, curated suite of embodied reasoning tasks, (ii) a novel evaluation framework that disentangles perceptual grounding from action reasoning, and (iii) empirical analysis of several leading LMMs under this setting. Our benchmark includes over 1.1k samples with detailed step-by-step reasoning across 10 tasks and 8 embodiments, covering three different robot types. Our results highlight both the potential and current limitations of LMMs in embodied reasoning, pointing towards key challenges and opportunities for future research in robot intelligence. Our data and code will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Evaluating foundation models' step-by-step reasoning in embodied environments

Assessing multimodal models' performance in complex embodied decision-making scenarios

Testing LMMs' ability to interpret observations and generate valid actions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed FoMER benchmark for embodied reasoning evaluation

Disentangled perceptual grounding from action reasoning framework

Analyzed LMMs on 1.1k multimodal task samples

🔎 Similar Papers

No similar papers found.