Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing research lacks a systematic understanding of how embodiment influences perception, reasoning, and control in vision-language navigation, while mainstream benchmarks fail to comprehensively evaluate semantic, spatial, temporal, and physical dimensions of embodied reasoning. This paper introduces the first closed-loop Turing test benchmark for embodied intelligence, covering heterogeneous agents—autonomous vehicles, drones, and robotic arms—evaluated across 1.1K reasoning questions and 58 navigation tasks. Key innovations include: (i) a cross-agent multimodal alignment-constrained evaluation framework; (ii) domain-agnostic query generation to mitigate overfitting; (iii) a four-dimensional embodied reasoning metric suite; and (iv) a joint VLM-control assessment protocol. Validation on 10 state-of-the-art vision-language models and 4 embodied baselines reveals that multimodal alignment and instruction tuning are more critical than model scale, and that spatial and temporal reasoning constitute the primary bottlenecks in current embodied competence.

Technology Category

Application Category

📝 Abstract

Vision-language navigation requires agents to reason and act under constraints of embodiment. While vision-language models (VLMs) demonstrate strong generalization, current benchmarks provide limited understanding of how embodiment -- i.e., the choice of physical platform, sensor configuration, and modality alignment -- influences perception, reasoning, and control. We introduce Embodied4C, a closed-loop benchmark designed as a Turing test for embodied reasoning. The benchmark evaluates the core embodied capabilities of VLMs across three heterogeneous embodiments -- autonomous vehicles, aerial drones, and robotic manipulators -- through approximately 1.1K one-shot reasoning questions and 58 goal-directed navigation tasks. These tasks jointly assess four foundational dimensions: semantic, spatial, temporal, and physical reasoning. Each embodiment presents dynamic sensor configurations and environment variations to probe generalization beyond platform-specific adaptation. To prevent embodiment overfitting, Embodied4C integrates domain-far queries targeting abstract and cross-context reasoning. Comprehensive evaluation across ten state-of-the-art VLMs and four embodied control baselines shows that cross-modal alignment and instruction tuning matter more than scale, while spatial and temporal reasoning remains the primary bottleneck for reliable embodied competence.

Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs' embodied reasoning across diverse platforms

Assesses semantic, spatial, temporal, and physical reasoning dimensions

Probes generalization beyond platform-specific adaptation and overfitting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-loop benchmark for embodied reasoning evaluation

Heterogeneous embodiments with dynamic sensor configurations

Domain-far queries to prevent embodiment overfitting

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models