Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research lacks a systematic understanding of how embodiment influences perception, reasoning, and control in vision-language navigation, while mainstream benchmarks fail to comprehensively evaluate semantic, spatial, temporal, and physical dimensions of embodied reasoning. This paper introduces the first closed-loop Turing test benchmark for embodied intelligence, covering heterogeneous agents—autonomous vehicles, drones, and robotic arms—evaluated across 1.1K reasoning questions and 58 navigation tasks. Key innovations include: (i) a cross-agent multimodal alignment-constrained evaluation framework; (ii) domain-agnostic query generation to mitigate overfitting; (iii) a four-dimensional embodied reasoning metric suite; and (iv) a joint VLM-control assessment protocol. Validation on 10 state-of-the-art vision-language models and 4 embodied baselines reveals that multimodal alignment and instruction tuning are more critical than model scale, and that spatial and temporal reasoning constitute the primary bottlenecks in current embodied competence.

Technology Category

Application Category

📝 Abstract
Vision-language navigation requires agents to reason and act under constraints of embodiment. While vision-language models (VLMs) demonstrate strong generalization, current benchmarks provide limited understanding of how embodiment -- i.e., the choice of physical platform, sensor configuration, and modality alignment -- influences perception, reasoning, and control. We introduce Embodied4C, a closed-loop benchmark designed as a Turing test for embodied reasoning. The benchmark evaluates the core embodied capabilities of VLMs across three heterogeneous embodiments -- autonomous vehicles, aerial drones, and robotic manipulators -- through approximately 1.1K one-shot reasoning questions and 58 goal-directed navigation tasks. These tasks jointly assess four foundational dimensions: semantic, spatial, temporal, and physical reasoning. Each embodiment presents dynamic sensor configurations and environment variations to probe generalization beyond platform-specific adaptation. To prevent embodiment overfitting, Embodied4C integrates domain-far queries targeting abstract and cross-context reasoning. Comprehensive evaluation across ten state-of-the-art VLMs and four embodied control baselines shows that cross-modal alignment and instruction tuning matter more than scale, while spatial and temporal reasoning remains the primary bottleneck for reliable embodied competence.
Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs' embodied reasoning across diverse platforms
Assesses semantic, spatial, temporal, and physical reasoning dimensions
Probes generalization beyond platform-specific adaptation and overfitting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-loop benchmark for embodied reasoning evaluation
Heterogeneous embodiments with dynamic sensor configurations
Domain-far queries to prevent embodiment overfitting
🔎 Similar Papers
No similar papers found.
T
Tin Stribor Sohn
Karlsruhe Institute of Technology
M
Maximilian Dillitzer
UAS Esslingen
J
Jason J. Corso
University of Michigan
Eric Sax
Eric Sax
Karlsruhe Institute for Technology
Systems Engineering