ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether modern vision-language models (VLMs) possess embodied cognition—the capacity for intelligent behavior grounded in sensorimotor interaction rather than passive observation. Method: We introduce ENACT, the first scalable benchmark for embodied cognition evaluation, which formalizes world modeling as POMDP-inspired forward and inverse sequence reordering tasks—bypassing image generation confounds and implicitly assessing perception, action reasoning, and long-term memory. Using BEHAVIOR simulation, we generate 8,972 household-scene visual question-answering pairs, with scene graph transformations encoding actions. Results: VLMs exhibit reversed performance—surpassing humans on inverse tasks while underperforming on forward ones—alongside right-handedness bias and viewpoint-dependent errors. Performance degrades significantly with increasing interaction length, widening the human–model gap and exposing fundamental limitations in embodied reasoning.

Technology Category

Application Category

📝 Abstract
Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Evaluating embodied cognition in vision-language models through world modeling tasks
Assessing model capabilities in affordance recognition and action-effect reasoning
Measuring performance gaps between AI models and humans in egocentric interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating embodied cognition via world modeling benchmark
Using forward and inverse sequence reordering tasks
Synthesizing QA pairs from robotics simulation data
🔎 Similar Papers
No similar papers found.