EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

It remains unclear whether current multimodal large language models can reliably reason about the physical consequences of long-horizon action sequences from an egocentric perspective. This work formally defines and introduces the task of egocentric long-horizon scene prediction: given an initial scene image and a sequence of atomic actions, the model must predict the resulting final scene. To facilitate systematic evaluation, we present EXPLORE-Bench, a benchmark derived from real-world first-person videos, featuring structured annotations of object categories, visual attributes, and spatial relations to enable fine-grained assessment. Our experiments reveal a substantial performance gap between existing models and human capabilities. While incorporating step-by-step reasoning partially improves prediction accuracy, it incurs significant computational overhead.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.

Problem

Research questions and friction points this paper is trying to address.

egocentric reasoning

long-horizon prediction

scene prediction

embodied agents

multimodal large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric scene prediction

long-horizon reasoning

multimodal large language models