🤖 AI Summary
This work addresses a critical gap in spatial reasoning research, which often neglects the influence of human intention on dynamic spatial configurations and lacks a unified evaluation framework integrating physical laws with goal-directed behavior. To bridge this gap, we propose Teleo-Spatial Intelligence (TSI), a novel paradigm that systematically incorporates intention-driven spatial reasoning. We introduce EscherVerse, an open-world benchmark comprising the large-scale real-world video dataset Escher-35k and the evaluation suite Escher-Bench, enabling joint assessment of object permanence, state transitions, and trajectory prediction. This benchmark advances spatial intelligence from passive perception toward purpose-oriented, holistic understanding. Furthermore, we develop the Escher series of models that jointly learn physical interactions and intention inference, providing embodied agents with foundational capabilities grounded in both physical commonsense and goal comprehension.
📝 Abstract
The ability to reason about spatial dynamics is a cornerstone of intelligence, yet current research overlooks the human intent behind spatial changes. To address these limitations, we introduce Teleo-Spatial Intelligence (TSI), a new paradigm that unifies two critical pillars: Physical-Dynamic Reasoning--understanding the physical principles of object interactions--and Intent-Driven Reasoning--inferring the human goals behind these actions. To catalyze research in TSI, we present EscherVerse, consisting of a large-scale, open-world benchmark (Escher-Bench), a dataset (Escher-35k), and models (Escher series). Derived from real-world videos, EscherVerse moves beyond constrained settings to explicitly evaluate an agent's ability to reason about object permanence, state transitions, and trajectory prediction in dynamic, human-centric scenarios. Crucially, it is the first benchmark to systematically assess Intent-Driven Reasoning, challenging models to connect physical events to their underlying human purposes. Our work, including a novel data curation pipeline, provides a foundational resource to advance spatial intelligence from passive scene description toward a holistic, purpose-driven understanding of the world.