EvoEmpirBench: Dynamic Spatial Reasoning with Agent-ExpVer

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing spatial reasoning benchmarks predominantly focus on static, fully observable environments, limiting their ability to assess models’ long-horizon reasoning and memory utilization under partial observability and dynamic conditions. To address this gap, we introduce two novel benchmarks: Local-Observable Maze Navigation and Match-2 Elimination—designed to systematically evaluate agents’ spatial understanding, adaptive planning, and cognitive updating in dynamic settings. Our method introduces a subjectively experienced, structured memory mechanism that jointly models local observations, integrates dynamic environmental feedback, and enables online memory revision, all embedded within a reinforcement learning framework for continual policy optimization. Experiments reveal that state-of-the-art models exhibit significant performance degradation on these tasks, underscoring their difficulty and the inadequacy of current architectures. These findings validate the benchmarks’ utility in advancing embodied spatial reasoning and long-term memory-augmented agent design.

Technology Category

Application Category

📝 Abstract
Most existing spatial reasoning benchmarks focus on static or globally observable environments, failing to capture the challenges of long-horizon reasoning and memory utilization under partial observability and dynamic changes. We introduce two dynamic spatial benchmarks, locally observable maze navigation and match-2 elimination that systematically evaluate models' abilities in spatial understanding and adaptive planning when local perception, environment feedback, and global objectives are tightly coupled. Each action triggers structural changes in the environment, requiring continuous update of cognition and strategy. We further propose a subjective experience-based memory mechanism for cross-task experience transfer and validation. Experiments show that our benchmarks reveal key limitations of mainstream models in dynamic spatial reasoning and long-term memory, providing a comprehensive platform for future methodological advances. Our code and data are available at https://anonymous.4open.science/r/EvoEmpirBench-143C/.
Problem

Research questions and friction points this paper is trying to address.

Evaluates spatial reasoning under partial observability and dynamics
Tests adaptive planning with local perception and global objectives
Assesses memory utilization for long-horizon reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic spatial benchmarks with partial observability
Action-triggered environment changes requiring strategy updates
Subjective experience-based memory for cross-task transfer
🔎 Similar Papers
No similar papers found.
P
Pukun Zhao
Guangdong University of Finance and Economics
Longxiang Wang
Longxiang Wang
PhD student, City University of Hong Kong
Large language modelEncrypted database
M
Miaowei Wang
University of Edinburgh
C
Chen Chen
Guangdong University of Finance and Economics
F
Fanqing Zhou
Guangdong University of Finance and Economics
H
Haojian Huang
The University of Hong Kong