🤖 AI Summary
Real-world business decision-making requires concurrent handling of open-ended goal optimization, active modeling under sparse feedback, long-horizon planning in stochastic environments, and spatial reasoning—yet existing human-AI benchmarks evaluate these capabilities in isolation, failing to assess integrated decision-making competence. To address this gap, we introduce Mini Amusement Parks (MAPs), the first unified simulation benchmark integrating all four dimensions. MAPs enables systematic evaluation of agents’ world modeling under open goals and stochasticity with sparse rewards, long-horizon optimization, and spatial reasoning, while providing LLM-based agent implementations and human performance baselines. Experimental results show that state-of-the-art LLM agents achieve only 15.4% and 10.2% of human decision-making efficiency on easy and medium difficulty levels, respectively—revealing fundamental deficiencies in long-horizon planning and spatial reasoning.
📝 Abstract
Despite rapid progress in artificial intelligence, current systems struggle with the interconnected challenges that define real-world decision making. Practical domains, such as business management, require optimizing an open-ended and multi-faceted objective, actively learning environment dynamics from sparse experience, planning over long horizons in stochastic settings, and reasoning over spatial information. Yet existing human--AI benchmarks isolate subsets of these capabilities, limiting our ability to assess holistic decision-making competence. We introduce Mini Amusement Parks (MAPs), an amusement-park simulator designed to evaluate an agent's ability to model its environment, anticipate long-term consequences under uncertainty, and strategically operate a complex business. We provide human baselines and a comprehensive evaluation of state-of-the-art LLM agents, finding that humans outperform these systems by 6.5x on easy mode and 9.8x on medium mode. Our analysis reveals persistent weaknesses in long-horizon optimization, sample-efficient learning, spatial reasoning, and world modelling. By unifying these challenges within a single environment, MAPs offers a new foundation for benchmarking agents capable of adaptable decision making. Code: https://github.com/Skyfall-Research/MAPs