🤖 AI Summary
Existing benchmarks inadequately evaluate autonomous agents’ integrated capabilities in both productive activities and social interactions. To address this, we propose StarDojo—the first open-ended production–life simulation benchmark grounded in the game *Stardew Valley*—designed to holistically assess agents across five core task domains: farming, crafting, exploration, combat, and social interaction. StarDojo supports cross-platform parallel execution and keyboard/mouse-free operation, offering a comprehensive suite of 1,000 tasks and a curated subset of 100 tasks, along with a lightweight, extensible multimodal simulation interface. Its key innovation lies in the first unified evaluation framework that deeply integrates production logic with social dynamics, establishing an open, realism-oriented standard for agent assessment in virtual lifelike environments. Evaluation of leading multimodal large language model (MLLM)-based agents reveals that even the state-of-the-art GPT-4.1 achieves only a 12.7% overall success rate, highlighting critical bottlenecks in visual understanding, cross-modal reasoning, and low-level action control.
📝 Abstract
Autonomous agents navigating human society must master both production activities and social interactions, yet existing benchmarks rarely evaluate these skills simultaneously. To bridge this gap, we introduce StarDojo, a novel benchmark based on Stardew Valley, designed to assess AI agents in open-ended production-living simulations. In StarDojo, agents are tasked to perform essential livelihood activities such as farming and crafting, while simultaneously engaging in social interactions to establish relationships within a vibrant community. StarDojo features 1,000 meticulously curated tasks across five key domains: farming, crafting, exploration, combat, and social interactions. Additionally, we provide a compact subset of 100 representative tasks for efficient model evaluation. The benchmark offers a unified, user-friendly interface that eliminates the need for keyboard and mouse control, supports all major operating systems, and enables the parallel execution of multiple environment instances, making it particularly well-suited for evaluating the most capable foundation agents, powered by multimodal large language models (MLLMs). Extensive evaluations of state-of-the-art MLLMs agents demonstrate substantial limitations, with the best-performing model, GPT-4.1, achieving only a 12.7% success rate, primarily due to challenges in visual understanding, multimodal reasoning and low-level manipulation. As a user-friendly environment and benchmark, StarDojo aims to facilitate further research towards robust, open-ended agents in complex production-living environments.