StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production-Living Simulations with Stardew Valley

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing benchmarks inadequately evaluate autonomous agents’ integrated capabilities in both productive activities and social interactions. To address this, we propose StarDojo—the first open-ended production–life simulation benchmark grounded in the game *Stardew Valley*—designed to holistically assess agents across five core task domains: farming, crafting, exploration, combat, and social interaction. StarDojo supports cross-platform parallel execution and keyboard/mouse-free operation, offering a comprehensive suite of 1,000 tasks and a curated subset of 100 tasks, along with a lightweight, extensible multimodal simulation interface. Its key innovation lies in the first unified evaluation framework that deeply integrates production logic with social dynamics, establishing an open, realism-oriented standard for agent assessment in virtual lifelike environments. Evaluation of leading multimodal large language model (MLLM)-based agents reveals that even the state-of-the-art GPT-4.1 achieves only a 12.7% overall success rate, highlighting critical bottlenecks in visual understanding, cross-modal reasoning, and low-level action control.

Technology Category

Application Category

📝 Abstract

Autonomous agents navigating human society must master both production activities and social interactions, yet existing benchmarks rarely evaluate these skills simultaneously. To bridge this gap, we introduce StarDojo, a novel benchmark based on Stardew Valley, designed to assess AI agents in open-ended production-living simulations. In StarDojo, agents are tasked to perform essential livelihood activities such as farming and crafting, while simultaneously engaging in social interactions to establish relationships within a vibrant community. StarDojo features 1,000 meticulously curated tasks across five key domains: farming, crafting, exploration, combat, and social interactions. Additionally, we provide a compact subset of 100 representative tasks for efficient model evaluation. The benchmark offers a unified, user-friendly interface that eliminates the need for keyboard and mouse control, supports all major operating systems, and enables the parallel execution of multiple environment instances, making it particularly well-suited for evaluating the most capable foundation agents, powered by multimodal large language models (MLLMs). Extensive evaluations of state-of-the-art MLLMs agents demonstrate substantial limitations, with the best-performing model, GPT-4.1, achieving only a 12.7% success rate, primarily due to challenges in visual understanding, multimodal reasoning and low-level manipulation. As a user-friendly environment and benchmark, StarDojo aims to facilitate further research towards robust, open-ended agents in complex production-living environments.

Problem

Research questions and friction points this paper is trying to address.

Assessing AI agents in open-ended production-living simulations

Evaluating multimodal LLMs in farming, crafting, and social interactions

Benchmarking agent performance in complex visual and reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for production-living simulations with Stardew Valley

Unified interface supporting parallel environment execution

Evaluates multimodal LLMs in diverse open-ended tasks

🔎 Similar Papers

Odyssey: Empowering Minecraft Agents with Open-World Skills

2024-07-22Citations: 3