EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing evaluation frameworks struggle to assess the planning and execution capabilities of large language models in long-term, dynamic economic interactions. To address this gap, this work proposes EcoGym—a general and extensible benchmark platform that establishes, for the first time, a paradigm for evaluating long-horizon planning in persistent interactive economic systems. EcoGym features three environments—vending, freelancing, and operations—each grounded in realistic economic logic, and supports cross-scenario comparison through a unified interface. It incorporates budget constraints, partial observability, and stochasticity, evaluating trade-offs between strategic coherence and execution efficiency via business metrics such as net wealth, income, and daily active users. Evaluations of 11 prominent large language models reveal that none consistently outperforms others across all scenarios, with most exhibiting significant suboptimality in either high-level strategy or low-level execution.

Technology Category

Application Category

📝 Abstract

Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.

Problem

Research questions and friction points this paper is trying to address.

long-horizon planning

interactive economies

LLM-based agents

evaluation benchmark

economic dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon planning

interactive economies

plan-and-execute