EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments

📅 2025-03-24

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 1

career value

194K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of large language model (LLM) agents’ capabilities in dynamically learning, modeling, and making strategic decisions within unknown economic environments—such as procurement, scheduling, task allocation, and pricing. Methodologically, we propose the first open-environment LLM agent evaluation framework, built upon a dual-paradigm approach: (i) scalable synthetic benchmarks grounded in economic theory, and (ii) qualitative “litmus tests” capturing behavioral tendencies. Our framework integrates economic modeling, behavioral trajectory analysis, and multi-dimensional quantitative metrics, and—crucially—explicitly incorporates value trade-offs (e.g., efficiency vs. fairness) into the evaluation ontology. The resulting dynamic benchmark suite is controllably challenging and highly interpretable. Validated across four real-world economic tasks, it demonstrates strong evaluation validity and cross-task generalizability. This infrastructure enables reproducible, extensible assessment of LLM agents’ value alignment and practical deployability in economic decision-making contexts.

Technology Category

Application Category

📝 Abstract

We develop benchmarks for LLM agents that act in, learn from, and strategize in unknown environments, the specifications of which the LLM agent must learn over time from deliberate exploration. Our benchmarks consist of decision-making tasks derived from key problems in economics. To forestall saturation, the benchmark tasks are synthetically generated with scalable difficulty levels. Additionally, we propose litmus tests, a new kind of quantitative measure for LLMs and LLM agents. Unlike benchmarks, litmus tests quantify differences in character, values, and tendencies of LLMs and LLM agents, by considering their behavior when faced with tradeoffs (e.g., efficiency versus equality) where there is no objectively right or wrong behavior. Overall, our benchmarks and litmus tests assess the abilities and tendencies of LLM agents in tackling complex economic problems in diverse settings spanning procurement, scheduling, task allocation, and pricing -- applications that should grow in importance as such agents are further integrated into the economy.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM agents' decision-making in unknown economic environments

Measuring LLM agents' character and values via tradeoff scenarios

Evaluating LLM agents' performance in diverse economic applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic benchmarks with scalable difficulty levels

Litmus tests for LLM character and values

Decision-making tasks from economic problems

🔎 Similar Papers

GLEE: A Unified Framework and Benchmark for Language-based Economic Environments