STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing benchmarks lack fine-grained evaluation of large language models’ (LLMs) non-strategic microeconomic reasoning capabilities—particularly in supply-demand analysis—leading to incomplete assessments. Method: We introduce the first fine-grained microeconomic logic benchmark, covering 10 domains, 58 reasoning elements, 5 analytical perspectives, and 3 question types. To ensure scalability and robustness, we propose auto-STEER—a novel automated data generation protocol integrating hand-crafted templates (dynamically adaptable to new domains/perspectives), LLM-assisted synthesis, multi-prompt strategies, and a multi-metric scoring framework—effectively mitigating evaluation overfitting. Contribution/Results: We conduct systematic evaluation across 27 open- and closed-source LLMs, revealing—for the first time—critical capability gaps in microeconomic reasoning, pronounced sensitivity to prompt engineering, and domain-specific performance disparities. We publicly release a reproducible, fine-grained performance atlas to support rigorous, transparent model assessment and advancement in economic reasoning research.

Technology Category

Application Category

📝 Abstract

How should one judge whether a given large language model (LLM) can reliably perform economic reasoning? Most existing LLM benchmarks focus on specific applications and fail to present the model with a rich variety of economic tasks. A notable exception is Raman et al. [2024], who offer an approach for comprehensively benchmarking strategic decision-making; however, this approach fails to address the non-strategic settings prevalent in microeconomics, such as supply-and-demand analysis. We address this gap by taxonomizing microeconomic reasoning into $58$ distinct elements, focusing on the logic of supply and demand, each grounded in up to $10$ distinct domains, $5$ perspectives, and $3$ types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting handwritten templates to target new domains and perspectives. Because it offers an automated way of generating fresh questions, auto-STEER mitigates the risk that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that it will serve as a useful tool both for evaluating and fine-tuning models for years to come. We demonstrate the usefulness of our benchmark via a case study on $27$ LLMs, ranging from small open-source models to the current state of the art. We examined each model's ability to solve microeconomic problems across our whole taxonomy and present the results across a range of prompting strategies and scoring metrics.

Problem

Research questions and friction points this paper is trying to address.

Assess LLM microeconomic reasoning

Address non-strategic microeconomics

Generate diverse economic benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-assisted data generation

Taxonomizing microeconomic reasoning

Auto-STEER protocol implementation

🔎 Similar Papers

Macroeconomic Forecasting with Large Language Models