Social Welfare Function Leaderboard: When LLM Agents Allocate Social Welfare

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the value ambiguity and unquantified decision risks of large language models (LLMs) in allocating scarce social resources. To this end, we construct the first dynamic simulation environment and social welfare function benchmark—quantifying both efficiency (via return-on-investment, ROI, as collective utility) and fairness (via Gini coefficient)—and release the inaugural leaderboard for this domain. Evaluating 20 mainstream LLMs, we find no significant correlation between allocation capability and general dialogue proficiency; models consistently exhibit utilitarian bias, sacrificing fairness for aggregate output, and demonstrate high sensitivity to prompt perturbations. We further propose a novel analytical framework—output-length–social-impact sensitivity analysis—that exposes critical gaps in value alignment and underscores the urgent need for governance-oriented, value-aware evaluation benchmarks.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly entrusted with high-stakes decisions that affect human welfare. However, the principles and values that guide these models when distributing scarce societal resources remain largely unexamined. To address this, we introduce the Social Welfare Function (SWF) Benchmark, a dynamic simulation environment where an LLM acts as a sovereign allocator, distributing tasks to a heterogeneous community of recipients. The benchmark is designed to create a persistent trade-off between maximizing collective efficiency (measured by Return on Investment) and ensuring distributive fairness (measured by the Gini coefficient). We evaluate 20 state-of-the-art LLMs and present the first leaderboard for social welfare allocation. Our findings reveal three key insights: (i) A model's general conversational ability, as measured by popular leaderboards, is a poor predictor of its allocation skill. (ii) Most LLMs exhibit a strong default utilitarian orientation, prioritizing group productivity at the expense of severe inequality. (iii) Allocation strategies are highly vulnerable, easily perturbed by output-length constraints and social-influence framing. These results highlight the risks of deploying current LLMs as societal decision-makers and underscore the need for specialized benchmarks and targeted alignment for AI governance.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM allocation skills in social welfare distribution
Assessing trade-offs between efficiency and fairness in AI decisions
Identifying vulnerabilities in LLM resource allocation strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulated dynamic environment for welfare allocation
Evaluated 20 LLMs using efficiency-fairness trade-off metrics
Identified allocation vulnerabilities to constraints and framing