Social Policy of Large Language Models: How GPT, Claude, DeepSeek and Grok Allocate Social Budgets in Spain and Germany

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study presents the first systematic comparison of mainstream large language models—GPT-4o, Claude, DeepSeek, and Grok—regarding their implicit preferences in social budget allocation within real-world national fiscal contexts. Using standardized prompts to simulate budget allocations in Spain and Germany, the authors benchmark model outputs against actual OECD expenditure data. Through non-parametric tests (Kruskal-Wallis and Mann-Whitney U), Pearson correlation analyses, and lexical scrutiny of generated text, they find that all models significantly underestimate pension spending (by approximately one-third) while overestimating housing (by fourfold) and employment expenditures (by twofold). Only Claude demonstrates a substantive response to national context. Inter-model differences primarily reflect variations in budget concentration rather than geopolitical stance, revealing systematic biases in how large language models handle public policy issues.

📝 Abstract

We study how four widely used large language models, namely Claude, GPT-4o, DeepSeek and Grok, distribute a fixed national social budget across twelve macro-areas of public expenditure under two European national contexts, Spain and Germany. Each combination of model and country is queried six times under identical prompts and generation parameters, producing forty-eight independent allocations that are compared against approximate Organisation for Economic Co-operation and Development (OECD) reference budgets and against each other. We formalise five hypotheses regarding geopolitical bias, housing under-allocation, structural convergence, sensitivity to national context, and under-representation of politically sensitive categories. The differences between models are then validated through Kruskal-Wallis tests on each macro-area, with post-hoc Mann-Whitney U comparisons under Bonferroni correction, and complemented by an analysis of pairwise Pearson correlations and a lexical examination of the textual justifications produced by each model. The results show that all four models share a systematic implicit social policy that diverges from real European spending structures: pensions are under-allocated by a factor close to three, while housing and employment are over-allocated by factors of four and two respectively. The principal axis of differentiation between models is not geopolitical, since Claude and DeepSeek are the most correlated pair across both countries, but rather a contrast between concentration and dispersion of the budget. Only Claude exhibits substantive sensitivity to the national context. The conclusions delimit the conditions under which language models may responsibly support, but not replace, expert deliberation in public budgeting.

Problem

Research questions and friction points this paper is trying to address.

large language models

social budget allocation

public expenditure

geopolitical bias

policy simulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

large language models

social budget allocation

policy bias