Street-Level AI: Are Large Language Models Ready for Real-World Judgments?

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) can reliably substitute street-level bureaucrats in high-stakes social decisions—specifically, resource allocation for unhoused individuals—by aligning with both human expert judgments and official vulnerability scoring systems. Method: Leveraging real-world service demand data, we conduct privacy-preserving, multi-round comparative experiments using locally deployed LLMs (including both open-source and commercial variants), benchmarked against paired human expert annotations. Contribution/Results: We find significant inconsistency across LLM runs, model architectures, and official scoring criteria. While LLM outputs exhibit qualitative alignment with lay judgments, they lack robustness and policy fidelity. Critically, their instability and misalignment with institutional priorities preclude reliable deployment in scarcity-constrained, high-risk public resource allocation. This work provides the first empirical evidence and a methodological framework for the cautious integration of LLMs in public governance contexts.

Technology Category

Application Category

📝 Abstract

A surge of recent work explores the ethical and societal implications of large-scale AI models that make "moral" judgments. Much of this literature focuses either on alignment with human judgments through various thought experiments or on the group fairness implications of AI judgments. However, the most immediate and likely use of AI is to help or fully replace the so-called street-level bureaucrats, the individuals deciding to allocate scarce social resources or approve benefits. There is a rich history underlying how principles of local justice determine how society decides on prioritization mechanisms in such domains. In this paper, we examine how well LLM judgments align with human judgments, as well as with socially and politically determined vulnerability scoring systems currently used in the domain of homelessness resource allocation. Crucially, we use real data on those needing services (maintaining strict confidentiality by only using local large models) to perform our analyses. We find that LLM prioritizations are extremely inconsistent in several ways: internally on different runs, between different LLMs, and between LLMs and the vulnerability scoring systems. At the same time, LLMs demonstrate qualitative consistency with lay human judgments in pairwise testing. Findings call into question the readiness of current generation AI systems for naive integration in high-stakes societal decision-making.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM alignment with human judgments in real-world decisions

Evaluating LLM consistency in homelessness resource allocation prioritization

Examining AI readiness for high-stakes societal decision-making integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses local large models for confidentiality

Compares LLM judgments with human systems

Tests LLM consistency in real-world scenarios

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval