LLMs in social services: How does chatbot accuracy affect human accuracy?

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge faced by caseworkers in accurately guiding clients through complex eligibility rules of social service programs like SNAP. To investigate the impact of AI assistance, the authors construct a benchmark dataset comprising 770 questions and conduct a randomized controlled trial evaluating how large language model (LLM) chatbots with varying accuracy levels influence human judgment. The findings reveal, for the first time empirically, a nonlinear relationship between AI advice accuracy and human performance: highly accurate AI (96–100%) improves caseworker accuracy by 27 percentage points, whereas AI providing erroneous advice reduces performance by more than two-thirds on simple questions. This work identifies a phenomenon termed the “AI trust deficit plateau,” underscoring the necessity of user-centered evaluation in human-AI collaborative systems.

Technology Category

Application Category

📝 Abstract
Social service programs like the Supplemental Nutrition Assistance Program (SNAP, or food stamps) have eligibility rules that can be challenging to understand. For nonprofit caseworkers who often support clients in navigating a dozen or more complex programs, LLM-based chatbots may offer a means to provide better, faster help to clients whose situations may be less common. In this paper, we measure the potential effects of LLM-based chatbot suggestions on caseworkers' ability to provide accurate guidance. We first created a 770-question multiple-choice benchmark dataset of difficult, but realistic questions that a caseworker might receive. Next, using these benchmark questions and corresponding expert-verified answers, we conducted a randomized experiment with caseworkers recruited from nonprofit outreach organizations in Los Angeles. Caseworkers in the control condition did not see chatbot suggestions and had a mean accuracy of 49%. Caseworkers in the treatment condition saw chatbot suggestions that we artificially varied to range in aggregate accuracy from low (53%) to high (100%). Caseworker performance significantly improves as chatbot quality improves: high-quality chatbots (96-100% accurate) improved caseworker accuracy by 27 percentage points. At the question-level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best (without chatbot suggestions). Finally, improvements in caseworker accuracy level off as chatbot accuracy increases, a phenomenon that we call the "AI underreliance plateau," which is a concern for real-world deployment and highlights the importance of evaluating human-in-the-loop tools with their users.
Problem

Research questions and friction points this paper is trying to address.

LLM-based chatbot
human accuracy
social services
AI underreliance
caseworker guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based chatbots
human-AI collaboration
AI underreliance plateau
social services
accuracy benchmarking
🔎 Similar Papers
No similar papers found.
J
Jennah Gosciak
Department of Information Science, Cornell Tech
E
Eric Giannella
Better Government Lab, Georgetown University
Z
Zhaowen Guo
Better Government Lab, Georgetown University
Michael Chen
Michael Chen
Undergraduate, Carnegie Mellon University
Allison Koenecke
Allison Koenecke
Asst. Prof., Cornell University