🤖 AI Summary
This study addresses the challenge faced by caseworkers in accurately guiding clients through complex eligibility rules of social service programs like SNAP. To investigate the impact of AI assistance, the authors construct a benchmark dataset comprising 770 questions and conduct a randomized controlled trial evaluating how large language model (LLM) chatbots with varying accuracy levels influence human judgment. The findings reveal, for the first time empirically, a nonlinear relationship between AI advice accuracy and human performance: highly accurate AI (96–100%) improves caseworker accuracy by 27 percentage points, whereas AI providing erroneous advice reduces performance by more than two-thirds on simple questions. This work identifies a phenomenon termed the “AI trust deficit plateau,” underscoring the necessity of user-centered evaluation in human-AI collaborative systems.
📝 Abstract
Social service programs like the Supplemental Nutrition Assistance Program (SNAP, or food stamps) have eligibility rules that can be challenging to understand. For nonprofit caseworkers who often support clients in navigating a dozen or more complex programs, LLM-based chatbots may offer a means to provide better, faster help to clients whose situations may be less common. In this paper, we measure the potential effects of LLM-based chatbot suggestions on caseworkers' ability to provide accurate guidance. We first created a 770-question multiple-choice benchmark dataset of difficult, but realistic questions that a caseworker might receive. Next, using these benchmark questions and corresponding expert-verified answers, we conducted a randomized experiment with caseworkers recruited from nonprofit outreach organizations in Los Angeles. Caseworkers in the control condition did not see chatbot suggestions and had a mean accuracy of 49%. Caseworkers in the treatment condition saw chatbot suggestions that we artificially varied to range in aggregate accuracy from low (53%) to high (100%). Caseworker performance significantly improves as chatbot quality improves: high-quality chatbots (96-100% accurate) improved caseworker accuracy by 27 percentage points. At the question-level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best (without chatbot suggestions). Finally, improvements in caseworker accuracy level off as chatbot accuracy increases, a phenomenon that we call the "AI underreliance plateau," which is a concern for real-world deployment and highlights the importance of evaluating human-in-the-loop tools with their users.