RealHarm: A Collection of Real-World Language Model Application Failures

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This paper addresses the lack of empirical failure analysis for language models deployed in consumer-facing applications. To bridge this gap, we introduce RealHarm—the first publicly available, real-world AI application failure dataset grounded in verified incidents. Our methodology involves systematic event mining from public sources and multi-dimensional human annotation, adopting a deployer-centric perspective to classify harm types (e.g., reputational damage, misinformation), root causes, and risk propagation pathways; we further empirically evaluate the efficacy of mainstream content safety guardrails against these authentic failures. Key contributions include: (1) establishing the first empirically grounded, organization-level AI failure repository; (2) revealing a substantial misalignment between regulatory frameworks and observed operational risks; (3) identifying reputational damage as the most prevalent organizational harm and misinformation as the dominant risk category; and (4) demonstrating critically low interception rates of existing safety systems against real-world failure instances.

Technology Category

Application Category

📝 Abstract

Language model deployments in consumer-facing applications introduce numerous risks. While existing research on harms and hazards of such applications follows top-down approaches derived from regulatory frameworks and theoretical analyses, empirical evidence of real-world failure modes remains underexplored. In this work, we introduce RealHarm, a dataset of annotated problematic interactions with AI agents built from a systematic review of publicly reported incidents. Analyzing harms, causes, and hazards specifically from the deployer's perspective, we find that reputational damage constitutes the predominant organizational harm, while misinformation emerges as the most common hazard category. We empirically evaluate state-of-the-art guardrails and content moderation systems to probe whether such systems would have prevented the incidents, revealing a significant gap in the protection of AI applications.

Problem

Research questions and friction points this paper is trying to address.

Identify real-world failures in language model applications

Analyze harms and hazards from deployer's perspective

Evaluate effectiveness of current guardrails and moderation systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces RealHarm dataset from real incidents

Analyzes harms from deployer's perspective

Evaluates guardrails and moderation systems

🔎 Similar Papers

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models