Towards Trustworthy AI: Characterizing User-Reported Risks across LLMs "In the Wild"

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing LLM risk studies predominantly focus on controlled laboratory settings, single models, or isolated risk categories, failing to capture real-world user experiences and systemic risk patterns. Method: This study conducts the first empirical, user-centered risk analysis across seven mainstream LLM chatbots using authentic Reddit discussions, systematically coded via qualitative content analysis and grounded in the NIST AI Risk Management Framework. Contribution/Results: We identify distinct “risk fingerprints” for each LLM, revealing that user-perceived risks diverge significantly from lab-based findings: the most prevalent risk is failure in *effectiveness and reliability*; high-frequency risks often cause direct personal harm, whereas low-frequency risks typically reflect functional trade-offs. The study underscores the complexity of real-world risk perception and establishes user feedback as indispensable for robust AI risk assessment—providing both empirical evidence and methodological foundations for practical, human-in-the-loop LLM risk management.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) are rapidly integrating into daily life, research on their risks often remains lab-based and disconnected from the problems users encounter "in the wild." While recent HCI research has begun to explore these user-facing risks, it typically concentrates on a singular LLM chatbot like ChatGPT or an isolated risk like privacy. To gain a holistic understanding of multi-risk across LLM chatbots, we analyze online discussions on Reddit around seven major LLM chatbots through the U.S. NIST's AI Risk Management Framework. We find that user-reported risks are unevenly distributed and platform-specific. While "Valid and Reliable" risk is the most frequently mentioned, each product also exhibits a unique "risk fingerprint;" for instance, user discussions associate GPT more with "Safe" and "Fair" issues, Gemini with "Privacy," and Claude with "Secure and Resilient" risks. Furthermore, the nature of these risks differs by their prevalence: less frequent risks like "Explainability" and "Privacy" manifest as nuanced user trade-offs, more common ones like "Fairness" are experienced as direct personal harms. Our findings reveal gaps between risks reported by system-centered studies and by users, highlighting the need for user-centered approaches that support users in their daily use of LLM chatbots.

Problem

Research questions and friction points this paper is trying to address.

Characterizing user-reported risks across LLM chatbots

Analyzing online discussions through AI Risk Management Framework

Identifying gaps between lab-based and user-experienced risks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed Reddit discussions on multiple LLMs

Used NIST AI Risk Management Framework

Identified platform-specific risk fingerprints

🔎 Similar Papers

SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety