🤖 AI Summary
This study addresses the challenge that end users face in comprehending cybersecurity alerts generated by large language models (LLMs). To this end, we propose the Human-Centered Security Alert Evaluation Framework (HCSAEF). Methodologically, HCSAEF integrates human factors engineering principles, natural language metrics, LLM output parsing, semantic consistency verification, and behavior-oriented empirical user studies. Its core contribution is the first quantifiable, multi-dimensional evaluation system—assessing alert understandability, urgency, and correctness—that enables cross-prompt, cross-model, and cross-consistency comparative analysis and root-cause attribution of alert quality. Evaluated across three representative use cases, HCSAEF effectively discriminates the impacts of prompt engineering, model selection, and output stability on alert quality. Results demonstrate its strong discriminative power and practical utility in assessing intuitiveness, urgency, and factual correctness of LLM-generated security notifications.
📝 Abstract
Due to the increasing presence of networked devices in everyday life, not only cybersecurity specialists but also end users benefit from security applications such as firewalls, vulnerability scanners, and intrusion detection systems. Recent approaches use large language models (LLMs) to rewrite brief, technical security alerts into intuitive language and suggest actionable measures, helping everyday users understand and respond appropriately to security risks. However, it remains an open question how well such alerts are explained to users. LLM outputs can also be hallucinated, inconsistent, or misleading. In this work, we introduce the Human-Centered Security Alert Evaluation Framework (HCSAEF). HCSAEF assesses LLM-generated cybersecurity notifications to support researchers who want to compare notifications generated for everyday users, improve them, or analyze the capabilities of different LLMs in explaining cybersecurity issues. We demonstrate HCSAEF through three use cases, which allow us to quantify the impact of prompt design, model selection, and output consistency. Our findings indicate that HCSAEF effectively differentiates generated notifications along dimensions such as intuitiveness, urgency, and correctness.