🤖 AI Summary
This work addresses the lack of a unified definition and standardized benchmark for threat detection in natural language processing, a task often conflated with toxicity or hate speech detection. The authors introduce ThreatCore, the first fine-grained benchmark for threat detection, constructed by integrating and re-annotating multiple public datasets to clearly distinguish explicit threats, implicit threats, and non-threats. To enhance coverage of implicit threats, the benchmark incorporates synthetically generated examples validated by human annotators. The study employs semantic role labeling (SRL) as an intermediate representation and systematically evaluates Perspective API, zero-shot classifiers, and large language models. Experimental results reveal that current models perform significantly worse on implicit threats than on explicit ones, while the integration of SRL consistently improves performance, highlighting both the challenges and promising directions in detecting indirect harmful intent.
📝 Abstract
Threat detection in Natural Language Processing lacks consistent definitions and standardized benchmarks, and is often conflated with broader phenomena such as toxicity, hate speech, or offensive language. In this work, we introduce ThreatCore, a public available benchmark dataset for fine-grained threat detection that distinguishes between explicit threats, implicit threats, and non-threats. The dataset is constructed by aggregating multiple publicly available resources and systematically re-annotating them under a unified operational definition of threat, revealing substantial inconsistencies across existing labels. To improve the coverage of underrepresented cases, particularly implicit threats, we further augment the dataset with synthetic examples, which are manually validated using the same annotation protocol adopted for the re-annotation of the public datasets, ensuring consistency across all data sources. We evaluate Perspective API, zero-shot classifiers, and recent language models on ThreatCore, showing that implicit threats remain substantially harder to detect than explicit ones. Our results also indicate that incorporating Semantic Role Labeling as an intermediate representation can improve performance by making the structure of harmful intent more explicit. Overall, ThreatCore provides a more consistent benchmark for studying fine-grained threat detection and highlights the challenges that current models still face in identifying indirect expressions of harmful intent.