๐ค AI Summary
This work addresses the scarcity of high-quality annotated data in crime-domain named entity recognition (NER) by introducing CrimeNERdb, the first large-scale NER dataset derived from real-world case documents. Comprising over 1,500 annotated texts, the dataset defines five coarse-grained and twenty-two fine-grained entity types tailored to legal contexts. The study presents the first systematic investigation of crime-domain NER under zero-shot and few-shot settings, evaluating a range of state-of-the-art NER models alongside general-purpose large language models. By establishing a benchmark for low-resource information extraction in legal texts, this research provides foundational data, methodological insights, and empirical evidence to advance NER applications in the criminal justice domain.
๐ Abstract
The extraction of critical information from crime-related documents is a crucial task for law enforcement agencies. Named-Entity Recognition (NER) can perform this task in extracting information about the crime, the criminal, or law enforcement agencies involved. However, there is a considerable lack of adequately annotated data on general real-world crime scenarios. To address this issue, we present CrimeNER, a case-study of Crime-related zero- and Few-Shot NER, and a general Crime-related Named-Entity Recognition database (CrimeNERdb) consisting of more than 1.5k annotated documents for the NER task extracted from public reports on terrorist attacks and the U.S. Department of Justice's press notes. We define 5 types of coarse crime entity and a total of 22 types of fine-grained entity. We address the quality of the case-study and the annotated data with experiments on Zero and Few-Shot settings with State-of-the-Art NER models as well as generalist and commonly used Large Language Models.