An Agentic Workflow for Detecting Personally Identifiable Information in Crash Narratives

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the challenge of identifying sparse and highly context-dependent personally identifiable information (PII) in narrative traffic accident reports, where traditional rule-based approaches and manual annotation struggle to balance accuracy and scalability. The authors propose a locally deployable agent-based workflow that integrates a rule engine (Presidio) with a domain-adaptively fine-tuned large language model to jointly handle structured and ambiguous PII—such as home addresses and alphanumeric identifiers. Their framework incorporates ensemble extraction and an evidence-based agent verification mechanism. Evaluated on real-world accident data, the method achieves a precision of 0.82, recall of 0.94, F1 score of 0.87, and accuracy of 0.96, significantly outperforming baseline approaches while fulfilling the stringent requirement of zero external API calls for privacy-sensitive deployments.

Technology Category

Application Category

📝 Abstract

Crash narratives in crash reports provide crucial contextual information for traffic safety analysis. Yet, their broader use is hindered by the presence of personally identifiable information (PII), including names, home addresses, and license plate numbers. Because PII appears sparsely and inconsistently in crash narratives, manual detection is not scalable, and existing rule-based approaches often fail to capture context-dependent PII. This study develops and evaluates a locally deployable, agentic workflow for PII detection in crash narratives by leveraging large language models (LLMs). The workflow contains a Hybrid Extractor and a Verifier. The Hybrid Extractor routes structured PII (e.g., phone numbers and email addresses) to a rule-based model (i.e., Presidio) and context-dependent PII (e.g., names, home addresses, and alphanumeric identifiers) to a domain-adapted, fine-tuned LLM. To address ambiguity in challenging categories, the workflow incorporates ensemble LLM extraction and an agentic verification step that filters false detections through evidence-based reasoning. Evaluated on a real-world crash dataset, the agentic workflow achieves strong performance with a precision of 0.82, a recall of 0.94, an F1 of 0.87, and an accuracy of 0.96, outperforming multiple baseline methods. Moreover, the ablation results suggest that ensemble LLM extraction and Verifier offer improved detection for home addresses and alphanumeric identifiers. The workflow runs locally, supporting privacy-sensitive operational settings where external APIs are restricted. This work offers a practical and robust path for scalable, privacy-preserving crash data processing, enabling broader research and safety interventions while safeguarding individual privacy.

Problem

Research questions and friction points this paper is trying to address.

Personally Identifiable Information

Crash Narratives

Privacy Preservation

Context-dependent PII

Traffic Safety Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic workflow

PII detection

large language models