Domain-Adapted Pre-trained Language Models for Implicit Information Extraction in Crash Narratives

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
Extracting implicit information (e.g., collision patterns, vehicle accident types) from unstructured traffic accident texts remains challenging; existing pre-trained models suffer performance degradation on inference-intensive tasks, while reliance on proprietary large language models (LLMs) raises privacy concerns and lacks domain-specific knowledge. Method: We propose a lightweight, interpretable domain adaptation framework leveraging open-source small models (BERT and open-weight LLMs), integrating authoritative CISS domain data to inject traffic-specific knowledge, and applying Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. Contribution/Results: Our approach significantly improves implicit information extraction accuracy—outperforming closed-source models like GPT-4o—while requiring minimal computational resources. It ensures strong interpretability, enables automatic identification of annotation errors, and preserves data privacy through local, open-model deployment—achieving high-accuracy, practical semantic parsing for traffic incident analysis.

Technology Category

Application Category

📝 Abstract
Free-text crash narratives recorded in real-world crash databases have been shown to play a significant role in improving traffic safety. However, large-scale analyses remain difficult to implement as there are no documented tools that can batch process the unstructured, non standardized text content written by various authors with diverse experience and attention to detail. In recent years, Transformer-based pre-trained language models (PLMs), such as Bidirectional Encoder Representations from Transformers (BERT) and large language models (LLMs), have demonstrated strong capabilities across various natural language processing tasks. These models can extract explicit facts from crash narratives, but their performance declines on inference-heavy tasks in, for example, Crash Type identification, which can involve nearly 100 categories. Moreover, relying on closed LLMs through external APIs raises privacy concerns for sensitive crash data. Additionally, these black-box tools often underperform due to limited domain knowledge. Motivated by these challenges, we study whether compact open-source PLMs can support reasoning-intensive extraction from crash narratives. We target two challenging objectives: 1) identifying the Manner of Collision for a crash, and 2) Crash Type for each vehicle involved in the crash event from real-world crash narratives. To bridge domain gaps, we apply fine-tuning techniques to inject task-specific knowledge to LLMs with Low-Rank Adaption (LoRA) and BERT. Experiments on the authoritative real-world dataset Crash Investigation Sampling System (CISS) demonstrate that our fine-tuned compact models outperform strong closed LLMs, such as GPT-4o, while requiring only minimal training resources. Further analysis reveals that the fine-tuned PLMs can capture richer narrative details and even correct some mislabeled annotations in the dataset.
Problem

Research questions and friction points this paper is trying to address.

Extracting implicit information from unstructured crash narratives
Addressing domain gaps in language models for crash analysis
Improving crash type and collision manner identification accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-adapted PLMs for implicit crash data extraction
Fine-tuning with LoRA for domain knowledge injection
Compact models outperforming GPT-4o on crash narratives
🔎 Similar Papers
No similar papers found.
Xixi Wang
Xixi Wang
University of Rochester
Machine learningPattern analysisAlzheimer's disease
Jordanka Kovaceva
Jordanka Kovaceva
Chalmers University of Technology
Active SafetyRoad User BehaviourCrash Prevention
M
Miguel Costa
Department of Technology, Management and Economics, Technical University of Denmark, Akademivej, 2800 Kongens Lyngby, , Denmark
S
Shuai Wang
Department of Computer Science and Engineering, Chalmers University of Technology, Chalmersgatan 4, Gothenburg, 412 96, Sweden
F
Francisco Camara Pereira
Department of Technology, Management and Economics, Technical University of Denmark, Akademivej, 2800 Kongens Lyngby, , Denmark
R
Robert Thomson
Department of Mechanics and Maritime Sciences, Chalmers University of Technology, Chalmersgatan 4, Gothenburg, 412 96, Sweden