Predicting person-level injury severity using crash narratives: A balanced approach with roadway classification and natural language process techniques

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurate prediction of injury severity in traffic crashes remains challenging due to insufficient modeling of unstructured narrative evidence alongside structured crash data. Method: This study proposes a hybrid modeling framework integrating police-written unstructured accident descriptions with structured crash attributes. It introduces a multi-level road classification system to capture road-type heterogeneity and employs a joint TF-IDF and Word2Vec text representation. Ensemble learning models—including XGBoost, Random Forest, and AdaBoost—are adopted, with SMOTE applied to address class imbalance. Contribution/Results: Evaluated on real-world crash data from Kentucky (2019–2023), the text-augmented models significantly outperform purely structured baselines. The TF-IDF + XGBoost combination achieves the highest performance, improving AUC by up to 8.2%. The framework delivers an interpretable, robust predictive tool for road safety assessment, emergency response optimization, and precision public health interventions.

Technology Category

Application Category

📝 Abstract
Predicting injuries and fatalities in traffic crashes plays a critical role in enhancing road safety, improving emergency response, and guiding public health interventions. This study investigates the added value of unstructured crash narratives (written by police officers at the scene) when combined with structured crash data to predict injury severity. Two widely used Natural Language Processing (NLP) techniques, Term Frequency-Inverse Document Frequency (TF-IDF) and Word2Vec, were employed to extract semantic meaning from the narratives, and their effectiveness was compared. To address the challenge of class imbalance, a K-Nearest Neighbors-based oversampling method was applied to the training data prior to modeling. The dataset consists of crash records from Kentucky spanning 2019 to 2023. To account for roadway heterogeneity, three road classification schemes were used: (1) eight detailed functional classes (e.g., Urban Two-Lane, Rural Interstate, Urban Multilane Divided), (2) four broader paired categories (e.g., Urban vs. Rural, Freeway vs. Non-Freeway), and (3) a unified dataset without classification. A total of 102 machine learning models were developed by combining structured features and narrative-based features using the two NLP techniques alongside three ensemble algorithms: XGBoost, Random Forest, and AdaBoost. Results demonstrate that models incorporating narrative data consistently outperform those relying solely on structured data. Among all combinations, TF-IDF coupled with XGBoost yielded the most accurate predictions in most subgroups. The findings highlight the power of integrating textual and structured crash information to enhance person-level injury prediction. This work offers a practical and adaptable framework for transportation safety professionals to improve crash severity modeling, guide policy decisions, and design more effective countermeasures.
Problem

Research questions and friction points this paper is trying to address.

Predicting person-level injury severity from crash data
Combining unstructured narratives with structured crash information
Addressing class imbalance and roadway heterogeneity in modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining NLP techniques with structured crash data
Using K-Nearest Neighbors oversampling for class imbalance
Employing multiple roadway classification schemes for heterogeneity
🔎 Similar Papers
2024-07-082024 IEEE International Automated Vehicle Validation Conference (IAVVC)Citations: 1
M
Mohammad Zana Majidi
Department of Civil Engineering, University of Kentucky, KY, USA
Sajjad Karimi
Sajjad Karimi
Postdoctoral Fellow at Emory University
Multi-Modal ProcessingStatistical Signal ProcessingStatistical ModelingDynamic Bayesian
T
Teng Wang
Kentucky Transportation Center, Lexington, KY, USA
R
Robert Kluger
Department of Civil and Environmental Engineering, University of Louisville, KY, USA
Reginald Souleyrette
Reginald Souleyrette
Professor of Civil Engineering, University of Kentucky
TransportationTraffic SafetyGISRailroads