π€ AI Summary
Unstructured, heterogeneous, and inconsistently spelled location descriptions in disaster databases (e.g., EM-DAT) impede subnational geocoding. Method: We propose the first fully automated, GPT-4oβdriven geocoding workflow: large language models perform text cleaning and semantic parsing; cross-validated geographic matching integrates GADM, OpenStreetMap, and Wikidata to generate subnational coordinates with reliability scores. Contribution/Results: The method enables flexible, multi-hazard, cross-administrative mapping and introduces the first LLM-powered, multi-source trustworthy geolocation framework. Applied to EM-DAT records from 2000β2024, it successfully geocoded 14,215 disaster events and 17,948 unique locations at subnational resolution, achieving high precision. This significantly enhances spatial comparability, interoperability, and analytical utility of disaster data.
π Abstract
Subnational location data of disaster events are critical for risk assessment and disaster risk reduction. Disaster databases such as EM-DAT often report locations in unstructured textual form, with inconsistent granularity or spelling, that make it difficult to integrate with spatial datasets. We present a fully automated LLM-assisted workflow that processes and cleans textual location information using GPT-4o, and assigns geometries by cross-checking three independent geoinformation repositories: GADM, OpenStreetMap and Wikidata. Based on the agreement and availability of these sources, we assign a reliability score to each location while generating subnational geometries. Applied to the EM-DAT dataset from 2000 to 2024, the workflow geocodes 14,215 events across 17,948 unique locations. Unlike previous methods, our approach requires no manual intervention, covers all disaster types, enables cross-verification across multiple sources, and allows flexible remapping to preferred frameworks. Beyond the dataset, we demonstrate the potential of LLMs to extract and structure geographic information from unstructured text, offering a scalable and reliable method for related analyses.