🤖 AI Summary
Humanitarian decision-making urgently requires timely, accurate, and verifiable situational reports, yet current practices rely heavily on manual processes—resulting in low efficiency and inconsistent quality. This paper introduces the first end-to-end large language model (LLM) framework for fully automating the transformation of heterogeneous, multi-source humanitarian documents into structured, verifiable, and action-oriented reports. Our method innovatively integrates semantic clustering, evidence-grounded question generation, and a multi-level expert-simulation evaluation paradigm. It ensures explainability, verifiability, and operational utility across key stages: event aggregation, question generation, retrieval-augmented answer extraction, multi-granularity summarization, and executive summary generation. Evaluated on 13 real-world humanitarian incidents, our framework achieves 84.7% and 86.3% relevance scores for generated questions and answers, respectively; citation precision and recall both exceed 76%; and human-AI collaborative evaluation yields an F1-score >0.80—significantly outperforming all baselines.
📝 Abstract
Timely and accurate situational reports are essential for humanitarian decision-making, yet current workflows remain largely manual, resource intensive, and inconsistent. We present a fully automated framework that uses large language models (LLMs) to transform heterogeneous humanitarian documents into structured and evidence-grounded reports. The system integrates semantic text clustering, automatic question generation, retrieval augmented answer extraction with citations, multi-level summarization, and executive summary generation, supported by internal evaluation metrics that emulate expert reasoning. We evaluated the framework across 13 humanitarian events, including natural disasters and conflicts, using more than 1,100 documents from verified sources such as ReliefWeb. The generated questions achieved 84.7 percent relevance, 84.0 percent importance, and 76.4 percent urgency. The extracted answers reached 86.3 percent relevance, with citation precision and recall both exceeding 76 percent. Agreement between human and LLM based evaluations surpassed an F1 score of 0.80. Comparative analysis shows that the proposed framework produces reports that are more structured, interpretable, and actionable than existing baselines. By combining LLM reasoning with transparent citation linking and multi-level evaluation, this study demonstrates that generative AI can autonomously produce accurate, verifiable, and operationally useful humanitarian situation reports.