CienaLLM: Generative Climate-Impact Extraction from News Articles with Autoregressive LLMs

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the need for monitoring socioeconomic impacts of climate-related disasters, this paper proposes a zero-shot news event structured extraction framework for large-scale information retrieval from heterogeneous multilingual news sources. Methodologically, it introduces a novel schema-guided generative extraction paradigm, built upon open-source autoregressive large language models (e.g., Llama, Phi), integrating zero-shot prompt engineering, dynamic output formatting, multi-step reasoning, and post-response parsing—requiring no fine-tuning for cross-hazard, cross-domain, or multilingual adaptation. Key contributions include a schema-constrained generation mechanism coupled with a robust parsing strategy that virtually eliminates structural errors, alongside quantified reasoning to substantially improve inference efficiency. In a Spanish-language drought impact extraction task, the framework achieves accuracy comparable to supervised models. All code, configurations, and schema definitions are publicly released.

Technology Category

Application Category

📝 Abstract
Understanding and monitoring the socio-economic impacts of climate hazards requires extracting structured information from heterogeneous news articles on a large scale. To that end, we have developed CienaLLM, a modular framework based on schema-guided Generative Information Extraction. CienaLLM uses open-weight Large Language Models for zero-shot information extraction from news articles, and supports configurable prompts and output schemas, multi-step pipelines, and cloud or on-premise inference. To systematically assess how the choice of LLM family, size, precision regime, and prompting strategy affect performance, we run a large factorial study in models, precisions, and prompt engineering techniques. An additional response parsing step nearly eliminates format errors while preserving accuracy; larger models deliver the strongest and most stable performance, while quantization offers substantial efficiency gains with modest accuracy trade-offs; and prompt strategies show heterogeneous, model-specific effects. CienaLLM matches or outperforms the supervised baseline in accuracy for extracting drought impacts from Spanish news, although at a higher inference cost. While evaluated in droughts, the schema-driven and model-agnostic design is suitable for adapting to related information extraction tasks (e.g., other hazards, sectors, or languages) by editing prompts and schemas rather than retraining. We release code, configurations, and schemas to support reproducible use.
Problem

Research questions and friction points this paper is trying to address.

Extracts structured climate-impact data from news articles
Evaluates LLM performance factors for zero-shot information extraction
Adapts to various hazards and languages via configurable schemas
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-weight LLMs for zero-shot information extraction
Configurable prompts and schemas for adaptable extraction tasks
Factorial study on LLM choices and prompt strategies
🔎 Similar Papers
No similar papers found.
J
Javier Vela-Tambo
Worcester Polytechnic Institute, Worcester, MA, USA
Jorge Gracia
Jorge Gracia
University of Zaragoza
Semantic WebOntologiesLinguistic Linked DataOntology MatchingQuery interpretation
F
Fernando Dominguez-Castro
Pyrenean Institute of Ecology (IPE-CSIC), Zaragoza, Spain