Climate-Eval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A standardized NLP evaluation benchmark for climate change research remains absent. Method: We introduce ClimateBench, the first comprehensive climate-oriented NLP benchmark, comprising 25 tasks across 13 datasets—including a newly curated high-quality news classification dataset—and unifying heterogeneous climate-related textual and cross-modal data. We propose a standardized zero-shot and few-shot evaluation protocol, design principled data fusion strategies and annotation guidelines, and conduct systematic evaluations across open-source LLMs ranging from 2B to 70B parameters. Contribution/Results: Experiments reveal critical deficiencies in current LLMs regarding factual consistency, domain-specific terminology comprehension, and implicit stance detection. ClimateBench establishes an empirically grounded, reproducible evaluation framework to advance trustworthy, green AI for climate science.

Technology Category

Application Category

📝 Abstract
Climate-Eval is a comprehensive benchmark designed to evaluate natural language processing models across a broad range of tasks related to climate change. Climate-Eval aggregates existing datasets along with a newly developed news classification dataset, created specifically for this release. This results in a benchmark of 25 tasks based on 13 datasets, covering key aspects of climate discourse, including text classification, question answering, and information extraction. Our benchmark provides a standardized evaluation suite for systematically assessing the performance of large language models (LLMs) on these tasks. Additionally, we conduct an extensive evaluation of open-source LLMs (ranging from 2B to 70B parameters) in both zero-shot and few-shot settings, analyzing their strengths and limitations in the domain of climate change.
Problem

Research questions and friction points this paper is trying to address.

Evaluates NLP models on climate-related tasks
Aggregates datasets for comprehensive climate discourse analysis
Assesses LLM performance in zero-shot and few-shot settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aggregates existing and new climate datasets
Standardized evaluation suite for LLMs
Extensive zero-shot and few-shot LLM analysis
🔎 Similar Papers
No similar papers found.
M
Murathan Kurfali
RISE Research Institutes of Sweden, Swedish Centre for Impacts of Climate Extremes (climes)
S
Shorouq Zahra
Uppsala University, Swedish Centre for Impacts of Climate Extremes (climes)
Joakim Nivre
Joakim Nivre
Professor of Computational Linguistics, Uppsala University
Computational LinguisticsNatural Language ProcessingDependency Parsing
Gabriele Messori
Gabriele Messori
Uppsala University, Swedish Centre for Impacts of Climate Extremes (climes)