🤖 AI Summary
A lack of high-quality, multi-reasoning benchmark datasets hinders evaluation of enterprise climate disclosure question answering (QA). Method: We introduce ClimateQA—the first open-source climate disclosure QA benchmark—comprising 33 sustainability reports across 11 industries and 330 expert-annotated QA pairs, supporting extraction, numerical, and logical reasoning tasks. We conduct the first systematic analysis of retrieval-augmented generation (RAG) for climate QA, revealing that retriever accuracy in localizing answer-bearing passages is the primary performance bottleneck. Contribution/Results: Our evaluation demonstrates that retrieval precision critically determines end-to-end QA performance. We further propose low-carbon AI practices—including weight quantization—to advance transparent carbon reporting and trustworthy green AI. ClimateQA provides a reproducible, domain-specific evaluation framework, addressing a critical gap in climate-related QA benchmarking.
📝 Abstract
Climate Finance Bench introduces an open benchmark that targets question-answering over corporate climate disclosures using Large Language Models. We curate 33 recent sustainability reports in English drawn from companies across all 11 GICS sectors and annotate 330 expert-validated question-answer pairs that span pure extraction, numerical reasoning, and logical reasoning. Building on this dataset, we propose a comparison of RAG (retrieval-augmented generation) approaches. We show that the retriever's ability to locate passages that actually contain the answer is the chief performance bottleneck. We further argue for transparent carbon reporting in AI-for-climate applications, highlighting advantages of techniques such as Weight Quantization.