Listen to the Context: Towards Faithful Large Language Models for Retrieval Augmented Generation on Climate Questions

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) in climate-domain retrieval-augmented generation (RAG) frequently exhibit factual hallucinations and produce outputs unfaithful to retrieved documents. Method: We propose a fidelity-oriented evaluation and optimization paradigm: (1) an automated atomic claim faithfulness metric; (2) a novel data purification strategy based on unfaithful sample filtering to significantly improve training data cleanliness; and (3) context-aware instruction fine-tuning built upon ClimateGPT. Contribution/Results: The resulting ClimateGPT Faithful+ model achieves a 57% atomic claim support rate on climate QA tasks—up 27 percentage points from the baseline 30%—demonstrating for the first time the decisive impact of data cleanliness on RAG fidelity. This work establishes a reproducible methodological framework for trustworthy climate AI.

Technology Category

Application Category

📝 Abstract
Large language models that use retrieval augmented generation have the potential to unlock valuable knowledge for researchers, policymakers, and the public by making long and technical climate-related documents more accessible. While this approach can help alleviate factual hallucinations by relying on retrieved passages as additional context, its effectiveness depends on whether the model's output remains faithful to these passages. To address this, we explore the automatic assessment of faithfulness of different models in this setting. We then focus on ClimateGPT, a large language model specialised in climate science, to examine which factors in its instruction fine-tuning impact the model's faithfulness. By excluding unfaithful subsets of the model's training data, we develop ClimateGPT Faithful+, which achieves an improvement in faithfulness from 30% to 57% in supported atomic claims according to our automatic metric.
Problem

Research questions and friction points this paper is trying to address.

Assessing faithfulness of retrieval-augmented LLMs in climate science
Improving model output fidelity to retrieved climate passages
Enhancing ClimateGPT's faithfulness via data filtering and fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval augmented generation for climate knowledge
Automatic faithfulness assessment of model outputs
Training data exclusion to improve model faithfulness
🔎 Similar Papers
No similar papers found.
David Thulke
David Thulke
RWTH Aachen University | AppTek
large language modelsretrieval augmented generation
J
Jakob Kemmler
Machine Learning and Human Language Technology, RWTH Aachen University, Germany
C
Christian Dugast
AppTek GmbH, Aachen, Germany
Hermann Ney
Hermann Ney
RWTH Aachen University
Machine LearningSpeech RecognitionMachine TranslationComputer Vision