CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

128K/year

🤖 AI Summary

Extracting carbon footprint information from non-standardized PDF sustainability reports is challenging due to mixed text–table layouts, heterogeneous formatting, and absence of uniform structure. Method: We propose CF-RAG, the first framework for carbon footprint question answering over unstructured text derived from PDF parsing. It jointly models tabular and textual content, fine-tunes Llama-3 with retrieval-augmented generation (RAG), and enhances multimodal understanding of PDF documents. Contribution/Results: We introduce CarbonPDF-QA—the first fine-grained, human-annotated dataset for this task—and demonstrate that CF-RAG significantly outperforms GPT-4o and state-of-the-art table–text joint reasoning models on carbon footprint QA, achieving substantial accuracy gains. This work establishes a novel paradigm for automated, structured analysis of sustainability reporting.

Technology Category

Application Category

📝 Abstract

Product sustainability reports provide valuable insights into the environmental impacts of a product and are often distributed in PDF format. These reports often include a combination of tables and text, which complicates their analysis. The lack of standardization and the variability in reporting formats further exacerbate the difficulty of extracting and interpreting relevant information from large volumes of documents. In this paper, we tackle the challenge of answering questions related to carbon footprints within sustainability reports available in PDF format. Unlike previous approaches, our focus is on addressing the difficulties posed by the unstructured and inconsistent nature of text extracted from PDF parsing. To facilitate this analysis, we introduce CarbonPDF-QA, an open-source dataset containing question-answer pairs for 1735 product report documents, along with human-annotated answers. Our analysis shows that GPT-4o struggles to answer questions with data inconsistencies. To address this limitation, we propose CarbonPDF, an LLM-based technique specifically designed to answer carbon footprint questions on such datasets. We develop CarbonPDF by fine-tuning Llama 3 with our training data. Our results show that our technique outperforms current state-of-the-art techniques, including question-answering (QA) systems finetuned on table and text data.

Problem

Research questions and friction points this paper is trying to address.

Extracting carbon footprint data from unstructured PDF reports

Handling inconsistent text and tables in sustainability reports

Improving QA accuracy for carbon footprint questions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation for carbon footprint QA

Fine-tuned Llama 3 on CarbonPDF-QA dataset

Handles unstructured PDF text and table data

🔎 Similar Papers

WeQA: A Benchmark for Retrieval Augmented Generation in Wind Energy Domain