An Expert-grounded benchmark of General Purpose LLMs in LCA

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A standardized evaluation framework for large language models (LLMs) in life cycle assessment (LCA) contexts is lacking, hindering systematic assessment of scientific accuracy, explanation quality, robustness, verifiability, and instruction adherence. Method: We introduce LCA-Bench—the first expert-anchored LLM benchmark for LCA—comprising 22 domain-specific tasks evaluated by 17 LCA experts via double-blind review across 11 state-of-the-art models. Contribution/Results: Our evaluation reveals that 37% of model responses contain scientific errors or misleading content, with hallucinated citations reaching 40% for some models; yet most models perform well on explanation quality and formatting compliance. Critically, we identify systematic capability disparities between open- and closed-source models on LCA tasks—previously unreported. LCA-Bench establishes a reproducible methodology and empirically grounded benchmark to advance domain-specific LLM evaluation.

Technology Category

Application Category

📝 Abstract
Purpose: Artificial intelligence (AI), and in particular large language models (LLMs), are increasingly being explored as tools to support life cycle assessment (LCA). While demonstrations exist across environmental and social domains, systematic evidence on their reliability, robustness, and usability remains limited. This study provides the first expert-grounded benchmark of LLMs in LCA, addressing the absence of standardized evaluation frameworks in a field where no clear ground truth or consensus protocols exist. Methods: We evaluated eleven general-purpose LLMs, spanning both commercial and open-source families, across 22 LCA-related tasks. Seventeen experienced practitioners reviewed model outputs against criteria directly relevant to LCA practice, including scientific accuracy, explanation quality, robustness, verifiability, and adherence to instructions. We collected 168 expert reviews. Results: Experts judged 37% of responses to contain inaccurate or misleading information. Ratings of accuracy and quality of explanation were generally rated average or good on many models even smaller models, and format adherence was generally rated favourably. Hallucination rates varied significantly, with some models producing hallucinated citations at rates of up to 40%. There was no clear-cut distinction between ratings on open-weight versus closed-weight LLMs, with open-weight models outperforming or competing on par with closed-weight models on criteria such as accuracy and quality of explanation. Conclusion: These findings highlight the risks of applying LLMs naïvely in LCA, such as when LLMs are treated as free-form oracles, while also showing benefits especially around quality of explanation and alleviating labour intensiveness of simple tasks. The use of general-purpose LLMs without grounding mechanisms presents ...
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM reliability and robustness in life cycle assessment tasks
Establishing expert-grounded benchmark for LLM performance in LCA
Assessing scientific accuracy and hallucination risks in LCA applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-grounded benchmark evaluates LLMs for LCA
Seventeen practitioners review model outputs systematically
Assess accuracy, explanation quality, and hallucination rates
🔎 Similar Papers
No similar papers found.
A
Artur Donaldson
PRé Sustainability B.V., Stationsplein 121, Amersfoort, 3818 LE, The Netherlands
Bharathan Balaji
Bharathan Balaji
Senior Applied Scientist, Amazon
SustainabilityMachine LearningSmart BuildingsInternet of ThingsDistributed Systems
C
Cajetan Oriekezie
School of Engineering, London South Bank University, 103 Borough Road, London, SE1 0AA, United Kingdom
M
Manish Kumar
Helmholtz Institute Ulm-Electrochemical Energy Storage (HIU), Helmholtzstraße 11, Ulm, 89081, Karlsruhe Institute of Technology, Germany
L
Laure Patouillard
CIRAIG, Department of Chemical Engineering, Polytechnique Montréal, 3333 Queen Mary Rd Suite 310, Montreal, H3V 1A2 QC, Canada