🤖 AI Summary
A standardized evaluation framework for large language models (LLMs) in life cycle assessment (LCA) contexts is lacking, hindering systematic assessment of scientific accuracy, explanation quality, robustness, verifiability, and instruction adherence. Method: We introduce LCA-Bench—the first expert-anchored LLM benchmark for LCA—comprising 22 domain-specific tasks evaluated by 17 LCA experts via double-blind review across 11 state-of-the-art models. Contribution/Results: Our evaluation reveals that 37% of model responses contain scientific errors or misleading content, with hallucinated citations reaching 40% for some models; yet most models perform well on explanation quality and formatting compliance. Critically, we identify systematic capability disparities between open- and closed-source models on LCA tasks—previously unreported. LCA-Bench establishes a reproducible methodology and empirically grounded benchmark to advance domain-specific LLM evaluation.
📝 Abstract
Purpose: Artificial intelligence (AI), and in particular large language models (LLMs), are increasingly being explored as tools to support life cycle assessment (LCA). While demonstrations exist across environmental and social domains, systematic evidence on their reliability, robustness, and usability remains limited. This study provides the first expert-grounded benchmark of LLMs in LCA, addressing the absence of standardized evaluation frameworks in a field where no clear ground truth or consensus protocols exist.
Methods: We evaluated eleven general-purpose LLMs, spanning both commercial and open-source families, across 22 LCA-related tasks. Seventeen experienced practitioners reviewed model outputs against criteria directly relevant to LCA practice, including scientific accuracy, explanation quality, robustness, verifiability, and adherence to instructions. We collected 168 expert reviews.
Results: Experts judged 37% of responses to contain inaccurate or misleading information. Ratings of accuracy and quality of explanation were generally rated average or good on many models even smaller models, and format adherence was generally rated favourably. Hallucination rates varied significantly, with some models producing hallucinated citations at rates of up to 40%. There was no clear-cut distinction between ratings on open-weight versus closed-weight LLMs, with open-weight models outperforming or competing on par with closed-weight models on criteria such as accuracy and quality of explanation.
Conclusion: These findings highlight the risks of applying LLMs naïvely in LCA, such as when LLMs are treated as free-form oracles, while also showing benefits especially around quality of explanation and alleviating labour intensiveness of simple tasks. The use of general-purpose LLMs without grounding mechanisms presents ...