MolErr2Fix:Benchmarking LLM Trustworthiness in Chemistry via Modular Error Detection, Localization, Explanation, and Revision

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently generate chemically inaccurate descriptions in molecular science and lack capabilities for error detection and interpretable correction, undermining their scientific reliability. Method: We introduce MolErr2Fix, the first fine-grained benchmark for diagnosing and correcting chemical reasoning errors, comprising 1,193 human-annotated instances covering structural and semantic errors. It features a novel quadruple annotation schema—error type, location, explanation, and correction—integrating structured error taxonomy with domain-specific chemical knowledge validation. The benchmark includes modular task designs and an open-source evaluation API. Contribution/Results: Comprehensive evaluation of mainstream LLMs reveals significant deficiencies in error localization and correction. MolErr2Fix effectively exposes model weaknesses, serving as a critical evaluation tool and data foundation for advancing trustworthy chemical reasoning in LLMs.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown growing potential in molecular sciences, but they often produce chemically inaccurate descriptions and struggle to recognize or justify potential errors. This raises important concerns about their robustness and reliability in scientific applications. To support more rigorous evaluation of LLMs in chemical reasoning, we present the MolErr2Fix benchmark, designed to assess LLMs on error detection and correction in molecular descriptions. Unlike existing benchmarks focused on molecule-to-text generation or property prediction, MolErr2Fix emphasizes fine-grained chemical understanding. It tasks LLMs with identifying, localizing, explaining, and revising potential structural and semantic errors in molecular descriptions. Specifically, MolErr2Fix consists of 1,193 fine-grained annotated error instances. Each instance contains quadruple annotations, i.e,. (error type, span location, the explanation, and the correction). These tasks are intended to reflect the types of reasoning and verification required in real-world chemical communication. Evaluations of current state-of-the-art LLMs reveal notable performance gaps, underscoring the need for more robust chemical reasoning capabilities. MolErr2Fix provides a focused benchmark for evaluating such capabilities and aims to support progress toward more reliable and chemically informed language models. All annotations and an accompanying evaluation API will be publicly released to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM trustworthiness in chemical error detection
Evaluating error localization and explanation in molecular descriptions
Benchmarking LLM capabilities for chemical error correction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for LLM error detection and correction in chemistry
Quadruple annotations for error type, location, explanation, correction
Fine-grained chemical understanding evaluation with 1,193 instances
🔎 Similar Papers
No similar papers found.