Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

196K/year
πŸ€– AI Summary
This work addresses the lack of systematic evaluation of large language models’ (LLMs’) ability to perform precise, judge-free cost reasoning in chemistry. We introduce ChemCost, a novel benchmark that formulates chemical cost estimation as a task grounded in real-world procurement data, requiring agents to identify compounds from reaction descriptions, retrieve supplier quotes, select packaging options, normalize quantities, and compute final costs. The benchmark integrates a chemical knowledge graph, snapshots of supplier pricing, and structured tool invocation, enabling stage-wise diagnostic analysis and robustness evaluation under noisy conditions. Experiments reveal that even state-of-the-art agents achieve only 50.6% accuracy (within 25% relative error) on clean inputs, with performance degrading significantly under perturbations, highlighting critical bottlenecks in parsing robustness and evidence integration.
πŸ“ Abstract
Large Language Models (LLMs) have become increasingly capable as tool-using agents, with benchmarks spanning diverse general agentic tasks. Yet rigorous evaluation of scientific tool use remains limited. In chemistry, recent agents can plan syntheses and invoke domain-specific tools, but evaluations often rely on curated demonstrations, expert assessment, or LLM-as-judge scoring rather than exact, judge-free ground truth. We address this gap with chemical procurement cost estimation, a practical task in which an agent must ground chemical identities, retrieve supplier quotes, select valid purchasable packs, normalize quantities, and compute cost from a reaction description. We introduce ChemCost, a benchmark of 1,427 evaluable reactions grounded to a frozen pricing snapshot covering 2,261 chemicals and 230,775 supplier quotes, supporting scalar scoring and stage-level diagnosis of grounding, retrieval, procurement, and arithmetic failures. To evaluate robustness, we further construct controlled noise-injected views that perturb chemical aliases, quantity expressions, missing fields, and input formatting. Experiments with frontier, open-weight, and chemistry-specialized LLM agents show that tool access is necessary but insufficient for solving the task. The strongest agents reach only 50.6% accuracy within 25% relative error on clean inputs and degrade substantially with realistic noise. Stage-level analysis further shows that failures arise from brittle parsing, ineffective evidence integration, invalid pack selection, and non-convergent tool use.
Problem

Research questions and friction points this paper is trying to address.

chemical cost estimation
large language models
tool-using agents
benchmark evaluation
reaction procurement
Innovation

Methods, ideas, or system contributions that make the work stand out.

ChemCost
chemical cost reasoning
tool-using agents
ground truth evaluation
robustness to noise