π€ AI Summary
This work addresses the lack of systematic evaluation of large language modelsβ (LLMsβ) ability to perform precise, judge-free cost reasoning in chemistry. We introduce ChemCost, a novel benchmark that formulates chemical cost estimation as a task grounded in real-world procurement data, requiring agents to identify compounds from reaction descriptions, retrieve supplier quotes, select packaging options, normalize quantities, and compute final costs. The benchmark integrates a chemical knowledge graph, snapshots of supplier pricing, and structured tool invocation, enabling stage-wise diagnostic analysis and robustness evaluation under noisy conditions. Experiments reveal that even state-of-the-art agents achieve only 50.6% accuracy (within 25% relative error) on clean inputs, with performance degrading significantly under perturbations, highlighting critical bottlenecks in parsing robustness and evidence integration.
π Abstract
Large Language Models (LLMs) have become increasingly capable as tool-using agents, with benchmarks spanning diverse general agentic tasks. Yet rigorous evaluation of scientific tool use remains limited. In chemistry, recent agents can plan syntheses and invoke domain-specific tools, but evaluations often rely on curated demonstrations, expert assessment, or LLM-as-judge scoring rather than exact, judge-free ground truth. We address this gap with chemical procurement cost estimation, a practical task in which an agent must ground chemical identities, retrieve supplier quotes, select valid purchasable packs, normalize quantities, and compute cost from a reaction description. We introduce ChemCost, a benchmark of 1,427 evaluable reactions grounded to a frozen pricing snapshot covering 2,261 chemicals and 230,775 supplier quotes, supporting scalar scoring and stage-level diagnosis of grounding, retrieval, procurement, and arithmetic failures. To evaluate robustness, we further construct controlled noise-injected views that perturb chemical aliases, quantity expressions, missing fields, and input formatting. Experiments with frontier, open-weight, and chemistry-specialized LLM agents show that tool access is necessary but insufficient for solving the task. The strongest agents reach only 50.6% accuracy within 25% relative error on clean inputs and degrade substantially with realistic noise. Stage-level analysis further shows that failures arise from brittle parsing, ineffective evidence integration, invalid pack selection, and non-convergent tool use.