๐ค AI Summary
This work addresses a critical limitation in existing scientific design benchmarks, which evaluate large language models (LLMs) solely based on final performance after a fixed number of iterations while ignoring the optimization trajectory. To remedy this, the authors propose LEAPBench, a novel evaluation framework that introduces trajectory efficiency as a core assessment dimension. LEAPBench incorporates area-under-the-curve (AUC) of the optimization trajectory as a metric, establishes Bayesian optimization baselines, integrates literature-anchored validation, and leverages trajectory-based scores as trainable reward signals in offline reinforcement learning. Experiments across 55 tasks reveal that trajectory-aware evaluation alters the ranking of best-performing models in 53% of cases; offline reinforcement learning improves performance on 14 out of 21 held-out tasks; and overall, LLMs fail to surpass classical Bayesian optimization baselines. This study underscores the inadequacy of endpoint-only evaluation and establishes a more reliable paradigm for iterative scientific design.
๐ Abstract
LLMs are increasingly deployed in autonomous laboratories, under the assumption that their domain priors and reasoning over iterative feedback let them converge on good designs in fewer iterations than feedback-only baselines. Current iterative scientific design benchmarks, however, score only outcome snapshots at fixed horizons. This leaves the learning trajectory unmeasured, even though the trajectory is what captures learning efficiency, where each iteration saved is a real saving in cost and time. Motivated by this, we examine three evaluation choices that change the conclusions one draws about LLM learning efficiency in iterative scientific design: what to measure, what baseline to compare against, and what to ground against. We introduce LEAPBench, Learning Efficiency in Adaptive Processes, a 55-task framework that pairs a best-so-far area under the curve (AUC) trajectory metric with a classical Bayesian-optimization reference and an audit grounded in published literature. Applied to eight contemporary LLMs, switching from final-outcome to trajectory scoring changes the best-model decision on 53% of tasks at matched horizons, and exposes efficiency gains overlooked by outcome-based scoring. LLMs do not outperform a classical Bayesian baseline. On 16 biology tasks where the oracle's reward signal is aligned with configurations from the published-best design, domain-aware prompting leads to LLM choices that match the published-best's approximately 10 percentage points less often than domain-agnostic prompting at iteration 30. The pattern is sharpest on 6 tasks where the literature-typical and published-best configurations diverge, and domain-agnostic prompting matches the published-best more often on all 6. The trajectory metric also doubles as a tractable training target. Offline reinforcement learning with the metric as a reward improves performance on 14 of 21 held-out tasks.