LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Large language models (LLMs) exhibit inflated performance in scientific equation discovery due to spurious memorization of training data, undermining their true causal reasoning capability. To address this, we introduce LLM-SRBench—the first rigorous benchmark for symbolic regression, comprising 239 nontrivial equation discovery tasks spanning physics, chemistry, biology, and engineering. We propose a dual-path evaluation paradigm: LSR-Transform (assessing robustness to algebraic form transformations) and LSR-Synth (evaluating de novo symbolic synthesis), with test cases generated via domain-informed physical modeling and memory-free symbolic synthesis. Comprehensive evaluation across state-of-the-art models—including GPT, Llama, and Qwen—reveals a maximum symbolic accuracy of only 31.5%, confirming severe limitations in causal and compositional reasoning. LLM-SRBench thus establishes a new authoritative standard for evaluating LLMs in scientific discovery.

Technology Category

Application Category

📝 Abstract

Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' true discovery capabilities for scientific equations

Preventing memorization bias in scientific equation benchmarks

Assessing LLMs' performance on diverse, challenging equation discovery tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-SRBench benchmark prevents trivial memorization

Transforms common models into uncommon representations

Introduces synthetic discovery-driven problems

🔎 Similar Papers

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models