Finetuning Large Language Model as an Effective Symbolic Regressor

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address the weak generalization, slow convergence, and difficulty in discovering complex governing equations exhibited by large language models (LLMs) in symbolic regression (SR), this paper introduces SymbArena—the first large-scale, high-quality benchmark dataset for scientific discovery—comprising 148,000 equations and 1.83 billion tokens, along with a novel formal consistency metric. Building upon SymbArena, we propose SymbolicChat, a specialized fine-tuning framework that synergistically integrates generative modeling with heuristic search. Experimental results demonstrate that SymbolicChat achieves a twofold improvement in R² score and an 8.37% gain in formal consistency accuracy. Notably, it is the first LLM-based approach to outperform traditional numerical methods simultaneously in both numerical precision and symbolic correctness, thereby establishing a new state-of-the-art benchmark for LLMs in symbolic regression.

Technology Category

Application Category

📝 Abstract

Deriving governing equations from observational data, known as Symbolic Regression (SR), is a cornerstone of scientific discovery. Large Language Models (LLMs) have shown promise in this task by leveraging their vast cross-disciplinary scientific knowledge. However, existing LLM-based methods primarily rely on direct inference or prompt engineering, often requiring excessive inference iterations to converge on correct formulas or failing to treating complex equation targets. These limitations in effectiveness and generalization stem from an inherent tension between pre-trained LLMs' proficiency in approximate reasoning and the high-precision demands of SR tasks. To bridge this gap, we propose to fine-tune LLMs for enhanced SR capability. Yet, the absence of dedicated datasets for SR-oriented fine-tuning remains a critical barrier. We thus introduce SymbArena, specifically engineered to optimize LLMs for SR. This benchmark comprises 148,102 diverse equations formulated as corpora of 1.83 billion tokens for LLM utilization, enabling effective training and inference. Further, SymbArena proposes a heuristics metric to precisely quantify form-level consistency, going beyond existing SR numerical-oriented evaluation strategies. With this benchmark, we explore mainstream LLM fine-tuning techniques for SR tasks and establish SymbolicChat, a simple yet effective LLM-based SR strong baseline. Experimental results validate SymbolicChat as the first LLM to exceed traditional numerical methods in both numerical precision and symbolic form accuracy, outperforming the second-best LLM baseline with improvements of 2-fold gains in R2 score and 8.37% in form-level consistency score.

Problem

Research questions and friction points this paper is trying to address.

Bridging gap between LLMs' approximate reasoning and SR precision demands

Addressing lack of dedicated datasets for symbolic regression fine-tuning

Overcoming limitations of inference-based methods for complex equation discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned LLMs for symbolic regression tasks

SymbArena benchmark dataset for equation training

Heuristics metric for form-level consistency evaluation

🔎 Similar Papers

No similar papers found.