Knowledge Integration for Physics-informed Symbolic Regression Using Pre-trained Large Language Models

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Physical-informed symbolic regression (PiSR) suffers from heavy reliance on manual domain-knowledge injection and poor generalizability. Method: We propose an LLM-driven automatic knowledge integration framework that embeds large language models—including Falcon, Mistral, and LLaMA2—directly into the symbolic regression loss function. Through prompt engineering, these LLMs encode physical priors and dynamically constrain the evolutionary search process. The framework is algorithm-agnostic, seamlessly integrating with DEAP, gplearn, and PySR. Contribution/Results: Experiments across diverse physical dynamical systems demonstrate substantial improvements in equation discovery accuracy and noise robustness. Performance remains stable across varying LLM–algorithm combinations, significantly reducing dependence on domain-expert guidance. Our approach advances PiSR toward automated, general-purpose scientific discovery without sacrificing interpretability or physical consistency.

Technology Category

Application Category

📝 Abstract
Symbolic regression (SR) has emerged as a powerful tool for automated scientific discovery, enabling the derivation of governing equations from experimental data. A growing body of work illustrates the promise of integrating domain knowledge into the SR to improve the discovered equation's generality and usefulness. Physics-informed SR (PiSR) addresses this by incorporating domain knowledge, but current methods often require specialized formulations and manual feature engineering, limiting their adaptability only to domain experts. In this study, we leverage pre-trained Large Language Models (LLMs) to facilitate knowledge integration in PiSR. By harnessing the contextual understanding of LLMs trained on vast scientific literature, we aim to automate the incorporation of domain knowledge, reducing the need for manual intervention and making the process more accessible to a broader range of scientific problems. Namely, the LLM is integrated into the SR's loss function, adding a term of the LLM's evaluation of the SR's produced equation. We extensively evaluate our method using three SR algorithms (DEAP, gplearn, and PySR) and three pre-trained LLMs (Falcon, Mistral, and LLama 2) across three physical dynamics (dropping ball, simple harmonic motion, and electromagnetic wave). The results demonstrate that LLM integration consistently improves the reconstruction of physical dynamics from data, enhancing the robustness of SR models to noise and complexity. We further explore the impact of prompt engineering, finding that more informative prompts significantly improve performance.
Problem

Research questions and friction points this paper is trying to address.

Automating physics-informed symbolic regression with LLMs
Reducing manual feature engineering in equation discovery
Enhancing robustness to noise and complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLMs into symbolic regression loss functions
Uses pre-trained LLMs for automated domain knowledge incorporation
Combines multiple SR algorithms with LLM physics evaluation
🔎 Similar Papers
No similar papers found.
B
Bilge Taskin
Department of Computing, Jonkoping University, Jonkoping, Sweden
W
Wenxiong Xie
Department of Computing, Jonkoping University, Jonkoping, Sweden
Teddy Lazebnik
Teddy Lazebnik
Assistant Professor
Computational MathematicsScientometricsBiomathematicsSocio-economic simulations