π€ AI Summary
This work addresses the inefficiency of traditional large language modelβdriven symbolic regression, which relies on coarse-grained error feedback and struggles to identify effective or erroneous components within candidate equations. To overcome this limitation, the authors propose an influence-guided symbolic regression framework that formulates equation discovery as an iterative generate-and-select process. The approach leverages a large language model to propose candidate basis functions and employs fine-grained influence scores to quantify the marginal contribution of each component to generalization performance. By integrating Monte Carlo tree search, the method efficiently explores high-influence structural configurations. Notably, this is the first work to incorporate influence function analysis into symbolic regression, enabling component-level feedback for targeted pruning and optimization. The framework demonstrates superior performance on benchmarks from pharmacokinetics, epidemiological modeling, and genomics, and uncovers a novel association between DNA methylation and RNA polymerase II pausing in high-dimensional biological data, subsequently validated by wet-lab experiments.
π Abstract
Large Language Models (LLMs) offer a promising avenue for scientific discovery, yet their application to symbolic regression is often constrained by inefficient search strategies and coarse feedback signals. Current methods typically guide LLMs using scalar metrics (e.g., global Mean Squared Error), which fail to identify which components of a proposed equation are driving performance or causing error. We introduce \textit{Influence-Guided Symbolic Regression} (IGSR), a method that frames equation discovery as an iterative two-step process combining diverse term generation with rigorous selection: an LLM generates candidate basis functions $Ο_j(\mathbf{x})$ for a linear model, which are then evaluated using granular influence scores $Ξ_j$. These scores quantify each term's marginal contribution to generalization accuracy, enabling an influence-guided pruning process that systematically refines the model structure. Integrating this mechanism into a Monte Carlo Tree Search (MCTS) enables navigating the combinatorial search space while balancing exploration of novel functional forms with exploitation of high-influence components. We demonstrate IGSR's effectiveness on a diverse suite of benchmarks, including LLM-SRBench, pharmacological PKPD models, an epidemiological simulation, and real-world genomic data. Notably, we validate the framework's capacity for genuine discovery in a case study using a high-dimensional biological dataset, in which IGSR identified a novel relationship between DNA methylation and RNA Polymerase II pausing; a hypothesis that was subsequently supported via wet-lab experimentation.