Hierarchical Bayesian Operator-induced Symbolic Regression Trees for Structural Learning of Scientific Expressions

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing symbolic regression methods often rely on heuristic search or data-intensive black-box models, struggling to deliver interpretable results and principled uncertainty quantification under noisy conditions. To address this, we propose the first Bayesian symbolic regression framework with theoretical guarantees: a hierarchical Bayesian model incorporating a regularized tree prior and operator-induction mechanism; a convergence theory grounded in marginal posterior model selection and expression-distance metrics; and an MCMC inference algorithm enabling joint structural learning and uncertainty quantification. Evaluated on synthetic benchmarks, the Feynman equations dataset, and single-atom catalysis data, our method consistently outperforms state-of-the-art approaches—achieving superior predictive accuracy, robustness to noise, and model parsimony while providing rigorous uncertainty estimates.

Technology Category

Application Category

📝 Abstract
The advent of Scientific Machine Learning has heralded a transformative era in scientific discovery, driving progress across diverse domains. Central to this progress is uncovering scientific laws from experimental data through symbolic regression. However, existing approaches are dominated by heuristic algorithms or data-hungry black-box methods, which often demand low-noise settings and lack principled uncertainty quantification. Motivated by interpretable Statistical Artificial Intelligence, we develop a hierarchical Bayesian framework for symbolic regression that represents scientific laws as ensembles of tree-structured symbolic expressions endowed with a regularized tree prior. This coherent probabilistic formulation enables full posterior inference via an efficient Markov chain Monte Carlo algorithm, yielding a balance between predictive accuracy and structural parsimony. To guide symbolic model selection, we develop a marginal posterior-based criterion adhering to the Occam's window principle and further quantify structural fidelity to ground truth through a tailored expression-distance metric. On the theoretical front, we establish near-minimax rate of Bayesian posterior concentration, providing the first rigorous guarantee in context of symbolic regression. Empirical evaluation demonstrates robust performance of our proposed methodology against state-of-the-art competing modules on a simulated example, a suite of canonical Feynman equations, and single-atom catalysis dataset.
Problem

Research questions and friction points this paper is trying to address.

Symbolic regression lacks principled uncertainty quantification
Existing methods require low-noise settings and large data
Heuristic algorithms dominate without rigorous theoretical guarantees
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Bayesian framework for symbolic regression
Markov chain Monte Carlo for posterior inference
Marginal posterior-based model selection criterion
🔎 Similar Papers
No similar papers found.
S
Somjit Roy
Department of Statistics, Texas A&M University, College Station, TX 77843
P
Pritam Dey
Department of Statistics, Texas A&M University, College Station, TX 77843
Debdeep Pati
Debdeep Pati
Professor, Department of Statistics, University of Wisconsin - Madison
Bayesian nonparametricshigh-dimensional data analysis
B
Bani K. Mallick
Department of Statistics, Texas A&M University, College Station, TX 77843