SymbolFit: Automatic Parametric Modeling with Symbolic Regression

📅 2024-11-15
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional parametric modeling relies on manually specified functional forms, rendering it inadequate for binned data lacking prior analytical solutions. Method: We propose a fully automated parametric modeling framework that unifies symbolic regression with Bayesian uncertainty quantification within a single optimization process. Leveraging genetic programming, it searches directly over multivariate analytic function spaces to identify the optimal closed-form expression, concurrently estimating model parameters and their posterior uncertainties. Contribution/Results: The method eliminates reliance on pre-specified analytic forms, enabling high-dimensional, physics-informed, data-driven modeling. Evaluated on five background modeling tasks in LHC new-physics searches, it significantly outperforms traditional fitting approaches. Comprehensive validation on multiple real and simulated datasets demonstrates superior accuracy, strong generalizability, and intrinsic interpretability.

Technology Category

Application Category

📝 Abstract
We introduce SymbolFit, a framework that automates parametric modeling by using symbolic regression to perform a machine-search for functions that fit the data, while simultaneously providing uncertainty estimates in a single run. Traditionally, constructing a parametric model to accurately describe binned data has been a manual and iterative process, requiring an adequate functional form to be determined before the fit can be performed. The main challenge arises when the appropriate functional forms cannot be derived from first principles, especially when there is no underlying true closed-form function for the distribution. In this work, we address this problem by utilizing symbolic regression, a machine learning technique that explores a vast space of candidate functions without needing a predefined functional form, treating the functional form itself as a trainable parameter. Our approach is demonstrated in data analysis applications in high-energy physics experiments at the CERN Large Hadron Collider (LHC). We demonstrate its effectiveness and efficiency using five real proton-proton collision datasets from new physics searches at the LHC, namely the background modeling in resonance searches for high-mass dijet, trijet, paired-dijet, diphoton, and dimuon events. We also validate the framework using several toy datasets with one and more variables.
Problem

Research questions and friction points this paper is trying to address.

Automates parametric modeling process
Eliminates need for predefined functional forms
Provides uncertainty estimates in single run
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates parametric modeling via symbolic regression
Provides uncertainty estimates in a single run
Demonstrated in high-energy physics at CERN LHC
H
Ho Fung Tsoi
University of Pennsylvania, USA
D
Dylan S. Rankin
University of Pennsylvania, USA
C
C. Caillol
European Organization for Nuclear Research (CERN), Switzerland
Miles Cranmer
Miles Cranmer
University of Cambridge
Machine LearningAstrophysicsFluid Dynamics
S
S. Dasu
University of Wisconsin-Madison, USA
J
Javier Duarte
University of California San Diego, USA
Philip Harris
Philip Harris
MIT
Machine LearningDark MatterHiggs bosonGravitational WavesFPGAs
E
E. Lipeles
University of Pennsylvania, USA
Vladimir Loncar
Vladimir Loncar
CERN