Data-Informed Model Complexity Metric for Optimizing Symbolic Regression Models

📅 2025-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Model complexity selection in symbolic regression is highly subjective and often leads to poor generalization. Method: This paper proposes a data-adaptive complexity measure that jointly leverages the Hessian rank and intrinsic dimensionality (ID) for model selection. It introduces the first efficient approximation of the average Hessian rank using only three sampling points, and—uniquely—integrates this with twelve ID estimators from the scikit-dimension library within symbolic regression. This approach eliminates reliance on hand-tuned parameters (e.g., parsimony pressure). Contribution/Results: Embedded in the StackGP framework with a post-hoc selection mechanism, the method achieves significantly improved generalization on the PMLB benchmark. It automatically identifies the optimal complexity window balancing accuracy and expressivity, thereby effectively mitigating overfitting without human intervention.

Technology Category

Application Category

📝 Abstract
Choosing models from a well-fitted evolved population that generalizes beyond training data is difficult. We introduce a pragmatic method to estimate model complexity using Hessian rank for post-processing selection. Complexity is approximated by averaging the model output Hessian rank across a few points (N=3), offering efficient and accurate rank estimates. This method aligns model selection with input data complexity, calculated using intrinsic dimensionality (ID) estimators. Using the StackGP system, we develop symbolic regression models for the Penn Machine Learning Benchmark and employ twelve scikit-dimension library methods to estimate ID, aligning model expressiveness with dataset ID. Our data-informed complexity metric finds the ideal complexity window, balancing model expressiveness and accuracy, enhancing generalizability without bias common in methods reliant on user-defined parameters, such as parsimony pressure in weight selection.
Problem

Research questions and friction points this paper is trying to address.

Symbolic Regression
Model Complexity
Generalization Ability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hessian Rank
Model Complexity Estimation
Data Intrinsic Dimension Matching
🔎 Similar Papers
No similar papers found.