🤖 AI Summary
This work addresses the lack of theoretical grounding for the generalization capability of Genetic Programming-based Symbolic Regression (GP-SR). Drawing from statistical learning theory, it decomposes the generalization error into structural selection and constant fitting components, and derives— for the first time—a generalization bound explicitly constrained by expression tree size, depth, and the number of learnable constants. By modeling symbolic expressions as trees and employing techniques from combinatorial complexity analysis, parameter perturbation sensitivity, and interval arithmetic, the study reveals the theoretical underpinnings of common practical strategies such as parsimony pressure and depth limits: structural constraints curtail the growth of hypothesis class complexity, while stability mechanisms mitigate the sensitivity of predictions to parameter perturbations. This paper thus establishes the first interpretable theoretical framework for generalization in GP-SR.
📝 Abstract
Symbolic regression (SR) with genetic programming (GP) aims to discover interpretable mathematical expressions directly from data. Despite its strong empirical success, the theoretical understanding of why GP-based SR generalizes beyond the training data remains limited. In this work, we provide a learning-theoretic analysis of SR models represented as expression trees. We derive a generalization bound for GP-style SR under constraints on tree size, depth, and learnable constants. Our result decomposes the generalization gap into two interpretable components: a structure-selection term, reflecting the combinatorial complexity of choosing an expression-tree structure, and a constant-fitting term, capturing the complexity of optimizing numerical constants within a fixed structure. This decomposition provides a theoretical perspective on several widely used practices in GP, including parsimony pressure, depth limits, numerically stable operators, and interval arithmetic. In particular, our analysis shows how structural restrictions reduce hypothesis-class growth while stability mechanisms control the sensitivity of predictions to parameter perturbations. By linking these practical design choices to explicit complexity terms in the generalization bound, our work offers a principled explanation for commonly observed empirical behaviors in GP-based SR and contributes towards a more rigorous understanding of its generalization properties.