Guiding Multi-Objective Genetic Programming with Description Length Improves Symbolic Regression Solutions

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work addresses the challenge in symbolic regression where genetic programming is prone to overfitting and expression bloat in the presence of noise, often failing to balance accuracy and generalization. The authors propose, for the first time, integrating Fisher information–based description length (DL) and fractional Bayes factor (FBF) as post-selection criteria—rather than direct optimization objectives—within a multi-objective genetic programming framework to effectively trade off model fit against complexity. Extensive experiments on multiple noisy synthetic and real-world datasets demonstrate that DL/FBF post-selection significantly outperforms AIC/BIC baselines; incorporating BIC-based complexity penalties also yields comparable performance, whereas using such criteria directly as fitness functions tends to produce overly simplistic models. The proposed approach markedly enhances both solution parsimony and generalization capability.

📝 Abstract

Symbolic regression with genetic programming (GPSR) may suffer from overfitting and structural bloat, especially when noise is present. In this paper we evaluate description length (DL) and fractional Bayes factor (FBF) criteria as principled, data-efficient alternatives to heuristics for selecting compact expressions that generalise well. We implement DL using a Fisher-information-based parameter encoding and compare it to AIC and BIC across multiple datasets, including noisy synthetic benchmarks and real-world regression problems. We study three search/selection strategies: (i) multi-objective search for accuracy and program length followed by DL/FBF selection; (ii) multi-objective search using DL directly as an objective; and (iii) single-objective optimisation with DL/FBF as the fitness. Across datasets we find that DL/FBF post-selection improves test performance compared to AIC/BIC baseline and that BIC in combination with the same function complexity penalty from DL/FBF produces similar results. In contrast, using DL/FBF directly as a fitness function in single-objective GPSR frequently induces premature convergence to overly simple models. We conclude with practical guidance for using DL/FBF as robust model-selection tools in genetic programming workflows.

Problem

Research questions and friction points this paper is trying to address.

symbolic regression

overfitting

structural bloat

genetic programming

model selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Description Length

Genetic Programming

Symbolic Regression