๐ค AI Summary
In symbolic regression, genetic programming (GP) suffers from severe search inefficiency due to neutralityโup to 60% of expression evaluations are redundant. To address this, we propose SymRegg, the first symbolic regression algorithm systematically integrating equality graphs (e-graphs): it compactly represents and incrementally maintains sets of equivalent expressions to eliminate redundant evaluation, and employs lightweight perturbation and selection strategies for efficient exploration of the expression space. SymRegg achieves GP-level modeling accuracy while reducing expression redundancy by up to 60%, significantly improving search efficiency. It exhibits strong expression preservation, minimal hyperparameters (only population size and number of iterations), and robust generalization across diverse datasets. Extensive experiments on multiple benchmarks demonstrate that SymRegg simultaneously attains high predictive accuracy and low computational overhead.
๐ Abstract
In Symbolic Regression (SR), Genetic Programming (GP) is a popular search algorithm that delivers state-of-the-art results in term of accuracy. Its success relies on the concept of neutrality, which induces large plateaus that the search can safely navigate to more promising regions. Navigating these plateaus, while necessary, requires the computation of redundant expressions, up to 60% of the total number of evaluation, as noted in a recent study. The equality graph (e-graph) structure can compactly store and group equivalent expressions enabling us to verify if a given expression and their variations were already visited by the search, thus enabling us to avoid unnecessary computation. We propose a new search algorithm for symbolic regression called SymRegg that revolves around the e-graph structure following simple steps: perturb solutions sampled from a selection of expressions stored in the e-graph, if it generates an unvisited expression, insert it into the e-graph and generates its equivalent forms. We show that SymRegg is capable of improving the efficiency of the search, maintaining consistently accurate results across different datasets while requiring a choice of a minimalist set of hyperparameters.