🤖 AI Summary
This work proposes a robust outlier-aware downsampling method for symbolic regression that addresses the susceptibility of traditional Informed Downsampling (IDS) to outliers, which often leads to overfitting. By integrating an outlier detection mechanism directly into the IDS framework, the proposed approach actively identifies and excludes potential outliers during the sampling phase, thereby preserving the strengths of IDS while significantly enhancing model generalization on noisy data. The method is seamlessly combined with genetic programming to enable end-to-end optimization. Experimental results demonstrate that it outperforms the original IDS on more than 80% of real-world benchmark problems and achieves the best average ranking across both synthetic and real datasets.
📝 Abstract
Informed down-sampling (IDS) is known to improve performance in symbolic regression when combined with various selection strategies, especially tournament selection. However, recent work found that IDS's gains are not consistent across all problems. Our analysis reveals that IDS performance is worse for problems containing outliers. IDS systematically favors including outliers in subsets which pushes GP towards finding solutions that overfit to outliers. To address this, we introduce ROIDS (Robust Outlier-Aware Informed Down-Sampling), which excludes potential outliers from the sampling process of IDS. With ROIDS it is possible to keep the advantages of IDS without overfitting to outliers and to compete on a wide range of benchmark problems. This is also reflected in our experiments in which ROIDS shows the desired behavior on all studied benchmark problems. ROIDS consistently outperforms IDS on synthetic problems with added outliers as well as on a wide range of complex real-world problems, surpassing IDS on over 80% of the real-world benchmark problems. Moreover, compared to all studied baseline approaches, ROIDS achieves the best average rank across all tested benchmark problems. This robust behavior makes ROIDS a reliable down-sampling method for selection in symbolic regression, especially when outliers may be included in the data set.