🤖 AI Summary
Existing symbolic regression methods commonly use formula length as a proxy for interpretability, neglecting mathematical structural soundness—yielding compact yet analytically opaque expressions. This work introduces the Effective Information Criterion (EIC), the first metric to model symbolic formulas as numerical information-processing systems, quantifying intrinsic structural stability via significant-digit loss and rounding-noise amplification. EIC exposes a fundamental numerical robustness gap between current methods and physically grounded laws, and guides joint optimization of search- and generation-based algorithms for principled model selection. Experiments demonstrate that EIC-driven approaches significantly improve Pareto-optimal performance: reducing structurally unsound expressions by 42%, cutting pretraining sample requirements by 2–4×, and achieving 70.2% alignment with domain-expert preferences.
📝 Abstract
Symbolic regression discovers accurate and interpretable formulas to describe given data, thereby providing scientific insights for domain experts and promoting scientific discovery. However, existing symbolic regression methods often use complexity metrics as a proxy for interoperability, which only considers the size of the formula but ignores its internal mathematical structure. Therefore, while they can discover formulas with compact forms, the discovered formulas often have structures that are difficult to analyze or interpret mathematically. In this work, inspired by the observation that physical formulas are typically numerically stable under limited calculation precision, we propose the Effective Information Criterion (EIC). It treats formulas as information processing systems with specific internal structures and identifies the unreasonable structure in them by the loss of significant digits or the amplification of rounding noise as data flows through the system. We find that this criterion reveals the gap between the structural rationality of models discovered by existing symbolic regression algorithms and real-world physical formulas. Combining EIC with various search-based symbolic regression algorithms improves their performance on the Pareto frontier and reduces the irrational structure in the results. Combining EIC with generative-based algorithms reduces the number of samples required for pre-training, improving sample efficiency by 2~4 times. Finally, for different formulas with similar accuracy and complexity, EIC shows a 70.2% agreement with 108 human experts' preferences for formula interpretability, demonstrating that EIC, by measuring the unreasonable structures in formulas, actually reflects the formula's interpretability.