Data-Efficient Symbolic Regression via Foundation Model Distillation

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address negative transfer and poor generalization of foundation models in few-shot symbolic regression, this paper proposes EQUATE: a framework that reformulates discrete equation search as a continuous optimization problem within a shared embedding space, enabling lightweight adaptation of foundation models via knowledge distillation. It introduces a symbolic-numerical alignment mechanism to ensure semantic consistency and designs an evaluator-guided embedding optimization with parsimony regularization to jointly optimize equation accuracy and simplicity. Evaluated on the Feynman, Strogatz, and black-box datasets, EQUATE significantly outperforms state-of-the-art methods across four key dimensions—accuracy, robustness, model simplicity, and inference speed—achieving synergistic improvements. Notably, it is the first method to realize high-quality, low-overhead, and interpretable end-to-end equation discovery.

Technology Category

Application Category

📝 Abstract

Discovering interpretable mathematical equations from observed data (a.k.a. equation discovery or symbolic regression) is a cornerstone of scientific discovery, enabling transparent modeling of physical, biological, and economic systems. While foundation models pre-trained on large-scale equation datasets offer a promising starting point, they often suffer from negative transfer and poor generalization when applied to small, domain-specific datasets. In this paper, we introduce EQUATE (Equation Generation via QUality-Aligned Transfer Embeddings), a data-efficient fine-tuning framework that adapts foundation models for symbolic equation discovery in low-data regimes via distillation. EQUATE combines symbolic-numeric alignment with evaluator-guided embedding optimization, enabling a principled embedding-search-generation paradigm. Our approach reformulates discrete equation search as a continuous optimization task in a shared embedding space, guided by data-equation fitness and simplicity. Experiments across three standard public benchmarks (Feynman, Strogatz, and black-box datasets) demonstrate that EQUATE consistently outperforms state-of-the-art baselines in both accuracy and robustness, while preserving low complexity and fast inference. These results highlight EQUATE as a practical and generalizable solution for data-efficient symbolic regression in foundation model distillation settings.

Problem

Research questions and friction points this paper is trying to address.

Improving symbolic regression foundation model generalization on small datasets

Adapting foundation models for data-efficient equation discovery via distillation

Reformulating discrete equation search as continuous optimization task

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distillation framework for symbolic equation discovery

Symbolic-numeric alignment with evaluator-guided optimization

Continuous optimization in shared embedding space

🔎 Similar Papers

No similar papers found.