🤖 AI Summary
Symbolic regression aims to discover accurate, concise, and interpretable mathematical equations from observational data, yet it faces challenges stemming from an exponentially large search space and the inherent trade-off between accuracy and model complexity. This paper proposes DiffuSR, the first framework to introduce continuous-state diffusion language models to symbolic regression: it maps discrete symbols into a continuous latent space via a trainable embedding layer and employs numerical data to guide iterative denoising for equation generation. We design an inference strategy incorporating logit-based priors and introduce a cross-attention mechanism to explicitly model symbol–numerical correlations. Compared to dominant autoregressive approaches, DiffuSR achieves comparable performance on standard benchmarks while significantly improving the interpretability, semantic plausibility, and structural diversity of generated equations.
📝 Abstract
Discovering valid and meaningful mathematical equations from observed data plays a crucial role in scientific discovery. While this task, symbolic regression, remains challenging due to the vast search space and the trade-off between accuracy and complexity. In this paper, we introduce DiffuSR, a pre-training framework for symbolic regression built upon a continuous-state diffusion language model. DiffuSR employs a trainable embedding layer within the diffusion process to map discrete mathematical symbols into a continuous latent space, modeling equation distributions effectively. Through iterative denoising, DiffuSR converts an initial noisy sequence into a symbolic equation, guided by numerical data injected via a cross-attention mechanism. We also design an effective inference strategy to enhance the accuracy of the diffusion-based equation generator, which injects logit priors into genetic programming. Experimental results on standard symbolic regression benchmarks demonstrate that DiffuSR achieves competitive performance with state-of-the-art autoregressive methods and generates more interpretable and diverse mathematical expressions.