Discovering Mathematical Equations with Diffusion Language Model

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Symbolic regression aims to discover accurate, concise, and interpretable mathematical equations from observational data, yet it faces challenges stemming from an exponentially large search space and the inherent trade-off between accuracy and model complexity. This paper proposes DiffuSR, the first framework to introduce continuous-state diffusion language models to symbolic regression: it maps discrete symbols into a continuous latent space via a trainable embedding layer and employs numerical data to guide iterative denoising for equation generation. We design an inference strategy incorporating logit-based priors and introduce a cross-attention mechanism to explicitly model symbol–numerical correlations. Compared to dominant autoregressive approaches, DiffuSR achieves comparable performance on standard benchmarks while significantly improving the interpretability, semantic plausibility, and structural diversity of generated equations.

Technology Category

Application Category

📝 Abstract
Discovering valid and meaningful mathematical equations from observed data plays a crucial role in scientific discovery. While this task, symbolic regression, remains challenging due to the vast search space and the trade-off between accuracy and complexity. In this paper, we introduce DiffuSR, a pre-training framework for symbolic regression built upon a continuous-state diffusion language model. DiffuSR employs a trainable embedding layer within the diffusion process to map discrete mathematical symbols into a continuous latent space, modeling equation distributions effectively. Through iterative denoising, DiffuSR converts an initial noisy sequence into a symbolic equation, guided by numerical data injected via a cross-attention mechanism. We also design an effective inference strategy to enhance the accuracy of the diffusion-based equation generator, which injects logit priors into genetic programming. Experimental results on standard symbolic regression benchmarks demonstrate that DiffuSR achieves competitive performance with state-of-the-art autoregressive methods and generates more interpretable and diverse mathematical expressions.
Problem

Research questions and friction points this paper is trying to address.

Discovering valid mathematical equations from observed data
Addressing symbolic regression challenges with diffusion language model
Generating interpretable and diverse mathematical expressions efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion language model for symbolic regression
Continuous latent space mapping via embedding layer
Cross-attention mechanism with numerical data injection
X
Xiaoxu Han
South China University of Technology
C
Chengzhen Ning
South China University of Technology
Jinghui Zhong
Jinghui Zhong
Professor, South China University of Technology
Artificial IntelligenceGenetic programmingAI for Science
F
Fubiao Yang
South China University of Technology
Y
Yu Wang
Pengcheng Laboratory
Xin Mu
Xin Mu
Pengcheng Laboratory
Data miningMachine learning