🤖 AI Summary
This work addresses the significant amortization gap in symbolic regression, where single-step inference struggles to balance expression accuracy and simplicity. The authors propose the Latent Equation Embedding (LEE) framework, which uniquely integrates iterative amortized inference with differentiable function evaluation. By constructing a shared latent space anchored to functional behavior, LEE jointly embeds symbolic expressions and observational data, enabling a hybrid discrete–continuous optimization through alternating discrete recoding and continuous gradient descent. Notably, the encoder itself acts as a learned optimizer, iteratively refining expressions in the latent space. Experiments on SRBench demonstrate that LEE generates highly accurate expressions with remarkably low complexity—only 8–11—outperforming the strongest baselines by 2–10× in simplicity while maintaining robustness under noisy conditions.
📝 Abstract
Symbolic regression (SR) seeks closed-form mathematical expressions that fit observed data. Neural SR methods amortize the search by training an encoder to map observations directly to expressions in a single pass, but this amortized inference leaves a residual amortization gap between its one-shot prediction and the true posterior. We propose Latent Equation Embedding (LEE), a framework that closes this gap through iterative amortized inference in a functionally grounded latent space. LEE learns a shared latent space Z equipped with three components: an encoder f_theta that jointly embeds symbolic tokens and numerical observations into a single latent vector z; an expression decoder g_expr that reconstructs formulas from z; and an evaluation decoder g_eval that predicts function values from z, explicitly grounding the latent space in functional behavior. At inference, LEE performs iterative refinement by re-encoding decoded expressions jointly with observations, progressively improving the latent estimate. LEE uses the encoder itself as a learned inference optimizer: each re-encoding step implicitly computes the mismatch between the candidate and the data. Because g_eval is differentiable in z, we additionally interleave continuous gradient descent with discrete re-encoding, yielding a hybrid iterative and gradient refinement procedure. On SRBench across three noise levels, against 19 baselines spanning genetic programming, symbolic-neural hybrids, and pre-trained Transformers, LEE produces expressions 2--10x simpler than the strongest accuracy-oriented baselines, including Operon, GP-GOMEA, TPSR, RAG-SR, and GenSR, with complexity 8--11 versus 20--90. These results advance the low-complexity region of the accuracy-complexity Pareto frontier and show graceful degradation as noise increases.