🤖 AI Summary
This work addresses the inefficiency and unreliability of large language models (LLMs) in program synthesis tasks requiring extensive combinatorial search. The authors propose a novel approach that leverages a small number of LLM reasoning traces to compile, via an encoding agent, a reusable symbolic program synthesizer operating over a constrained domain-specific language (DSL), eliminating the need for LLM calls during testing. This method is the first to transform LLM reasoning traces into a zero-inference-overhead, reusable symbolic solver that functions independently while also enabling neuro-symbolic enhancement when combined with an LLM. It further supports zero-shot cross-domain transfer. On PBEBench-Hard, the approach achieves 84.7% accuracy—16.3 percentage points higher than test-time-scaled LLMs—and reaches 85.8% when augmented with an LLM while reducing token usage by 78%. In historical linguistics tasks, it attains 80.1% zero-shot accuracy.
📝 Abstract
LLMs can solve program synthesis tasks but remain inefficient and unreliable on hard instances requiring large combinatorial search. Given a small set of reasoning traces, we use coding agents to compile them into reusable symbolic program synthesizers over constrained DSLs. The resulting solvers require no LLM calls at test time and are strong standalone systems: symbolic solver ensembles reach 91.3% accuracy on PBEBench-Lite and 84.7% on PBEBench-Hard, outperforming LLMs with test-time scaling for the latter by +16.3 percentage points at zero LLM inference cost. They also complement LLM search, improving PBEBench-Hard accuracy from 68.4% to 85.8% while reducing reported token usage by 78%, and raising SLR-Bench hard-tier accuracy from 34.4% to 58.0% in a neuro-symbolic hybrid setting. Compared to directly using coding agents as per-instance solvers, induced solvers are substantially more Pareto-efficient, amortizing a small one-time construction cost over many zero-token executions. Finally, most solvers transfer zero-shot to a real historical linguistics task - predicting sound changes in natural language data - reaching 80.1% accuracy under ensembling and recovering some plausible linguistic rules. Together, these results show that reasoning traces can be compiled into reusable symbolic solvers that solve many tasks directly, complement LLM inference on hard cases, and provide a scalable route to domain-general solver induction. We release code and data for reproducibility.