Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge that large language models struggle to generate reusable, constraint-aware solvers for combinatorial optimization. To overcome this, the authors propose a reinforcement learning approach that internalizes inference costs into the parameters of a code large language model, enabling it to automatically produce structurally correct and instance-agnostic heuristic solvers for the Synergistic Dependency Selection (SDS) problem family. Built upon Qwen2.5-Coder-14B-Instruct, the method employs Group Relative Policy Optimization, feasibility-gated rewards, and lightweight structural scaffolding during fine-tuning. Experiments demonstrate that the generated solvers achieve performance within 5.0% of the global virtual optimum, with 99.8% of outputs adhering to a constraint-aware simulated annealing template. Moreover, their post-inference execution cost is 91× lower than Best-of-64 sampling, and they exhibit strong generalization on held-out test instances.

📝 Abstract

Large language models (LLMs) typically approach combinatorial optimization as an inference-time procedure, solving each instance separately through sampling, search, or repeated prompting. We ask whether reinforcement learning can instead shift part of this reasoning cost into the weights of a code LLM, so that the model synthesizes a reusable solver for an entire problem family. We study this question on Synergistic Dependency Selection (SDS), a controlled variant of constrained Quadratic Knapsack designed to expose a specific failure mode: local signals and strict feasibility constraints make greedy heuristics attractive but unreliable. Under identical scaffolding, Best-of-64 base-model sampling saturates at an approximately 28.7% gap to the global Virtual Best Solver (VBS); code audits show that the base model often retrieves Simulated Annealing templates but misimplements the Metropolis acceptance rule. We fine-tune Qwen2.5-Coder-14B-Instruct with Group Relative Policy Optimization (GRPO) using a feasibility-gated reward and light structural scaffolding. The resulting policy converges to a constraint-aware Simulated Annealing template in 99.8% of feasible SDS outputs, achieves a 5.0% gap to that VBS, and is 91 times cheaper in post-generation execution/search cost than cumulative Best-of-64 evaluation. A compile-once check shows that one best frozen solver per seed remains highly competitive when reused unchanged across the SDS test set, while an additional-domain evaluation on Job Shop Scheduling provides narrower but positive evidence that the scaffold transfers beyond SDS. Negative ablations reveal the limits of this recipe: standard stabilizers degrade performance, a soft feasibility gate fails, and results remain sensitive to reward normalization and domain-specific design choices.

Problem

Research questions and friction points this paper is trying to address.

combinatorial optimization

reusable solver

large language models

reinforcement learning

inference-time search

Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning

reusable solver

combinatorial optimization