π€ AI Summary
This work addresses the challenge that large language models struggle to generate reusable, constraint-aware solvers for combinatorial optimization. To overcome this, the authors propose a reinforcement learning approach that internalizes inference costs into the parameters of a code large language model, enabling it to automatically produce structurally correct and instance-agnostic heuristic solvers for the Synergistic Dependency Selection (SDS) problem family. Built upon Qwen2.5-Coder-14B-Instruct, the method employs Group Relative Policy Optimization, feasibility-gated rewards, and lightweight structural scaffolding during fine-tuning. Experiments demonstrate that the generated solvers achieve performance within 5.0% of the global virtual optimum, with 99.8% of outputs adhering to a constraint-aware simulated annealing template. Moreover, their post-inference execution cost is 91Γ lower than Best-of-64 sampling, and they exhibit strong generalization on held-out test instances.
π Abstract
Large language models (LLMs) typically approach combinatorial optimization as an inference-time procedure, solving each instance separately through sampling, search, or repeated prompting. We ask whether reinforcement learning can instead shift part of this reasoning cost into the weights of a code LLM, so that the model synthesizes a reusable solver for an entire problem family. We study this question on Synergistic Dependency Selection (SDS), a controlled variant of constrained Quadratic Knapsack designed to expose a specific failure mode: local signals and strict feasibility constraints make greedy heuristics attractive but unreliable. Under identical scaffolding, Best-of-64 base-model sampling saturates at an approximately 28.7% gap to the global Virtual Best Solver (VBS); code audits show that the base model often retrieves Simulated Annealing templates but misimplements the Metropolis acceptance rule. We fine-tune Qwen2.5-Coder-14B-Instruct with Group Relative Policy Optimization (GRPO) using a feasibility-gated reward and light structural scaffolding. The resulting policy converges to a constraint-aware Simulated Annealing template in 99.8% of feasible SDS outputs, achieves a 5.0% gap to that VBS, and is 91 times cheaper in post-generation execution/search cost than cumulative Best-of-64 evaluation. A compile-once check shows that one best frozen solver per seed remains highly competitive when reused unchanged across the SDS test set, while an additional-domain evaluation on Job Shop Scheduling provides narrower but positive evidence that the scaffold transfers beyond SDS. Negative ablations reveal the limits of this recipe: standard stabilizers degrade performance, a soft feasibility gate fails, and results remain sensitive to reward normalization and domain-specific design choices.