🤖 AI Summary
This work addresses the limitation of existing large language models in solving NP-hard optimization problems, which typically prioritize solution correctness while neglecting solution quality and optimality under constraints. The authors propose OPT-BENCH, a novel framework that establishes the first quality-aware reinforcement learning (Quality-Aware RLVR) training and evaluation paradigm tailored for NP-hard problems. It incorporates continuous quality-based reward signals, a scalable instance generator, and a solution quality verifier, and is implemented on Qwen2.5-7B-Instruct-1M. Experiments demonstrate that the approach achieves a 93.1% success rate and a 46.6% quality ratio on OPT-BENCH, substantially outperforming GPT-4o, and yields transfer gains of 2.2%–6.1% on mathematical and logical reasoning tasks. The study further reveals that task diversity contributes more effectively to generalization than data volume alone, and that quality-based rewards significantly surpass binary correctness rewards.
📝 Abstract
Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic, and puzzles. However, existing benchmarks evaluate only correctness, while overlooking optimality, namely the ability to find the best solutions under constraints. We propose OPT-BENCH, the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. OPT-BENCH provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility, measured by Success Rate, and quality, measured by Quality Ratio; and quality-aware rewards that enable continuous improvement beyond binary correctness. Training on Qwen2.5-7B-Instruct-1M with 15K examples achieves 93.1% SR and 46.6% QR, significantly outperforming GPT-4o, which achieves 29.6% SR and 14.6% QR. Beyond optimization, training on OPT-BENCH transfers to diverse tasks, including mathematics (+2.2%), logic (+1.2%), knowledge (+4.1%), and instruction following (+6.1%). Our analysis reveals that quality-aware rewards improve solutions by 28.8% over binary rewards, and that task diversity drives generalization more than data quantity, offering insights into RLVR scaling for complex reasoning.