π€ AI Summary
Large language models (LLMs) exhibit weak reasoning capabilities on NP-hard optimization problems. Method: This paper introduces NP-ENGINE, the first end-to-end training and evaluation framework specifically designed for NP-hard problems, accompanied by NP-BENCHβa benchmark comprising 10 tasks across 5 domains. Our approach innovatively integrates verifiable-reward reinforcement learning (RLVR), curriculum learning, rule-based validators, and heuristic solvers to construct a controllable and verifiable synthetic data pipeline. Results: The Qwen2.5-7B-NP model trained on NP-ENGINE achieves state-of-the-art performance on NP-BENCH, significantly outperforming GPT-4o at comparable parameter count. Furthermore, we identify a scaling law wherein task diversity drives cross-domain generalization, demonstrating strong generalization capabilities in logical reasoning, mathematical modeling, and complex instruction following.
π Abstract
Large Language Models (LLMs) have shown strong reasoning capabilities, with models like OpenAI's O-series and DeepSeek R1 excelling at tasks such as mathematics, coding, logic, and puzzles through Reinforcement Learning with Verifiable Rewards (RLVR). However, their ability to solve more complex optimization problems - particularly NP-hard tasks - remains underexplored. To bridge this gap, we propose NP-ENGINE, the first comprehensive framework for training and evaluating LLMs on NP-hard problems. NP-ENGINE covers 10 tasks across five domains, each equipped with (i) a controllable instance generator, (ii) a rule-based verifier, and (iii) a heuristic solver that provides approximate optimal solutions as ground truth. This generator-verifier-heuristic pipeline enables scalable and verifiable RLVR training under hierarchical difficulties. We also introduce NP-BENCH, a benchmark derived from NP-ENGINE-DATA, specifically designed to evaluate LLMs' ability to tackle NP-hard level reasoning problems, focusing not only on feasibility but also on solution quality. Additionally, we present QWEN2.5-7B-NP, a model trained via zero-RLVR with curriculum learning on Qwen2.5-7B-Instruct, which significantly outperforms GPT-4o on NP-BENCH and achieves SOTA performance with the same model size. Beyond in-domain tasks, we demonstrate that RLVR training on NP-ENGINE-DATA enables strong out-of-domain (OOD) generalization to reasoning tasks (logic, puzzles, math, and knowledge), as well as non-reasoning tasks such as instruction following. We also observe a scaling trend: increasing task diversity improves OOD generalization. These findings suggest that task-rich RLVR training is a promising direction for advancing LLM's reasoning ability, revealing new insights into the scaling laws of RLVR.