🤖 AI Summary
This study systematically evaluates the end-to-end reasoning capabilities of large language models (LLMs) on combinatorial optimization problems described in natural language, with a particular focus on performance in high-dimensional solution spaces under hard constraints. To this end, the authors introduce NLCO, a benchmark comprising 43 problems that require models to directly output discrete solutions without relying on code generation or external solvers. A novel four-level taxonomy—based on variable types, constraint families, global patterns, and objective categories—enables fine-grained assessment of feasibility, optimality, and reasoning efficiency. Experimental results show that state-of-the-art models perform well on small-scale instances but suffer significant performance degradation as problem size increases; tasks involving sets are relatively easier, whereas those with graph structures or bottleneck objectives pose greater challenges.
📝 Abstract
While large language models (LLMs) have shown strong performance in math and logic reasoning, their ability to handle combinatorial optimization (CO) -- searching high-dimensional solution spaces under hard constraints -- remains underexplored. To bridge the gap, we introduce NLCO, a \textbf{N}atural \textbf{L}anguage \textbf{C}ombinatorial \textbf{O}ptimization benchmark that evaluates LLMs on end-to-end CO reasoning: given a language-described decision-making scenario, the model must output a discrete solution without writing code or calling external solvers. NLCO covers 43 CO problems and is organized using a four-layer taxonomy of variable types, constraint families, global patterns, and objective classes, enabling fine-grained evaluation. We provide solver-annotated solutions and comprehensively evaluate LLMs by feasibility, solution optimality, and reasoning efficiency. Experiments across a wide range of modern LLMs show that high-performing models achieve strong feasibility and solution quality on small instances, but both degrade as instance size grows, even if more tokens are used for reasoning. We also observe systematic effects across the taxonomy: set-based tasks are relatively easy, whereas graph-structured problems and bottleneck objectives lead to more frequent failures.