🤖 AI Summary
Current large language models exhibit weak reasoning capabilities on graph combinatorial optimization (GCO) tasks. To address this, we propose the Optimal Thoughts Design (OTD) modeling framework—the first to formally define the state space of “thoughts” and the action space in GCO. Based on OTD, we develop GraphThought, a system integrating graph-structure awareness, chain-of-thought generation, and reinforcement-guided heuristic data synthesis to automatically produce high-quality reasoning traces. We fine-tune Llama-3-8B-Instruct with these synthetic thought data and demonstrate that the resulting 8B-parameter model surpasses same-scale and most closed-source LLMs on the GraphArena benchmark—matching or exceeding specialized models like o1-mini. This challenges the “scale-is-all” paradigm. Our core contribution is establishing a learnable, structured representation of reasoning for GCO and empirically validating that small models, when trained on high-fidelity thought data, achieve substantial reasoning capability gains.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across various domains, especially in text processing and generative tasks. Recent advancements in the reasoning capabilities of state-of-the-art LLMs, such as OpenAI-o1, have significantly broadened their applicability, particularly in complex problem-solving and logical inference. However, most existing LLMs struggle with notable limitations in handling graph combinatorial optimization (GCO) problems. To bridge this gap, we formally define the Optimal Thoughts Design (OTD) problem, including its state and action thought space. We then introduce a novel framework, GraphThought, designed to generate high-quality thought datasets for GCO problems. Leveraging these datasets, we fine-tune the Llama-3-8B-Instruct model to develop Llama-GT. Notably, despite its compact 8B-parameter architecture, Llama-GT matches the performance of state-of-the-art LLMs on the GraphArena benchmark. Experimental results show that our approach outperforms both proprietary and open-source models, even rivaling specialized models like o1-mini. This work sets a new state-of-the-art benchmark while challenging the prevailing notion that model scale is the primary driver of reasoning capability.