A Large Language Model-Enhanced Q-learning for Capacitated Vehicle Routing Problem with Time Windows

📅 2025-05-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the NP-hard Capacitated Vehicle Routing Problem with Time Windows (CVRPTW). We propose an LLM-driven, constraint-aware reinforcement learning framework. Methodologically, we design a two-stage adaptive Q-learning training scheme that integrates Chain-of-Thought (CoT) reasoning to enable three-level self-correction—syntactic, semantic, and physical—ensuring strict adherence to problem constraints. Furthermore, we introduce an LLM-generated experience prioritized replay strategy to improve guidance efficiency. Compared to conventional Q-learning, our approach reduces average delivery cost by 7.3% and significantly accelerates convergence. The core contribution is the first LLM-guided combinatorial optimization paradigm, which deeply couples large language models’ logical reasoning capabilities with reinforcement learning’s sequential decision-making capacity. Empirical results demonstrate that LLMs provide reliable, interpretable, and constraint-compliant guidance in complex constrained combinatorial optimization settings.

Technology Category

Application Category

📝 Abstract
The Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) is a classic NP-hard combinatorial optimization problem widely applied in logistics distribution and transportation management. Its complexity stems from the constraints of vehicle capacity and time windows, which pose significant challenges to traditional approaches. Advances in Large Language Models (LLMs) provide new possibilities for finding approximate solutions to CVRPTW. This paper proposes a novel LLM-enhanced Q-learning framework to address the CVRPTW with real-time emergency constraints. Our solution introduces an adaptive two-phase training mechanism that transitions from the LLM-guided exploration phase to the autonomous optimization phase of Q-network. To ensure reliability, we design a three-tier self-correction mechanism based on the Chain-of-Thought (CoT) for LLMs: syntactic validation, semantic verification, and physical constraint enforcement. In addition, we also prioritized replay of the experience generated by LLMs to amplify the regulatory role of LLMs in the architecture. Experimental results demonstrate that our framework achieves a 7.3% average reduction in cost compared to traditional Q-learning, with fewer training steps required for convergence.
Problem

Research questions and friction points this paper is trying to address.

Solving CVRPTW with LLM-enhanced Q-learning for logistics optimization
Addressing real-time emergency constraints in vehicle routing problems
Reducing cost and training steps via adaptive two-phase training
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-enhanced Q-learning for CVRPTW optimization
Adaptive two-phase LLM-guided training mechanism
Three-tier self-correction with Chain-of-Thought
🔎 Similar Papers
No similar papers found.
L
Linjiang Cao
Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University, Shanghai, China
Maonan Wang
Maonan Wang
Unknown affiliation
X
Xi Xiong
Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University, Shanghai, China