🤖 AI Summary
This work addresses the significant bottlenecks in training efficiency, memory consumption, and throughput that hinder neural combinatorial optimization solvers, where existing online learning paradigms struggle to balance performance and resource overhead. To overcome these limitations, we propose ECO, a novel offline self-play two-stage training framework tailored for neural combinatorial optimization. The approach first initializes the policy via supervised pre-warming and then iteratively refines it using Direct Preference Optimization (DPO). We further integrate the Mamba sequence modeling architecture to reduce memory usage and introduce a heuristic progressive guidance mechanism to enhance training stability. Evaluated on Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) benchmarks, ECO achieves state-of-the-art solution quality while substantially improving training throughput and reducing memory footprint.
📝 Abstract
We propose ECO, a versatile learning paradigm that enables efficient offline self-play for Neural Combinatorial Optimization (NCO). ECO addresses key limitations in the field through: 1) Paradigm Shift: Moving beyond inefficient online paradigms, we introduce a two-phase offline paradigm consisting of supervised warm-up and iterative Direct Preference Optimization (DPO); 2) Architecture Shift: We deliberately design a Mamba-based architecture to further enhance the efficiency in the offline paradigm; and 3) Progressive Bootstrapping: To stabilize training, we employ a heuristic-based bootstrapping mechanism that ensures continuous policy improvement during training. Comparison results on TSP and CVRP highlight that ECO performs competitively with up-to-date baselines, with significant advantage on the efficiency side in terms of memory utilization and training throughput. We provide further in-depth analysis on the efficiency, throughput and memory usage of ECO. Ablation studies show rationale behind our designs.