Preference Optimization for Combinatorial Optimization Problems

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the challenges of sparse rewards and inefficient exploration in high-dimensional action spaces within reinforcement learning (RL) for combinatorial optimization, this paper proposes a preference-based RL framework that converts scalar rewards into statistical pairwise preference signals—guiding the policy to directly learn relative solution quality. Methodologically, it integrates statistical preference modeling, policy reparameterization, and fine-grained tuning via embedded local search. Key contributions include: (i) the first application of preference modeling to RL for combinatorial optimization; (ii) a reformulated entropy-regularized objective that avoids intractable computations; and (iii) the novel integration of local search into preference pair generation—rather than as a post-hoc refinement step. Evaluated on TSP, CVRP, and FFSP benchmarks, the method significantly outperforms state-of-the-art RL approaches, achieving faster convergence and superior solution quality.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) has emerged as a powerful tool for neural combinatorial optimization, enabling models to learn heuristics that solve complex problems without requiring expert knowledge. Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast combinatorial action spaces, leading to inefficiency. In this paper, we propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling, emphasizing the superiority among sampled solutions. Methodologically, by reparameterizing the reward function in terms of policy and utilizing preference models, we formulate an entropy-regularized RL objective that aligns the policy directly with preferences while avoiding intractable computations. Furthermore, we integrate local search techniques into the fine-tuning rather than post-processing to generate high-quality preference pairs, helping the policy escape local optima. Empirical results on various benchmarks, such as the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP) and the Flexible Flow Shop Problem (FFSP), demonstrate that our method significantly outperforms existing RL algorithms, achieving superior convergence efficiency and solution quality.

Problem

Research questions and friction points this paper is trying to address.

Addresses inefficiency in RL for combinatorial optimization

Transforms rewards to preferences for better solution quality

Integrates local search to escape local optima

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms rewards into qualitative preference signals

Reparameterizes reward function with policy alignment

Integrates local search in fine-tuning phase

🔎 Similar Papers

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization