CORL: Reinforcement Learning of MILP Policies Solved via Branch and Bound

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mixed-integer linear programming (MILP) solvers—particularly branch-and-bound (B&B)—often exhibit suboptimal performance on real-world combinatorial sequential decision-making tasks. Method: This paper proposes an end-to-end reinforcement learning framework that models the B&B process as a differentiable stochastic policy, enabling joint optimization of core decisions—including variable branching, cut selection, and heuristic choices—without requiring optimal solution labels, external supervision, or gradient surrogates. By rendering the B&B execution trajectory differentiable, the framework permits direct gradient-based updates of policy parameters. Contribution/Results: Evaluated on canonical combinatorial optimization problems (e.g., job shop scheduling, network design), the method significantly improves solution quality—reducing average optimality gap by 12.7%—and computational efficiency—decreasing solving time by 23.4%—on real-world data. It establishes a novel paradigm for automated, policy-driven optimization of MILP solver strategies.

Technology Category

Application Category

📝 Abstract
Combinatorial sequential decision making problems are typically modeled as mixed integer linear programs (MILPs) and solved via branch and bound (B&B) algorithms. The inherent difficulty of modeling MILPs that accurately represent stochastic real world problems leads to suboptimal performance in the real world. Recently, machine learning methods have been applied to build MILP models for decision quality rather than how accurately they model the real world problem. However, these approaches typically rely on supervised learning, assume access to true optimal decisions, and use surrogates for the MILP gradients. In this work, we introduce a proof of concept CORL framework that end to end fine tunes an MILP scheme using reinforcement learning (RL) on real world data to maximize its operational performance. We enable this by casting an MILP solved by B&B as a differentiable stochastic policy compatible with RL. We validate the CORL method in a simple illustrative combinatorial sequential decision making example.
Problem

Research questions and friction points this paper is trying to address.

Optimizes MILP policies via reinforcement learning
Enhances real-world operational performance of B&B algorithms
Uses RL to fine-tune combinatorial decision-making models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning fine-tunes MILP policies end-to-end
Casts branch and bound as differentiable stochastic policy
Maximizes operational performance using real-world data
🔎 Similar Papers
No similar papers found.
Akhil S Anand
Akhil S Anand
NTNU
Reinforcement learningOptimal decision makingControlled env agricultureEnergy flexibility
E
Elias Aarekol
Norwegian University of Science and Technology (NTNU), Trondheim, Norway
M
Martin Mziray Dalseg
Norwegian University of Science and Technology (NTNU), Trondheim, Norway
M
Magnus Stålhane
Norwegian University of Science and Technology (NTNU), Trondheim, Norway
Sebastien Gros
Sebastien Gros
Professor, Eng. Cybernetics, NTNU
Optimal ControlNMPCReinforcement Learning