OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge of automatic modeling and solving of operations research (OR) optimization problems from natural language descriptions. We propose OR-R1, a data-efficient two-stage framework: Stage I employs supervised fine-tuning (SFT) on a small set of annotated examples to establish foundational generation capability; Stage II introduces Test-time Grouped Relative Policy Optimization (TGRPO), a novel unsupervised strategy that dynamically enhances output consistency and feasibility without labeled data. OR-R1 significantly reduces reliance on large-scale annotated or synthetic datasets while improving cross-problem generalization. Evaluated on multiple real-world benchmarks, it achieves an average solution accuracy of 67.7%, outperforming prior state-of-the-art methods by 4.2% using only 10% of their training data. TGRPO alone contributes up to 6.4% additional gain. The framework establishes a scalable, low-barrier paradigm for automated OR modeling.

Technology Category

Application Category

📝 Abstract

Optimization modeling and solving are fundamental to the application of Operations Research (OR) in real-world decision making, yet the process of translating natural language problem descriptions into formal models and solver code remains highly expertise intensive. While recent advances in large language models (LLMs) have opened new opportunities for automation, the generalization ability and data efficiency of existing LLM-based methods are still limited, asmost require vast amounts of annotated or synthetic data, resulting in high costs and scalability barriers. In this work, we present OR-R1, a data-efficient training framework for automated optimization modeling and solving. OR-R1 first employs supervised fine-tuning (SFT) to help the model acquire the essential reasoning patterns for problem formulation and code generation from limited labeled data. In addition, it improves the capability and consistency through Test-Time Group Relative Policy Optimization (TGRPO). This two-stage design enables OR-R1 to leverage both scarce labeled and abundant unlabeled data for effective learning. Experiments show that OR-R1 achieves state-of-the-art performance with an average solving accuracy of $67.7%$, using only $1/10$ the synthetic data required by prior methods such as ORLM, exceeding ORLM's solving accuracy by up to $4.2%$. Remarkably, OR-R1 outperforms ORLM by over $2.4%$ with just $100$ synthetic samples. Furthermore, TGRPO contributes an additional $3.1%-6.4%$ improvement in accuracy, significantly narrowing the gap between single-attempt (Pass@1) and multi-attempt (Pass@8) performance from $13%$ to $7%$. Extensive evaluations across diverse real-world benchmarks demonstrate that OR-R1 provides a robust, scalable, and cost-effective solution for automated OR optimization problem modeling and solving, lowering the expertise and data barriers for industrial OR applications.

Problem

Research questions and friction points this paper is trying to address.

Automating translation of natural language to optimization models and solver code

Reducing data dependency and expertise requirements in operations research

Improving generalization and consistency in automated optimization problem solving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses supervised fine-tuning for reasoning patterns

Applies test-time reinforcement learning for consistency

Leverages both labeled and unlabeled data efficiently

🔎 Similar Papers

Deep Reinforcement Learning for Dynamic Order Picking in Warehouse Operations