Learning to Staff: Offline Reinforcement Learning and Fine-Tuned LLMs for Warehouse Staffing Optimization

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a novel decision-making framework that integrates offline reinforcement learning with fine-tuned large language models (LLMs) to address the inefficiency of real-time human scheduling in semi-automated warehouses. The approach uniquely combines, in parallel, a Transformer-based offline reinforcement learning policy with an LLM’s capacity to interpret abstract state descriptions, enabling multi-granular inputs ranging from raw system states to human-readable summaries. It further introduces a simulation-based Direct Preference Optimization (DPO) mechanism to automatically refine prompting strategies. Experimental results demonstrate that the offline reinforcement learning policy improves system throughput by 2.4% in simulation, while the fine-tuned LLM achieves performance on par with or slightly exceeding historical baselines, confirming the feasibility and complementary potential of both approaches for real-world warehouse scheduling.

Technology Category

Application Category

📝 Abstract
We investigate machine learning approaches for optimizing real-time staffing decisions in semi-automated warehouse sortation systems. Operational decision-making can be supported at different levels of abstraction, with different trade-offs. We evaluate two approaches, each in a matching simulation environment. First, we train custom Transformer-based policies using offline reinforcement learning on detailed historical state representations, achieving a 2.4% throughput improvement over historical baselines in learned simulators. In high-volume warehouse operations, improvements of this size translate to significant savings. Second, we explore LLMs operating on abstracted, human-readable state descriptions. These are a natural fit for decisions that warehouse managers make using high-level operational summaries. We systematically compare prompting techniques, automatic prompt optimization, and fine-tuning strategies. While prompting alone proves insufficient, supervised fine-tuning combined with Direct Preference Optimization on simulator-generated preferences achieves performance that matches or slightly exceeds historical baselines in a hand-crafted simulator. Our findings demonstrate that both approaches offer viable paths toward AI-assisted operational decision-making. Offline RL excels with task-specific architectures. LLMs support human-readable inputs and can be combined with an iterative feedback loop that can incorporate manager preferences.
Problem

Research questions and friction points this paper is trying to address.

warehouse staffing
offline reinforcement learning
large language models
operational decision-making
sortation systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Offline Reinforcement Learning
Fine-tuned LLMs
Warehouse Staffing Optimization
Direct Preference Optimization
Transformer-based Policy
🔎 Similar Papers
No similar papers found.