🤖 AI Summary
This work proposes a novel decision-making framework that integrates offline reinforcement learning with fine-tuned large language models (LLMs) to address the inefficiency of real-time human scheduling in semi-automated warehouses. The approach uniquely combines, in parallel, a Transformer-based offline reinforcement learning policy with an LLM’s capacity to interpret abstract state descriptions, enabling multi-granular inputs ranging from raw system states to human-readable summaries. It further introduces a simulation-based Direct Preference Optimization (DPO) mechanism to automatically refine prompting strategies. Experimental results demonstrate that the offline reinforcement learning policy improves system throughput by 2.4% in simulation, while the fine-tuned LLM achieves performance on par with or slightly exceeding historical baselines, confirming the feasibility and complementary potential of both approaches for real-world warehouse scheduling.
📝 Abstract
We investigate machine learning approaches for optimizing real-time staffing decisions in semi-automated warehouse sortation systems. Operational decision-making can be supported at different levels of abstraction, with different trade-offs. We evaluate two approaches, each in a matching simulation environment. First, we train custom Transformer-based policies using offline reinforcement learning on detailed historical state representations, achieving a 2.4% throughput improvement over historical baselines in learned simulators. In high-volume warehouse operations, improvements of this size translate to significant savings. Second, we explore LLMs operating on abstracted, human-readable state descriptions. These are a natural fit for decisions that warehouse managers make using high-level operational summaries. We systematically compare prompting techniques, automatic prompt optimization, and fine-tuning strategies. While prompting alone proves insufficient, supervised fine-tuning combined with Direct Preference Optimization on simulator-generated preferences achieves performance that matches or slightly exceeds historical baselines in a hand-crafted simulator. Our findings demonstrate that both approaches offer viable paths toward AI-assisted operational decision-making. Offline RL excels with task-specific architectures. LLMs support human-readable inputs and can be combined with an iterative feedback loop that can incorporate manager preferences.