🤖 AI Summary
This work addresses the performance degradation of existing deep reinforcement learning methods in online 3D bin packing under distribution shifts commonly encountered in real-world logistics scenarios. The authors formulate the packing problem with short-horizon lookahead information as a model predictive control (MPC) task and propose a Monte Carlo Tree Search (MCTS)-based optimization framework. Their approach innovatively incorporates a dynamic exploration prior to adaptively balance exploration and exploitation, alongside an auxiliary reward mechanism that penalizes long-term spatial wastage. Experimental results on real-world datasets demonstrate significant improvements over state-of-the-art methods: the proposed method achieves over 10% performance gain under distribution shift, an average 4% improvement in online deployment, and up to 8% enhancement in optimal scenarios.
📝 Abstract
Online 3D Bin Packing (3D-BP) with robotic arms is crucial for reducing transportation and labor costs in modern logistics. While Deep Reinforcement Learning (DRL) has shown strong performance, it often fails to adapt to real-world short-term distribution shifts, which arise as different batches of goods arrive sequentially, causing performance drops. We argue that the short-term lookahead information available in modern logistics systems is key to mitigating this issue, especially during distribution shifts. We formulate online 3D-BP with lookahead parcels as a Model Predictive Control (MPC) problem and adapt the Monte Carlo Tree Search (MCTS) framework to solve it. Our framework employs a dynamic exploration prior that automatically balances a learned RL policy and a robust random policy based on the lookahead characteristics. Additionally, we design an auxiliary reward to penalize long-term spatial waste from individual placements. Extensive experiments on real-world datasets show that our method consistently outperforms state-of-the-art baselines, achieving over 10\% gains under distributional shifts, 4\% average improvement in online deployment, and up to more than 8\% in the best case--demonstrating the effectiveness of our framework.