🤖 AI Summary
Addressing challenges in human-robot collaborative manipulation of deformable objects—such as dynamic modeling complexity, low force-control precision, and poor task adaptability—this paper focuses on gift wrapping as a representative scenario. We propose START, a subtask-aware unified policy model that enhances temporal modeling via subtask ID embedding, integrates large language model–driven task planning, Transformer-based imitation learning, and reinforcement learning, and incorporates a residual force control mechanism to achieve end-to-end closed-loop execution from high-level planning to fine-grained force regulation. Evaluated in real-world settings, START achieves a 97% packaging success rate, significantly reduces reliance on task-specific models, and enables flexible human-robot collaboration with fine-grained adaptive manipulation. The framework provides a scalable, generalizable solution for force-controlled manipulation of deformable objects.
📝 Abstract
Human-robot cooperation is essential in environments such as warehouses and retail stores, where workers frequently handle deformable objects like paper, bags, and fabrics. Coordinating robotic actions with human assistance remains difficult due to the unpredictable dynamics of deformable materials and the need for adaptive force control. To explore this challenge, we focus on the task of gift wrapping, which exemplifies a long-horizon manipulation problem involving precise folding, controlled creasing, and secure fixation of paper. Success is achieved when the robot completes the sequence to produce a neatly wrapped package with clean folds and no tears. We propose a learning-based framework that integrates a high-level task planner powered by a large language model (LLM) with a low-level hybrid imitation learning (IL) and reinforcement learning (RL) policy. At its core is a Sub-task Aware Robotic Transformer (START) that learns a unified policy from human demonstrations. The key novelty lies in capturing long-range temporal dependencies across the full wrapping sequence within a single model. Unlike vanilla Action Chunking with Transformer (ACT), typically applied to short tasks, our method introduces sub-task IDs that provide explicit temporal grounding. This enables robust performance across the entire wrapping process and supports flexible execution, as the policy learns sub-goals rather than merely replicating motion sequences. Our framework achieves a 97% success rate on real-world wrapping tasks. We show that the unified transformer-based policy reduces the need for specialized models, allows controlled human supervision, and effectively bridges high-level intent with the fine-grained force control required for deformable object manipulation.