🤖 AI Summary
In retail settings, robotic manipulation of beverage bottles suffers from unreliable teleoperation due to complex contact dynamics and insufficient visual cues. To address this, we propose a multimodal imitation learning framework integrating six-axis force/torque sensing. Building upon the Action Chunking Transformer (ACT), we introduce force/torque signals as a dedicated input modality—jointly processed with RGB images and joint-state observations for end-to-end policy learning. This augmentation significantly enhances state awareness during contact-intensive phases (e.g., pressing, placing) and improves fine-grained interaction modeling under visual occlusion. We deploy the policy in closed-loop on a single-arm Ghost robot, achieving substantial gains in grasp success and bottle reorientation accuracy. Ablation studies confirm the critical role of force/torque feedback in vision-limited scenarios. Our work establishes a scalable, multimodal learning paradigm for dexterous and robust manipulation in real-world retail environments.
📝 Abstract
Manipulator robots are increasingly being deployed in retail environments, yet contact rich edge cases still trigger costly human teleoperation. A prominent example is upright lying beverage bottles, where purely visual cues are often insufficient to resolve subtle contact events required for precise manipulation. We present a multimodal Imitation Learning policy that augments the Action Chunking Transformer with force and torque sensing, enabling end-to-end learning over images, joint states, and forces and torques. Deployed on Ghost, single-arm platform by Telexistence Inc, our approach improves Pick-and-Reorient bottle task by detecting and exploiting contact transitions during pressing and placement. Hardware experiments demonstrate greater task success compared to baseline matching the observation space of ACT as an ablation and experiments indicate that force and torque signals are beneficial in the press and place phases where visual observability is limited, supporting the use of interaction forces as a complementary modality for contact rich skills. The results suggest a practical path to scaling retail manipulation by combining modern imitation learning architectures with lightweight force and torque sensing.