🤖 AI Summary
This work addresses the inefficiency of traditional humanoid robot deployment for novel object grasping, which typically requires one to two days of data collection, annotation, and training. The authors propose an end-to-end rapid deployment pipeline that integrates Roboflow for automated annotation, Meta’s SAM for 3D reconstruction, and FoundationPose for zero-shot pose estimation—eliminating the need for laser scanning or manual intervention and reducing deployment time to approximately 30 minutes. The system leverages YOLOv8 for object detection, SAM for 3D mesh generation, and FoundationPose for six-degree-of-freedom pose tracking, coupled with Unity-based inverse kinematics and UDP-based real-time control. Evaluated on the Unitree G1 robot, the approach achieves an mAP@0.5 of 0.995 and pose estimation accuracy with σ < 1.05 mm, successfully demonstrating multi-location grasping and automotive window sealant application tasks.
📝 Abstract
Deploying a humanoid robot to manipulate a new object has traditionally required one to two days of effort: data collection, manual annotation, 3D model acquisition, and model training. This paper presents an end-to-end rapid deployment pipeline that integrates three foundation-model components to shorten the onboarding cycle for a new object to approximately 30 minutes: (i) Roboflow-based automatic annotation to assist in training a YOLOv8 object detector; (ii) 3D reconstruction based on Meta SAM 3D, which eliminates the need for a dedicated laser scanner; and (iii) zero-shot 6-DoF pose tracking based on FoundationPose, using the SAM~3D-generated mesh directly as the template. The estimated pose drives a Unity-based inverse kinematics planner, whose joint commands are streamed via UDP to a Unitree~G1 humanoid and executed through the Unitree SDK. We demonstrate detection accuracy of mAP@0.5 = 0.995, pose tracking precision of $σ< 1.05$ mm, and successful grasping on a real robot at five positions within the workspace. We further verify the generality of the pipeline on an automobile-window glue-application task. The results show that combining foundation models for perception with everyday imaging devices (e.g., smartphones) can substantially lower the deployment barrier for humanoid manipulation tasks.