🤖 AI Summary
This work addresses the challenge of enabling robots to autonomously perform contact-rich, force-sensitive fine manipulation tasks—such as peeling—whose success criteria are inherently subjective. To this end, the authors propose a two-stage learning framework: first, a robust initial policy is acquired through force-aware imitation learning; then, a reward model is constructed using quantitative metrics and human preference feedback to fine-tune the policy via preference optimization, aligning its behavior with human judgments of task quality. This study presents the first application of human preference learning to fine manipulation. Using only 50–200 demonstration trajectories, the approach achieves over 90% average success rates across diverse produce—including cucumbers, apples, and potatoes—and demonstrates approximately 40% performance improvement from preference fine-tuning, along with strong zero-shot generalization across object types.
📝 Abstract
Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.