UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing approaches for robotic manipulation of unknown objects in unstructured environments under open-ended instructions rely on manual annotations or are constrained to predefined task sets, limiting generalization. Method: We propose the first unsupervised embodied affordance distillation framework. It requires no human annotation: by fusing frozen large vision models and vision-language models, it enables self-supervised labeling to automatically construct a large-scale <instruction, visual affordance> dataset; a lightweight task-conditioned decoder is then trained via imitation learning on affordance representations. Contribution/Results: This work achieves the first fully unsupervised embodied affordance knowledge distillation. It demonstrates strong generalization across unseen object instances, categories, and instruction variants, attaining robust performance with only ten demonstrations. Experiments show significant improvements in generalization to novel objects and instructions on real robots, robust sim-to-real transfer, and a 100% reduction in annotation cost.

Technology Category

Application Category

📝 Abstract

Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods of visual affordance predictions often rely on manually annotated data or conditions only on a predefined set of tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed $<$instruction, visual affordance$>$ pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes and to various human activities, despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we show an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations. Project website: https://unsup-affordance.github.io/

Problem

Research questions and friction points this paper is trying to address.

Unsupervised learning of object affordances for robotic manipulation

Generalization to diverse tasks without manual annotations

Leveraging foundation models for scalable affordance distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised distillation from foundation models

Leverages vision and vision-language models

Lightweight decoder for task-conditioned affordance

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey