UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation

πŸ“… 2025-06-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

203K/year
πŸ€– AI Summary
Existing approaches for robotic manipulation of unknown objects in unstructured environments under open-ended instructions rely on manual annotations or are constrained to predefined task sets, limiting generalization. Method: We propose the first unsupervised embodied affordance distillation framework. It requires no human annotation: by fusing frozen large vision models and vision-language models, it enables self-supervised labeling to automatically construct a large-scale <instruction, visual affordance> dataset; a lightweight task-conditioned decoder is then trained via imitation learning on affordance representations. Contribution/Results: This work achieves the first fully unsupervised embodied affordance knowledge distillation. It demonstrates strong generalization across unseen object instances, categories, and instruction variants, attaining robust performance with only ten demonstrations. Experiments show significant improvements in generalization to novel objects and instructions on real robots, robust sim-to-real transfer, and a 100% reduction in annotation cost.

Technology Category

Application Category

πŸ“ Abstract
Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods of visual affordance predictions often rely on manually annotated data or conditions only on a predefined set of tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed $<$instruction, visual affordance$>$ pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes and to various human activities, despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we show an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations. Project website: https://unsup-affordance.github.io/
Problem

Research questions and friction points this paper is trying to address.

Unsupervised learning of object affordances for robotic manipulation
Generalization to diverse tasks without manual annotations
Leveraging foundation models for scalable affordance distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised distillation from foundation models
Leverages vision and vision-language models
Lightweight decoder for task-conditioned affordance
πŸ’Ό Related Jobs
Vision Foundation Model Research Intern
Intrinsic
Salary Range$57.69β€”$57.69 USDAt Intrinsic, we are proud to be an equal opportunity workplace. Employment at Intrinsic is based solely on a person's merit and qualifications directly related to professional competence. Intrinsic does not discriminate against any employee or applicant because of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), or any other basis protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. It is Intrinsic’s policy to comply with all applicable national, state and local laws pertaining to nondiscrimination and equal opportunity.
Mountain View, California / Mountain View (US-MTV), Mountain View, California, United States