Learning to Grasp Anything by Playing with Random Toys

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Robot generalization to novel objects is fundamentally limited by zero-shot adaptability. Inspired by how children develop manipulation skills through playful interaction, this paper proposes a lightweight, object-centric visual representation framework for generalized robotic grasping. Our method employs a synthetic training set comprising only four basic geometric primitives—randomly arranged as “toys”—to induce compositional object understanding. We design a detection-pooling–driven object-centric representation mechanism and integrate it with a reinforcement learning policy trained exclusively on this minimal synthetic data. Crucially, no real-world target-domain annotations or large-scale domain-specific simulation data are required. The framework enables zero-shot cross-object-category transfer. Evaluated on the YCB-Video dataset with a physical robot, our approach achieves a 67% grasp success rate—outperforming state-of-the-art methods reliant on extensive in-domain data. This validates the efficacy and scalability of the “simple toys → generalizable skill” paradigm.

Technology Category

Application Category

📝 Abstract

Robotic manipulation policies often struggle to generalize to novel objects, limiting their real-world utility. In contrast, cognitive science suggests that children develop generalizable dexterous manipulation skills by mastering a small set of simple toys and then applying that knowledge to more complex items. Inspired by this, we study if similar generalization capabilities can also be achieved by robots. Our results indicate robots can learn generalizable grasping using randomly assembled objects that are composed from just four shape primitives: spheres, cuboids, cylinders, and rings. We show that training on these "toys" enables robust generalization to real-world objects, yielding strong zero-shot performance. Crucially, we find the key to this generalization is an object-centric visual representation induced by our proposed detection pooling mechanism. Evaluated in both simulation and on physical robots, our model achieves a 67% real-world grasping success rate on the YCB dataset, outperforming state-of-the-art approaches that rely on substantially more in-domain data. We further study how zero-shot generalization performance scales by varying the number and diversity of training toys and the demonstrations per toy. We believe this work offers a promising path to scalable and generalizable learning in robotic manipulation. Demonstration videos, code, checkpoints and our dataset are available on our project page: https://lego-grasp.github.io/ .

Problem

Research questions and friction points this paper is trying to address.

Robots struggle to generalize grasping to novel objects

Training with simple shape primitives enables robust generalization

Object-centric visual representation is key for zero-shot performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training robots with randomly assembled shape primitives

Using object-centric visual representation for generalization

Achieving zero-shot grasping on real-world objects

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey