Learning to Grasp Anything by Playing with Random Toys

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robot generalization to novel objects is fundamentally limited by zero-shot adaptability. Inspired by how children develop manipulation skills through playful interaction, this paper proposes a lightweight, object-centric visual representation framework for generalized robotic grasping. Our method employs a synthetic training set comprising only four basic geometric primitives—randomly arranged as “toys”—to induce compositional object understanding. We design a detection-pooling–driven object-centric representation mechanism and integrate it with a reinforcement learning policy trained exclusively on this minimal synthetic data. Crucially, no real-world target-domain annotations or large-scale domain-specific simulation data are required. The framework enables zero-shot cross-object-category transfer. Evaluated on the YCB-Video dataset with a physical robot, our approach achieves a 67% grasp success rate—outperforming state-of-the-art methods reliant on extensive in-domain data. This validates the efficacy and scalability of the “simple toys → generalizable skill” paradigm.

Technology Category

Application Category

📝 Abstract
Robotic manipulation policies often struggle to generalize to novel objects, limiting their real-world utility. In contrast, cognitive science suggests that children develop generalizable dexterous manipulation skills by mastering a small set of simple toys and then applying that knowledge to more complex items. Inspired by this, we study if similar generalization capabilities can also be achieved by robots. Our results indicate robots can learn generalizable grasping using randomly assembled objects that are composed from just four shape primitives: spheres, cuboids, cylinders, and rings. We show that training on these "toys" enables robust generalization to real-world objects, yielding strong zero-shot performance. Crucially, we find the key to this generalization is an object-centric visual representation induced by our proposed detection pooling mechanism. Evaluated in both simulation and on physical robots, our model achieves a 67% real-world grasping success rate on the YCB dataset, outperforming state-of-the-art approaches that rely on substantially more in-domain data. We further study how zero-shot generalization performance scales by varying the number and diversity of training toys and the demonstrations per toy. We believe this work offers a promising path to scalable and generalizable learning in robotic manipulation. Demonstration videos, code, checkpoints and our dataset are available on our project page: https://lego-grasp.github.io/ .
Problem

Research questions and friction points this paper is trying to address.

Robots struggle to generalize grasping to novel objects
Training with simple shape primitives enables robust generalization
Object-centric visual representation is key for zero-shot performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training robots with randomly assembled shape primitives
Using object-centric visual representation for generalization
Achieving zero-shot grasping on real-world objects