Learning to Manipulate Anything: Revealing Data Scaling Laws in Bounding-Box Guided Policies

πŸ“… 2026-02-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limited generalization of existing diffusion-based policies in semantic manipulation tasks, which struggle to accurately localize target objects in complex environments based on textual instructions. The authors propose a bounding box–guided diffusion policy that integrates object detection with a diffusion model through a semantics-to-motion decoupling framework, leveraging the Label-UMI automated annotation system to efficiently construct a semantically labeled demonstration dataset. They uncover, for the first time, a power-law relationship between generalization performance and the number of objects annotated with bounding boxes, enabling a data-efficient collection strategy. In large-scale real-world experiments, the method achieves an 85% success rate across four task categories on both seen and unseen objects, significantly enhancing generalization and scalability.

Technology Category

Application Category

πŸ“ Abstract
Diffusion-based policies show limited generalization in semantic manipulation, posing a key obstacle to the deployment of real-world robots. This limitation arises because relying solely on text instructions is inadequate to direct the policy's attention toward the target object in complex and dynamic environments. To solve this problem, we propose leveraging bounding-box instruction to directly specify target object, and further investigate whether data scaling laws exist in semantic manipulation tasks. Specifically, we design a handheld segmentation device with an automated annotation pipeline, Label-UMI, which enables the efficient collection of demonstration data with semantic labels. We further propose a semantic-motion-decoupled framework that integrates object detection and bounding-box guided diffusion policy to improve generalization and adaptability in semantic manipulation. Throughout extensive real-world experiments on large-scale datasets, we validate the effectiveness of the approach, and reveal a power-law relationship between generalization performance and the number of bounding-box objects. Finally, we summarize an effective data collection strategy for semantic manipulation, which can achieve 85\% success rates across four tasks on both seen and unseen objects. All datasets and code will be released to the community.
Problem

Research questions and friction points this paper is trying to address.

semantic manipulation
diffusion policy
generalization
bounding-box instruction
data scaling laws
Innovation

Methods, ideas, or system contributions that make the work stand out.

bounding-box guided policy
data scaling laws
semantic manipulation
diffusion policy
Label-UMI
πŸ”Ž Similar Papers
Y
Yihao Wu
Center for Intelligent Control and Telescience, Tsinghua Shenzhen International Graduate School, Shenzhen, China.
Jinming Ma
Jinming Ma
University of Science and Technology of China
reinforcement learning
J
Junbo Tan
Center for Intelligent Control and Telescience, Tsinghua Shenzhen International Graduate School, Shenzhen, China.
Y
Yanzhao Yu
Center for Intelligent Control and Telescience, Tsinghua Shenzhen International Graduate School, Shenzhen, China.
Shoujie Li
Shoujie Li
Tsinghua University
Robot SensingGraspingEmbodied AI
M
Mingliang Zhou
Beijing Xiaomi Robot Technology Co., Ltd 602, 6th Floor, Building 5, No. 15 10th Kechuang Street, Beijing Economic-Technological Development Area, Beijing, China, 100176
D
Diyun Xiang
Beijing Xiaomi Robot Technology Co., Ltd 602, 6th Floor, Building 5, No. 15 10th Kechuang Street, Beijing Economic-Technological Development Area, Beijing, China, 100176
Xueqian Wang
Xueqian Wang
Tsinghua University
Information FusionTarget DetectionRadar ImagingImage Processing