DataPlatter: Boosting Robotic Manipulation Generalization with Minimal Costly Data

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization of Vision-Language-Action (VLA) models in robotic manipulation, particularly during the Spatial Reasoning Phase (SRP) over large workspaces, due to insufficient coverage of high-cost Physical Interaction Data (PIP). We propose a trajectory-phase decoupling and ratio-driven, subtask-specific training paradigm. Our approach innovatively decomposes manipulation trajectories into semantically meaningful, low-cost phases—such as SRP—that can be collected without physical interaction, and leverages their supervision signals to collaboratively enhance learning from scarce PIP. By integrating phased supervised learning with a ratio-based curriculum that prioritizes low-cost data guidance, our method significantly improves cross-object zero-shot transferability: it achieves up to a 41% absolute success rate gain in zero-shot settings and attains strong generalization with only minimal PIP.

Technology Category

Application Category

📝 Abstract
The growing adoption of Vision-Language-Action (VLA) models in embodied AI intensifies the demand for diverse manipulation demonstrations. However, high costs associated with data collection often result in insufficient data coverage across all scenarios, which limits the performance of the models. It is observed that the spatial reasoning phase (SRP) in large workspace dominates the failure cases. Fortunately, this data can be collected with low cost, underscoring the potential of leveraging inexpensive data to improve model performance. In this paper, we introduce the DataPlatter method, a framework that decouples training trajectories into distinct task stages and leverages abundant easily collectible SRP data to enhance VLA model's generalization. Through analysis we demonstrate that sub-task-specific training with additional SRP data with proper proportion can act as a performance catalyst for robot manipulation, maximizing the utilization of costly physical interaction phase (PIP) data. Experiments show that through introducing large proportion of cost-effective SRP trajectories into a limited set of PIP data, we can achieve a maximum improvement of 41% on success rate in zero-shot scenes, while with the ability to transfer manipulation skill to novel targets.
Problem

Research questions and friction points this paper is trying to address.

Enhancing robotic manipulation generalization with low-cost data
Addressing insufficient data coverage in Vision-Language-Action models
Leveraging spatial reasoning phase data to boost model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples training trajectories into task stages
Leverages low-cost SRP data for model enhancement
Maximizes utilization of costly PIP data
L
Liming Zheng
Meituan Inc.
F
Feng Yan
Meituan Inc.
Fanfan Liu
Fanfan Liu
Researcher, Meituan
Computer visionMulti-modalEmbodied AI
Chengjian Feng
Chengjian Feng
Meituan
Computer VisionObject Detection
Yufeng Zhong
Yufeng Zhong
Meituan
Multimodal LLMComputer Vision
Y
Yiyang Huang
Institute of Computing Technology, Chinese Academy of Sciences
L
Lin Ma
Meituan Inc.