๐ค AI Summary
This work addresses key limitations in general-purpose robotic vision-language-action modelsโnamely data scarcity, architectural inefficiency, and poor cross-platform generalization. Building upon a 7B-parameter vision-language foundation model, the authors leverage over 10,000 hours of open-source, multi-platform manipulation data and employ a three-stage training pipeline to align language instructions with continuous control signals. By integrating residual vector quantization (RVQ), flow matching, and knowledge distillation, they construct an enhanced Universal Manipulation Interface (UMI) dataset. The resulting model achieves, for the first time, simultaneous zero-shot generalization across unseen objects, environments, instructions, and robot platforms, while supporting real-time inference. It significantly outperforms state-of-the-art methods in dexterous manipulation, long-horizon tasks, and dynamic scenarios such as table tennis.
๐ Abstract
Vision-Language-Action (VLA) models hold promise for generalist robotics but currently struggle with data scarcity, architectural inefficiencies, and the inability to generalize across different hardware platforms. We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM designed to enable zero-shot deployment on novel embodiments for open-vocabulary tasks. To achieve this, we collected one of the largest open-source robotic datasets--over 10,000 hours of demonstrations in diverse families--using an enhanced, embodiment-agnostic Universal Manipulation Interface (UMI). Our approach employs a novel three-stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow-matching, and distillation for real-time inference. Consequently, RDT2 becomes one of the first models that simultaneously zero-shot generalizes to unseen objects, scenes, instructions, and even robotic platforms. Besides, it outperforms state-of-the-art baselines in dexterous, long-horizon, and dynamic downstream tasks like playing table tennis. See https://rdt-robotics.github.io/rdt2/ for more information.