UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning

๐Ÿ“… 2025-10-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the limited generalization capability of robotic policies in open-world environments, this paper proposes UniCoDโ€”the first unified robotic policy framework that jointly models visual-language understanding, task planning, and continuous future representation learning. UniCoD leverages large-scale instructional video pretraining to synergistically integrate a vision-language model (VLM) and a vision-generation model (VGM), while innovatively co-learning discrete task instruction embeddings and continuous action/state representations. This enables cross-task and out-of-distribution action token mapping, as well as end-to-end fine-tuning. Evaluated on both simulated and real-world out-of-distribution tasks, UniCoD consistently outperforms existing baselines by 9โ€“12%, demonstrating substantial improvements in robotsโ€™ ability to comprehend, reason about, and execute diverse, open-ended tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Building generalist robot policies that can handle diverse tasks in open-ended environments is a central challenge in robotics. To leverage knowledge from large-scale pretraining, prior work has typically built generalist policies either on top of vision-language understanding models (VLMs) or generative models. However, both semantic understanding from vision-language pretraining and visual dynamics modeling from visual-generation pretraining are crucial for embodied robots. Recent unified models of generation and understanding have demonstrated strong capabilities in both comprehension and generation through large-scale pretraining. We posit that robotic policy learning can likewise benefit from the combined strengths of understanding, planning and continuous future representation learning. Building on this insight, we introduce UniCoD, which acquires the ability to dynamically model high-dimensional visual features through pretraining on over 1M internet-scale instructional manipulation videos. Subsequently, UniCoD is fine-tuned on data collected from the robot embodiment, enabling the learning of mappings from predictive representations to action tokens. Extensive experiments show our approach consistently outperforms baseline methods in terms of 9% and 12% across simulation environments and real-world out-of-distribution tasks.
Problem

Research questions and friction points this paper is trying to address.

Building generalist robot policies for diverse open-ended environments
Combining semantic understanding and visual dynamics modeling for robots
Learning mappings from predictive representations to action tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified continuous and discrete representation learning
Pretraining on million-scale instructional videos
Mapping predictive representations to action tokens
๐Ÿ”Ž Similar Papers
No similar papers found.
Jianke Zhang
Jianke Zhang
Tsinghua University, IIIS
Embodied AI. VLM. Multimodal Learning
Y
Yucheng Hu
Institute for Interdisciplinary Information Sciences, Tsinghua University
Yanjiang Guo
Yanjiang Guo
Tsinghua University
Embodied AIGenerative Model
X
Xiaoyu Chen
Institute for Interdisciplinary Information Sciences, Tsinghua University
Y
Yichen Liu
Institute for Interdisciplinary Information Sciences, Tsinghua University
W
Wenna Chen
Peking University
Chaochao Lu
Chaochao Lu
Shanghai AI Laboratory
Causal AI
Jianyu Chen
Jianyu Chen
Assistant Professor, Tsinghua University
AIRobotics