Learning and Transferring Better with Depth Information in Visual Reinforcement Learning

📅 2025-07-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak multimodal representation generalization and low sim-to-real transfer efficiency in vision-based reinforcement learning, this paper proposes a visual Transformer framework integrating RGB and depth modalities. Methodologically, it employs dual CNN stems to separately encode RGB and depth inputs, introduces contrastive unsupervised learning using masked and unmasked tokens to enhance representation quality, and designs a domain-randomization-driven dynamic curriculum learning strategy to optimize cross-domain adaptation. The key contributions are a novel multimodal contrastive representation learning mechanism and an adaptive, sim-to-real-oriented curriculum design. Experiments demonstrate substantial improvements in sample efficiency and policy generalization capability; the approach achieves superior robustness and performance on sim-to-real transfer tasks across multiple robotic manipulation benchmarks.

Technology Category

Application Category

📝 Abstract
Depth information is robust to scene appearance variations and inherently carries 3D spatial details. In this paper, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive unsupervised learning scheme is designed with masked and unmasked tokens to accelerate the sample efficiency during the reinforcement learning progress. For sim2real transfer, a flexible curriculum learning schedule is developed to deploy domain randomization over training processes.
Problem

Research questions and friction points this paper is trying to address.

Enhancing generalization in visual reinforcement learning using depth information
Improving sample efficiency via contrastive unsupervised learning with masked tokens
Facilitating sim2real transfer through flexible curriculum learning and domain randomization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses RGB and depth via vision transformer
Uses contrastive learning with masked tokens
Implements curriculum learning for sim2real
🔎 Similar Papers
No similar papers found.
Z
Zichun Xu
State Key Laboratory of Robotics and Systems, Harbin Institute of Technology, Harbin 150001, Heilongjiang Province, China
Yuntao Li
Yuntao Li
Peking University
Z
Zhaomin Wang
State Key Laboratory of Robotics and Systems, Harbin Institute of Technology, Harbin 150001, Heilongjiang Province, China
L
Lei Zhuang
State Key Laboratory of Robotics and Systems, Harbin Institute of Technology, Harbin 150001, Heilongjiang Province, China
G
Guocai Yang
State Key Laboratory of Robotics and Systems, Harbin Institute of Technology, Harbin 150001, Heilongjiang Province, China
J
Jingdong Zhao
State Key Laboratory of Robotics and Systems, Harbin Institute of Technology, Harbin 150001, Heilongjiang Province, China