Learning and Transferring Better with Depth Information in Visual Reinforcement Learning

📅 2025-07-12

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address weak multimodal representation generalization and low sim-to-real transfer efficiency in vision-based reinforcement learning, this paper proposes a visual Transformer framework integrating RGB and depth modalities. Methodologically, it employs dual CNN stems to separately encode RGB and depth inputs, introduces contrastive unsupervised learning using masked and unmasked tokens to enhance representation quality, and designs a domain-randomization-driven dynamic curriculum learning strategy to optimize cross-domain adaptation. The key contributions are a novel multimodal contrastive representation learning mechanism and an adaptive, sim-to-real-oriented curriculum design. Experiments demonstrate substantial improvements in sample efficiency and policy generalization capability; the approach achieves superior robustness and performance on sim-to-real transfer tasks across multiple robotic manipulation benchmarks.

Technology Category

Application Category

📝 Abstract

Depth information is robust to scene appearance variations and inherently carries 3D spatial details. In this paper, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive unsupervised learning scheme is designed with masked and unmasked tokens to accelerate the sample efficiency during the reinforcement learning progress. For sim2real transfer, a flexible curriculum learning schedule is developed to deploy domain randomization over training processes.

Problem

Research questions and friction points this paper is trying to address.

Enhancing generalization in visual reinforcement learning using depth information

Improving sample efficiency via contrastive unsupervised learning with masked tokens

Facilitating sim2real transfer through flexible curriculum learning and domain randomization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses RGB and depth via vision transformer

Uses contrastive learning with masked tokens

Implements curriculum learning for sim2real

🔎 Similar Papers

No similar papers found.