🤖 AI Summary
To address weak multimodal representation generalization and low sim-to-real transfer efficiency in vision-based reinforcement learning, this paper proposes a visual Transformer framework integrating RGB and depth modalities. Methodologically, it employs dual CNN stems to separately encode RGB and depth inputs, introduces contrastive unsupervised learning using masked and unmasked tokens to enhance representation quality, and designs a domain-randomization-driven dynamic curriculum learning strategy to optimize cross-domain adaptation. The key contributions are a novel multimodal contrastive representation learning mechanism and an adaptive, sim-to-real-oriented curriculum design. Experiments demonstrate substantial improvements in sample efficiency and policy generalization capability; the approach achieves superior robustness and performance on sim-to-real transfer tasks across multiple robotic manipulation benchmarks.
📝 Abstract
Depth information is robust to scene appearance variations and inherently carries 3D spatial details. In this paper, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive unsupervised learning scheme is designed with masked and unmasked tokens to accelerate the sample efficiency during the reinforcement learning progress. For sim2real transfer, a flexible curriculum learning schedule is developed to deploy domain randomization over training processes.