π€ AI Summary
This work addresses the low sample efficiency and sim-to-real transfer challenges in end-to-end learning for monocular vision-based autonomous flight. The authors propose a βreal β simulation β realβ paradigm that decouples representation learning from policy optimization. They introduce geometry-constrained 3D Gaussian Splatting to reconstruct high-fidelity simulation environments and leverage contrastive learning to extract robust, low-dimensional visual features for efficient visuomotor policy training. This study presents the first integration of 3D Gaussian Splatting with contrastive reinforcement learning, enabling zero-shot cross-domain transfer without fine-tuning. Experimental results demonstrate that the proposed method significantly outperforms baseline approaches in both simulated and real-world settings, effectively narrowing the domain performance gap and generalizing successfully to unseen, complex-textured environments.
π Abstract
Learning visuomotor policies for Autonomous Aerial Vehicles (AAVs) relying solely on monocular vision is an attractive yet highly challenging paradigm. Existing end-to-end learning approaches directly map high-dimensional RGB observations to action commands, which frequently suffer from low sample efficiency and severe sim-to-real gaps due to the visual discrepancy between simulation and physical domains. To address these long-standing challenges, we propose GaussFly, a novel framework that explicitly decouples representation learning from policy optimization through a cohesive real-to-sim-to-real paradigm. First, to achieve a high-fidelity real-to-sim transition, we reconstruct training scenes using 3D Gaussian Splatting (3DGS) augmented with explicit geometric constraints. Second, to ensure robust sim-to-real transfer, we leverage these photorealistic simulated environments and employ contrastive representation learning to extract compact, noise-resilient latent features from the rendered RGB images. By utilizing this pre-trained encoder to provide low-dimensional feature inputs, the computational burden on the visuomotor policy is significantly reduced while its resistance against visual noise is inherently enhanced. Extensive experiments in simulated and real-world environments demonstrate that GaussFly achieves superior sample efficiency and asymptotic performance compared to baselines. Crucially, it enables robust and zero-shot policy transfer to unseen real-world environments with complex textures, effectively bridging the sim-to-real gap.