🤖 AI Summary
This work addresses the challenge of closed-loop robustness for end-to-end visual navigation under unseen natural language instructions and out-of-distribution visual inputs. Methodologically, we propose Flex—a lightweight framework that freezes a pre-trained multimodal foundation model (e.g., CLIP) and repurposes it as a spatialized, word-level semantic-visual joint feature extractor. We introduce patch-wise feature alignment and spatial attention embedding, trained via behavioral cloning. Our key contributions are: (1) the first realization of spatial-semantic disentangled representation from vision-language models (VLMs) for closed-loop navigation; and (2) strong generalization to real-world complex environments using only limited simulation data—enabling reliable closed-loop flight across diverse novel targets and natural language commands. Flex significantly improves robustness to both instruction distribution shifts and visual domain shifts, without requiring fine-tuning of the VLM backbone.
📝 Abstract
End-to-end learning directly maps sensory inputs to actions, creating highly integrated and efficient policies for complex robotics tasks. However, such models often struggle to generalize beyond their training scenarios, limiting adaptability to new environments, tasks, and concepts. In this work, we investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies under unseen text instructions and visual distribution shifts. Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors, generating spatially aware embeddings that integrate semantic and visual information. We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning on a small simulated dataset successfully generalize to real-world scenes with diverse novel goals and command formulations.