Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features

📅 2024-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of closed-loop robustness for end-to-end visual navigation under unseen natural language instructions and out-of-distribution visual inputs. Methodologically, we propose Flex—a lightweight framework that freezes a pre-trained multimodal foundation model (e.g., CLIP) and repurposes it as a spatialized, word-level semantic-visual joint feature extractor. We introduce patch-wise feature alignment and spatial attention embedding, trained via behavioral cloning. Our key contributions are: (1) the first realization of spatial-semantic disentangled representation from vision-language models (VLMs) for closed-loop navigation; and (2) strong generalization to real-world complex environments using only limited simulation data—enabling reliable closed-loop flight across diverse novel targets and natural language commands. Flex significantly improves robustness to both instruction distribution shifts and visual domain shifts, without requiring fine-tuning of the VLM backbone.

Technology Category

Application Category

📝 Abstract
End-to-end learning directly maps sensory inputs to actions, creating highly integrated and efficient policies for complex robotics tasks. However, such models often struggle to generalize beyond their training scenarios, limiting adaptability to new environments, tasks, and concepts. In this work, we investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies under unseen text instructions and visual distribution shifts. Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors, generating spatially aware embeddings that integrate semantic and visual information. We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning on a small simulated dataset successfully generalize to real-world scenes with diverse novel goals and command formulations.
Problem

Research questions and friction points this paper is trying to address.

Generalizing vision-based control policies to unseen text instructions
Adapting to visual distribution shifts in new environments
Achieving robust closed-loop performance with minimal data requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses frozen Vision Language Models for feature extraction
Integrates semantic and visual spatial embeddings
Trains agents via behavior cloning on small datasets
🔎 Similar Papers
No similar papers found.