OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing Vision-Language-Action (VLA) models feed visual and linguistic features independently into downstream policies, disrupting the cross-modal semantic alignment established during pretraining and thereby limiting zero-shot generalization. To address this, we propose Text-Guided Visual Gating (TGVG): a lightweight architecture that freezes pretrained multimodal encoders (e.g., CLIP) and introduces an attention-driven, text-conditioned visual gating mechanism to dynamically emphasize image regions semantically aligned with language instructions; the policy layer employs a compact Transformer. Crucially, TGVG preserves pretraining-induced cross-modal alignment end-to-end without fine-tuning the vision-language backbone. Experiments demonstrate that TGVG consistently outperforms state-of-the-art methods on both simulated and real-world robotic manipulation tasks, achieving high-precision zero-shot generalization to novel objects and environments while maintaining computational efficiency.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained visionlanguage models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zeroshot generalization to novel objects and environments. Video, code, checkpoints, and dataset: https://ottervla.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Enhances robotic action prediction using vision and language inputs.

Preserves pre-trained semantic alignments without fine-tuning.

Improves zero-shot generalization to new objects and environments.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-aware visual feature extraction

Preserves pre-trained semantic alignments

Strong zero-shot generalization capabilities

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs