MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Weak cross-task generalization of Vision-Language-Action (VLA) models in embodied reasoning necessitates task-specific fine-tuning, hindering adaptation to unseen tasks. To address this, we propose a backbone-agnostic, context-aware meta-co-training post-training framework. Our approach leverages an attention-based Neural Process to enable lightweight meta-learning, unifying diverse auxiliary tasks into a single-stage multi-task supervised fine-tuning procedure. Crucially, it achieves substantial gains in zero-shot task generalization with negligible inference overhead. On the LIBERO-LONG benchmark—the most challenging long-horizon embodied reasoning benchmark—we outperform OpenVLA by 8.0% absolute accuracy, using only 75K training steps and reducing GPU-hours by 76%. This demonstrates efficient, low-resource, and highly generalizable adaptation for universal embodied agents.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists-they often require task-specific fine-tuning, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism-derived from Attentive Neural Processes-to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76%. These results show that scalable, low-resource post-training is achievable-paving the way toward general-purpose embodied agents. Code will be available.

Problem

Research questions and friction points this paper is trying to address.

Improves generalization of Vision-Language-Action models to unseen tasks

Reduces task-specific fine-tuning requirements for embodied agents

Enables efficient adaptation with minimal architectural changes and overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified meta co-training for efficient embodied adaptation

Lightweight meta-learning mechanism for rapid context adaptation

Consolidates diverse tasks into single fine-tuning stage

🔎 Similar Papers

Law of Vision Representation in MLLMs