$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the limitations of existing vision-language-action (VLA) models, which rely on end-to-end fine-tuning and often suffer from degraded generalization, catastrophic forgetting in vision-language models (VLMs), and ineffective mapping from high-level semantics to precise motor control. To overcome these issues, the authors propose the $M^2$-VLA framework, which directly reuses a general-purpose VLM as its backbone without fine-tuning. The approach introduces a novel Mixture-of-Layers (MoL) mechanism to extract task-relevant semantic features and incorporates a Meta-Skill Module (MSM) that injects strong inductive bias, enabling efficient trajectory learning under limited model capacity. This design effectively bridges the gap between semantic understanding and robotic control, achieving state-of-the-art performance and robust zero-shot transfer in both simulation and real-world environments, with ablation studies confirming the contribution of each component.

Technology Category

Application Category

📝 Abstract

Current Vision-Language-Action (VLA) models predominantly rely on end-to-end fine-tuning. While effective, this paradigm compromises the inherent generalization capabilities of Vision-Language Models (VLMs) and incurs catastrophic forgetting. To address these limitations, we propose $M^2$-VLA, which demonstrates that a generalized VLM is able to serve as a powerful backbone for robotic manipulation directly. However, it remains a key challenge to bridge the gap between the high-level semantic understanding of VLMs and the precise requirements of robotic control. To overcome this, we introduce the Mixture of Layers (MoL) strategy that selectively extracts task-critical information from dense semantic features. Furthermore, to facilitate efficient trajectory learning under constrained model capacity, we propose a Meta Skill Module (MSM) that integrates strong inductive biases. Extensive experiments in both simulated and real-world environments demonstrate the effectiveness of our approach. Furthermore, generalization and ablation studies validate the architecture's zero-shot capabilities and confirm the contribution of each key component. Our code and pre-trained models will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

generalization

catastrophic forgetting

robotic manipulation

semantic understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Layers

Meta Skill Module

Vision-Language-Action