$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
This work addresses the limitations of existing vision-language-action (VLA) models, which rely on end-to-end fine-tuning and often suffer from degraded generalization, catastrophic forgetting in vision-language models (VLMs), and ineffective mapping from high-level semantics to precise motor control. To overcome these issues, the authors propose the $M^2$-VLA framework, which directly reuses a general-purpose VLM as its backbone without fine-tuning. The approach introduces a novel Mixture-of-Layers (MoL) mechanism to extract task-relevant semantic features and incorporates a Meta-Skill Module (MSM) that injects strong inductive bias, enabling efficient trajectory learning under limited model capacity. This design effectively bridges the gap between semantic understanding and robotic control, achieving state-of-the-art performance and robust zero-shot transfer in both simulation and real-world environments, with ablation studies confirming the contribution of each component.

Technology Category

Application Category

📝 Abstract
Current Vision-Language-Action (VLA) models predominantly rely on end-to-end fine-tuning. While effective, this paradigm compromises the inherent generalization capabilities of Vision-Language Models (VLMs) and incurs catastrophic forgetting. To address these limitations, we propose $M^2$-VLA, which demonstrates that a generalized VLM is able to serve as a powerful backbone for robotic manipulation directly. However, it remains a key challenge to bridge the gap between the high-level semantic understanding of VLMs and the precise requirements of robotic control. To overcome this, we introduce the Mixture of Layers (MoL) strategy that selectively extracts task-critical information from dense semantic features. Furthermore, to facilitate efficient trajectory learning under constrained model capacity, we propose a Meta Skill Module (MSM) that integrates strong inductive biases. Extensive experiments in both simulated and real-world environments demonstrate the effectiveness of our approach. Furthermore, generalization and ablation studies validate the architecture's zero-shot capabilities and confirm the contribution of each key component. Our code and pre-trained models will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
generalization
catastrophic forgetting
robotic manipulation
semantic understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Layers
Meta Skill Module
Vision-Language-Action
Generalizable Manipulation
Zero-shot Transfer
S
Siyao Xiao
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China.
Yuhong Zhang
Yuhong Zhang
清华大学
Autonomous Driving
Zhifang Liu
Zhifang Liu
School of Mathematical Sciences, Tianjin Normal University
image processing
Z
Zihan Gao
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China.
J
Jingye Zhang
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China.
S
Sinwai Choo
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China.
D
Dake Zhong
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China.
M
Mengzhe Wang
College of Intelligent Robotics and Advanced Manufacturing, Fudan University, Shanghai, China.
X
Xiao Lin
Synapath.
X
Xianfeng Zhou
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China.
J
Jia Jia
Peng Cheng Laboratory, Shenzhen, China.
H
Haoqian Wang
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China.