UAM: A Dual-Stream Perspective on Forgetting in VLA Training

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the “embodiment tax” problem in vision-language-action (VLA) models, where fine-tuning often leads to catastrophic forgetting of the pretrained vision-language model’s (VLM) multimodal semantic capabilities. Inspired by the biological dual-stream visual processing mechanism, we propose the Unified Action Model (UAM), which introduces a dorsal stream pathway into VLA architectures for the first time. UAM employs a parallel dorsal expert module dedicated exclusively to action control, preserving the original VLM’s semantic knowledge without requiring parameter freezing or data replay. The module is initialized from a pretrained generative model and trained end-to-end with mid-level visual dynamics prediction as the learning objective, eschewing gradient stopping or auxiliary language tasks. Experiments demonstrate that UAM achieves state-of-the-art average success rates across diverse out-of-distribution manipulation tasks while retaining over 95% of the VLM’s multimodal capabilities.

📝 Abstract

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

Problem

Research questions and friction points this paper is trying to address.

forgetting

vision-language-action

embodiment tax

multimodal competence

semantic preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-stream architecture

vision-language-action models

embodiment tax