Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
VLA models face dual challenges in out-of-distribution generalization: catastrophic forgetting during VLM backbone fine-tuning and modality imbalance in VLA data—where linguistic diversity lags behind visual and action modalities—inducing visual shortcuts and language-conditioned forgetting. To address this, we propose a Bayesian decomposition framework that, for the first time, probabilistically decouples the visual-action prior from the language-conditioned likelihood, enabling “see-to-act” perception and “prompt-to-specify” control. Our method requires no external data and achieves endogenous disentanglement via Bayesian factor-driven decomposition, joint policy factorization, and mutual information minimization constraints—operating synergistically across pre- and post-contact phases. This preserves the VLM’s inherent generalization capacity while enhancing instruction following. Experiments demonstrate significant improvements in cross-task instruction-following accuracy and robustness on unseen instructions, objects, and environments, effectively suppressing visual shortcut learning.

Technology Category

Application Category

📝 Abstract
The pursuit of out-of-distribution generalization in Vision-Language-Action (VLA) models is often hindered by catastrophic forgetting of the Vision-Language Model (VLM) backbone during fine-tuning. While co-training with external reasoning data helps, it requires experienced tuning and data-related overhead. Beyond such external dependencies, we identify an intrinsic cause within VLA datasets: modality imbalance, where language diversity is much lower than visual and action diversity. This imbalance biases the model toward visual shortcuts and language forgetting. To address this, we introduce BayesVLA, a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify. This inherently preserves generalization and promotes instruction following. We further incorporate pre- and post-contact phases to better leverage pre-trained foundation models. Information-theoretic analysis formally validates our effectiveness in mitigating shortcut learning. Extensive experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods. Project page is available at: https://xukechun.github.io/papers/BayesVLA.
Problem

Research questions and friction points this paper is trying to address.

Addresses catastrophic forgetting in Vision-Language-Action models during fine-tuning
Mitigates modality imbalance where language diversity lags visual and action diversity
Enhances generalization to unseen instructions, objects, and environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian factorization separates visual-action prior from language likelihood
Pre- and post-contact phases integrate pre-trained foundation models
Information-theoretic analysis validates mitigation of shortcut learning
🔎 Similar Papers
No similar papers found.
K
Kechun Xu
Zhejiang University
Z
Zhenjie Zhu
Zhejiang University
A
Anzhe Chen
Zhejiang University
S
Shuqi Zhao
UC Berkeley
Qing Huang
Qing Huang
Chinese Academy of Science
Material Editing
Yifei Yang
Yifei Yang
Shanghai Jiao Tong University
Natural Language Processing
H
Haojian Lu
Zhejiang University
Rong Xiong
Rong Xiong
Zhejiang University
Robotics
Masayoshi Tomizuka
Masayoshi Tomizuka
Mechaniccal Engineering, University of California
mechanical engineeringdynamic systemscontrolmechatronics
Y
Yue Wang
Zhejiang University