Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

VLA models face dual challenges in out-of-distribution generalization: catastrophic forgetting during VLM backbone fine-tuning and modality imbalance in VLA data—where linguistic diversity lags behind visual and action modalities—inducing visual shortcuts and language-conditioned forgetting. To address this, we propose a Bayesian decomposition framework that, for the first time, probabilistically decouples the visual-action prior from the language-conditioned likelihood, enabling “see-to-act” perception and “prompt-to-specify” control. Our method requires no external data and achieves endogenous disentanglement via Bayesian factor-driven decomposition, joint policy factorization, and mutual information minimization constraints—operating synergistically across pre- and post-contact phases. This preserves the VLM’s inherent generalization capacity while enhancing instruction following. Experiments demonstrate significant improvements in cross-task instruction-following accuracy and robustness on unseen instructions, objects, and environments, effectively suppressing visual shortcut learning.

Technology Category

Application Category

📝 Abstract

The pursuit of out-of-distribution generalization in Vision-Language-Action (VLA) models is often hindered by catastrophic forgetting of the Vision-Language Model (VLM) backbone during fine-tuning. While co-training with external reasoning data helps, it requires experienced tuning and data-related overhead. Beyond such external dependencies, we identify an intrinsic cause within VLA datasets: modality imbalance, where language diversity is much lower than visual and action diversity. This imbalance biases the model toward visual shortcuts and language forgetting. To address this, we introduce BayesVLA, a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify. This inherently preserves generalization and promotes instruction following. We further incorporate pre- and post-contact phases to better leverage pre-trained foundation models. Information-theoretic analysis formally validates our effectiveness in mitigating shortcut learning. Extensive experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods. Project page is available at: https://xukechun.github.io/papers/BayesVLA.

Problem

Research questions and friction points this paper is trying to address.

Addresses catastrophic forgetting in Vision-Language-Action models during fine-tuning

Mitigates modality imbalance where language diversity lags visual and action diversity

Enhances generalization to unseen instructions, objects, and environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian factorization separates visual-action prior from language likelihood

Pre- and post-contact phases integrate pre-trained foundation models

Information-theoretic analysis validates mitigation of shortcut learning

🔎 Similar Papers

GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance