🤖 AI Summary
This work challenges the prevailing underestimation of the action-related information latent in pretrained vision-language models (VLMs) within vision-language-action (VLA) frameworks. It introduces a quotient space perspective, formally demonstrating for the first time that the VLM latent space is action-sufficient, with its apparent redundancy primarily arising from instruction variations irrelevant to action execution. To exploit this insight, the authors propose QuoVLA, a novel framework that compresses the VLM latent space into an action-equivariant quotient representation through a quantization module, a dual-branch architecture, and a relative temporal complexity regularizer—preserving task-relevant structure while eliminating extraneous variability. Experiments show that QuoVLA significantly outperforms existing methods across multiple benchmarks and exhibits exceptional generalization under visual, linguistic, and environmental distribution shifts.
📝 Abstract
Vision-Language-Action (VLA) models commonly adapt pretrained Vision-Language Models (VLMs) to robot control by mapping visual observations and language instructions to continuous actions. Existing approaches typically take an action-insufficiency view, assuming that pretrained VLM latents either lack directly usable action information or should be shielded from action-learning signals. Against this view, our \textit{Quotient Theory for VLA} shows that pretrained VLM latents are not action-insufficient but action-sufficient: they already contain the information needed for control, yet remain overcomplete by distinguishing prompt-level variations that induce the same optimal action behavior. To operationalize this theory, we propose QuoVLA, a quotient-space framework for VLA that compresses pretrained VLM latents into action-sufficient representations. Specifically, QuoVLA instantiates this principle with a quantization module and a dual-branch design with relative temporal-complexity regularization, preserving action-relevant information while removing prompt-level redundancy. Extensive experiments across multiple benchmarks demonstrate that QuoVLA achieves strong performance, with particularly notable improvements in generalization under visual, linguistic, and environmental distribution shifts. Our code will be made publicly available.