SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-Language-Action (VLA) models suffer from prohibitive computational and memory overhead, hindering practical deployment. Existing compression techniques—such as quantization and token pruning—are typically applied in isolation and exhibit intrinsic incompatibility, preventing synergistic acceleration. Method: We propose the first training-free, structured inference acceleration framework that jointly optimizes quantization and dynamic token pruning. Our approach introduces a quantization-aware pruning criterion and modifies the quantizer to preserve accuracy consistency across pruning and quantization operations. Contribution/Results: The method preserves full model architecture integrity while integrating high-precision weight quantization with input-adaptive visual and language token pruning. Evaluated on standard VLA models, it achieves a 1.93× inference speedup and improves average task success rate by 4.5%, significantly outperforming standalone quantization or pruning baselines.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models exhibit unprecedented capabilities for embodied intelligence. However, their extensive computational and memory costs hinder their practical deployment. Existing VLA compression and acceleration approaches conduct quantization or token pruning in an ad-hoc manner but fail to enable both for a holistic efficiency improvement due to an observed incompatibility. This work introduces SQAP-VLA, the first structured, training-free VLA inference acceleration framework that simultaneously enables state-of-the-art quantization and token pruning. We overcome the incompatibility by co-designing the quantization and token pruning pipeline, where we propose new quantization-aware token pruning criteria that work on an aggressively quantized model while improving the quantizer design to enhance pruning effectiveness. When applied to standard VLA models, SQAP-VLA yields significant gains in computational efficiency and inference speed while successfully preserving core model performance, achieving a $ imes$1.93 speedup and up to a 4.5% average success rate enhancement compared to the original model.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational and memory costs of Vision-Language-Action models
Overcoming incompatibility between quantization and token pruning
Enabling efficient deployment without sacrificing model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free quantization-aware pruning framework
Co-designed quantization and pruning pipeline
Achieves speedup while preserving performance
🔎 Similar Papers
No similar papers found.
H
Hengyu Fang
School of Electronic Science and Engineering, Nanjing University
Yijiang Liu
Yijiang Liu
PhD
Machine Learning Efficiency
Y
Yuan Du
School of Electronic Science and Engineering, Nanjing University
Huanrui Yang
Huanrui Yang
Assistant Professor, ECE, University of Arizona
Efficient deep learningTrustworthy deep learning
L
Li Du
School of Electronic Science and Engineering, Nanjing University