ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of deploying Vision-Language-Action (VLA) models on edge devices, where high computational demands hinder practicality and existing post-training quantization methods suffer severe performance degradation below 4 bits. To overcome this, we propose ActQuant, an action-guided mixed-precision post-training quantization framework that allocates bit widths based on action contribution scores and optimizes block-wise quantization scales using action-aware curvature to prioritize control-critical parameters. ActQuant achieves the first effective VLA quantization down to 2.5 bits per weight (bpw), retaining 95.0% of original performance at 3 bpw and 90.1% at 2.5 bpw on the LIBERO benchmark with OpenVLA-OFT. The model size is reduced from 14.3 GB to 2.7 GB, enabling real-world deployment on a UR3 robotic arm with task success rates matching the full-precision baseline while cutting memory usage by 2.5×, supported by our OmniModel.cpp runtime for edge inference.
📝 Abstract
Vision-Language-Action (VLA) models exhibit remarkable action generation for embodied intelligence, but their heavy compute make deployment on edge platforms impractical. Aggressive, sub-4-bit weight quantization is the natural solution, yet existing post-training quantization (PTQ) methods suffer severe performance degradation in this regime. To address this, we introduce ActQuant, an action-guided mixed-precision PTQ framework that operates in two stages: (1) an inter-tensor bit allocator that assigns each weight matrix a single bit-width based on how much it contributes to predicting the agent's actions; (2) an intra-tensor scale optimizer tunes per-block quantization scales using action-aware curvature, so that dynamic range is concentrated on the weights most influential for control. To deliver the on-device benefits of our aggressive quantization, we further introduce OmniModel.cpp, an agentic conversion pipeline that ports architectures into a native C/C++ runtime with efficient low-bit kernels. We evaluate ActQuant both in simulation and on a real-world 6-DoF UR3 arm, with all models deployed through OmniModel.cpp. On the LIBERO benchmark, ActQuant is the only method that operates at or below 3 bits-per-weight, retaining 95.0% on OpenVLA-OFT and 94.8% on $π_{0.5}$. Pushed further, ActQuant reaches 2.5 bpw at 90.1% on OpenVLA-OFT, compressing the backbone from 14.3 GB to 2.7 GB (5.3$\times$). On the physical UR3 arm, $π_{0.5}$ quantized with ActQuant retains the baseline's success rate while reducing the memory footprint by 2.5$\times$.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
sub-4-bit quantization
post-training quantization
edge deployment
performance degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

action-guided quantization
sub-4-bit quantization
mixed-precision PTQ
low-bit deployment
vision-language-action models
🔎 Similar Papers