🤖 AI Summary
This work addresses the challenges of deploying large vision-language-action (VLA) models on edge devices, which are hindered by their massive parameter counts and the high computational cost of diffusion-based action heads, as well as the absence of effective unified low-bit quantization methods. We propose the first training-free post-training quantization framework that enables uniform W4A4 quantization across all VLA components—including both the language backbone and the diffusion action head—by introducing an SVD-Hadamard composite rotation to balance weight energy and suppress activation outliers, coupled with a stepwise DiT activation scaling strategy tailored to mitigate dynamic range drift during the denoising process. Evaluated on the LIBERO benchmark, our method achieves task success rates of 98.0% for Pi 0.5 and 87.8% for GR00T N1.5, surpassing FP16 baselines while reducing static memory footprint by 71.3%, and demonstrating stable, precise real-world robotic performance.
📝 Abstract
Vision-Language-Action (VLA) models unify perception, reasoning, and control within a single policy, yet their multi-billion-parameter backbones and diffusion-based action heads make on-device deployment prohibitively expensive. Prior quantization efforts offer only partial solutions, compressing the LLM backbone while leaving the DiT action head at full precision, or resorting to mixed-precision schemes, driven by the belief that uniformly quantizing the action head is inherently unstable. We challenge this assumption with Omega-QVLA, the first training-free post-training quantization framework that compresses both the language backbone and the entire diffusion action head of a VLA model to a uniform W4A4 precision, eliminating the need for mixed-precision allocation. Omega-QVLA combines a composite SVD-Hadamard rotation that equalizes per-channel weight energy while diffusing residual activation outliers with per-step DiT activation scaling quantization that absorbs dynamic-range drift across denoising steps. On LIBERO, Omega-QVLA compresses Pi 0.5 and GR00T N1.5 to W4A4 with 98.0% and 87.8% task success rates, matching or exceeding their FP16 references of 97.1% and 87.0%, while reducing the static memory footprint by 71.3%. Real-world manipulation experiments further confirm smooth, accurate manipulation where prior methods fail. Code is available at https://github.com/UCMP13753/Omega-QVLA.