Ω-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of deploying large vision-language-action (VLA) models on edge devices, which are hindered by their massive parameter counts and the high computational cost of diffusion-based action heads, as well as the absence of effective unified low-bit quantization methods. We propose the first training-free post-training quantization framework that enables uniform W4A4 quantization across all VLA components—including both the language backbone and the diffusion action head—by introducing an SVD-Hadamard composite rotation to balance weight energy and suppress activation outliers, coupled with a stepwise DiT activation scaling strategy tailored to mitigate dynamic range drift during the denoising process. Evaluated on the LIBERO benchmark, our method achieves task success rates of 98.0% for Pi 0.5 and 87.8% for GR00T N1.5, surpassing FP16 baselines while reducing static memory footprint by 71.3%, and demonstrating stable, precise real-world robotic performance.
📝 Abstract
Vision-Language-Action (VLA) models unify perception, reasoning, and control within a single policy, yet their multi-billion-parameter backbones and diffusion-based action heads make on-device deployment prohibitively expensive. Prior quantization efforts offer only partial solutions, compressing the LLM backbone while leaving the DiT action head at full precision, or resorting to mixed-precision schemes, driven by the belief that uniformly quantizing the action head is inherently unstable. We challenge this assumption with Omega-QVLA, the first training-free post-training quantization framework that compresses both the language backbone and the entire diffusion action head of a VLA model to a uniform W4A4 precision, eliminating the need for mixed-precision allocation. Omega-QVLA combines a composite SVD-Hadamard rotation that equalizes per-channel weight energy while diffusing residual activation outliers with per-step DiT activation scaling quantization that absorbs dynamic-range drift across denoising steps. On LIBERO, Omega-QVLA compresses Pi 0.5 and GR00T N1.5 to W4A4 with 98.0% and 87.8% task success rates, matching or exceeding their FP16 references of 97.1% and 87.0%, while reducing the static memory footprint by 71.3%. Real-world manipulation experiments further confirm smooth, accurate manipulation where prior methods fail. Code is available at https://github.com/UCMP13753/Omega-QVLA.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
quantization
diffusion action head
on-device deployment
uniform precision
Innovation

Methods, ideas, or system contributions that make the work stand out.

post-training quantization
vision-language-action models
uniform low-bit quantization
composite rotation
per-step scaling
🔎 Similar Papers
2024-10-10Neural Information Processing SystemsCitations: 1