🤖 AI Summary
Deploying vision-language models on edge devices is often hindered by resource constraints and distribution shifts, while existing test-time adaptation (TTA) methods are impractical due to their high computational overhead. This work proposes LQA, a novel framework that integrates modality-aware quantization with a gradient-free TTA mechanism and introduces a selective hybrid quantization (SHQ) strategy. LQA significantly reduces memory and computational costs while preserving model robustness. Experiments across seven open-source datasets demonstrate that LQA improves average adaptive performance by 4.5%, substantially lowers memory usage compared to full-precision models, and achieves up to a 19.9× reduction in memory consumption relative to gradient-based TTA approaches.
📝 Abstract
Deploying Vision-Language Models (VLMs) on edge devices is challenged by resource constraints and performance degradation under distribution shifts. While test-time adaptation (TTA) can counteract such shifts, existing methods are too resource-intensive for on-device deployment. To address this challenge, we propose LQA, a lightweight, quantized-adaptive framework for VLMs that combines a modality-aware quantization strategy with gradient-free test-time adaptation. We introduce Selective Hybrid Quantization (SHQ) and a quantized, gradient-free adaptation mechanism to enable robust and efficient VLM deployment on resource-constrained hardware. Experiments across both synthetic and real-world distribution shifts show that LQA improves overall adaptation performance by 4.5\%, uses less memory than full-precision models, and significantly outperforms gradient-based TTA methods, achieving up to 19.9$\times$ lower memory usage across seven open-source datasets. These results demonstrate that LQA offers a practical pathway for robust, privacy-preserving, and efficient VLM deployment on edge devices.