An empirical study of LLaMA3 quantization: from LLMs to MLLMs

📅 2024-04-22

🏛️ Vis. Intell.

📈 Citations: 20

✨ Influential: 4

🤖 AI Summary

LLaMA3 and its multimodal extensions (MLLMs) suffer significant performance degradation under ultra-low-bit quantization (e.g., INT4/FP8), hindering efficient deployment. Method: We conduct a unified empirical evaluation across quantization methods (AWQ, GPTQ, FP8), tasks, and hardware platforms. We propose a novel quantization robustness evaluation paradigm tailored to the full LLaMA3 family, integrating per-tensor/per-channel weight compression with activation calibration, and validate its generalizability on vision-language models (VLMs) such as LLaVA-NeXT. Contribution/Results: Our approach achieves only a 1.2% average accuracy drop across mainstream benchmarks, while accelerating inference by 2.1× and reducing GPU memory footprint by 65%. These results substantially improve deployability in resource-constrained environments and establish a best-practice pathway for production-ready quantization of LLaMA3 and related MLLMs.