Can Large Language Models Still Explain Themselves? Investigating the Impact of Quantization on Self-Explanations

📅 2026-01-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study addresses the unclear impact of quantization on the self-explanatory capabilities of large language models, particularly its potential to undermine the credibility and consistency of explanations in high-stakes scenarios. We present the first systematic evaluation of three mainstream quantization methods across varying bit widths, assessing their effects on the quality and faithfulness of model-generated natural language explanations and counterfactual examples. Combining automated metrics with user studies, we find that quantization can degrade explanation quality by up to 4.4% and faithfulness by 2.38%, with user-perceived coherence and trustworthiness dropping by as much as 8.5%. Larger models exhibit greater robustness in faithfulness, yet no single quantization strategy consistently outperforms others across all dimensions. Our analysis reveals complex interactions among model scale, quantization approach, and explanation type, and we offer practical validation recommendations for real-world deployment.

Technology Category

Application Category

📝 Abstract

Quantization is widely used to accelerate inference and streamline the deployment of large language models (LLMs), yet its effects on self-explanations (SEs) remain unexplored. SEs, generated by LLMs to justify their own outputs, require reasoning about the model's own decision-making process, a capability that may exhibit particular sensitivity to quantization. As SEs are increasingly relied upon for transparency in high-stakes applications, understanding whether and to what extent quantization degrades SE quality and faithfulness is critical. To address this gap, we examine two types of SEs: natural language explanations (NLEs) and counterfactual examples, generated by LLMs quantized using three common techniques at distinct bit widths. Our findings indicate that quantization typically leads to moderate declines in both SE quality (up to 4.4\%) and faithfulness (up to 2.38\%). The user study further demonstrates that quantization diminishes both the coherence and trustworthiness of SEs (up to 8.5\%). Compared to smaller models, larger models show limited resilience to quantization in terms of SE quality but better maintain faithfulness. Moreover, no quantization technique consistently excels across task accuracy, SE quality, and faithfulness. Given that quantization's impact varies by context, we recommend validating SE quality for specific use cases, especially for NLEs, which show greater sensitivity. Nonetheless, the relatively minor deterioration in SE quality and faithfulness does not undermine quantization's effectiveness as a model compression technique.

Problem

Research questions and friction points this paper is trying to address.

quantization

self-explanations

large language models

faithfulness

model transparency

Innovation

Methods, ideas, or system contributions that make the work stand out.

quantization

self-explanations

large language models