Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost hindering large language model (LLM) deployment in resource-constrained settings, this work systematically investigates the performance limits of knowledge distillation (KD) for compressing LLMs on question answering (QA). Using Pythia and Qwen2.5 series as teacher models, we evaluate distilled student models on SQuAD and MLQA under zero-shot and one-shot settings. Our key finding is that a minimalist prompt combined with KD enables students to retain over 90% of teacher QA performance while reducing parameter count by up to 57.1%; notably, one-shot adaptation further boosts accuracy. Results demonstrate that lightweight distilled models maintain strong generalization despite significant parameter reduction, offering a reproducible, balanced trade-off between efficiency and capability. This provides a practical pathway for low-overhead, high-fidelity LLM deployment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated outstanding performance across a range of NLP tasks, however, their computational demands hinder their deployment in real-world, resource-constrained environments. This work investigates the extent to which LLMs can be compressed using Knowledge Distillation (KD) while maintaining strong performance on Question Answering (QA) tasks. We evaluate student models distilled from the Pythia and Qwen2.5 families on two QA benchmarks, SQuAD and MLQA, under zero-shot and one-shot prompting conditions. Results show that student models retain over 90% of their teacher models' performance while reducing parameter counts by up to 57.1%. Furthermore, one-shot prompting yields additional performance gains over zero-shot setups for both model families. These findings underscore the trade-off between model efficiency and task performance, demonstrating that KD, combined with minimal prompting, can yield compact yet capable QA systems suitable for resource-constrained applications.
Problem

Research questions and friction points this paper is trying to address.

Compress LLMs using Knowledge Distillation for efficiency
Maintain QA task performance with fewer parameters
Evaluate distilled models on SQuAD and MLQA benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge Distillation for LLM compression
One-shot prompting boosts QA performance
Compact student models retain 90% performance
🔎 Similar Papers
No similar papers found.