🤖 AI Summary
To address hallucination in large language models (LLMs), this paper proposes a layer-pruning-based contrastive decoding method. Instead of relying on conventional early-exit mechanisms, it constructs a lightweight, domain-agnostic “contrastive model” by pruning the top layers of the Transformer architecture; this pruned model runs in parallel with the full-parameter model during inference, and their logits are dynamically weighted and fused. Leveraging the smoother, fact-distribution-aligned outputs of the pruned model, the approach strengthens discriminative contrastive signals. Experiments demonstrate substantial improvements in factual accuracy—e.g., +3.2–7.8 points on FactScore and FEVER—while incurring only ~12% additional latency, ensuring practical inference overhead. The core contribution is the first use of structured layer pruning to instantiate a contrastive model, enabling efficient, plug-and-play factual consistency enhancement.
📝 Abstract
To mitigate the hallucination problem in large language models, DoLa exploits early exit logits from the same model as a contrastive prior. However, we found that these early exit logits tend to be flat, low in magnitude, and fail to reflect meaningful contrasts. To address this, we propose PruneCD, a novel contrastive decoding method that constructs the amateur model via layer pruning rather than early exit. This design leads to more informative and well-aligned logits, enabling more effective contrastive decoding. Through qualitative and quantitative analyses, we demonstrate that PruneCD consistently improves factuality with minimal inference overhead, offering a robust and practical approach to mitigating hallucinations in LLMs.