How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This paper investigates how post-training mechanistically reshapes the internal representational structure of large language models (LLMs), focusing on four dimensions: knowledge storage, truthfulness judgment, refusal behavior, and confidence generation. Using interpretability techniques—including linear probing, representation intervention, hidden-layer geometric analysis, and cross-model comparison—we find: (1) truthfulness directions are highly transferable across pre- and post-trained models, whereas refusal directions are not; (2) the physical location of factual knowledge remains largely unchanged, but representations are adapted and augmented with new features; (3) the hypothesis that “entropy neurons dominate confidence differences” is refuted, revealing instead a fundamental reconfiguration of the confidence-generation pathway. Collectively, post-training selectively preserves knowledge localization and truthfulness directions while systematically reconfiguring refusal mechanisms and confidence computation routes—providing an interpretable foundation for controllable model editing and alignment.

Technology Category

Application Category

📝 Abstract

Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by linear vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training.

Problem

Research questions and friction points this paper is trying to address.

How post-training reshapes LLMs internally

Effects of post-training on knowledge, truthfulness, refusal

Mechanistic comparison of base and post-trained LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-training adapts and develops knowledge representations

Truthfulness represented by transferable linear vectors

Refusal direction differs with limited transferability

🔎 Similar Papers

ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence