🤖 AI Summary
This study investigates the “belief depth” of large language models (LLMs) after knowledge editing—i.e., whether edited facts are genuinely internalized, exhibiting semantic consistency and robustness. To this end, we propose the first quantifiable, three-dimensional evaluation framework: generalization, resistance to adversarial questioning, and representational similarity. We integrate linear probing, self-consistency testing, adversarial interrogation, and synthetic document fine-tuning (SDF) for multi-faceted validation. Results show that SDF effectively implants realistic facts and induces deep, stable belief formation. However, when edited facts contradict commonsense, models exhibit only superficial acceptance: their internal representations significantly deviate from natural knowledge distributions and remain vulnerable to refutation via targeted queries. This work exposes a fundamental limitation of current knowledge editing methods—namely, their inability to ensure deep, coherent belief integration—and establishes both a theoretical benchmark and empirical methodology for trustworthy, semantically grounded knowledge updating.
📝 Abstract
Knowledge editing techniques promise to implant new factual knowledge into large language models (LLMs). But do LLMs really believe these facts? We develop a framework to measure belief depth and use it to evaluate the success of knowledge editing techniques. We operationalize belief depth as the extent to which implanted knowledge 1) generalizes to related contexts (e.g. Fermi estimates several logical steps removed), 2) is robust to self-scrutiny and direct challenge, and 3) is represented similarly to genuine knowledge (as measured by linear probes). Our evaluations show that simple prompting and mechanistic editing techniques fail to implant knowledge deeply. In contrast, Synthetic Document Finetuning (SDF) - where models are trained on LLM-generated documents consistent with a fact - often succeeds at implanting beliefs that behave similarly to genuine knowledge. However, SDF's success is not universal, as implanted beliefs that contradict basic world knowledge are brittle and representationally distinct from genuine knowledge. Overall, our work introduces measurable criteria for belief depth and enables the rigorous evaluation necessary for deploying knowledge editing in real-world applications.