Disentangling Deception and Hallucination Failures in LLMs

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Large language models (LLMs) often conflate knowledge absence with erroneous outputs in factual question answering, failing to distinguish between hallucination—generating false content due to genuine knowledge gaps—and deception—intentionally misrepresenting known facts. This work proposes a knowledge-behavior decoupling framework that isolates these fundamentally distinct failure modes by constructing an entity-centric controlled environment to systematically analyze four behavioral scenarios. Integrating representational disentanglement analysis, sparse interpretability methods, and activation steering at inference time, the study provides the first mechanistic distinction between hallucination and deception within LLMs. Crucially, it demonstrates that behavioral outputs can be successfully modulated without erasing underlying knowledge, thereby validating the separability of these failure modes and offering a novel pathway for understanding and intervening in LLM errors.

Technology Category

Application Category

📝 Abstract

Failures in large language models (LLMs) are often analyzed from a behavioral perspective, where incorrect outputs in factual question answering are commonly associated with missing knowledge. In this work, focusing on entity-based factual queries, we suggest that such a view may conflate different failure mechanisms, and propose an internal, mechanism-oriented perspective that separates Knowledge Existence from Behavior Expression. Under this formulation, hallucination and deception correspond to two qualitatively different failure modes that may appear similar at the output level but differ in their underlying mechanisms. To study this distinction, we construct a controlled environment for entity-centric factual questions in which knowledge is preserved while behavioral expression is selectively altered, enabling systematic analysis of four behavioral cases. We analyze these failure modes through representation separability, sparse interpretability, and inference-time activation steering.

Problem

Research questions and friction points this paper is trying to address.

hallucination

deception

large language models

failure modes

factual question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

disentanglement

hallucination

deception