The Trilemma of Truth in Large Language Models

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the problem of assessing the factual validity of internal probabilistic knowledge in large language models (LLMs)—specifically, classifying generated statements as *true*, *false*, or *neither-true-nor-false* (e.g., subjective, undefined, or context-dependent). To this end, we propose sAwMIL, a sparse-aware multiple-instance learning framework that integrates collusive prediction with nonlinear probing. Our method reveals, for the first time, that factual signals concentrate predominantly in the model’s initial three layers; true and false signals exhibit strong asymmetry; and a distinct, identifiable third-class signal—beyond binary truth values—exists. Evaluated across 16 open-source LLMs, sAwMIL overcomes the cross-model generalization limitations of conventional linear probes. It significantly improves separability and reliability of all three knowledge classes under five rigorous validity criteria and on three newly constructed datasets, establishing a novel paradigm for fine-grained, trustworthy assessment of internal knowledge representation in LLMs.

Technology Category

Application Category

📝 Abstract

We often attribute human characteristics to large language models (LLMs) and claim that they "know" certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM's depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs "know" and how certain they are of their probabilistic internal knowledge.

Problem

Research questions and friction points this paper is trying to address.

Assessing veracity of LLMs' internal probabilistic knowledge

Identifying flawed assumptions in current truth evaluation methods

Developing sAwMIL to classify statements as true, false, or neither

Innovation

Methods, ideas, or system contributions that make the work stand out.

sAwMIL method probes LLM veracity via activations

Uses multiple-instance learning and conformal prediction

Evaluates truth signals in LLM depth and asymmetry

🔎 Similar Papers

No similar papers found.