🤖 AI Summary
This work addresses the problem of assessing the factual validity of internal probabilistic knowledge in large language models (LLMs)—specifically, classifying generated statements as *true*, *false*, or *neither-true-nor-false* (e.g., subjective, undefined, or context-dependent). To this end, we propose sAwMIL, a sparse-aware multiple-instance learning framework that integrates collusive prediction with nonlinear probing. Our method reveals, for the first time, that factual signals concentrate predominantly in the model’s initial three layers; true and false signals exhibit strong asymmetry; and a distinct, identifiable third-class signal—beyond binary truth values—exists. Evaluated across 16 open-source LLMs, sAwMIL overcomes the cross-model generalization limitations of conventional linear probes. It significantly improves separability and reliability of all three knowledge classes under five rigorous validity criteria and on three newly constructed datasets, establishing a novel paradigm for fine-grained, trustworthy assessment of internal knowledge representation in LLMs.
📝 Abstract
We often attribute human characteristics to large language models (LLMs) and claim that they "know" certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM's depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs "know" and how certain they are of their probabilistic internal knowledge.