🤖 AI Summary
To address the propensity of large language models (LLMs) to generate non-factual responses in factual question answering, this paper introduces the novel task of *Non-Factual Prediction* (NFP)—predicting the factual correctness of an answer *before* generation, enabling proactive risk identification rather than post-hoc detection. We observe consistent cross-model patterns in implicit question representations that correlate with non-factuality across diverse LLMs. Leveraging this insight, we propose FacLens: a lightweight, transferable probe model that constructs a supervised binary classifier over hidden-layer representations and employs cross-model representation alignment for generalization. Evaluated on Llama, Qwen, and ChatGLM, FacLens achieves an average 3.2% F1 improvement over baselines, with <1M parameters and <5ms inference latency—outperforming existing methods. It provides an efficient, model-agnostic, and proactive safeguard for LLM content safety.
📝 Abstract
Despite advancements in large language models (LLMs), non-factual responses remain prevalent. Unlike extensive studies on post-hoc detection of such responses, this work studies non-factuality prediction (NFP), aiming to predict whether an LLM will generate a non-factual response to a question before the generation process. Previous efforts on NFP have demonstrated LLMs' awareness of their internal knowledge, but they still face challenges in efficiency and transferability. In this work, we propose a lightweight NFP model named Factuality Lens (FacLens), which effectively probes hidden representations of questions for the NFP task. Besides, we discover that hidden question representations sourced from different LLMs exhibit similar NFP patterns, which enables the transferability of FacLens across LLMs to reduce development costs. Extensive experiments highlight FacLens's superiority in both effectiveness and efficiency.