🤖 AI Summary
Large language models (LLMs) exhibit latent truthfulness awareness yet frequently generate factually inconsistent outputs—manifesting as factual, logical, and existential hallucinations. To address this, we propose a fine-tuning-free, inference-time adaptive activation steering method that, for the first time, formalizes truthfulness as a linearly separable concept within LLM internal representations. Our approach introduces a training-free, multi-granularity, multi-vector steering mechanism with dynamically adjustable intervention strength, leveraging linear decoding and directional offset in activation space to uniformly suppress diverse hallucination types. Evaluated across six open-source models—from LLaMA to LLaMA3 (13B–65B)—our method improves truthfulness by over 30% on average, with gains up to 142%, requiring zero parameter modification and enabling plug-and-play deployment. This advances trustworthy generative AI by establishing a scalable, architecture-agnostic framework for real-time truthfulness calibration.
📝 Abstract
Recent studies have indicated that Large Language Models (LLMs) harbor an inherent understanding of truthfulness, yet often fail to consistently express it and generate false statements. This gap between"knowing"and"telling"poses a challenge for ensuring the truthfulness of generated content. Inspired by recent work on the practice of encoding human-interpretable concepts linearly within large language models, we treat truthfulness as a specially linearly encoded concept within LLMs, and introduce Adaptive Activation Steering (ACT), a tuning-free method that adaptively shifts LLM's activations in the"truthful"direction during inference. ACT addresses diverse categories of hallucinations by utilizing diverse truthfulness-related steering vectors and adjusting the steering intensity adaptively. Applied as an add-on across various models, ACT significantly improves truthfulness in LLaMA ($uparrow$ 142%), LLaMA2 ($uparrow$ 24%), Alpaca ($uparrow$ 36%), Vicuna ($uparrow$ 28%), LLaMA2-Chat ($uparrow$ 19%), and LLaMA3($uparrow$ 34%). Furthermore, we verify ACT's scalability across larger models (13B, 33B, 65B), underscoring the adaptability of ACT to large-scale language models. Our code is available at https://github.com/tianlwang/ACT.