Depth-Wise Activation Steering for Honest Language Models

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Large language models (LLMs) often generate factually incorrect statements despite possessing correct knowledge—a failure of honesty rather than factual accuracy—undermining auditability and safety. To address this, we propose a training-free activation steering method that introduces a depth-aware Gaussian scheduling mechanism to dynamically allocate intervention strength across network layers, enabling weighted modulation of hidden-layer representations. Unlike uniform, random, or single-layer interventions, our approach more effectively elicits intrinsic honesty without fine-tuning and maintains cross-architectural compatibility. On the MASK benchmark, it significantly improves honesty in six out of seven mainstream LLMs. Ablation studies confirm that Gaussian scheduling uniquely disentangles honesty from knowledge retention, outperforming alternative scheduling strategies. This work advances controllable honesty in LLMs through interpretable, parameter-free intervention.

Technology Category

Application Category

📝 Abstract

Large language models sometimes assert falsehoods despite internally representing the correct answer, failures of honesty rather than accuracy, which undermines auditability and safety. Existing approaches largely optimize factual correctness or depend on retraining and brittle single-layer edits, offering limited leverage over truthful reporting. We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. On the MASK benchmark, which separates honesty from knowledge, we evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines in six of seven models. Equal-budget ablations on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct show the Gaussian schedule outperforms random, uniform, and box-filter depth allocations, indicating that how intervention is distributed across depth materially affects outcomes beyond total strength. The method is simple, model-agnostic, requires no finetuning, and provides a low-cost control knob for eliciting truthful reporting from models' existing capabilities.

Problem

Research questions and friction points this paper is trying to address.

Improves honesty in language models without retraining

Uses Gaussian depth scheduling for activation steering

Enhances truthful reporting across multiple model families

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free activation steering with Gaussian depth weighting

Gaussian scheduling improves honesty across multiple model families

Model-agnostic method requiring no fine-tuning for truthful reporting

🔎 Similar Papers

No similar papers found.