Manifold-Guided Attention Steering

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the challenge that large language models often err during reasoning due to attention activations deviating from correct trajectories, and existing static intervention methods struggle to distinguish between correct and incorrect steps, thereby introducing harmful perturbations. The paper proposes a trajectory-aware, inference-time intervention method that, for the first time, models “correctness” as a low-dimensional manifold in the attention activation space. By leveraging contrastive learning on pairs of correct and erroneous activations, the approach identifies this subspace and dynamically triggers projection-based corrections based on real-time manifold distance, enabling targeted and adaptive intervention. Empirical results demonstrate significant improvements over baseline and static guidance strategies across diverse tasks, including mathematical reasoning (MATH-500, GSM8K), code generation (HumanEval, MBPP), and molecular design (SMILES).

📝 Abstract

Large language models frequently produce errors in reasoning tasks despite possessing the underlying knowledge required for correct reasoning. One possible approach to improve reasoning consistency is through activation steering. However, existing activation steering approaches apply fixed, pre-computed correction vectors, ignoring where the model currently sits along its generation trajectory; the result is indiscriminate perturbation that disrupts already-correct steps as freely as erroneous ones. We propose Manifold-Guided Attention Steering (MAGS), a trajectory-aware inference-time intervention grounded in a geometric observation: the output activations of specific attention heads diverge from a low-dimensional correctness manifold at the point of error, and this deviation compounds through subsequent steps. For each identified attention head, we learn a low-dimensional subspace from contrastive pairs of correct and incorrect traces that capture the directions along which error behavior deviates from correct behavior. During inference, we monitor each head's proximity to this manifold and apply a targeted projection correction when deviation exceeds a learned threshold, steering the attention output back toward the correct subspace before the error propagates. MAGS consistently outperforms both unsteered baselines and static steering approaches across benchmarks spanning mathematical reasoning (MATH-500, GSM8K), code generation (HumanEval, MBPP), and molecular generation (SMILES), suggesting that correctness manifolds are a general feature of LLM attention geometry.

Problem

Research questions and friction points this paper is trying to address.

reasoning errors

activation steering

correctness manifold

attention heads

trajectory-aware intervention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manifold-Guided Attention Steering

activation steering

correctness manifold