Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

This work proposes a contrastive feature retrieval framework based on sparse autoencoders to reliably link and precisely control the internal features of large language models (LLMs) with their high-level semantic behaviors, such as personality traits. By integrating activation analysis and generative validation, the method isolates mono-semantic functional features from the sparse activation space and enables bidirectional intervention on model behavior. The study reveals, for the first time, a phenomenon termed “functional fidelity,” wherein manipulating a single internal feature induces coordinated shifts across multiple dimensions of linguistic behavior. Evaluated on the Big Five personality task, the approach significantly outperforms existing techniques like CAA, achieving highly stable and precise semantic behavior control, thereby demonstrating that LLMs encode highly integrated representations of high-order concepts.

Technology Category

Application Category

📝 Abstract

Recent work in Mechanistic Interpretability (MI) has enabled the identification and intervention of internal features in Large Language Models (LLMs). However, a persistent challenge lies in linking such internal features to the reliable control of complex, behavior-level semantic attributes in language generation. In this paper, we propose a Sparse Autoencoder-based framework for retrieving and steering semantically interpretable internal features associated with high-level linguistic behaviors. Our method employs a contrastive feature retrieval pipeline based on controlled semantic oppositions, combing statistical activation analysis and generation-based validation to distill monosemantic functional features from sparse activation spaces. Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering of model behavior while maintaining superior stability and performance compared to existing activation steering methods like Contrastive Activation Addition (CAA). We further identify an empirical effect, which we term Functional Faithfulness, whereby intervening on a specific internal feature induces coherent and predictable shifts across multiple linguistic dimensions aligned with the target semantic attribute. Our findings suggest that LLMs internalize deeply integrated representations of high-order concepts, and provide a novel, robust mechanistic path for the regulation of complex AI behaviors.

Problem

Research questions and friction points this paper is trying to address.

Mechanistic Interpretability

Large Language Models

Semantic Features

Behavioral Control

High-Order Concepts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoder

Mechanistic Interpretability

Feature Steering