Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This study addresses the lack of clarity in cross-layer neural feature evolution within large language models (LLMs), which hinders fine-grained interpretability and precise intervention. We propose the first causally grounded framework for modeling cross-layer feature flows: semantic features are extracted layer-wise using sparse autoencoders; inter-layer feature alignment is achieved via data-free cosine similarity matching, enabling construction of a hierarchical feature flow graph. We further introduce feature amplification and suppression mechanisms to support topic-level generation control. Unlike prior single-layer interpretability methods, our approach enables dynamic, feature-level tracing and direct steering of internal representations. Extensive evaluation across multiple LLMs demonstrates substantial improvements in mechanistic understanding depth and accuracy of controllable text editing.

Technology Category

Application Category

📝 Abstract

We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.

Problem

Research questions and friction points this paper is trying to address.

Map features in large language models across layers

Trace feature persistence and transformation using cosine similarity

Enable targeted thematic control in text generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse autoencoder feature mapping

Data-free cosine similarity technique

Cross-layer feature steering control

🔎 Similar Papers

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models