🤖 AI Summary
This study addresses the lack of clarity in cross-layer neural feature evolution within large language models (LLMs), which hinders fine-grained interpretability and precise intervention. We propose the first causally grounded framework for modeling cross-layer feature flows: semantic features are extracted layer-wise using sparse autoencoders; inter-layer feature alignment is achieved via data-free cosine similarity matching, enabling construction of a hierarchical feature flow graph. We further introduce feature amplification and suppression mechanisms to support topic-level generation control. Unlike prior single-layer interpretability methods, our approach enables dynamic, feature-level tracing and direct steering of internal representations. Extensive evaluation across multiple LLMs demonstrates substantial improvements in mechanistic understanding depth and accuracy of controllable text editing.
📝 Abstract
We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.