🤖 AI Summary
Neural networks’ high-dimensional internal representations remain notoriously difficult to interpret. Method: Inspired by human cognitive “chunking”—the process of grouping information into meaningful units—we propose a causal mapping framework linking embedding states, semantic concepts, and behavioral responses. Our approach integrates state clustering, perturbation-based attribution, and dictionary learning to extract reproducible, semantically coherent “semantic chunks” from both synthetic RNN sequence models and real-world LLaMA language model embeddings. Results: In RNNs, our method successfully recovers ground-truth sequence patterns; in LLaMA, it enables precise localization and controllable activation or suppression of semantic chunks, markedly enhancing representational interpretability and intervention capability. Contribution: This work establishes the first cognition-inspired, structured interpretability paradigm—grounded in chunking theory—that provides both a novel theoretical foundation and practical tools for large language model explanation.
📝 Abstract
Understanding neural networks is challenging due to their high-dimensional, interacting components. Inspired by human cognition, which processes complex sensory data by chunking it into recurring entities, we propose leveraging this principle to interpret artificial neural population activities. Biological and artificial intelligence share the challenge of learning from structured, naturalistic data, and we hypothesize that the cognitive mechanism of chunking can provide insights into artificial systems. We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities, observing that their hidden states reflect these patterns, which can be extracted as a dictionary of chunks that influence network responses. Extending this to large language models (LLMs) like LLaMA, we identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts. By exploring methods to extract dictionaries of identifiable chunks across neural embeddings of varying complexity, our findings introduce a new framework for interpreting neural networks, framing their population activity as structured reflections of the data they process.