🤖 AI Summary
The internal organization of functional modules in large language models (LLMs) remains poorly understood, and effective methods for disentangling neurons and linking them to semantic concepts are lacking. To address this gap, this work proposes ULCMOD, an unsupervised cross-layer module discovery framework that introduces a novel objective function and an iterative disentanglement (IterD) algorithm. ULCMOD enables, for the first time, a comprehensive functional partitioning of neurons across the entire LLM architecture and aligns these partitions with the thematic semantics of input samples. Experimental results demonstrate that the discovered modules exhibit clear semantic coherence, a hierarchical spatial structure, and task specialization, leading to strong performance on downstream tasks. This approach significantly enhances model interpretability and fills a critical void in the study of functional disentanglement in LLMs.
📝 Abstract
Understanding the internal functional organization of Large Language Models (LLMs) is crucial for improving their trustworthiness and performance. However, how LLMs organize different functions into modules remains highly unexplored. To bridge this gap, we formulate a functional module discovery problem and propose an Unsupervised LLM Cross-layer MOdule Discovery (ULCMOD) framework that simultaneously disentangles the large set of neurons in the entire LLM into modules while discovering the topics of input samples related to these modules. Our framework introduces a novel objective function and an efficient Iterative Decoupling (IterD) algorithm. Extensive experiments show that our method discovers high-quality, disentangled modules that capture more meaningful semantic information and achieve superior performance in various downstream tasks. Moreover, our qualitative analysis reveals that the discovered modules show semantic coherence, correspond to interpretable specializations, and a clear spatial and hierarchical organization within the LLM. Our work provides a novel tool for interpreting the functional modules of LLMs, filling a critical blank in LLM's interpretability research.