Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the density, opacity, and editability challenges of MLP layers in large language models (LLMs), this work introduces a hierarchical sparse decomposition paradigm—departing from conventional neuron-level sparsity—and proposes the Mixture of Decoders (MxDs). MxDs expands a single MLP layer into thousands of specialized, sparsely activated sublayers while preserving full-rank linear transformation capability even under high sparsity. Designed via flexible tensor decomposition, MxDs is compatible with both standard MLP and GLU architectures, enabling sparse probing and feature-directed control. On a 3B-parameter model, MxDs achieves superior sparse-accuracy trade-offs compared to state-of-the-art methods such as Transcoders. Empirical results demonstrate that MxDs learns natural-language-specialized representations and enables high-fidelity, interpretable, and editable reconstruction of MLP layers.

Technology Category

Application Category

📝 Abstract
Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language--opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.
Problem

Research questions and friction points this paper is trying to address.

Improving interpretability of MLPs without losing accuracy
Overcoming accuracy trade-offs in sparse layer approximation
Preserving expressive capacity in decomposed dense layers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-level sparsity for accurate approximation
Mixture of Decoders expands dense layers
Full-rank weights preserve expressive capacity
🔎 Similar Papers
No similar papers found.