Binary Autoencoder for Mechanistic Interpretability of Large Language Models

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing LLM interpretability methods rely on autoencoders with implicit regularization to disentangle numerical features, but lack explicit inter-sample global sparsity guarantees—resulting in numerous low-activity dense features that undermine feature atomicity and interpretability. To address this, we propose the Binary Autoencoder (BAE), which enforces explicit, cross-sample global sparsity via minibatch-level minimum-entropy constraints, 1-bit feature discretization, and gradient estimation for backpropagation. BAE enables reliable, dynamic computation of functional entropy—a key metric for feature utility. Empirically, BAE significantly improves feature atomicity and sparsity, extracts the largest number of interpretable features in analyses of inference dynamics and in-context learning mechanisms, and substantially suppresses non-active dense features. By providing a robust, quantifiable foundation for interpreting LLM internals, BAE advances the rigor and fidelity of mechanistic interpretability research.

Technology Category

Application Category

📝 Abstract

Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs) for interpreting their mechanism. However, they typically rely on autoencoders constrained by some implicit training-time regularization on single training instances (i.e., $L_1$ normalization, top-k function, etc.), without an explicit guarantee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which we empirically evaluate and leverage to characterize the inference dynamics of LLMs and In-context Learning. (2) Feature untangling. Similar to typical methods, BAE can extract atomized features from LLM's hidden states. To robustly evaluate such feature extraction capability, we refine traditional feature-interpretation methods to avoid unreliable handling of numerical tokens, and show that BAE avoids dense features while producing the largest number of interpretable ones among baselines, which confirms the effectiveness of BAE serving as a feature extractor.

Problem

Research questions and friction points this paper is trying to address.

Improving feature sparsity in LLM interpretation methods

Enforcing feature independence across multiple instances

Developing binary autoencoder for efficient entropy estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary Autoencoder enforces minimal entropy on minibatches

Discretizes hidden activations to 1-bit using step function

Promotes feature independence and sparsity across instances

🔎 Similar Papers

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models