Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing sparse autoencoder (SAE) methods suffer from high training costs, poor scalability (limited to models ≤27B parameters), and non-transferable features. This paper proposes ITDA—a training-free, inference-time sparse dictionary learning framework enabling efficient decomposition of large-model activations and cross-model representational comparability. Its core contributions are: (1) the first zero-data-dependent dictionary construction mechanism based on greedy matching pursuit; and (2) the first use of Jaccard similarity—instead of conventional representational similarity metrics such as CKA or SVCCA—to significantly improve cross-model feature alignment accuracy. On a single consumer-grade GPU, ITDA achieves dictionary learning for Llama-3.1 70B and 405B using only 1% of the time and data required by SAEs, while attaining reconstruction fidelity comparable to SAEs and substantially higher cross-model feature alignment accuracy.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) are a popular method for decomposing Large Langage Models (LLM) activations into interpretable latents. However, due to their substantial training cost, most academic research uses open-source SAEs which are only available for a restricted set of models of up to 27B parameters. SAE latents are also learned from a dataset of activations, which means they do not transfer between models. Motivated by relative representation similarity measures, we introduce Inference-Time Decomposition of Activations (ITDA) models, an alternative method for decomposing language model activations. To train an ITDA, we greedily construct a dictionary of language model activations on a dataset of prompts, selecting those activations which were worst approximated by matching pursuit on the existing dictionary. ITDAs can be trained in just 1% of the time required for SAEs, using 1% of the data. This allowed us to train ITDAs on Llama-3.1 70B and 405B on a single consumer GPU. ITDAs can achieve similar reconstruction performance to SAEs on some target LLMs, but generally incur a performance penalty. However, ITDA dictionaries enable cross-model comparisons, and a simple Jaccard similarity index on ITDA dictionaries outperforms existing methods like CKA, SVCCA, and relative representation similarity metrics. ITDAs provide a cheap alternative to SAEs where computational resources are limited, or when cross model comparisons are necessary. Code available at https://github.com/pleask/itda.

Problem

Research questions and friction points this paper is trying to address.

Reducing training cost for interpreting large language models

Enabling cross-model comparisons of activations

Providing scalable decomposition with limited resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inference-Time Decomposition of Activations (ITDA)

Greedily construct dictionary of activations

Enables cross-model comparisons efficiently

🔎 Similar Papers

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models