CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs

📅 2026-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of excessive size and redundancy in attribution maps generated by existing dictionary learning–based mechanistic interpretability methods employing Cross-Layer Transcoders (CLTs), which hinder efficient training and analysis. To overcome these limitations, the authors introduce the first open-source CLT framework that integrates distributed training, activation cache compression, and an automated interpretability pipeline. By leveraging model sharding, Circuit-Tracer–based attribution computation, a unified analysis workflow, and interactive visualization, the proposed system substantially enhances the scalability and practicality of CLT-based interpretability. This tool effectively mitigates attribution map redundancy, offering an efficient, cohesive, and scalable solution for mechanistic interpretability research in large language models.

Technology Category

Application Category

📝 Abstract
Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice. Cross-Layer Transcoders (CLTs) address this issue by sharing features across layers while preserving layer-specific decoding, yielding more compact representations, but remain difficult to train and analyze at scale. We introduce an open-source library for end-to-end training and interpretability of CLTs. Our framework integrates scalable distributed training with model sharding and compressed activation caching, a unified automated interpretability pipeline for feature analysis and explanation, attribution graph computation using Circuit-Tracer, and a flexible visualization interface. This provides a practical and unified solution for scaling CLT-based mechanistic interpretability. Our code is available at: https://github.com/LLM-Interp/CLT-Forge.
Problem

Research questions and friction points this paper is trying to address.

mechanistic interpretability
Cross-Layer Transcoders
attribution graphs
Large Language Models
sparse features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Layer Transcoders
Mechanistic Interpretability
Attribution Graphs
Distributed Training
Feature Attribution
🔎 Similar Papers
No similar papers found.