TRACE: Training and Inference-Time Interpretability Analysis for Language Models

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

136K/year

🤖 AI Summary

The dynamic acquisition of linguistic knowledge during language model training remains poorly understood; existing interpretability tools are largely post-hoc, rely on scalar metrics, or involve complex integrations—hindering deployment and reproducibility. This paper introduces a modular interpretability analysis framework enabling fine-grained tracking of linguistic and representational signals throughout both training and inference in Transformer models. Methodologically, it integrates (1) ABSynth, a controllable synthetic corpus generator that uncovers evolution patterns—such as early syntactic emergence, delayed semantic acquisition, and representation compression—overlooked by conventional metrics; and (2) a multi-faceted diagnostic suite combining feature probing, intrinsic dimension estimation, Hessian curvature analysis, and output diagnostics to support layer-wise interpretation, convergence-driven early stopping, and structural error detection. Lightweight and fully reproducible, the framework significantly enhances the operationality and systematicity of interpretability research.

Technology Category

Application Category

📝 Abstract

Understanding when and how linguistic knowledge emerges during language model training remains a central challenge for interpretability. Most existing tools are post hoc, rely on scalar metrics, or require nontrivial integration effort, making comprehensive interpretability analysis difficult to deploy and maintain. We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models. It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics. It integrates with ABSynth, a controllable synthetic corpus generator that provides structured annotations for precise evaluation of linguistic feature acquisition. Experiments with autoregressive transformers demonstrate that TRACE reveals developmental phenomena such as early syntactic emergence, delayed semantic acquisition, and representational compression, signals overlooked by traditional scalar metrics such as loss or accuracy. With minimal integration effort, the tool enables layer-wise diagnostics, convergence-based early stopping, and detection of structural errors, making transformer analysis interpretable, actionable, and reproducible.

Problem

Research questions and friction points this paper is trying to address.

Analyzing linguistic knowledge emergence in language models

Overcoming limitations of post hoc and scalar interpretability tools

Providing modular toolkit for training and inference-time analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular toolkit for interpretability analysis

Integrates synthetic corpus with annotations

Enables layer-wise diagnostics and early stopping

🔎 Similar Papers

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models