Step-Level Sparse Autoencoder for Reasoning Process Interpretation

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work proposes the Step-level Sparse Autoencoder (SSAE), a novel approach that shifts sparse autoencoding from the token level to the reasoning-step granularity in large language models. Unlike existing methods that operate at the token level and struggle to capture step-wise semantic and logical structure, SSAE introduces context-conditioned sparsity to construct an information bottleneck that disentangles incremental reasoning content from background knowledge, thereby yielding interpretable step-level features. Through linear probing analyses across multiple base models and reasoning tasks, SSAE effectively predicts both the correctness and logical coherence of individual reasoning steps, uncovering an intrinsic self-verification mechanism within the models. This advancement significantly enhances the interpretability of the reasoning processes employed by large language models.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions. Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features. By linear probing, we can easily predict surface-level information, such as generation length and first token distribution, as well as more complicated properties, such as the correctness and logicality of the step. These observations indicate that LLMs should already at least partly know about these properties during generation, which provides the foundation for the self-verification ability of LLMs. The code is available at https://github.com/Miaow-Lab/SSAE

Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought

interpretability

Sparse Autoencoder

reasoning process

step-level

Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-Level Sparse Autoencoder

Chain-of-Thought Reasoning

Interpretability