🤖 AI Summary
Sparse autoencoders (SAEs) reveal interpretable features in large language models (LLMs), yet their high-dimensional feature spaces are prohibitively large for systematic analysis; existing dimensionality reduction techniques (e.g., UMAP) often introduce compression artifacts, neighborhood distortions, and visual occlusion. To address this, we propose a concept-centric, interactive, topology-aware visualization framework that jointly integrates locality-preserving dimensionality reduction with persistent homology–based topological encoding. The framework prioritizes user-specified key concepts and their semantically associated features, preserving global structural integrity while enabling focused exploration. It supports fine-grained inspection of conceptual organization and semantic hierarchies within the latent space, balancing local fidelity with global interpretability. Experiments demonstrate substantial improvements in both explanatory power and analytical efficiency for SAE feature exploration, establishing a scalable, interactive paradigm for understanding LLM internal representations.
📝 Abstract
Sparse autoencoders (SAEs) have emerged as a powerful tool for uncovering interpretable features in large language models (LLMs) through the sparse directions they learn. However, the sheer number of extracted directions makes comprehensive exploration intractable. While conventional embedding techniques such as UMAP can reveal global structure, they suffer from limitations including high-dimensional compression artifacts, overplotting, and misleading neighborhood distortions. In this work, we propose a focused exploration framework that prioritizes curated concepts and their corresponding SAE features over attempts to visualize all available features simultaneously. We present an interactive visualization system that combines topology-based visual encoding with dimensionality reduction to faithfully represent both local and global relationships among selected features. This hybrid approach enables users to investigate SAE behavior through targeted, interpretable subsets, facilitating deeper and more nuanced analysis of concept representation in latent space.