Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts

📅 2025-11-08

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Sparse autoencoders (SAEs) reveal interpretable features in large language models (LLMs), yet their high-dimensional feature spaces are prohibitively large for systematic analysis; existing dimensionality reduction techniques (e.g., UMAP) often introduce compression artifacts, neighborhood distortions, and visual occlusion. To address this, we propose a concept-centric, interactive, topology-aware visualization framework that jointly integrates locality-preserving dimensionality reduction with persistent homology–based topological encoding. The framework prioritizes user-specified key concepts and their semantically associated features, preserving global structural integrity while enabling focused exploration. It supports fine-grained inspection of conceptual organization and semantic hierarchies within the latent space, balancing local fidelity with global interpretability. Experiments demonstrate substantial improvements in both explanatory power and analytical efficiency for SAE feature exploration, establishing a scalable, interactive paradigm for understanding LLM internal representations.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) have emerged as a powerful tool for uncovering interpretable features in large language models (LLMs) through the sparse directions they learn. However, the sheer number of extracted directions makes comprehensive exploration intractable. While conventional embedding techniques such as UMAP can reveal global structure, they suffer from limitations including high-dimensional compression artifacts, overplotting, and misleading neighborhood distortions. In this work, we propose a focused exploration framework that prioritizes curated concepts and their corresponding SAE features over attempts to visualize all available features simultaneously. We present an interactive visualization system that combines topology-based visual encoding with dimensionality reduction to faithfully represent both local and global relationships among selected features. This hybrid approach enables users to investigate SAE behavior through targeted, interpretable subsets, facilitating deeper and more nuanced analysis of concept representation in latent space.

Problem

Research questions and friction points this paper is trying to address.

Visualizing sparse autoencoder features without distortion artifacts

Managing feature explosion in large language model interpretability

Enabling focused exploration of curated concept relationships

Innovation

Methods, ideas, or system contributions that make the work stand out.

Focuses on curated concepts over all features

Combines topology-based encoding with dimensionality reduction

Enables targeted analysis of feature relationships

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models