Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Current人工 descriptions of LLM features suffer from ambiguity, inconsistency, and poor automatability, hindering interpretability research. To address this, we propose Semantic Regexes—a structured, linguistically grounded framework for feature characterization that replaces natural language with compositional semantic patterns built from linguistic primitives and semantic modifiers (contextualization, compositionality, quantification). Our method enables rigorous complexity quantification and model-level pattern analysis. Evaluation shows substantial improvements in description consistency (+42%) and conciseness (58% reduction in average length), while preserving accuracy comparable to human-written descriptions. A user study confirms that Semantic Regexes effectively support the construction of mental models of LLM activation behavior. This work pioneers the integration of formal grammar principles into LLM interpretability, establishing a novel paradigm for automated, scalable, and principled feature analysis.

Technology Category

Application Category

📝 Abstract

Automated interpretability aims to translate large language model (LLM) features into human understandable descriptions. However, these natural language feature descriptions are often vague, inconsistent, and require manual relabeling. In response, we introduce semantic regexes, structured language descriptions of LLM features. By combining primitives that capture linguistic and semantic feature patterns with modifiers for contextualization, composition, and quantification, semantic regexes produce precise and expressive feature descriptions. Across quantitative benchmarks and qualitative analyses, we find that semantic regexes match the accuracy of natural language while yielding more concise and consistent feature descriptions. Moreover, their inherent structure affords new types of analyses, including quantifying feature complexity across layers, scaling automated interpretability from insights into individual features to model-wide patterns. Finally, in user studies, we find that semantic regex descriptions help people build accurate mental models of LLM feature activations.

Problem

Research questions and friction points this paper is trying to address.

Translating vague LLM features into precise descriptions

Creating structured language for consistent feature interpretation

Enabling automated analysis of model-wide feature patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic regexes use structured language for feature descriptions

They combine linguistic primitives with contextualization modifiers

This enables precise and consistent automated interpretability of LLMs

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models