🤖 AI Summary
Current人工 descriptions of LLM features suffer from ambiguity, inconsistency, and poor automatability, hindering interpretability research. To address this, we propose Semantic Regexes—a structured, linguistically grounded framework for feature characterization that replaces natural language with compositional semantic patterns built from linguistic primitives and semantic modifiers (contextualization, compositionality, quantification). Our method enables rigorous complexity quantification and model-level pattern analysis. Evaluation shows substantial improvements in description consistency (+42%) and conciseness (58% reduction in average length), while preserving accuracy comparable to human-written descriptions. A user study confirms that Semantic Regexes effectively support the construction of mental models of LLM activation behavior. This work pioneers the integration of formal grammar principles into LLM interpretability, establishing a novel paradigm for automated, scalable, and principled feature analysis.
📝 Abstract
Automated interpretability aims to translate large language model (LLM) features into human understandable descriptions. However, these natural language feature descriptions are often vague, inconsistent, and require manual relabeling. In response, we introduce semantic regexes, structured language descriptions of LLM features. By combining primitives that capture linguistic and semantic feature patterns with modifiers for contextualization, composition, and quantification, semantic regexes produce precise and expressive feature descriptions. Across quantitative benchmarks and qualitative analyses, we find that semantic regexes match the accuracy of natural language while yielding more concise and consistent feature descriptions. Moreover, their inherent structure affords new types of analyses, including quantifying feature complexity across layers, scaling automated interpretability from insights into individual features to model-wide patterns. Finally, in user studies, we find that semantic regex descriptions help people build accurate mental models of LLM feature activations.