Sparse Autoencoders for Hypothesis Generation

๐Ÿ“… 2025-02-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses the automated discovery of interpretable relationships between textual data (e.g., news headlines) and target variables (e.g., click-through rates). We propose the first framework leveraging sparse autoencoders for hypothesis generation: (1) learn human-interpretable text features via sparse autoencoding; (2) select high-predictive features using feature importance scoring; and (3) generate natural-language hypotheses using large language models (LLMs). The method jointly optimizes interpretability, predictive accuracy, and computational efficiency. On real-world datasets, it yields twice as many statistically significant hypotheses as baseline methods, uncovering novel insightsโ€”such as partisan differences in congressional speeches and headline engagement drivers. On synthetic benchmarks, it achieves โ‰ฅ0.06 improvement in F1 score. Computationally, it incurs one to two orders of magnitude less overhead than state-of-the-art LLM-based approaches.

Technology Category

Application Category

๐Ÿ“ Abstract
We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g.,"mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.
Problem

Research questions and friction points this paper is trying to address.

Generates interpretable text-feature relationships
Predicts target variables from sparse autoencoder features
Identifies novel hypotheses in political and online engagement contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sparse autoencoders for feature extraction
Generates hypotheses via natural language interpretation
Requires significantly less computational resources
๐Ÿ”Ž Similar Papers