π€ AI Summary
This work addresses a key limitation of traditional sparse autoencoders (SAEs)βtheir reliance on linear reconstruction, which conflates compositional structure with mere feature co-occurrence, often collapsing complex concepts into opaque monolithic features. To overcome this, the authors propose augmenting the SAE decoder with explicit high-order polynomial terms that model pairwise and ternary feature interactions, while retaining a linear encoder to preserve interpretability. Efficient implementation is achieved via low-rank tensor decomposition within a shared projection subspace. This approach is the first to transcend the linear reconstruction constraint without sacrificing SAE interpretability; notably, the learned interaction weights show negligible correlation with surface-level co-occurrence statistics (r = 0.06). Evaluated across four language models and three SAE variants, the method yields an average 8% improvement in probe F1 scores, increases Wasserstein distance by 2β10Γ, and incurs only a 3% parameter overhead.
π Abstract
Sparse autoencoders (SAEs) have emerged as a promising method for interpreting neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume that features combine additively through linear reconstruction, an assumption that cannot capture compositional structure: linear models cannot distinguish whether"Starbucks"arises from the composition of"star"and"coffee"features or merely their co-occurrence. This forces SAEs to allocate monolithic features for compound concepts rather than decomposing them into interpretable constituents. We introduce PolySAE, which extends the SAE decoder with higher-order terms to model feature interactions while preserving the linear encoder essential for interpretability. Through low-rank tensor factorization on a shared projection subspace, PolySAE captures pairwise and triple feature interactions with small parameter overhead (3% on GPT2). Across four language models and three SAE variants, PolySAE achieves an average improvement of approximately 8% in probing F1 while maintaining comparable reconstruction error, and produces 2-10$\times$ larger Wasserstein distances between class-conditional feature distributions. Critically, learned interaction weights exhibit negligible correlation with co-occurrence frequency ($r = 0.06$ vs. $r = 0.82$ for SAE feature covariance), suggesting that polynomial terms capture compositional structure, such as morphological binding and phrasal composition, largely independent of surface statistics.