🤖 AI Summary
This work addresses the limitations of existing text representation methods, which often conflate features with labels, lack auditable interpretability, and suffer from poor reproducibility. The authors propose an interpretable and discriminative approach that constructs clear, non-leaky features by introducing operational criteria: inter-annotator agreement and feature-label disentanglement. Their method builds upon a Large Language Model–assisted Feature Discovery (LFD) framework, generating candidate features from contrastive text pairs, filtering them via cross-model Cohen’s κ to retain high-agreement features, and selecting the final representation based on residual predictive gain. Evaluated across seven corpora and ten tasks, the approach matches strong baselines in predictive performance while human evaluations confirm its features are clearer, exhibit less label leakage, and significantly improve both human–human and human–LLM consistency.
📝 Abstract
Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features from contrastive outcome-opposed text pairs, screens candidates using cross-LLM Cohen's $κ$, and selects features by residual held-out predictive gain. A stylized analysis connects the $κ$ screen to a per-feature annotation-noise bound, formalizing agreement as a reliability check. Across ten text-classification tasks spanning seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing substantially clearer and less label-entangled features. Human audits with 232 raters show that LFD features achieve higher human--human and human--LLM agreement than baseline concepts, and raters consistently judge them as less label-leaking. These results suggest that agreement-tested, label-disentangled coordinates provide a practical auditability standard for interpretable text classification.