Automated Feature Labeling with Token-Space Gradient Descent

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the low readability, poor accuracy, and unnatural language of automatically generated labels in neural network feature semantic interpretation. To this end, we propose a novel paradigm that directly optimizes interpretable feature labels in token space. Methodologically, we introduce a pretrained language model as a differentiable discriminator to guide gradient descent in token space—the first such approach—and formulate prediction accuracy, label entropy minimization, and linguistic naturalness as a joint multi-objective differentiable optimization problem—also a first. Extensive experiments across diverse tasks—including animal detection, mammal classification, Chinese text recognition, and digit recognition—demonstrate the method’s effectiveness: the generated labels achieve high readability, strong predictive accuracy, and cross-domain generalizability, significantly outperforming conventional clustering-based and human-annotated baselines.

Technology Category

Application Category

📝 Abstract
We present a novel approach to feature labeling using gradient descent in token-space. While existing methods typically use language models to generate hypotheses about feature meanings, our method directly optimizes label representations by using a language model as a discriminator to predict feature activations. We formulate this as a multi-objective optimization problem in token-space, balancing prediction accuracy, entropy minimization, and linguistic naturalness. Our proof-of-concept experiments demonstrate successful convergence to interpretable single-token labels across diverse domains, including features for detecting animals, mammals, Chinese text, and numbers. Although our current implementation is constrained to single-token labels and relatively simple features, the results suggest that token-space gradient descent could become a valuable addition to the interpretability researcher's toolkit.
Problem

Research questions and friction points this paper is trying to address.

Optimize feature label representations using token-space gradient descent
Balance prediction accuracy, entropy minimization, and linguistic naturalness
Generate interpretable single-token labels for diverse feature domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient descent in token-space for labeling
Language model as discriminator for optimization
Multi-objective balance in token-space optimization
🔎 Similar Papers
No similar papers found.
Julian Schulz
Julian Schulz
University of Göttingen
S
Seamus Fallows
Independent