🤖 AI Summary
Existing methods for neuron interpretation are constrained by predefined vocabularies or produce overly specific descriptions, limiting their ability to capture high-level, global concepts encoded in visual models. This work proposes a training-free, black-box iterative framework that leverages closed-loop interactions between large language models and text-to-image generators to dynamically generate and refine open-vocabulary concept labels based on neuron activation histories. The approach enables, for the first time, automatic discovery of neuron-associated concepts in an open-vocabulary setting, overcoming the limitations of fixed lexicons while supporting polysemy analysis and visual interpretability. Evaluated on ImageNet and Places365, the method achieves AUC improvements of 0.18 and 0.05, respectively, and uncovers an average of 29% novel concepts missed by existing vocabularies, matching the explanatory quality of gradient-based techniques.
📝 Abstract
Interpreting the concepts encoded by individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. We demonstrate that LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.18 on ImageNet and 0.05 on Places365, while discovering, on average, 29% of new concepts missed by massive predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, which enables polysemanticity evaluation and produces supporting visual explanations that rival gradient-dependent activation maximization methods.