Data Descriptions from Large Language Models with Influence Estimation

📅 2025-11-11
🏛️ Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the “data interpretability” challenge in deep learning—specifically, generating human-understandable natural language descriptions that reveal the semantic essence of image data. We propose a novel framework integrating large language models (LLMs), influence estimation, and cross-modal transfer learning: (1) candidate textual descriptions are generated by an LLM; (2) influence estimation identifies high-information-content descriptions based on their impact on model behavior; and (3) the selected descriptions are evaluated on zero-shot cross-modal classification tasks to assess generalization. CLIP-based scoring refines description quality, while GPT-4o is employed for human-aligned evaluation. Evaluated across nine image classification benchmarks, our generated descriptions significantly enhance pure vision model performance—outperforming multiple baselines—and achieve, for the first time, systematic, interpretability-driven semantic enhancement at the data level.

Technology Category

Application Category

📝 Abstract
Deep learning models have been successful in many areas but understanding their behaviors still remains a black-box. Most prior explainable AI (XAI) approaches have focused on interpreting and explaining how models make predictions. In contrast, we would like to understand how data can be explained with deep learning model training and propose a novel approach to understand the data via one of the most common media - language - so that humans can easily understand. Our approach proposes a pipeline to generate textual descriptions that can explain the data with large language models by incorporating external knowledge bases. However, generated data descriptions may still include irrelevant information, so we introduce to exploit influence estimation to choose the most informative textual descriptions, along with the CLIP score. Furthermore, based on the phenomenon of cross-modal transferability, we propose a novel benchmark task named cross-modal transfer classification to examine the effectiveness of our textual descriptions. In the experiment of zero-shot setting, we show that our textual descriptions are more effective than other baseline descriptions, and furthermore, we successfully boost the performance of the model trained only on images across all nine image classification datasets. These results are further supported by evaluation using GPT-4o. Through our approach, we may gain insights into the inherent interpretability of the decision-making process of the model.
Problem

Research questions and friction points this paper is trying to address.

Generating interpretable data descriptions using large language models
Filtering irrelevant text with influence estimation and CLIP scores
Evaluating descriptions via cross-modal transfer classification benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates textual data descriptions using large language models
Selects informative descriptions via influence estimation and CLIP
Proposes cross-modal transfer classification for evaluation
🔎 Similar Papers
No similar papers found.
C
Chaeri Kim
Ulsan National Institute of Science and Technology(UNIST)
J
Jaeyeon Bae
Ulsan National Institute of Science and Technology(UNIST)
Taehwan Kim
Taehwan Kim
UNIST
Machine LearningComputer VisionLanguage Processing