🤖 AI Summary
Current research on explainability in vision recognition models lacks a systematic classification framework, hindering reliable deployment in high-stakes domains such as autonomous driving and medical diagnosis. Method: We propose the first human-centered, four-dimensional taxonomy—comprising explanation intent, target object, presentation modality, and methodological foundation—derived from interdisciplinary modeling integrating human-computer interaction (HCI) principles and eXplainable AI (XAI) theory. Through comprehensive literature review, we formalize evaluation criteria for each dimension and conduct the first systematic analysis of opportunities introduced by multimodal large language models (MLLMs). Contribution/Results: Our framework enables structured organization of explainability methods, facilitates failure diagnosis of vision models, guides principled design of explanation techniques, and establishes a rigorous theoretical foundation and actionable roadmap for deploying interpretable models in safety-critical applications.
📝 Abstract
In recent years, visual recognition methods have advanced significantly, finding applications across diverse fields. While researchers seek to understand the mechanisms behind the success of these models, there is also a growing impetus to deploy them in critical areas like autonomous driving and medical diagnostics to better diagnose failures, which promotes the development of interpretability research. This paper systematically reviews existing research on the interpretability of visual recognition models and proposes a taxonomy of methods from a human-centered perspective. The proposed taxonomy categorizes interpretable recognition methods based on Intent, Object, Presentation, and Methodology, thereby establishing a systematic and coherent set of grouping criteria for these XAI methods. Additionally, we summarize the requirements for evaluation metrics and explore new opportunities enabled by recent technologies, such as large multimodal models. We aim to organize existing research in this domain and inspire future investigations into the interpretability of visual recognition models.