🤖 AI Summary
To address the unreliability and lack of real-time interpretability in outputs generated by large vision-language models (LVLMs), this paper introduces FastRM—the first lightweight, plug-and-play interpretability framework that requires no architectural modification or retraining of the original LVLM. FastRM leverages internal feature distillation and a lightweight surrogate network to automatically and instantaneously predict vision–language relevance maps, enabling both quantitative confidence estimation and qualitative attribution visualization. Compared to gradient-based backpropagation methods, FastRM reduces computational latency by 99.8% and memory footprint by 44.4%. By decoupling interpretability from model-specific training or inference overhead, FastRM significantly enhances the practicality and deployment efficiency of multimodal interpretable AI in real-world scenarios, establishing a new paradigm for trustworthy LVLM applications.
📝 Abstract
Large Vision Language Models (LVLMs) have demonstrated remarkable reasoning capabilities over textual and visual inputs. However, these models remain prone to generating misinformation. Identifying and mitigating ungrounded responses is crucial for developing trustworthy AI. Traditional explainability methods such as gradient-based relevancy maps, offer insight into the decision process of models, but are often computationally expensive and unsuitable for real-time output validation. In this work, we introduce FastRM, an efficient method for predicting explainable Relevancy Maps of LVLMs. Furthermore, FastRM provides both quantitative and qualitative assessment of model confidence. Experimental results demonstrate that FastRM achieves a 99.8% reduction in computation time and a 44.4% reduction in memory footprint compared to traditional relevancy map generation. FastRM allows explainable AI to be more practical and scalable, thereby promoting its deployment in real-world applications and enabling users to more effectively evaluate the reliability of model outputs.