ReFrame: Rectification Framework for Image Explaining Architectures

📅 2025-06-22

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing image explanation methods suffer from object hallucination and missed detection, resulting in inconsistent and incomplete explanations. This paper proposes a general, plug-and-play explanation rectification framework that integrates interpretability design with object detection techniques, compatible with diverse image explanation architectures. The framework optimizes for object-level consistency and completeness, evaluated using detection-based precision metrics, and is adaptable to image captioning, visual question answering (VQA), and LLM prompting systems. Experiments demonstrate substantial improvements: on image captioning, completeness increases by 81.81% and inconsistency decreases by 37.10%; on VQA, average completeness improves by 9.6% and inconsistency drops by 37.10%, significantly outperforming state-of-the-art methods. The core contribution is the first general-purpose rectification paradigm explicitly designed for object-level consistency.

Technology Category

Application Category

📝 Abstract

Image explanation has been one of the key research interests in the Deep Learning field. Throughout the years, several approaches have been adopted to explain an input image fed by the user. From detecting an object in a given image to explaining it in human understandable sentence, to having a conversation describing the image, this problem has seen an immense change throughout the years, However, the existing works have been often found to (a) hallucinate objects that do not exist in the image and/or (b) lack identifying the complete set of objects present in the image. In this paper, we propose a novel approach to mitigate these drawbacks of inconsistency and incompleteness of the objects recognized during the image explanation. To enable this, we propose an interpretable framework that can be plugged atop diverse image explaining frameworks including Image Captioning, Visual Question Answering (VQA) and Prompt-based AI using LLMs, thereby enhancing their explanation capabilities by rectifying the incorrect or missing objects. We further measure the efficacy of the rectified explanations generated through our proposed approaches leveraging object based precision metrics, and showcase the improvements in the inconsistency and completeness of image explanations. Quantitatively, the proposed framework is able to improve the explanations over the baseline architectures of Image Captioning (improving the completeness by 81.81% and inconsistency by 37.10%), Visual Question Answering(average of 9.6% and 37.10% in completeness and inconsistency respectively) and Prompt-based AI model (0.01% and 5.2% for completeness and inconsistency respectively) surpassing the current state-of-the-art by a substantial margin.

Problem

Research questions and friction points this paper is trying to address.

Reduces hallucination of non-existent objects in image explanations

Improves completeness of identified objects in image descriptions

Enhances consistency across diverse image explaining frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interpretable framework for image explanation enhancement

Plug-and-play solution for diverse image explaining frameworks

Object-based precision metrics for rectification evaluation

🔎 Similar Papers

T-TAME: Trainable Attention Mechanism for Explaining Convolutional Networks and Vision Transformers