Interpreting Object-level Foundation Models via Visual Precision Search

📅 2024-11-25

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing object-level foundation models (e.g., Grounding DINO, Florence-2) suffer from limited interpretability: gradient-based methods struggle with precise localization due to intricate multimodal fusion, while perturbation-based approaches yield noisy, coarse-grained saliency maps. This paper proposes Visual Precision Search (VPS), a gradient-free, black-box attribution method that requires no access to model parameters or internal multimodal architecture. VPS partitions the input into sparse subregions and leverages cross-modal consistency and collaboration scoring to accurately localize critical decision regions. Its key contributions include: (i) the first black-box attribution paradigm tailored for object-level models; (ii) theoretical error bounds on attribution fidelity; and (iii) support for failure-case diagnosis. Evaluated on RefCOCO, MS COCO, and LVIS, VPS outperforms state-of-the-art methods—improving grounding confidence by 20.1–31.6% for Grounding DINO and 66.9–102.9% for Florence-2. Code is publicly available.

Technology Category

Application Category

📝 Abstract

Advances in multimodal pre-training have propelled object-level foundation models, such as Grounding DINO and Florence-2, in tasks like visual grounding and object detection. However, interpreting these models' decisions has grown increasingly challenging. Existing interpretable attribution methods for object-level task interpretation have notable limitations: (1) gradient-based methods lack precise localization due to visual-textual fusion in foundation models, and (2) perturbation-based methods produce noisy saliency maps, limiting fine-grained interpretability. To address these, we propose a Visual Precision Search method that generates accurate attribution maps with fewer regions. Our method bypasses internal model parameters to overcome attribution issues from multimodal fusion, dividing inputs into sparse sub-regions and using consistency and collaboration scores to accurately identify critical decision-making regions. We also conducted a theoretical analysis of the boundary guarantees and scope of applicability of our method. Experiments on RefCOCO, MS COCO, and LVIS show our approach enhances object-level task interpretability over SOTA for Grounding DINO and Florence-2 across various evaluation metrics, with faithfulness gains of 23.7%, 31.6%, and 20.1% on MS COCO, LVIS, and RefCOCO for Grounding DINO, and 102.9% and 66.9% on MS COCO and RefCOCO for Florence-2. Additionally, our method can interpret failures in visual grounding and object detection tasks, surpassing existing methods across multiple evaluation metrics. The code will be released at https://github.com/RuoyuChen10/VPS.

Problem

Research questions and friction points this paper is trying to address.

Interpreting decisions of object-level foundation models

Overcoming limitations in gradient and perturbation methods

Enhancing interpretability for visual grounding and object detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Precision Search for accurate attribution maps

Bypasses internal parameters to avoid fusion issues

Uses consistency scores to identify critical regions

🔎 Similar Papers

Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models