On Explaining Visual Captioning with Hybrid Markov Logic Networks

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This paper addresses the limited interpretability of image captioning models by proposing the first Hybrid Markov Logic Network (HMLN) framework that unifies symbolic logic with real-valued functions. The method jointly models visual, linguistic, and external knowledge representations, enabling fine-grained attribution analysis to quantify how training samples influence generated captions and characterizing the dynamic multimodal fusion process. Unlike conventional output-matching evaluation, our framework supports cross-model interpretability comparison and enables traceability of generation grounds—crucial for robustness under distribution shift. Experiments across multiple state-of-the-art captioning models demonstrate high explanation consistency and human-validated credibility, significantly enhancing transparency and debuggability of multimodal systems.

Technology Category

Application Category

📝 Abstract

Deep Neural Networks (DNNs) have made tremendous progress in multimodal tasks such as image captioning. However, explaining/interpreting how these models integrate visual information, language information and knowledge representation to generate meaningful captions remains a challenging problem. Standard metrics to measure performance typically rely on comparing generated captions with human-written ones that may not provide a user with a deep insights into this integration. In this work, we develop a novel explanation framework that is easily interpretable based on Hybrid Markov Logic Networks (HMLNs) - a language that can combine symbolic rules with real-valued functions - where we hypothesize how relevant examples from the training data could have influenced the generation of the observed caption. To do this, we learn a HMLN distribution over the training instances and infer the shift in distributions over these instances when we condition on the generated sample which allows us to quantify which examples may have been a source of richer information to generate the observed caption. Our experiments on captions generated for several state-of-the-art captioning models using Amazon Mechanical Turk illustrate the interpretability of our explanations, and allow us to compare these models along the dimension of explainability.

Problem

Research questions and friction points this paper is trying to address.

Explain how DNNs integrate visual and language information for captioning

Develop interpretable framework using Hybrid Markov Logic Networks

Quantify training examples' influence on generated captions for explainability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Markov Logic Networks for captioning

Combines symbolic rules with real-valued functions

Quantifies training data influence on captions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs