Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic

📅 2024-12-15

🏛️ BigData Congress [Services Society]

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

In vision-language description tasks, disentangling the relative contributions of pretraining versus fine-tuning knowledge remains challenging. To address this, we propose Hybrid Markov Logic Networks (HMLN), the first framework integrating symbolic logic with probabilistic inference for modeling vision–language pairs, enabling fine-grained attribution: quantifying individual training samples’ influence on generated descriptions and explicitly decoupling pretraining and fine-tuning knowledge contributions. Our method unifies symbolic knowledge extraction, visual feature embedding, and probabilistic reasoning, and is validated on the MSCOCO benchmark. Experiments reveal that LLM-based models (e.g., BLIP-2) exhibit significantly smaller knowledge gains from fine-tuning compared to non-LLM baselines, confirming their superior generalization capability inherited from pretraining. This work establishes a novel, interpretable, and computationally tractable paradigm for knowledge attribution in large vision-language models.

Technology Category

Application Category

📝 Abstract

Multimodal systems have highly complex processing pipelines and are pretrained over large datasets before being fine-tuned for specific tasks such as visual captioning. However, it becomes hard to disentangle what the model learns during the fine-tuning process from what it already knows due to its pretraining. In this work, we learn a probabilistic model using Hybrid Markov Logic Networks (HMLNs) over the training examples by relating symbolic knowledge (extracted from the caption) with visual features (extracted from the image). For a generated caption, we quantify the influence of training examples based on the HMLN distribution using probabilistic inference. We evaluate two types of inference procedures on the MSCOCO dataset for different types of captioning models. Our results show that for BLIP2 (a model that uses a LLM), the fine-tuning may have smaller influence on the knowledge the model has acquired since it may have more general knowledge to perform visual captioning as compared to models that do not use a LLM.

Problem

Research questions and friction points this paper is trying to address.

Disentangle fine-tuning effects from pre-training in visual captioning.

Use Hybrid Markov Logic Networks to link symbolic and visual features.

Evaluate fine-tuning influence on models with and without large language models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Hybrid Markov Logic Networks for disentanglement

Relates symbolic knowledge with visual features

Quantifies influence using probabilistic inference

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs