Towards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) predominantly adopt CLIP as the visual encoder, which struggles to capture question-relevant fine-grained visual details. To address this, we propose repurposing pretrained text-to-image diffusion models (e.g., Stable Diffusion) as instruction-aware visual encoders: leveraging text-conditioned attention for task-driven spatial focus and introducing a denoising feature alignment mechanism to bridge diffusion representations with large language models. Furthermore, we identify and mitigate information leakage during cross-modal feature fusion, proposing a lightweight complementary fusion strategy that synergistically integrates CLIP and diffusion features. Experiments demonstrate consistent and significant performance gains across both general and domain-specific multimodal benchmarks. Notably, our approach substantially outperforms CLIP-based baselines on vision-centric tasks requiring spatial reasoning and compositional understanding.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal large language models (MLLMs) have enabled image-based question-answering capabilities. However, a key limitation is the use of CLIP as the visual encoder; while it can capture coarse global information, it often can miss fine-grained details that are relevant to the input query. To address these shortcomings, this work studies whether pre-trained text-to-image diffusion models can serve as instruction-aware visual encoders. Through an analysis of their internal representations, we find diffusion features are both rich in semantics and can encode strong image-text alignment. Moreover, we find that we can leverage text conditioning to focus the model on regions relevant to the input question. We then investigate how to align these features with large language models and uncover a leakage phenomenon, where the LLM can inadvertently recover information from the original diffusion prompt. We analyze the causes of this leakage and propose a mitigation strategy. Based on these insights, we explore a simple fusion strategy that utilizes both CLIP and conditional diffusion features. We evaluate our approach on both general VQA and specialized MLLM benchmarks, demonstrating the promise of diffusion models for visual understanding, particularly in vision-centric tasks that require spatial and compositional reasoning. Our project page can be found https://vatsalag99.github.io/mustafar/.

Problem

Research questions and friction points this paper is trying to address.

Enhancing fine-grained visual details in multimodal models

Leveraging diffusion models for instruction-aware visual encoding

Mitigating information leakage in diffusion-LMM feature alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Stable Diffusion for task-aware feature extraction

Combines CLIP and diffusion features for fusion

Mitigates leakage in diffusion-LMM alignment

🔎 Similar Papers

Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification