MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Vision-language models (VLMs) suffer from limited interpretability and trustworthiness due to their opaque internal representations. Method: We propose the first multimodal inversion framework targeting VLM concepts, jointly modeling the autoregressive encoding process via a gradient-free, fine-tuning-free inversion mechanism operating solely on frozen VLM forward passes. Our approach introduces a novel ternary regularization—enforcing feature alignment, spatial consistency, and natural image smoothness—to ensure semantically faithful reconstruction. Contribution/Results: The method achieves high-fidelity text-to-visual concept inversion across diverse, length-agnostic textual inputs. Quantitative evaluations and human studies demonstrate significant improvements in both visual quality and semantic consistency over existing baselines. By enabling interpretable and visually grounded analysis of VLM internal representations, our framework establishes a new paradigm for probing and understanding multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of VLMs by synthesizing visual concepts corresponding to internal encodings. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts. Reported results include both standard visual quality metrics as well as semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.

Problem

Research questions and friction points this paper is trying to address.

Interpret complex Vision Language Models (VLMs) architectures

Visualize internal VLM representations via synthesized concepts

Address autoregressive processing with inversion and alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal inversion for VLM interpretation

Joint VLM-based inversion with alignment

Triplet regularizers for enhanced realism

🔎 Similar Papers

No similar papers found.