🤖 AI Summary
Vision-language models (VLMs) suffer from limited interpretability and trustworthiness due to their opaque internal representations.
Method: We propose the first multimodal inversion framework targeting VLM concepts, jointly modeling the autoregressive encoding process via a gradient-free, fine-tuning-free inversion mechanism operating solely on frozen VLM forward passes. Our approach introduces a novel ternary regularization—enforcing feature alignment, spatial consistency, and natural image smoothness—to ensure semantically faithful reconstruction.
Contribution/Results: The method achieves high-fidelity text-to-visual concept inversion across diverse, length-agnostic textual inputs. Quantitative evaluations and human studies demonstrate significant improvements in both visual quality and semantic consistency over existing baselines. By enabling interpretable and visually grounded analysis of VLM internal representations, our framework establishes a new paradigm for probing and understanding multimodal foundation models.
📝 Abstract
Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of VLMs by synthesizing visual concepts corresponding to internal encodings. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts. Reported results include both standard visual quality metrics as well as semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.