TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing vision-language models (VLMs) exhibit strong reasoning capabilities for complex scenes, yet their internal decision-making processes remain opaque and poorly interpretable. To address this, we introduce the novel “logits-to-video” (L2V) task—first to directly map VLM-generated semantic logits into high-fidelity videos, enabling visual attribution of model reasoning. Our method builds upon the TRANSPORTER framework, integrating optimal transport theory with semantic-direction-guided conditional generation to bridge the VLM’s semantic space and text-to-video diffusion models. Evaluated across multiple state-of-the-art VLMs, L2V generates videos that faithfully reflect fine-grained variations in object attributes, action adverbs, and scene context. Results demonstrate substantial improvements in both interpretability—by visually grounding abstract logits—and controllability—by enabling precise, semantics-driven video generation. This establishes a new paradigm for introspecting and steering VLM behavior through interpretable, multimodal output.

Technology Category

Application Category

📝 Abstract

How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs' predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context. Quantitative and qualitative evaluations across VLMs demonstrate that L2V can provide a fidelity-rich, novel direction for model interpretability that has not been previously explored.

Problem

Research questions and friction points this paper is trying to address.

Understanding internal reasoning processes of Vision Language Models

Generating interpretable videos from model logit scores

Providing visual explanations for VLM prediction mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates videos from VLM logit scores

Learns optimal transport to semantic embeddings

Enables conditional video generation via embeddings

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts