See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the challenge of fine-grained video object understanding, where existing methods rely on explicit visual cues—such as masks or points—and struggle to accurately localize targets using text instructions alone. To overcome this limitation, we propose SWIM, a training strategy that leverages mask-based supervision during training to guide cross-modal attention, enabling text-driven object focusing at inference time without any visual prompts. We introduce the NL-Refer dataset to reveal systematic differences in visual activation between nouns and attribute words within multimodal large language models. Furthermore, we incorporate multi-layer cross-attention mechanisms and spatial consistency constraints to achieve fine-grained alignment between linguistic and visual representations. Experiments demonstrate that our approach significantly outperforms existing methods that depend on visual prompts across fine-grained understanding benchmarks.

📝 Abstract

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at \href{https://github.com/HumanMLLM/SWIM}{https://github.com/HumanMLLM/SWIM}.

Problem

Research questions and friction points this paper is trying to address.

fine-grained object understanding

vision-language alignment

cross-modal attention

referring expressions

multimodal large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal alignment

text-only prompting

fine-grained object understanding