LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inconsistency in cross-modal representations of existing vision-language models, which stems from asymmetric modality roles in training data and leads to significant performance degradation when images and text are swapped. To mitigate this, the authors propose Local Modality Substitution (LoMo), a lightweight and architecture-agnostic data construction paradigm that dynamically replaces textual segments with semantically equivalent rendered images, thereby forming interleaved image-text sequences. This approach enables supervision for cross-modal representation invariance through standard supervised fine-tuning alone, fostering deeper multimodal fusion. Evaluated across 13 benchmarks, LoMo substantially enhances model reasoning capabilities, yielding average improvements of 2.67 and 2.82 points on LLaVA-OneVision-1.5-8B and Qwen2-VL-7B, respectively.
📝 Abstract
Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.
Problem

Research questions and friction points this paper is trying to address.

modality substitution
vision-language models
cross-modal representation
carrier sensitivity
multimodal fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Local Modality Substitution
Cross-modal Invariance
Vision-Language Models
Modality Substitution
Multimodal Fusion
🔎 Similar Papers