Stitch and Tell: A Structured Multimodal Data Augmentation Method for Spatial Understanding

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual-language models (VLMs) commonly suffer from spatial hallucination—i.e., inaccurate descriptions of relative object positions in images. To address this, we propose an unsupervised, plug-and-play structured multimodal data augmentation method: images are automatically concatenated along spatial axes (horizontal/vertical) without annotations, and spatially consistent, layout-aware textual descriptions or question-answer pairs are self-generated to implicitly encode explicit inter-object spatial relations. Our approach requires no auxiliary models, human annotations, or architectural modifications, and is compatible with diverse VLM training paradigms. It achieves +5.50% and +4.19% improvements on MME_Position and Spatial-MM benchmarks, respectively, while preserving performance on general vision-language tasks. The core contribution is the first unsupervised, image-stitching–driven injection of spatial structure into VLM training—offering a lightweight, efficient, data-level solution for enhancing spatial reasoning in VLMs.

Technology Category

Application Category

📝 Abstract
Existing vision-language models often suffer from spatial hallucinations, i.e., generating incorrect descriptions about the relative positions of objects in an image. We argue that this problem mainly stems from the asymmetric properties between images and text. To enrich the spatial understanding ability of vision-language models, we propose a simple, annotation-free, plug-and-play method named $ ext{Stitch and Tell}$ (abbreviated as SiTe), which injects structured spatial supervision into data. It constructs stitched image-text pairs by stitching images along a spatial axis and generating spatially-aware captions or question answer pairs based on the layout of stitched image, without relying on costly advanced models or human involvement. We evaluate SiTe across three architectures including LLaVA-v1.5-7B, LLaVA-Qwen2-1.5B and HALVA-7B, two training datasets, and eight benchmarks. Experiments show that SiTe improves spatial understanding tasks such as $ ext{MME}_{ ext{Position}}$ (+5.50%) and Spatial-MM (+4.19%), while maintaining or improving performance on general vision-language benchmarks including COCO-QA (+1.02%) and MMBench (+4.76%). Our findings suggest that explicitly injecting spatially-aware structure into training data offers an effective way to mitigate spatial hallucinations and improve spatial understanding, while preserving general vision-language capabilities.
Problem

Research questions and friction points this paper is trying to address.

Mitigate spatial hallucinations in vision-language models
Enhance spatial understanding of object positions in images
Inject structured spatial supervision without costly annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stitching images along spatial axis for augmentation
Generating spatially-aware captions without human annotation
Injecting structured spatial supervision into training data
🔎 Similar Papers
No similar papers found.
H
Hang Yin
School of Computer Science, Beijing Institute of Technology
Xiaomin He
Xiaomin He
Columbia University
P
PeiWen Yuan
School of Computer Science, Beijing Institute of Technology
Y
Yiwei Li
School of Computer Science, Beijing Institute of Technology
J
Jiayi Shi
School of Computer Science, Beijing Institute of Technology
W
Wenxiao Fan
School of Computer Science, Beijing Institute of Technology
Shaoxiong Feng
Shaoxiong Feng
Beijing Institute of Technology; RedNote
Natural Language ProcessingLarge Language ModelDialogue SystemSearch System
Kan Li
Kan Li
Huazhong University of Science and Technology
3D AssemblyStretchable ElectronicsMetamaterials