Enhancing Multi-Image Understanding through Delimiter Token Scaling

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work identifies a critical issue in current large vision-language models: when processing multiple images, delimiter tokens fail to effectively prevent cross-image information leakage, leading the model to conflate content from distinct images. To address this, the authors propose a lightweight hidden-state scaling strategy that enhances intra-image interactions while suppressing inter-image interference—without introducing additional training or inference overhead. This approach significantly improves the model’s awareness of image boundaries. The method demonstrates consistent performance gains across multiple multi-image benchmarks, including Mantis, MuirBench, MIRB, and QBench2, and further achieves superior results on multi-document and multi-table understanding tasks such as TQABench, MultiNews, and WCEP-10.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model's ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB, and QBench2. We further evaluate our method on text-only tasks that require clear distinction. The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews, and WCEP-10. Notably, our method requires no additional training or inference cost.

Problem

Research questions and friction points this paper is trying to address.

multi-image understanding

cross-image information leakage

delimiter tokens

vision-language models

image-specific information

Innovation

Methods, ideas, or system contributions that make the work stand out.

delimiter token scaling

multi-image understanding

cross-image information leakage