Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

📅 2024-07-18
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Current large multimodal models (LMMs) exhibit three critical limitations in multi-image question answering (MIQA): weak cross-image reasoning, susceptibility to irrelevant images, and high sensitivity to the spatial positioning of key visual information. Method: We introduce Visual Haystacks—the first long-context, multi-image benchmark—to systematically expose these deficiencies; propose a vision-centric “visual haystack” evaluation paradigm; and design MIRAGE, a lightweight, open-source vision-augmented RAG framework enabling single-GPU processing of up to 10,000 images—surpassing the prior thousand-image scalability barrier. Its core techniques include vision-guided retrieval-augmented generation (V-RAG), multi-image embedding alignment, hierarchical token compression, and cross-image attention masking optimization. Results: Experiments show MIRAGE achieves a 13% absolute improvement over state-of-the-art open-source LMMs on Visual Haystacks and establishes new SOTA on RetVQA for multi-image QA, while matching top proprietary models on single-image QA.

Technology Category

Application Category

📝 Abstract
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like photo album searches or satellite imagery analysis. In this work, we first assess the limitations of current benchmarks for long-context LMMs. We address these limitations by introducing a new vision-centric, long-context benchmark,"Visual Haystacks (VHs)". We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window. Towards a solution, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU -- far surpassing the 1k-image limit of contemporary models. MIRAGE demonstrates up to 13% performance improvement over existing open-source LMMs on VHs, sets a new state-of-the-art on the RetVQA multi-image QA benchmark, and achieves competitive performance on single-image QA with state-of-the-art LMMs. Our dataset, model, and code are available at: https://visual-haystacks.github.io.
Problem

Research questions and friction points this paper is trying to address.

Assess limitations of long-context LMM benchmarks.
Introduce Visual Haystacks for multi-image reasoning.
Propose MIRAGE for enhanced image retrieval.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Visual Haystacks benchmark
Develops MIRAGE framework for multi-image processing
Achieves 13% performance improvement over LMMs
🔎 Similar Papers
No similar papers found.