SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning

📅 2025-01-07

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Addressing two key bottlenecks in multi-image reasoning—scarcity of high-quality training data and absence of standardized evaluation benchmarks—this paper proposes an efficient synthetic data construction framework. First, it introduces a multimodal-embedding-driven image association extraction method to automatically generate semantically cohesive image groups and complex, multi-step reasoning instructions. Second, it establishes SMIR-BENCH, the first benchmark supporting multi-turn interactive evaluation with automatic assessment by vision-language models (VLMs). Third, it integrates multimodal retrieval, open-source LLM-based instruction generation, and VLM-based automated evaluation to synthesize 160K high-quality training samples. On SMIR-BENCH, fine-tuning open-source VLMs yields an average 8% performance gain over baselines, demonstrating substantial efficacy. Crucially, the framework reduces data curation costs significantly. The core contributions are: (1) the first multi-image associative synthetic paradigm, and (2) the first automated, multi-turn, multi-image reasoning benchmark.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have shown strong performance in understanding single images, aided by numerous high-quality instruction datasets. However, multi-image reasoning tasks are still under-explored in the open-source community due to two main challenges: (1) scaling datasets with multiple correlated images and complex reasoning instructions is resource-intensive and maintaining quality is difficult, and (2) there is a lack of robust evaluation benchmarks for multi-image tasks. To address these issues, we introduce SMIR, an efficient synthetic data-generation pipeline for multi-image reasoning, and a high-quality dataset generated using this pipeline. Our pipeline efficiently extracts highly correlated images using multimodal embeddings, combining visual and descriptive information and leverages open-source LLMs to generate quality instructions. Using this pipeline, we generated 160K synthetic training samples, offering a cost-effective alternative to expensive closed-source solutions. Additionally, we present SMIR-BENCH, a novel multi-image reasoning evaluation benchmark comprising 200 diverse examples across 7 complex multi-image reasoning tasks. SMIR-BENCH is multi-turn and utilizes a VLM judge to evaluate free-form responses, providing a comprehensive assessment of model expressiveness and reasoning capability across modalities. We demonstrate the effectiveness of SMIR dataset by fine-tuning several open-source VLMs and evaluating their performance on SMIR-BENCH. Our results show that models trained on our dataset outperform baseline models in multi-image reasoning tasks up to 8% with a much more scalable data pipeline.

Problem

Research questions and friction points this paper is trying to address.

Visual Language Models

Training Dataset Quality

Multi-image Task Evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

SMIR

Multi-Graph Reasoning

Data Generation

🔎 Similar Papers

No similar papers found.