MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation

📅 2025-01-07

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Vision-language models (VLMs) exhibit limited performance on fine-grained domain-specific tasks—such as chart understanding and spatial reasoning—primarily due to scarcity of high-quality, task-aligned human annotations and poor domain adaptation. To address this, we propose the first three-stage targeted synthetic data paradigm tailored for fine-grained professional tasks: (1) semantic clustering-based subgroup partitioning, (2) instruction-guided, task-descriptor-driven text generation, and (3) multi-model consensus filtering to eliminate redundancy and outliers. Leveraging strong multimodal foundation models—including GPT-4V and Qwen-VL—for knowledge distillation, our method significantly enhances synthetic data fidelity. Evaluated on LLaVA-1.5 (7B), it achieves +29% and +15% absolute gains in spatial reasoning and chart understanding, respectively, with average performance reaching 1.6× that of human-annotated baselines—the first demonstration of synthetic data outperforming human annotations on fine-grained VLM tasks. Our code is publicly available.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) are highly effective but often underperform on specialized tasks; for example, Llava-1.5 struggles with chart and diagram understanding due to scarce task-specific training data. Existing training data, sourced from general-purpose datasets, fails to capture the nuanced details needed for these tasks. We introduce MM-Gen, a scalable method that generates task-specific, high-quality synthetic text for candidate images by leveraging stronger models. MM-Gen employs a three-stage targeted process: partitioning data into subgroups, generating targeted text based on task descriptions, and filtering out redundant and outlier data. Fine-tuning VLMs with data generated by MM-Gen leads to significant performance gains, including 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5 (7B). Compared to human-curated caption data, MM-Gen achieves up to 1.6x better improvements for the original models, proving its effectiveness in enhancing task-specific VLM performance and bridging the gap between general-purpose datasets and specialized requirements. Code available at https://github.com/sjoshi804/MM-Gen.

Problem

Research questions and friction points this paper is trying to address.

Visual Language Models

Complex Diagram Understanding

Specialized Data Lack

Innovation

Methods, ideas, or system contributions that make the work stand out.

MM-Gen

Visual Language Models

Specialized Data Processing

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models