Composition-Grounded Instruction Synthesis for Visual Reasoning

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

To address the limited visual reasoning capability of multimodal large language models (MLLMs) on artificial images (e.g., charts, documents, web pages) due to scarce annotated data, this paper proposes COGS: a compositional generalization framework. First, seed questions are decomposed into perception and reasoning factors; an instruction synthesis mechanism—grounded in compositional structure—efficiently generates large-scale synthetic question-answer pairs with explicit intermediate reasoning steps. Second, factor-level process reward modeling is introduced within reinforcement learning to achieve fine-grained alignment and generalization optimization. COGS mitigates overfitting, significantly improves accuracy on unseen high-complexity and compositional chart reasoning tasks, and successfully transfers across domains—including web page understanding—while maintaining interpretability and data efficiency.

Technology Category

Application Category

📝 Abstract

Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.

Problem

Research questions and friction points this paper is trying to address.

Enhancing reasoning in MLLMs for artificial images lacking annotations

Generating synthetic QA pairs from limited seed questions via decomposition

Improving performance on compositional and reasoning-heavy visual questions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes instructions via compositional factor decomposition

Generates synthetic QA pairs with subquestions and intermediate answers

Uses factor-level process rewards for reinforcement learning training

🔎 Similar Papers

No similar papers found.