Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing vision-language models (VLMs) suffer from limited generalization in visual reasoning due to narrow training data (biased toward mathematical/logical reasoning), absence of cross-domain reward signals, and poor compatibility with heterogeneous multimodal sources. To address these challenges, this work introduces a large-scale, multi-domain visual reasoning dataset spanning eight dimensions and 46 diverse data sources. We propose an influence-function-based sample valuation mechanism coupled with a hierarchical difficulty filtering strategy for efficient data curation. Furthermore, we design a synergistic framework integrating multi-stage reinforcement learning with curriculum learning to facilitate cross-domain knowledge transfer and fusion. This is the first systematic approach enabling verifiable integration and adaptive training of multi-source visual reasoning data. Evaluated on multiple benchmarks, Vision-G1 achieves state-of-the-art performance, significantly outperforming comparably sized open-source VLMs as well as closed-source systems including GPT-4o and Gemini-1.5 Flash.

Technology Category

Application Category

📝 Abstract

Despite their success, current training pipelines for reasoning VLMs focus on a limited range of tasks, such as mathematical and logical reasoning. As a result, these models face difficulties in generalizing their reasoning capabilities to a wide range of domains, primarily due to the scarcity of readily available and verifiable reward data beyond these narrowly defined areas. Moreover, integrating data from multiple domains is challenging, as the compatibility between domain-specific datasets remains uncertain. To address these limitations, we build a comprehensive RL-ready visual reasoning dataset from 46 data sources across 8 dimensions, covering a wide range of tasks such as infographic, mathematical, spatial, cross-image, graphic user interface, medical, common sense and general science. We propose an influence function based data selection and difficulty based filtering strategy to identify high-quality training samples from this dataset. Subsequently, we train the VLM, referred to as Vision-G1, using multi-round RL with a data curriculum to iteratively improve its visual reasoning capabilities. Our model achieves state-of-the-art performance across various visual reasoning benchmarks, outperforming similar-sized VLMs and even proprietary models like GPT-4o and Gemini-1.5 Flash. The model, code and dataset are publicly available at https://github.com/yuh-zha/Vision-G1.

Problem

Research questions and friction points this paper is trying to address.

Limited generalization of VLMs across diverse reasoning domains

Scarcity of verifiable reward data for broad domain training

Challenges in integrating multi-domain datasets for VLM compatibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-domain dataset from 46 sources

Influence function based data selection

Multi-round RL with data curriculum

🔎 Similar Papers

No similar papers found.