On Data Synthesis and Post-training for Visual Abstract Reasoning

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the poor performance and limited generalization of large-scale vision-language models (VLMs) on abstract visual reasoning (AVR) tasks. We propose a staged data synthesis and targeted post-training paradigm built upon the LLaVA-NeXT-7B architecture. Our method employs controllable AVR dataset generation, progressive curriculum learning, and multi-objective alignment fine-tuning to systematically elicit intrinsic reasoning capabilities. To our knowledge, this is the first approach enabling a lightweight VLM to achieve robust, strong generalization across mainstream AVR benchmarks—including RAVEN, I-RAVEN, and PGM—outperforming significantly larger models such as Qwen2-VL-72B and GPT-4o. Crucially, the original multimodal understanding capabilities are fully preserved, with no degradation in standard vision-language comprehension. The core innovation lies in decoupling abstract reasoning into trainable, modular subprocesses, establishing a novel paradigm for advancing VLMs toward general-purpose visual reasoning.

Technology Category

Application Category

📝 Abstract

This paper is a pioneering work attempting to address abstract visual reasoning (AVR) problems for large vision-language models (VLMs). We make a common LLaVA-NeXT 7B model capable of perceiving and reasoning about specific AVR problems, surpassing both open-sourced (e.g., Qwen-2-VL-72B) and closed-sourced powerful VLMs (e.g., GPT-4o) with significant margin. This is a great breakthrough since almost all previous VLMs fail or show nearly random performance on representative AVR benchmarks. Our key success is our innovative data synthesis and post-training process, aiming to fully relieve the task difficulty and elicit the model to learn, step by step. Our 7B model is also shown to be behave well on AVR without sacrificing common multimodal comprehension abilities. We hope our paper could serve as an early effort in this area and would inspire further research in abstract visual reasoning.

Problem

Research questions and friction points this paper is trying to address.

Enhancing visual abstract reasoning in large vision-language models

Improving performance on AVR benchmarks with innovative data synthesis

Maintaining multimodal comprehension while excelling in abstract reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data synthesis for abstract visual reasoning

Post-training process to ease task difficulty

Enhances AVR without losing multimodal abilities

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?