On Data Synthesis and Post-training for Visual Abstract Reasoning

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poor performance and limited generalization of large-scale vision-language models (VLMs) on abstract visual reasoning (AVR) tasks. We propose a staged data synthesis and targeted post-training paradigm built upon the LLaVA-NeXT-7B architecture. Our method employs controllable AVR dataset generation, progressive curriculum learning, and multi-objective alignment fine-tuning to systematically elicit intrinsic reasoning capabilities. To our knowledge, this is the first approach enabling a lightweight VLM to achieve robust, strong generalization across mainstream AVR benchmarks—including RAVEN, I-RAVEN, and PGM—outperforming significantly larger models such as Qwen2-VL-72B and GPT-4o. Crucially, the original multimodal understanding capabilities are fully preserved, with no degradation in standard vision-language comprehension. The core innovation lies in decoupling abstract reasoning into trainable, modular subprocesses, establishing a novel paradigm for advancing VLMs toward general-purpose visual reasoning.

Technology Category

Application Category

📝 Abstract
This paper is a pioneering work attempting to address abstract visual reasoning (AVR) problems for large vision-language models (VLMs). We make a common LLaVA-NeXT 7B model capable of perceiving and reasoning about specific AVR problems, surpassing both open-sourced (e.g., Qwen-2-VL-72B) and closed-sourced powerful VLMs (e.g., GPT-4o) with significant margin. This is a great breakthrough since almost all previous VLMs fail or show nearly random performance on representative AVR benchmarks. Our key success is our innovative data synthesis and post-training process, aiming to fully relieve the task difficulty and elicit the model to learn, step by step. Our 7B model is also shown to be behave well on AVR without sacrificing common multimodal comprehension abilities. We hope our paper could serve as an early effort in this area and would inspire further research in abstract visual reasoning.
Problem

Research questions and friction points this paper is trying to address.

Enhancing visual abstract reasoning in large vision-language models
Improving performance on AVR benchmarks with innovative data synthesis
Maintaining multimodal comprehension while excelling in abstract reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data synthesis for abstract visual reasoning
Post-training process to ease task difficulty
Enhances AVR without losing multimodal abilities
🔎 Similar Papers
No similar papers found.
K
Ke Zhu
Nanjing University, Baidu VIS
Y
Yu Wang
Baidu VIS
J
Jiangjiang Liu
Baidu VIS
Qunyi Xie
Qunyi Xie
Baidu VIS
OCR、MLLM
S
Shanshan Liu
Baidu VIS
Gang Zhang
Gang Zhang
Tsinghua University
computer vision