🤖 AI Summary
This work addresses the poor performance and limited generalization of large-scale vision-language models (VLMs) on abstract visual reasoning (AVR) tasks. We propose a staged data synthesis and targeted post-training paradigm built upon the LLaVA-NeXT-7B architecture. Our method employs controllable AVR dataset generation, progressive curriculum learning, and multi-objective alignment fine-tuning to systematically elicit intrinsic reasoning capabilities. To our knowledge, this is the first approach enabling a lightweight VLM to achieve robust, strong generalization across mainstream AVR benchmarks—including RAVEN, I-RAVEN, and PGM—outperforming significantly larger models such as Qwen2-VL-72B and GPT-4o. Crucially, the original multimodal understanding capabilities are fully preserved, with no degradation in standard vision-language comprehension. The core innovation lies in decoupling abstract reasoning into trainable, modular subprocesses, establishing a novel paradigm for advancing VLMs toward general-purpose visual reasoning.
📝 Abstract
This paper is a pioneering work attempting to address abstract visual reasoning (AVR) problems for large vision-language models (VLMs). We make a common LLaVA-NeXT 7B model capable of perceiving and reasoning about specific AVR problems, surpassing both open-sourced (e.g., Qwen-2-VL-72B) and closed-sourced powerful VLMs (e.g., GPT-4o) with significant margin. This is a great breakthrough since almost all previous VLMs fail or show nearly random performance on representative AVR benchmarks. Our key success is our innovative data synthesis and post-training process, aiming to fully relieve the task difficulty and elicit the model to learn, step by step. Our 7B model is also shown to be behave well on AVR without sacrificing common multimodal comprehension abilities. We hope our paper could serve as an early effort in this area and would inspire further research in abstract visual reasoning.