VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models

📅 2025-03-29

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Current large vision-language models (LVLMs) exhibit poor performance on visual grid puzzles requiring precise perception, rule comprehension, and logical reasoning, and lack standardized, systematic evaluation benchmarks for structured reasoning. Method: We introduce VGRP-Bench—the first benchmark dedicated to visual grid reasoning—comprising 20 puzzle categories across multiple difficulty levels, with a unified evaluation framework. We conduct the first empirical analysis of how clue count, grid size, and rule complexity affect LVLM reasoning performance, and propose two supervised fine-tuning strategies: solution-driven (S-SFT) and reasoning-process-driven (R-SFT). Results: State-of-the-art models—including GPT-4o and Gemini-Thinking—achieve markedly suboptimal performance on VGRP-Bench. Both SFT variants improve in-distribution accuracy but yield limited out-of-distribution generalization. The benchmark will be publicly released to advance research on complex reasoning capabilities in LVLMs.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) struggle with puzzles, which require precise perception, rule comprehension, and logical reasoning. Assessing and enhancing their performance in this domain is crucial, as it reflects their ability to engage in structured reasoning - an essential skill for real-world problem-solving. However, existing benchmarks primarily evaluate pre-trained models without additional training or fine-tuning, often lack a dedicated focus on reasoning, and fail to establish a systematic evaluation framework. To address these limitations, we introduce VGRP-Bench, a Visual Grid Reasoning Puzzle Benchmark featuring 20 diverse puzzles. VGRP-Bench spans multiple difficulty levels, and includes extensive experiments not only on existing chat LVLMs (e.g., GPT-4o), but also on reasoning LVLMs (e.g., Gemini-Thinking). Our results reveal that even the state-of-the-art LVLMs struggle with these puzzles, highlighting fundamental limitations in their puzzle-solving capabilities. Most importantly, through systematic experiments, we identify and analyze key factors influencing LVLMs' puzzle-solving performance, including the number of clues, grid size, and rule complexity. Furthermore, we explore two Supervised Fine-Tuning (SFT) strategies that can be used in post-training: SFT on solutions (S-SFT) and SFT on synthetic reasoning processes (R-SFT). While both methods significantly improve performance on trained puzzles, they exhibit limited generalization to unseen ones. We will release VGRP-Bench to facilitate further research on LVLMs for complex, real-world problem-solving.

Problem

Research questions and friction points this paper is trying to address.

Assessing LVLMs' puzzle-solving skills with precise perception and reasoning

Addressing lack of systematic benchmarks for visual grid reasoning puzzles

Improving LVLMs' performance via fine-tuning strategies with limited generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VGRP-Bench for visual grid puzzles

Analyzes key factors affecting puzzle-solving performance

Explores SFT strategies for post-training improvement

🔎 Similar Papers

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision Language Models