GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from hallucinations and reasoning biases in vision-intensive tasks—such as geometric reasoning—due to insufficient visual perception, a limitation we term the “perception bottleneck,” which constrains reasoning capability training. To address this, we introduce GeoPQA, the first fine-grained benchmark explicitly designed for evaluating geometric structural understanding. We further propose a two-stage reinforcement learning framework: Stage I aligns geometric features to enhance visual perception; Stage II leverages the perception-enhanced representations to optimize logical reasoning. Implemented on the Qwen2.5-VL-3B-Instruct architecture, our method achieves +9.7% and +9.1% improvements on geometric reasoning and problem-solving tasks, respectively, and generalizes effectively to chart understanding. Our core contribution is breaking the perception–reasoning coupling constraint by establishing a verifiable, perception-augmented paradigm for geometric reasoning.

Technology Category

Application Category

📝 Abstract

Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, which constrain RL reward signals for effective training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and geometric problem solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.

Problem

Research questions and friction points this paper is trying to address.

MLLMs hallucinate frequently in geometric reasoning tasks

Visual perception bottleneck limits reasoning training benefits

Need to enhance geometric perception before reasoning training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage RL training for visual perception

Enhancing geometric structure perception first

Fostering reasoning capabilities after perception

🔎 Similar Papers

LLMI3D: MLLM-based 3D Perception from a Single 2D Image