🤖 AI Summary
Multimodal large language models (MLLMs) suffer from hallucinations and reasoning biases in vision-intensive tasks—such as geometric reasoning—due to insufficient visual perception, a limitation we term the “perception bottleneck,” which constrains reasoning capability training. To address this, we introduce GeoPQA, the first fine-grained benchmark explicitly designed for evaluating geometric structural understanding. We further propose a two-stage reinforcement learning framework: Stage I aligns geometric features to enhance visual perception; Stage II leverages the perception-enhanced representations to optimize logical reasoning. Implemented on the Qwen2.5-VL-3B-Instruct architecture, our method achieves +9.7% and +9.1% improvements on geometric reasoning and problem-solving tasks, respectively, and generalizes effectively to chart understanding. Our core contribution is breaking the perception–reasoning coupling constraint by establishing a verifiable, perception-augmented paradigm for geometric reasoning.
📝 Abstract
Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, which constrain RL reward signals for effective training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and geometric problem solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.