🤖 AI Summary
Current multimodal large language models lack systematic evaluation and effective methods for fine-grained visual difference perception. To address this gap, this work introduces OddGridBench, a controlled benchmark comprising over 1,400 carefully designed grid images that assess model sensitivity to subtle visual variations in color, size, rotation, and position. Furthermore, the paper proposes the OddGrid-GRPO framework, which integrates curriculum learning with a spatial distance-aware reward mechanism and leverages group relative policy optimization (GRPO) to enhance discriminative capabilities through reinforcement learning. Experimental results demonstrate that the proposed approach significantly outperforms baseline models, substantially narrowing the performance gap with human observers on OddGridBench and providing the first systematic analysis of the limitations of prevailing models in this task.
📝 Abstract
Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model's fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at https://wwwtttjjj.github.io/OddGridBench/.