OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models lack systematic evaluation and effective methods for fine-grained visual difference perception. To address this gap, this work introduces OddGridBench, a controlled benchmark comprising over 1,400 carefully designed grid images that assess model sensitivity to subtle visual variations in color, size, rotation, and position. Furthermore, the paper proposes the OddGrid-GRPO framework, which integrates curriculum learning with a spatial distance-aware reward mechanism and leverages group relative policy optimization (GRPO) to enhance discriminative capabilities through reinforcement learning. Experimental results demonstrate that the proposed approach significantly outperforms baseline models, substantially narrowing the performance gap with human observers on OddGridBench and providing the first systematic analysis of the limitations of prevailing models in this task.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model's fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at https://wwwtttjjj.github.io/OddGridBench/.
Problem

Research questions and friction points this paper is trying to address.

multimodal large language models
visual discrepancy sensitivity
fine-grained visual perception
low-level visual perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual discrepancy sensitivity
multimodal large language models
controllable benchmark
reinforcement learning with curriculum
distance-aware reward
🔎 Similar Papers
No similar papers found.
T
Tengjin Weng
College of Computer Science and Software Engineering, Shenzhen University
Wenhao Jiang
Wenhao Jiang
GML, Tencent, PolyU
Computer VisionMachine LearningFoundation Models
Jingyi Wang
Jingyi Wang
Tsinghua University
Ming Li
Ming Li
Senior Research Scientist, Guangming Lab
AIGCMLLMsEmbodied AI
Lin Ma
Lin Ma
Meituan
Multimodal LLMComputer Vision
Z
Zhong Ming
College of Computer Science and Software Engineering, Shenzhen University