RegionReasoner: Region-Grounded Multi-Round Visual Reasoning

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses the limitations of current vision-language models in multi-turn visual reasoning, which often lack explicit reference to image regions and iterative refinement, thereby failing to maintain spatial grounding and semantic consistency across dialogue turns. To overcome this, we propose RegionReasoner, a novel framework that introduces an explicit region referencing mechanism and incorporates global-local semantic consistency rewards to guide the model—via reinforcement learning—to accurately associate bounding boxes during reasoning. Additionally, we construct RegionDial-Bench, a new multi-turn visual reasoning benchmark supporting both detection and segmentation tasks. Experimental results demonstrate that RegionReasoner-7B significantly improves multi-turn reasoning accuracy, spatial grounding precision, and semantic consistency on this benchmark, establishing a strong baseline for future research in this direction.

Technology Category

Application Category

📝 Abstract

Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global-local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps. RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment. Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global-local consistency, establishing a strong baseline for this emerging research direction.

Problem

Research questions and friction points this paper is trying to address.

visual reasoning

multi-round reasoning

region grounding

vision-language models

semantic consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Region-Grounded Reasoning

Multi-Round Visual Reasoning

Reinforcement Learning