🤖 AI Summary
This work addresses the limitations of current vision-language models in multi-turn visual reasoning, which often lack explicit reference to image regions and iterative refinement, thereby failing to maintain spatial grounding and semantic consistency across dialogue turns. To overcome this, we propose RegionReasoner, a novel framework that introduces an explicit region referencing mechanism and incorporates global-local semantic consistency rewards to guide the model—via reinforcement learning—to accurately associate bounding boxes during reasoning. Additionally, we construct RegionDial-Bench, a new multi-turn visual reasoning benchmark supporting both detection and segmentation tasks. Experimental results demonstrate that RegionReasoner-7B significantly improves multi-turn reasoning accuracy, spatial grounding precision, and semantic consistency on this benchmark, establishing a strong baseline for future research in this direction.
📝 Abstract
Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global-local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps. RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment. Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global-local consistency, establishing a strong baseline for this emerging research direction.