See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the limitations of current vision-language models (VLMs), which often neglect low-level visual details and lack effective visual feedback mechanisms, thereby hindering fine-grained understanding and accuracy. To overcome these challenges, we propose ForeSight, a unified multimodal interleaved reasoning framework that, for the first time, integrates low-level vision tools and a mask-based visual feedback mechanism into the VLM inference process. This design enables a closed-loop system capable of dynamically reflecting on and refining its answers, with tool invocation and answer verification governed by reinforcement learning for autonomous decision-making. We further introduce CG-SalBench, a new benchmark dataset tailored for this task. Experimental results demonstrate that ForeSight-7B significantly outperforms existing open-source models of comparable scale and even surpasses state-of-the-art closed-source models on several key metrics.

Technology Category

Application Category

📝 Abstract

Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

low-level visual cues

visual feedback

reasoning ability

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

low-level visual cues

visual feedback

multimodal reasoning