Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) lack the ability to dynamically focus on salient image regions guided by both textual queries and visual cues, resulting in inefficient multimodal reasoning. To address this, we propose Chain-of-Focus—a novel mechanism enabling resolution-adaptive, iterative visual attention evolution. Our approach jointly optimizes a prior-free policy via reinforcement learning (RL) and supervised fine-tuning (SFT), trained on our newly constructed MM-CoF dataset. Unlike conventional static-image-input paradigms, Chain-of-Focus supports progressive visual understanding across eight resolutions—from 224×224 up to 4K—adapting granularity to task demands. Evaluated on the V* benchmark, our Qwen2.5-VL model achieves a 5% absolute accuracy gain over state-of-the-art VLMs, demonstrating substantial improvements in complex visual reasoning and real-world deployment efficiency.

Technology Category

Application Category

📝 Abstract
Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.
Problem

Research questions and friction points this paper is trying to address.

Enhancing VLM multimodal reasoning via adaptive visual focus
Training VLMs to zoom on key image regions efficiently
Improving visual reasoning accuracy across multiple resolutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive visual search via reinforcement learning
Two-stage training with supervised fine-tuning
Key region zooming for efficient reasoning
🔎 Similar Papers
No similar papers found.
X
Xintong Zhang
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology
Z
Zhi Gao
State Key Laboratory of General Artificial Intelligence, BIGAI
Bofei Zhang
Bofei Zhang
BIGAI
Pengxiang Li
Pengxiang Li
Beijing Institute of Technology
Multimodal AgentVision and Language3DVHyperbolic Learning
X
Xiaowen Zhang
State Key Laboratory of General Artificial Intelligence, BIGAI
Y
Yang Liu
State Key Laboratory of General Artificial Intelligence, BIGAI
Tao Yuan
Tao Yuan
University of California, Los Angeles
Computer VisionArtificial Intelligence
Yuwei Wu
Yuwei Wu
Ph.D. candidate, GRASP Lab, University of Pennsylvania
RoboticsTrajectory OptimizationTask and Motion Planning
Y
Yunde Jia
Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University
S
Song-Chun Zhu
School of Intelligence Science and Technology, Peking University; Department of Automation, Tsinghua University
Q
Qing Li
State Key Laboratory of General Artificial Intelligence, BIGAI