FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) struggle to precisely localize ultra-fine-grained objects—such as pixel-scale entities—in high-resolution images. To address this, we propose FineRS, a coarse-to-fine two-stage reinforcement learning framework: Stage I performs global semantic exploration guided by text instructions; Stage II introduces a backtracking reward mechanism grounded in pixel-level localization feedback to refine local perception. Crucially, FineRS innovatively couples text-guided reasoning with pixel-level supervision, enabling joint semantic-spatial optimization. Evaluated on our newly constructed FineRS-4k benchmark and multiple public datasets, FineRS achieves state-of-the-art performance on instruction-driven segmentation and visual reasoning tasks. Notably, it significantly improves both localization accuracy and robustness for small objects, outperforming existing methods across diverse evaluation metrics.

Technology Category

Application Category

📝 Abstract
Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities across a wide range of vision-language tasks. However, due to the restricted input resolutions, MLLMs face significant challenges in precisely understanding and localizing visual details in high-resolution images -- particularly when dealing with extra-small objects embedded in cluttered contexts. To address this issue, we propose extsc{FineRS}, a two-stage MLLM-based reinforcement learning framework for jointly reasoning and segmenting extremely small objects within high-resolution scenes. extsc{FineRS} adopts a coarse-to-fine pipeline comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR). Specifically, GSE performs instruction-guided reasoning to generate a textural response and a coarse target region, while LPR refines this region to produce an accurate bounding box and segmentation mask. To couple the two stages, we introduce a locate-informed retrospective reward, where LPR's outputs are used to optimize GSE for more robust coarse region exploration. % Additionally, we present extsc{FineRS}-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets in complex high-resolution scenes. Experimental results on extsc{FineRS}-4k and public datasets demonstrate that our method consistently outperforms state-of-the-art MLLM-based approaches on both instruction-guided segmentation and visual reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses precise localization of extra-small objects in high-resolution images
Enhances multimodal reasoning and segmentation in cluttered visual contexts
Improves coarse-to-fine object understanding through reinforcement learning framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage reinforcement learning framework for small objects
Coarse-to-fine pipeline with global exploration and local refinement
Locate-informed reward couples reasoning and segmentation stages
🔎 Similar Papers
No similar papers found.
L
Lu Zhang
Dalian University of Technology, Dalian, China
Jiazuo Yu
Jiazuo Yu
Dalian University of Technology
Continual learningMulti-modal large language models
H
Haomiao Xiong
Dalian University of Technology, Dalian, China
Ping Hu
Ping Hu
UESTC
Computer VisionDeep LearningImage/Video Processing
Yunzhi Zhuge
Yunzhi Zhuge
Dalian University of Technology
Computer Vision
H
Huchuan Lu
Dalian University of Technology, Dalian, China
Y
You He
Tsinghua Shenzhen International Graduate School, Shenzhen, China