ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

To address the limited fine-grained visual perception capability of vision-language models (VLMs), this paper proposes ViPER, a two-stage progressive framework. Its core innovation is a self-critique–self-prediction closed loop: Stage I initializes perception via image-level and instance-level self-supervised reconstruction; Stage II employs a two-phase reinforcement learning strategy—integrating internal data synthesis and joint vision-language modeling—to jointly optimize coarse-grained localization and fine-grained recognition. ViPER is the first framework enabling fully self-guided perceptual evolution without external annotations, uncovering a bidirectional enhancement mechanism between generative and discriminative capabilities. Evaluated on seven benchmarks, it achieves an average improvement of 1.7%, with up to 6.0% gain on fine-grained tasks. The method significantly enhances multi-scenario performance while preserving strong generalization.

Technology Category

Application Category

📝 Abstract

The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7% on seven comprehensive benchmarks spanning various tasks and up to 6.0% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.

Problem

Research questions and friction points this paper is trying to address.

Addressing fine-grained visual perception limitations in Vision-Language Models

Overcoming data scarcity and method limitations in visual perception training

Enabling self-evolution of perceptual abilities through closed-loop learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage task structures coarse-to-fine visual learning

Self-bootstrapping framework enables iterative evolution via self-critiquing

Closed-loop training synergizes reconstruction with reinforcement learning

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts