Region-based Cluster Discrimination for Visual Representation Learning

📅 2025-07-26

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing vision-language contrastive models (e.g., CLIP, SigLIP) rely on global image representations, limiting their applicability to dense prediction tasks such as object localization, OCR, and segmentation. To address this, we propose RICE—a region-aware clustering-discriminative framework that unifies object recognition and OCR within a single classification architecture for the first time. RICE constructs a billion-scale candidate region dataset, employs a Region Transformer to extract fine-grained regional semantics, and introduces a region clustering-discriminative loss that integrates large-scale vision-language contrastive learning. The framework supports efficient distributed training and significantly enhances region-level semantic understanding. It achieves state-of-the-art performance across segmentation, dense detection, and visual perception tasks in multimodal large language models. The pre-trained models are publicly released and widely adopted in downstream applications.

Technology Category

Application Category

📝 Abstract

Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). The pre-trained models have been released at https://github.com/deepglint/MVT.

Problem

Research questions and friction points this paper is trying to address.

Enhances region-level visual and OCR capabilities

Improves dense prediction tasks like segmentation

Supports object and OCR learning jointly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Region-Aware Cluster Discrimination for dense tasks

Region Transformer extracts rich regional semantics

Unified region cluster discrimination loss framework

🔎 Similar Papers

No similar papers found.