GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Contrastive learning suffers from insufficient diversity and semantic fragility in positive pair construction, as well as a lack of quality-aware supervision. To address these limitations, we propose GenView++, the first adaptive multi-condition (image-only, text-only, and image-text joint) view generation framework. It incorporates a dual-criterion quality assessment module—evaluating both semantic alignment and view diversity—to enable dynamic selection and weighted learning of positive pairs. Furthermore, we introduce a quality-aware contrastive loss that significantly enhances the reliability of supervisory signals. In visual representation learning, GenView++ achieves a +2.5% improvement in ImageNet linear classification accuracy over MoCo v2. In vision-language tasks, it attains +12.31% higher zero-shot classification accuracy than CLIP and improves text retrieval Recall@5 on Flickr30k by +3.2%. These results demonstrate synergistic gains in view diversity, semantic consistency, and cross-task generalization.

Technology Category

Application Category

📝 Abstract
The success of contrastive learning depends on the construction and utilization of high-quality positive pairs. However, current methods face critical limitations on two fronts: on the construction side, both handcrafted and generative augmentations often suffer from limited diversity and risk semantic corruption; on the learning side, the absence of a quality assessment mechanism leads to suboptimal supervision where all pairs are treated equally. To tackle these challenges, we propose GenView++, a unified framework that addresses both fronts by introducing two synergistic innovations. To improve pair construction, GenView++ introduces a multi-source adaptive view generation mechanism to synthesize diverse yet semantically coherent views by dynamically modulating generative parameters across image-conditioned, text-conditioned, and image-text-conditioned strategies. Second, a quality-driven contrastive learning mechanism assesses each pair's semantic alignment and diversity to dynamically reweight their training contribution, prioritizing high-quality pairs while suppressing redundant or misaligned pairs. Extensive experiments demonstrate the effectiveness of GenView++ across both vision and vision-language tasks. For vision representation learning, it improves MoCov2 by +2.5% on ImageNet linear classification. For vision-language learning, it raises the average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets, and further improves Flickr30k text retrieval R@5 by +3.2%. The code is available at https://github.com/xiaojieli0903/GenViewPlusPlus.
Problem

Research questions and friction points this paper is trying to address.

Enhancing contrastive learning through adaptive view generation
Addressing semantic corruption in generative data augmentations
Implementing quality-driven supervision for optimal pair utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-source adaptive view generation for diverse views
Quality-driven contrastive learning with dynamic reweighting
Unified framework combining generative and discriminative mechanisms
🔎 Similar Papers
No similar papers found.
X
Xiaojie Li
School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), 518055, China
B
Bei Wang
School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), 518055, China
Jianlong Wu
Jianlong Wu
Professor, Harbin Institute of Technology (Shenzhen)
Computer VisionMultimodal Learning
Y
Yue Yu
Pengcheng Laboratory, Shenzhen, China
L
Liqiang Nie
School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), 518055, China
M
Min Zhang
School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), 518055, China