PairUni: Pairwise Training for Unified Multimodal Language Models

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Unified vision-language models (UVLMs) suffer from data heterogeneity between understanding and generation tasks, leading to task imbalance during reinforcement learning (RL). Method: We propose PairUni—a framework that first constructs understanding-generation (U-G) paired data via a dual-pairing mechanism: GPT-4o is leveraged to generate semantically enriched content, which is then aligned with real-world samples using semantic retrieval to uncover cross-task semantic correlations. Second, we design Pair-GPRO, an RL algorithm that dynamically modulates the advantage function via similarity-based weighting to align U-G samples and suppress task interference. Experiments perform RL fine-tuning on a self-constructed, high-quality 16K PairUG dataset atop the Janus-Pro architecture. Contribution/Results: PairUni consistently outperforms strong baselines across multiple UVLMs, achieving synergistic improvements in both visual understanding and language generation capabilities.

Technology Category

Application Category

📝 Abstract

Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}

Problem

Research questions and friction points this paper is trying to address.

Balancing understanding and generation tasks in unified vision-language models

Addressing heterogeneous data supervision challenges in reinforcement learning

Reducing task interference through aligned pairwise training structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reorganizes data into understanding-generation pairs

Retrieves semantically related examples to form pairs

Uses pair-aware Group Relative Policy Optimization

🔎 Similar Papers

No similar papers found.