PairUni: Pairwise Training for Unified Multimodal Language Models

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Unified vision-language models (UVLMs) suffer from data heterogeneity between understanding and generation tasks, leading to task imbalance during reinforcement learning (RL). Method: We propose PairUni—a framework that first constructs understanding-generation (U-G) paired data via a dual-pairing mechanism: GPT-4o is leveraged to generate semantically enriched content, which is then aligned with real-world samples using semantic retrieval to uncover cross-task semantic correlations. Second, we design Pair-GPRO, an RL algorithm that dynamically modulates the advantage function via similarity-based weighting to align U-G samples and suppress task interference. Experiments perform RL fine-tuning on a self-constructed, high-quality 16K PairUG dataset atop the Janus-Pro architecture. Contribution/Results: PairUni consistently outperforms strong baselines across multiple UVLMs, achieving synergistic improvements in both visual understanding and language generation capabilities.

Technology Category

Application Category

📝 Abstract
Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}
Problem

Research questions and friction points this paper is trying to address.

Balancing understanding and generation tasks in unified vision-language models
Addressing heterogeneous data supervision challenges in reinforcement learning
Reducing task interference through aligned pairwise training structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reorganizes data into understanding-generation pairs
Retrieves semantically related examples to form pairs
Uses pair-aware Group Relative Policy Optimization
🔎 Similar Papers
No similar papers found.
J
Jiani Zheng
ByteDance
Zhiyang Teng
Zhiyang Teng
Bytedance SG
Natural Language Processing
Xiangtai Li
Xiangtai Li
Research Scientist, Tiktok, SG; MMLab@NTU
Generative AIComputer Vision
A
Anran Wang
ByteDance
Y
Yu Tian
ByteDance
K
Kunpeng Qiu
ByteDance
Y
Ye Tian
ByteDance
H
Haochen Wang
ByteDance
Z
Zhuochen Wang
ByteDance