🤖 AI Summary
Unified vision-language models (UVLMs) suffer from data heterogeneity between understanding and generation tasks, leading to task imbalance during reinforcement learning (RL). Method: We propose PairUni—a framework that first constructs understanding-generation (U-G) paired data via a dual-pairing mechanism: GPT-4o is leveraged to generate semantically enriched content, which is then aligned with real-world samples using semantic retrieval to uncover cross-task semantic correlations. Second, we design Pair-GPRO, an RL algorithm that dynamically modulates the advantage function via similarity-based weighting to align U-G samples and suppress task interference. Experiments perform RL fine-tuning on a self-constructed, high-quality 16K PairUG dataset atop the Janus-Pro architecture. Contribution/Results: PairUni consistently outperforms strong baselines across multiple UVLMs, achieving synergistic improvements in both visual understanding and language generation capabilities.
📝 Abstract
Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}