LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation

๐Ÿ“… 2026-03-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations of existing unified multimodal pretraining approaches, which rely on implicit alignment and struggle to simultaneously achieve fine-grained languageโ€“vision understanding and controllable generation. The authors propose LVRPO, a novel framework that introduces Group Relative Policy Optimization (GRPO) into multimodal learning for the first time, leveraging preference-driven reinforcement learning to explicitly align linguistic and visual representations. Notably, LVRPO requires neither additional encoders nor handcrafted cross-modal objectives, instead optimizing multimodal interaction behaviors in an end-to-end manner to jointly support understanding, generation, and reasoning tasks within a unified architecture. Experimental results demonstrate that LVRPO significantly outperforms current unified pretraining baselines across multiple benchmarks.
๐Ÿ“ Abstract
Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based preference optimization framework that explicitly aligns language and visual representations using Group Relative Policy Optimization (GRPO). Instead of introducing additional alignment losses at the representation level, LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, encouraging consistent and semantically grounded interactions between language and vision across both understanding and generation tasks. This formulation enables effective alignment without requiring auxiliary encoders or handcrafted cross-modal objectives, and naturally extends to diverse multimodal capabilities. Empirically, LVRPO consistently outperforms strong unified-pretraining baselines on a broad suite of benchmarks spanning multimodal understanding, generation, and reasoning.
Problem

Research questions and friction points this paper is trying to address.

multimodal understanding
multimodal generation
language-visual alignment
fine-grained reasoning
controllable generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-Visual Alignment
Reinforcement Learning
Preference Optimization
Multimodal Foundation Model
GRPO
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Shentong Mo
Department of Machine Learning, CMU, USA
Sukmin Yun
Sukmin Yun
Assistant Professor, Hanyang University - ERICA
Machine Learning