Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

๐Ÿ“… 2026-03-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing unified vision-language models struggle to generate interleaved multimodal content, limiting their applicability in tasks such as visual storytelling and step-by-step reasoning. This work proposes a reinforcement learningโ€“based post-training approach that endows models with high-quality multimodal interleaved generation capabilities without requiring large-scale interleaved data. The key innovation lies in the first extension of Group Relative Policy Optimization (GRPO) to multimodal settings, complemented by a hybrid reward mechanism that integrates text relevance, image-text alignment, and structural fidelity, along with process-level rewards to jointly optimize text and image generation. Evaluated on the MMIE and InterleavedBench benchmarks, the method significantly improves generation quality, coherence, and structural faithfulness, demonstrating its effectiveness and generalization capability.

Technology Category

Application Category

๐Ÿ“ Abstract
Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.
Problem

Research questions and friction points this paper is trying to address.

multimodal interleaved generation
vision-language models
visual storytelling
step-by-step visual reasoning
unified multimodal generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal interleaved generation
Group Relative Policy Optimization
reinforcement learning
unified vision-language models
hybrid reward