HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from an inherent imbalance wherein their visual understanding capabilities consistently surpass their generative capabilities. This work introduces HermesFlow, a unified alignment framework that systematically identifies and addresses this asymmetry for the first time. HermesFlow integrates Pairwise Direct Preference Optimization (Pair-DPO) over homologous preference data with a self-play iterative optimization mechanism to jointly align understanding and generation. The methodology encompasses construction of homologous multimodal datasets, fine-grained pairwise preference modeling, and closed-loop alignment training. Evaluated on multiple standard benchmarks, HermesFlow substantially narrows the understanding–generation performance gap and consistently outperforms state-of-the-art models—including Show-o, Transfusion, and Emu3—demonstrating its effectiveness and scalability as a general-purpose multimodal alignment framework.

Technology Category

Application Category

📝 Abstract

The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow

Problem

Research questions and friction points this paper is trying to address.

Bridge gap in multimodal understanding and generation.

Align understanding and generation using homologous data.

Improve multimodal models with HermesFlow framework.

Innovation

Methods, ideas, or system contributions that make the work stand out.

HermesFlow bridges multimodal understanding-generation gap

Uses homologous data for alignment optimization

Implements Pair-DPO and self-play iterative optimization

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4