Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses how to effectively leverage the image understanding capabilities of multimodal large language models (MLLMs) to enhance the fidelity and detail of text-to-image (T2I) generation. The authors propose Forge-and-Quench, a novel framework that, for the first time, translates the contextual reasoning ability of MLLMs into visual guidance signals. Specifically, the MLLM generates enriched textual instructions, which are then mapped via a lightweight Bridge Adapter into Bridge Features and injected into the T2I backbone network to seamlessly integrate comprehension with generation. Extensive experiments demonstrate that this approach significantly improves image quality across multiple state-of-the-art T2I models while preserving strong instruction-following and world knowledge capabilities, and it enables efficient cross-model transfer with minimal overhead.

Technology Category

Application Category

📝 Abstract

Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a crucial link, forging insights from the understanding model to quench and refine the generation process. It is subsequently injected into the T2I backbone as a visual guidance signal, alongside the enhanced text instruction that replaces the original input. To validate this paradigm, we conduct comprehensive studies on the design of the Bridge Feature and Bridge Adapter. Our framework demonstrates exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant savings in training overhead, all without compromising the MLLM's inherent multimodal understanding capabilities. Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models, while also maintaining instruction-following accuracy and enhancing world knowledge application. Models and codes are available at https://github.com/YanbingZeng/Forge-and-Quench.

Problem

Research questions and friction points this paper is trying to address.

image generation

multimodal models

image fidelity

understanding-assisted generation

unified framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Forge-and-Quench

Bridge Feature

multimodal unified model