Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual understanding and generation tasks suffer from misaligned representation granularities, hindering joint optimization within unified multimodal frameworks; existing approaches prioritize low-level visual features at the expense of semantic comprehension. To address this, we propose Harmon—a novel framework featuring a shared Masked Autoregressive (MAR) encoder, the first to simultaneously achieve strong semantic representation and high-fidelity generation capabilities. Harmon introduces a three-stage progressive co-training paradigm that intrinsically unifies understanding and generation. Evaluated on multiple benchmarks—including GenEval, MJHQ30K, and WISE—Harmon achieves state-of-the-art performance in image generation while matching the visual understanding accuracy of dedicated semantic encoders (e.g., Janus). Crucially, it attains optimal trade-offs between both tasks using a single, unified encoder.

Technology Category

Application Category

📝 Abstract
Unifying visual understanding and generation within a single multimodal framework remains a significant challenge, as the two inherently heterogeneous tasks require representations at different levels of granularity. Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. A preliminary study on the MAR encoder's representation reveals exceptional linear probing accuracy and precise feature response to visual concepts, which indicates MAR's potential for visual understanding tasks beyond its original generation role. Based on these insights, we present emph{Harmon}, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder. Through a three-stage training procedure that progressively optimizes understanding and generation capabilities, Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks while matching the performance of methods with dedicated semantic encoders (e.g., Janus) on image understanding benchmarks. Our code and models will be available at https://github.com/wusize/Harmon.
Problem

Research questions and friction points this paper is trying to address.

Unifying visual understanding and generation in one framework
Addressing semantic compromise in current visual representation methods
Harmonizing tasks via shared autoregressive encoder for better performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses masked autoregressive encoder for unified tasks
Three-stage training optimizes understanding and generation
Achieves state-of-the-art in multimodal benchmarks
🔎 Similar Papers
No similar papers found.