🤖 AI Summary
Current unified multimodal models (UMMs) suffer from a misalignment between visual understanding and generation capabilities due to reliance on sparse image-text pairs for training, thereby limiting image generation and editing quality. To address this, we propose Reconstruction Alignment (RecA), a self-supervised method that leverages dense embeddings from a pre-trained visual understanding encoder as conditioning signals for the generative process, enabling end-to-end alignment between understanding and generation modules. RecA requires no additional textual annotations, is architecture-agnostic, and trains efficiently in just 27 GPU hours. Evaluated on GenEval, DPGBench, ImgEdit, and GEdit benchmarks, RecA significantly improves both generation and editing performance—outperforming larger open-source models—while demonstrating exceptional efficiency, broad compatibility, and strong generalization across diverse tasks and domains.
📝 Abstract
Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$
ightarrow$0.90) and DPGBench (80.93$
ightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$
ightarrow$3.75, GEdit 6.94$
ightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs