Reconstruction Alignment Improves Unified Multimodal Models

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Current unified multimodal models (UMMs) suffer from a misalignment between visual understanding and generation capabilities due to reliance on sparse image-text pairs for training, thereby limiting image generation and editing quality. To address this, we propose Reconstruction Alignment (RecA), a self-supervised method that leverages dense embeddings from a pre-trained visual understanding encoder as conditioning signals for the generative process, enabling end-to-end alignment between understanding and generation modules. RecA requires no additional textual annotations, is architecture-agnostic, and trains efficiently in just 27 GPU hours. Evaluated on GenEval, DPGBench, ImgEdit, and GEdit benchmarks, RecA significantly improves both generation and editing performance—outperforming larger open-source models—while demonstrating exceptional efficiency, broad compatibility, and strong generalization across diverse tasks and domains.

Technology Category

Application Category

📝 Abstract

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$ ightarrow$0.90) and DPGBench (80.93$ ightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$ ightarrow$3.75, GEdit 6.94$ ightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

Problem

Research questions and friction points this paper is trying to address.

Improving multimodal model alignment between understanding and generation

Addressing sparse caption supervision in unified multimodal models

Enhancing image generation and editing fidelity across architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconstruction Alignment method leverages visual embeddings

Optimizes self-supervised reconstruction loss for realignment

Broadly applicable across diverse UMM architectures

🔎 Similar Papers

No similar papers found.