🤖 AI Summary
In multimodal learning, image-to-text (I2T) understanding and text-to-image (T2I) generation have long remained isolated tasks—coexisting but lacking true synergy. Method: We propose the Unified Autoencoder (UAE), a self-supervised framework that unifies I2T and T2I under a shared image reconstruction objective, enabling bidirectional information flow. UAE introduces a novel three-stage Unified-GRPO reinforcement training strategy, jointly optimizing understanding and generation via “generation-enhanced understanding” and “understanding-guided generation,” augmented by large-scale long-text pretraining and semantic reconstruction loss. Contribution/Results: To our knowledge, this is the first autoencoder-based unified paradigm for I2T/T2I. We release Unified-Bench, a new benchmark for evaluating model unity. Experiments show UAE significantly improves fine-grained captioning fidelity in understanding models and instruction-following accuracy in generation models, validating the effectiveness and generalization advantage of unified modeling on Unified-Bench.
📝 Abstract
In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.