Show, Don't Tell: Morphing Latent Reasoning into Image Generation

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the inefficiency, information loss, and cognitive misalignment inherent in existing text-to-image generation methods that rely on explicit textual intermediate reasoning. To overcome these limitations, we propose LatentMorph, a novel framework that, for the first time, enables end-to-end implicit dynamic reasoning and self-optimization entirely within a continuous latent space, thereby eliminating frequent image encoding and decoding. LatentMorph introduces a learnable reasoning trigger mechanism and integrates lightweight components—including a state compressor, a latent thought translator, a prediction-guided module, and a reinforcement learning-driven reasoning invoker—all operating exclusively in the latent domain. Experiments demonstrate that LatentMorph achieves performance gains of 16% and 25% on GenEval and T2I-CompBench, respectively, while reducing reasoning time by 44%, token consumption by 51%, and attaining a human cognitive alignment rate of 71%.

Technology Category

Application Category

📝 Abstract

Text-to-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation--a hallmark of human creativity. Current reasoning-augmented paradigms most rely on explicit thought processes, where intermediate reasoning is decoded into discrete text at fixed steps with frequent image decoding and re-encoding, leading to inefficiencies, information loss, and cognitive mismatches. To bridge this gap, we introduce LatentMorph, a novel framework that seamlessly integrates implicit latent reasoning into the T2I generation process. At its core, LatentMorph introduces four lightweight components: (i) a condenser for summarizing intermediate generation states into compact visual memory, (ii) a translator for converting latent thoughts into actionable guidance, (iii) a shaper for dynamically steering next image token predictions, and (iv) an RL-trained invoker for adaptively determining when to invoke reasoning. By performing reasoning entirely in continuous latent spaces, LatentMorph avoids the bottlenecks of explicit reasoning and enables more adaptive self-refinement. Extensive experiments demonstrate that LatentMorph (I) enhances the base model Janus-Pro by $16\%$ on GenEval and $25\%$ on T2I-CompBench; (II) outperforms explicit paradigms (e.g., TwiG) by $15\%$ and $11\%$ on abstract reasoning tasks like WISE and IPV-Txt, (III) while reducing inference time by $44\%$ and token consumption by $51\%$; and (IV) exhibits $71\%$ cognitive alignment with human intuition on reasoning invocation.

Problem

Research questions and friction points this paper is trying to address.

text-to-image generation

latent reasoning

dynamic refinement

cognitive alignment

image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent reasoning

text-to-image generation

implicit reasoning