How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the rapid degradation of image quality in existing unified multimodal models when generating long interleaved sequences of text and images, a problem primarily caused by the cumulative contamination of visual history. To mitigate this, we propose UniLongGen—a training-free inference optimization strategy that dynamically identifies and prunes visual tokens irrelevant to the current generation step by analyzing the model’s internal attention mechanisms. This approach actively discards distracting contextual information while preserving critical visual content, thereby breaking from conventional long-context handling paradigms. UniLongGen significantly enhances the fidelity and consistency of long-sequence multimodal generation while simultaneously reducing memory consumption and inference latency.

Technology Category

Application Category

📝 Abstract

Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.

Problem

Research questions and friction points this paper is trying to address.

unified multimodal models

long-horizon generation

interleaved image generation

reliability gap

visual history pollution

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal generation

long-horizon reliability

context curation