ChatUMM: Robust Context Tracking for Conversational Interleaved Generation

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a conversational unified model capable of multi-turn interleaved multimodal generation, addressing the limitation of existing unified multimodal models that are largely confined to single-turn interactions and struggle to effectively model context or generate alternating text and images in dialogue. By leveraging serialized text-image stream modeling, history-dependent query rewriting, distractor turn construction, and multimodal response synthesis—combined with a three-stage dialogue data synthesis pipeline and an interleaved multi-turn training strategy—the approach transforms single-turn data into natural multimodal conversations with long-range dependencies. The resulting model achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-based image editing benchmarks, maintains high-quality text-to-image generation capabilities, and demonstrates exceptional contextual robustness in complex multi-turn scenarios.

Technology Category

Application Category

📝 Abstract
Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm, effectively functioning as solvers for independent requests rather than assistants in continuous dialogue. To bridge this gap, we present ChatUMM. As a conversational unified model, it excels at robust context tracking to sustain interleaved multimodal generation. ChatUMM derives its capabilities from two key innovations: an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow, and a systematic conversational data synthesis pipeline. This pipeline transforms a diverse set of standard single-turn datasets into fluid dialogues through three progressive stages: constructing basic stateful dialogues, enforcing long-range dependency resolution via ``distractor''turns with history-dependent query rewriting, and synthesizing naturally interleaved multimodal responses. Extensive evaluations demonstrate that ChatUMM achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-guided editing benchmarks, while maintaining competitive fidelity in text-to-image generation. Notably, ChatUMM exhibits superior robustness in complex multi-turn scenarios, ensuring fluid, context-aware dialogues.
Problem

Research questions and friction points this paper is trying to address.

unified multimodal models
conversational interaction
context tracking
interleaved generation
multi-turn dialogue
Innovation

Methods, ideas, or system contributions that make the work stand out.

interleaved multi-turn training
conversational data synthesis
context tracking
unified multimodal model
long-range dependency resolution
🔎 Similar Papers
No similar papers found.
W
Wenxun Dai
Tsinghua University
Z
Zhiyuan Zhao
Tencent Hunyuan
Y
Yule Zhong
Tencent Hunyuan
Yiji Cheng
Yiji Cheng
Tsinghua University
Computer VisionGenerative Models
J
Jianwei Zhang
Tencent Hunyuan
L
Linqing Wang
Tencent Hunyuan
Shiyi Zhang
Shiyi Zhang
Tsinghua University
Video GenerationVideo Understanding
Y
Yunlong Lin
Tencent Hunyuan
Runze He
Runze He
Institute of Information Engineering, Chinese Academy of Sciences
Computer Vision
F
Fellix Song
Tencent Hunyuan
W
Wayne Zhuang
Tencent Hunyuan
Yong Liu
Yong Liu
Tsinghua University
Video SegmentationMultimodal SegmentationComputer Vision
H
Haoji Zhang
Tsinghua University
Y
Yansong Tang
Tsinghua University
Q
Qinglin Lu
Tencent Hunyuan
C
Chunyu Wang
Tencent Hunyuan