🤖 AI Summary
Existing open-source multimodal large language models exhibit insufficient contextual modeling capability in long-horizon, multi-turn dialogues, leading to unstable cross-turn visual–linguistic consistency. To address this, we propose ContextQFormer—a novel architecture introducing learnable memory blocks to explicitly enhance multimodal contextual representation across dialogue turns. We further construct TMDialog, the first open-source multimodal dialogue dataset specifically designed for long-context pretraining and evaluation. Our methodology integrates context-aware multimodal fusion, instruction tuning, and scalable data curation strategies. On TMDialog, ContextQFormer achieves a 2–4% improvement in usability rate over three strong baseline models, demonstrating significantly enhanced stability and coherence in extended contextual interactions. This work advances robust multimodal dialogue systems by jointly improving architectural design, training paradigms, and benchmark resources for long-horizon settings.
📝 Abstract
Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.