ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing open-source multimodal large language models exhibit insufficient contextual modeling capability in long-horizon, multi-turn dialogues, leading to unstable cross-turn visual–linguistic consistency. To address this, we propose ContextQFormer—a novel architecture introducing learnable memory blocks to explicitly enhance multimodal contextual representation across dialogue turns. We further construct TMDialog, the first open-source multimodal dialogue dataset specifically designed for long-context pretraining and evaluation. Our methodology integrates context-aware multimodal fusion, instruction tuning, and scalable data curation strategies. On TMDialog, ContextQFormer achieves a 2–4% improvement in usability rate over three strong baseline models, demonstrating significantly enhanced stability and coherence in extended contextual interactions. This work advances robust multimodal dialogue systems by jointly improving architectural design, training paradigms, and benchmark resources for long-horizon settings.

Technology Category

Application Category

📝 Abstract

Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.

Problem

Research questions and friction points this paper is trying to address.

Enhances multi-turn interaction in multi-modal models

Introduces ContextQFormer for better contextual information handling

Builds TMDialog dataset for long multi-modal conversations

Innovation

Methods, ideas, or system contributions that make the work stand out.

ContextQFormer enhances multi-turn context modeling

TMDialog dataset supports long multi-modal conversations

Memory block improves contextual information presentation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs