ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-source multimodal large language models exhibit insufficient contextual modeling capability in long-horizon, multi-turn dialogues, leading to unstable cross-turn visual–linguistic consistency. To address this, we propose ContextQFormer—a novel architecture introducing learnable memory blocks to explicitly enhance multimodal contextual representation across dialogue turns. We further construct TMDialog, the first open-source multimodal dialogue dataset specifically designed for long-context pretraining and evaluation. Our methodology integrates context-aware multimodal fusion, instruction tuning, and scalable data curation strategies. On TMDialog, ContextQFormer achieves a 2–4% improvement in usability rate over three strong baseline models, demonstrating significantly enhanced stability and coherence in extended contextual interactions. This work advances robust multimodal dialogue systems by jointly improving architectural design, training paradigms, and benchmark resources for long-horizon settings.

Technology Category

Application Category

📝 Abstract
Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.
Problem

Research questions and friction points this paper is trying to address.

Enhances multi-turn interaction in multi-modal models
Introduces ContextQFormer for better contextual information handling
Builds TMDialog dataset for long multi-modal conversations
Innovation

Methods, ideas, or system contributions that make the work stand out.

ContextQFormer enhances multi-turn context modeling
TMDialog dataset supports long multi-modal conversations
Memory block improves contextual information presentation
🔎 Similar Papers
No similar papers found.
Y
Yiming Lei
Beihang University, China; Hangzhou Innovation Institute, Beihang University, China
Z
Zhizheng Yang
Nanjing University, China
Z
Zeming Liu
Beihang University, China
H
Haitao Leng
Kuaishou Technology, China
Shaoguo Liu
Shaoguo Liu
Alibaba Corporation
Maching LearningComputer Vision
T
Tingting Gao
Kuaishou Technology, China
Qingjie Liu
Qingjie Liu
Professor, School of Computer Science and Engineering, Beihang University
Computer Vision and Pattern Recognition
Yunhong Wang
Yunhong Wang
Professor, School of Computer Science and Engineering, Beihang University
BiometricsPattern RecognitionImage ProcessingComputer Vision