Team of One: Cracking Complex Video QA with Model Synergy

📅 2025-07-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video large multimodal models (Video-LMMs) suffer from shallow contextual understanding, weak temporal modeling, and poor generalization to ambiguous or compositional queries in complex video question answering. To address these limitations, we propose a fine-tuning-free, multi-model collaborative reasoning framework: heterogeneous video-language models (VLMs) concurrently generate structured chain-of-thought (CoT) reasoning paths, which are dynamically evaluated, selected, and fused by an external large language model (LLM). We introduce the first prompting-and-response integration mechanism, enabling a lightweight and scalable multimodal collaboration architecture. Evaluated on the CVRR-ES benchmark, our approach comprehensively outperforms state-of-the-art methods, achieving significant gains in accuracy, robustness, and generalization. This work establishes an efficient, scalable new paradigm for complex video understanding.

Technology Category

Application Category

📝 Abstract
We propose a novel framework for open-ended video question answering that enhances reasoning depth and robustness in complex real-world scenarios, as benchmarked on the CVRR-ES dataset. Existing Video-Large Multimodal Models (Video-LMMs) often exhibit limited contextual understanding, weak temporal modeling, and poor generalization to ambiguous or compositional queries. To address these challenges, we introduce a prompting-and-response integration mechanism that coordinates multiple heterogeneous Video-Language Models (VLMs) via structured chains of thought, each tailored to distinct reasoning pathways. An external Large Language Model (LLM) serves as an evaluator and integrator, selecting and fusing the most reliable responses. Extensive experiments demonstrate that our method significantly outperforms existing baselines across all evaluation metrics, showcasing superior generalization and robustness. Our approach offers a lightweight, extensible strategy for advancing multimodal reasoning without requiring model retraining, setting a strong foundation for future Video-LMM development.
Problem

Research questions and friction points this paper is trying to address.

Enhancing reasoning depth in video question answering
Addressing limited contextual understanding in Video-LMMs
Improving generalization to ambiguous or compositional queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompting-and-response integration mechanism for VLMs
LLM as evaluator and integrator for reliable responses
Lightweight, extensible strategy without model retraining
🔎 Similar Papers
No similar papers found.
J
Jun Xie
Lenovo research
Z
Zhaoran Zhao
Lenovo research
X
Xiongjun Guan
Tsinghua University
Y
Yingjian Zhu
School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS)
H
Hongzhu Yi
School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS)
X
Xinming Wang
Institute of Automation, Chinese Academy of Sciences(CAS)
F
Feng Chen
Lenovo research
Zhepeng Wang
Zhepeng Wang
Applied Scientist at Amazon Stores Foundational AI
Large Language ModelsOn-device AISelf-supervised LearningQuantum Machine Learning