DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering

📅 2025-03-20

🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the fine-grained understanding challenge of document-centric instructional videos—characterized by dense textual graphics and synchronized speech—by introducing DocVideoQA, a novel task and the first large-scale benchmark comprising 1,454 videos and 154k question-answer pairs across 23 remote-work and online-course scenarios. To enhance joint text–vision–audio comprehension, we propose DV-LLaMA: a LLaMA-based multimodal large language model integrating video frame and audio encoders with a cross-modal alignment module, and trained via a novel paradigm combining multi-source instruction tuning and cross-modal contrastive learning. On DocVideoQA, DV-LLaMA significantly outperforms existing open-source multimodal LLMs, achieving breakthrough performance in semantic understanding, temporal reasoning, and cross-modal fusion. Both the code and dataset will be publicly released.

Technology Category

Application Category

📝 Abstract

Remote work and online courses have become important methods of knowledge dissemination, leading to a large number of document-based instructional videos. Unlike traditional video datasets, these videos mainly feature rich-text images and audio that are densely packed with information closely tied to the visual content, requiring advanced multimodal understanding capabilities. However, this domain remains underexplored due to dataset availability and its inherent complexity. In this paper, we introduce the DocVideoQA task and dataset for the first time, comprising 1454 videos across 23 categories with a total duration of about 828 hours. The dataset is annotated with 154k question-answer pairs generated manually and via GPT, assessing models' comprehension, temporal awareness, and modality integration capabilities. Initially, we establish a baseline using open-source MLLMs. Recognizing the challenges in modality comprehension for document-centric videos, we present DV-LLaMA, a robust video MLLM baseline. Our method enhances unimodal feature extraction with diverse instruction-tuning data and employs contrastive learning to strengthen modality integration. Through fine-tuning, the LLM is equipped with audio-visual capabilities, leading to significant improvements in document-centric video understanding. Extensive testing on the DocVideoQA dataset shows that DV-LLaMA significantly outperforms existing models. We'll release the code and dataset to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Addresses understanding of document-centric instructional videos

Introduces DocVideoQA dataset for multimodal comprehension

Proposes DV-LLaMA model for enhanced video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces DocVideoQA task and dataset

Develops DV-LLaMA for video understanding

Uses contrastive learning for modality integration

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding