DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Current video large language models (VLLMs) face significant bottlenecks in fine-grained temporal reasoning—particularly precise moment localization—and struggle to align cross-modal semantics with timestamps under label-scarce conditions. To address this, we propose a temporal-aware multimodal video understanding framework. Our method introduces a hierarchical dual-stream Temporal-aware Fuseformer architecture with a global residual mechanism to suppress spatial redundancy; devises a novel four-stage progressive supervision training paradigm; and constructs a GPT-enhanced temporal annotation QA dataset. Evaluated on video question answering and moment localization benchmarks, our approach consistently outperforms state-of-the-art methods. Notably, under low-supervision settings, it achieves substantial gains in moment-level reasoning accuracy and cross-dataset generalization. This work establishes a new paradigm for data-efficient, interpretable video-language modeling grounded in principled temporal awareness and structured multimodal alignment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with GPT-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.

Problem

Research questions and friction points this paper is trying to address.

Enhancing fine-grained temporal reasoning in Video LLMs

Improving multimodal alignment with visual and audio data

Reducing spatial redundancy while preserving semantic details

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical dual-stream architecture for temporal dynamics

Global residual reduces redundancy, preserves semantics

Four-stage training for multimodal alignment and reasoning

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs