Understanding Long Videos with Multimodal Language Models

📅 2024-03-25

📈 Citations: 5

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work investigates the independent roles of large language models’ (LLMs) world knowledge versus reasoning capabilities in long-video understanding. We find that LLMs alone—without explicit video input—achieve high accuracy, indicating their internal knowledge suffices for many long-video tasks. Building on this insight, we propose a novel multimodal understanding paradigm that unifies object-level visual information—detection, tracking, and attribute estimation—via natural language as a sole interface. Our method leverages open-source vision tools to extract structured visual representations and employs prompt engineering for cross-modal alignment, eliminating end-to-end video encoders. This is the first study to empirically reveal the dominant contribution of LLM priors to long-video understanding, significantly improving generalization and interpretability. Our approach achieves state-of-the-art performance across multiple long-video benchmarks and demonstrates strong transferability to cross-domain tasks such as robotic manipulation. Code is publicly available.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have allowed recent LLM-based approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Surprisingly, we discover that LLM-based approaches can yield surprisingly good accuracy on long-video tasks with limited video information, sometimes even with no video specific information. Building on this, we exploring injecting video-specific information into an LLM-based framework. We utilize off-the-shelf vision tools to extract three object-centric information modalities from videos and then leverage natural language as a medium for fusing this information. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across multiple video understanding benchmarks. Strong performance also on robotics domain tasks establish its strong generality. Our code will be released publicly.

Problem

Research questions and friction points this paper is trying to address.

Enhance long-video understanding using multimodal LLMs.

Inject video-specific information into LLM frameworks.

Achieve state-of-the-art video understanding benchmarks performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs enhance video understanding

Object-centric modalities fused via language

MVU framework sets benchmark standards

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs