Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

To address the challenge of balancing computational efficiency with fine-grained spatiotemporal modeling in long-video understanding, this paper proposes a multi-granularity video representation framework that unifies image and video modeling for the first time: images are treated as single-frame videos and accommodated via subgraph decomposition. The framework employs a high-resolution intra-chunk visual encoder—combining 3D convolutions and Vision Transformers—to preserve spatial fidelity, and an inter-chunk temporal aggregator—a Transformer augmented with chunk-level rotary position encoding—to capture long-range dynamics. By eliminating sparse sampling and low-resolution compression, it mitigates spatiotemporal information loss. Experiments demonstrate substantial improvements in fine-grained spatiotemporal reasoning across multiple benchmarks, while maintaining both spatial fidelity and temporal continuity—outperforming state-of-the-art methods comprehensively.

Technology Category

Application Category

📝 Abstract

Long-context video understanding in multimodal large language models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose $mathbf{Mavors}$, a novel framework that introduces $mathbf{M}$ulti-gr$mathbf{a}$nularity $mathbf{v}$ide$mathbf{o}$ $mathbf{r}$epre$mathbf{s}$entation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.

Problem

Research questions and friction points this paper is trying to address.

Balancing efficiency and fine-grained spatio-temporal retention in MLLMs

Addressing information loss in temporal dynamics and spatial details

Unifying image and video understanding via multi-granularity representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-granularity video representation for MLLMs

Intra-chunk encoder preserves high-resolution spatial features

Inter-chunk aggregator ensures temporal coherence

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs