UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing video large language models (Video LLMs) are largely confined to single-granularity, task-specific understanding, lacking the capability to jointly model global semantics, pixel-level details, and temporal dynamics. To address this, we propose the first Video LLM enabling unified global–spatiotemporal–pixel-level understanding. Our method introduces a novel multi-granularity vision-language guided alignment mechanism, cross-scale dynamic encoding, and collaborative decoding architecture—enabling joint generation of textual responses, temporal localization, and visual grounding masks. Furthermore, we construct UFVideo-Bench, the first benchmark tailored for fine-grained collaborative video understanding. Extensive experiments demonstrate that our model significantly outperforms GPT-4o on UFVideo-Bench and achieves state-of-the-art results across nine mainstream video understanding benchmarks, validating the effectiveness and strong generalizability of our cross-granularity collaborative understanding paradigm.

Technology Category

Application Category

📝 Abstract

With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.

Problem

Research questions and friction points this paper is trying to address.

Unifies multi-grained video understanding tasks

Handles global, pixel, and temporal scales in one model

Evaluates collaborative tasks across different video scales

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multi-grained cooperative video understanding model

Dynamic encoding for visual and text inputs across tasks

Generates textual, temporal, and grounded mask outputs

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs