UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-world applications demand video intelligence systems capable of multi-step collaborative and iterative interaction, yet existing models are largely confined to single tasks (e.g., generation or understanding). Method: We propose UniVA—the first open-source general-purpose video agent framework—featuring a dual-agent “Plan-Execute” architecture and a hierarchical memory mechanism to unify video understanding, segmentation, editing, and generation. It enables long-horizon, context-coherent, and fully traceable multi-condition workflows. Technically, UniVA integrates Modular Command Protocol (MCP) tool servers, multi-agent coordinated scheduling, and video-language model orchestration. Contribution/Results: Experiments demonstrate that UniVA significantly advances automation for cross-step video tasks. We further introduce UniVA-Bench—the first dedicated benchmark for evaluating general-purpose video agents—thereby establishing a foundational resource for future research in video agent development.

Technology Category

Application Category

📝 Abstract
While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation $ ightarrow$ multi-round editing $ ightarrow$ object segmentation $ ightarrow$ compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)
Problem

Research questions and friction points this paper is trying to address.

Bridging isolated video tasks into cohesive complex workflows
Unifying video understanding, segmentation, editing, and generation capabilities
Enabling interactive and self-reflective video creation with traceability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plan-and-Act dual-agent architecture for video workflows
Hierarchical multi-level memory for contextual continuity
Modular MCP-based tool servers for video processing
🔎 Similar Papers
No similar papers found.