Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation

📅 2026-04-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

234K/year
🤖 AI Summary
Existing video evaluation metrics are primarily designed for short videos and struggle to effectively capture long-range characteristics such as narrative richness and global causal consistency in long-form content. This work proposes decoupling long-video evaluation from short-video assessment by explicitly modeling long-range context as an orthogonal dimension to short-term visual quality. To this end, the authors introduce a dedicated evaluation framework for structural consistency in long videos, integrating shot-level dynamics modeling, perturbation-based stress tests—including shot shuffling and narrative disruption—and a human-annotated dataset of long-range features, Long-CODE. The proposed metric achieves state-of-the-art correlation with human judgments and complements existing short-video evaluation benchmarks, together forming a comprehensive assessment suite for video generation.

Technology Category

Application Category

📝 Abstract
As video generation models achieve unprecedented capabilities, the demand for robust video evaluation metrics becomes increasingly critical. Traditional metrics are intrinsically tailored for short-video evaluation, predominantly assessing frame-level visual quality and localized temporal smoothness. However, as state-of-the-art video generation models scale to generate longer videos, these metrics fail to capture essential long-range characteristics, such as narrative richness and global causal consistency. Recognizing that short-term visual perception and long-context attributes are fundamentally orthogonal dimensions, we argue that long-video metrics should be disentangled from short-video assessments. In this paper, we focus on the rigorous justification and design of a dedicated framework for long-video evaluation. We first introduce a suite of long-video attribute corruption tests, exposing the critical limitations of existing hort-video metrics from their insensitivity to structural inconsistencies, such as shot-level perturbations and narrative shuffling. To bridge this gap, we design a novel long-video metric based on shot dynamics, which is highly sensitive to the long-range testing framework. Furthermore, we introduce Long-CODE (Long-Context as an Orthogonal Dimension for video Evaluation), a specialized dataset designed to benchmark long-video evaluation, with human annotations isolated specifically to genuine long-range characteristics. Extensive experiments show that our proposed metrics achieve state-of-the-art correlation with human judgments. Ultimately, our metric and benchmark seamlessly complement existing short-video standards, establishing a holistic and unbiased evaluation paradigm for video generation models.
Problem

Research questions and friction points this paper is trying to address.

long-video evaluation
narrative consistency
temporal coherence
video generation metrics
long-context understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-context evaluation
orthogonal dimension
shot dynamics
narrative consistency
video generation benchmark
🔎 Similar Papers