ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-video understanding benchmarks lack unified multi-temporal-scale evaluation designs, hindering cross-scale performance comparison. To address this, we introduce the first benchmark enabling concurrent evaluation at four temporal granularities—second-level (clip), ten-second-level (shot), minute-level (event), and hour-level (story)—within a single video. We propose a temporal-scale-decoupled annotation framework and a hierarchical semantic modeling approach. Built upon 269 long videos (average duration: 86 minutes) spanning five high-level categories and 36 fine-grained subcategories, our benchmark systematically evaluates 23 state-of-the-art multimodal large language models (MLLMs). We uncover, for the first time, a U-shaped temporal-scale performance pattern: models achieve higher accuracy on both short- and long-scale tasks but exhibit significant degradation at intermediate scales. Furthermore, we empirically validate that visual token expansion consistently enhances inference performance across all temporal scales.

Technology Category

Application Category

📝 Abstract
Although long-video understanding demands that models capture hierarchical temporal information -- from clip (seconds) and shot (tens of seconds) to event (minutes) and story (hours) -- existing benchmarks either neglect this multi-scale design or scatter scale-specific questions across different videos, preventing direct comparison of model performance across timescales on the same content. To address this, we introduce ScaleLong, the first benchmark to disentangle these factors by embedding questions targeting four hierarchical timescales -- clip (seconds), shot (tens of seconds), event (minutes), and story (hours) -- all within the same video content. This within-content multi-timescale questioning design enables direct comparison of model performance across timescales on identical videos. ScaleLong features 269 long videos (avg. 86,min) from 5 main categories and 36 sub-categories, with 4--8 carefully designed questions, including at least one question for each timescale. Evaluating 23 MLLMs reveals a U-shaped performance curve, with higher accuracy at the shortest and longest timescales and a dip at intermediate levels. Furthermore, ablation studies show that increased visual token capacity consistently enhances reasoning across all timescales. ScaleLong offers a fine-grained, multi-timescale benchmark for advancing MLLM capabilities in long-video understanding. The code and dataset are available https://github.com/multimodal-art-projection/ScaleLong.
Problem

Research questions and friction points this paper is trying to address.

Lack of multi-scale benchmarks for hierarchical video understanding
Inability to compare model performance across timescales directly
Need for fine-grained evaluation of long-video comprehension models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-timescale benchmark for video understanding
Same video content with hierarchical questions
Increased visual token capacity enhances reasoning
🔎 Similar Papers
No similar papers found.
D
David Ma
ByteDance Inc.
H
Huaqing Yuan
ByteDance Inc.
X
Xingjian Wang
ByteDance Inc.
Q
Qianbo Zang
ByteDance Inc.
T
Tianci Liu
ByteDance Inc.
X
Xinyang He
ByteDance Inc.
Y
Yanbin Wei
ByteDance Inc.
Jiawei Guo
Jiawei Guo
Bupt & M-A-P
LLM MLLM
J
Jiahui Ni
ByteDance Inc.
Z
Zhenzhu Yang
ByteDance Inc.
Meng Cao
Meng Cao
Postdoc, Carnegie Mellon University
Psychology
Shanghaoran Quan
Shanghaoran Quan
Peking University
Natural Language ProcessingLarge Language Model
Yizhi Li
Yizhi Li
University of Manchester, M-A-P
LLMReasoningPost-trainingComputational Music
Wangchunshu Zhou
Wangchunshu Zhou
OPPO & M-A-P
artificial general intelligencelanguage agentslarge language modelsnatural language processing
J
Jiaheng Liu
ByteDance Inc.
W
Wenhao Huang
ByteDance Inc.
G
Ge Zhang
ByteDance Inc.
S
Shiwen Ni
ByteDance Inc.
X
Xiaojie Jin
ByteDance Inc.