TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Current video large language models lack effective evaluation of temporal object consistency—such as identity preservation, state coherence, and cross-frame continuity—leading to an overestimation of their temporal reasoning capabilities. This work proposes the first object-trajectory-anchored benchmark for assessing temporal consistency, leveraging structured event timelines to focus on challenging scenarios involving occlusion, disappearance, and reappearance. We introduce a three-tier temporal necessity filtering protocol to ensure that questions strictly require ordered visual evidence, and we construct high-quality question-answer pairs through object tracking and human verification. The benchmark comprises 1,951 videos and 2,323 QA pairs, revealing significant deficiencies in mainstream models’ abilities to perform event counting, sequencing, and identity-sensitive reasoning.

📝 Abstract

Video large language models (Video-LLMs) have achieved remarkable progress in general video understanding, yet their ability to maintain temporal object consistency remains insufficiently explored. Existing benchmarks primarily focus on event recognition, action understanding, or coarse temporal reasoning, but rarely evaluate whether a model can consistently preserve the identity, state, and temporal continuity of the same object across occlusion, disappearance, reappearance, state transitions, and cross-object interactions. As a result, current evaluations may overestimate temporal reasoning ability while overlooking failures in object-centric temporal coherence. To address this issue, we introduce TOC-Bench, a diagnostic benchmark specifically designed to evaluate temporal object consistency in Video-LLMs. TOC-Bench is explicitly object-track grounded, where each queried subject is associated with a per frame object trajectory and structured temporal event timeline. To ensure that benchmark items depend on temporally ordered visual evidence rather than language priors, single-frame shortcuts, or unordered frame cues, we propose a three-layer temporal-necessity filtering protocol that removes 60.7% of candidate QA pairs and retains 17,900 temporally dependent items spanning 10 diagnostic dimensions. From this filtered pool, we further construct a human-verified benchmark containing 2,323 high-quality QA pairs over 1,951 videos. Experiments on representative Video-LLMs show that temporal object consistency remains a major unsolved challenge. Current models exhibit substantial weaknesses in event counting, event ordering, identity-sensitive reasoning, and hallucination-aware verification, despite strong performance on general video understanding benchmarks.

Problem

Research questions and friction points this paper is trying to address.

temporal object consistency

video large language models

object identity

temporal reasoning

object continuity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Object Consistency

Video Large Language Models

Object Tracking