Can You Count to Nine? A Human Evaluation Benchmark for Counting Limits in Modern Text-to-Video Models

📅 2025-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-video (T2V) models exhibit systematic failure in adhering to basic numerical constraints—e.g., generating exactly 1–9 objects—with accuracy consistently below 12%. Method: We introduce T2VCountBench, the first dedicated benchmark for evaluating counting capability in T2V generation, covering multilingual, multi-style, and long-duration scenarios; we further design controllable and task-decomposition prompting strategies, and propose a multidimensional human evaluation framework for fine-grained quantification of numerical adherence. Contribution/Results: Ablation studies reveal that existing prompting optimizations—including decomposition, style control, and temporal adjustment—fail to meaningfully improve counting performance, thereby identifying the fundamental architectural limitation in current T2V models. This work establishes a new benchmark, evaluation methodology, and conceptual understanding for advancing controllable T2V generation.

Technology Category

Application Category

📝 Abstract
Generative models have driven significant progress in a variety of AI tasks, including text-to-video generation, where models like Video LDM and Stable Video Diffusion can produce realistic, movie-level videos from textual instructions. Despite these advances, current text-to-video models still face fundamental challenges in reliably following human commands, particularly in adhering to simple numerical constraints. In this work, we present T2VCountBench, a specialized benchmark aiming at evaluating the counting capability of SOTA text-to-video models as of 2025. Our benchmark employs rigorous human evaluations to measure the number of generated objects and covers a diverse range of generators, covering both open-source and commercial models. Extensive experiments reveal that all existing models struggle with basic numerical tasks, almost always failing to generate videos with an object count of 9 or fewer. Furthermore, our comprehensive ablation studies explore how factors like video style, temporal dynamics, and multilingual inputs may influence counting performance. We also explore prompt refinement techniques and demonstrate that decomposing the task into smaller subtasks does not easily alleviate these limitations. Our findings highlight important challenges in current text-to-video generation and provide insights for future research aimed at improving adherence to basic numerical constraints.
Problem

Research questions and friction points this paper is trying to address.

Evaluating counting capability in text-to-video models
Assessing adherence to numerical constraints in video generation
Exploring limitations in generating videos with object counts
Innovation

Methods, ideas, or system contributions that make the work stand out.

T2VCountBench evaluates counting in text-to-video models
Human assessments measure object count accuracy
Prompt refinement fails to improve numerical constraints
🔎 Similar Papers
No similar papers found.
Xuyang Guo
Xuyang Guo
Guilin University of Electronic Technology
Machine Learning
Z
Zekai Huang
The Ohio State University
J
Jiayan Huo
University of Arizona
Yingyu Liang
Yingyu Liang
The University of Hong Kong
machine learning
Zhenmei Shi
Zhenmei Shi
Senior Research Scientist at MongoDB + Voyage AI; PhD from University of Wisconsin–Madison
Deep LearningMachine LearningArtificial Intelligence
Z
Zhao Song
The Simons Institute for the Theory of Computing at the UC, Berkeley
J
Jiahao Zhang
Independent Researcher