🤖 AI Summary
Current text-to-video (T2V) models exhibit severe deficiencies in rendering on-screen text—such as subtitles and mathematical formulas—failing to ensure readability, spatial stability, and cross-frame consistency.
Method: We introduce ScreenTextBench, the first human evaluation benchmark dedicated to screen-text fidelity, featuring a multidimensional scoring protocol across font rendering, semantic accuracy, spatial positioning, and temporal consistency. We design a prompt set integrating complex textual instructions with dynamic scenes to systematically evaluate ten mainstream open-source and commercial T2V models.
Contribution/Results: ScreenTextBench fills a critical gap in evaluating precise on-screen text rendering. Experiments reveal consistent and significant failures across all evaluated models in mathematical formula generation, multilingual text, and long-text scenarios—exposing fundamental limitations in text-controllable video synthesis. The benchmark establishes a rigorous evaluation framework and identifies concrete directions for advancing controllable, text-accurate T2V generation.
📝 Abstract
Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models' ability to render precise on-screen text, such as captions or mathematical formulas, remains largely untested, posing significant challenges for applications requiring exact textual accuracy. In this work, we introduce T2VTextBench, the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models. Our suite of prompts integrates complex text strings with dynamic scene changes, testing each model's ability to maintain detailed instructions across frames. We evaluate ten state-of-the-art systems, ranging from open-source solutions to commercial offerings, and find that most struggle to generate legible, consistent text. These results highlight a critical gap in current video generators and provide a clear direction for future research aimed at enhancing textual manipulation in video synthesis.