OSCBench: Benchmarking Object State Change in Text-to-Video Generation

πŸ“… 2026-03-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses a critical gap in the evaluation of text-to-video (T2V) generation models: their inability to effectively capture object state changes (OSC) described in input text, leading to blind spots in action understanding. To this end, we introduce OSCBenchβ€”the first structured benchmark specifically designed to assess OSC capabilities. Built from cooking instruction data, it encompasses routine, novel, and compositional action-object interaction scenarios. We systematically evaluate both open- and closed-source T2V models using a hybrid approach combining human user studies and automatic assessment via multimodal large language models (MLLMs). Our findings reveal that while current models perform well on semantic and scene alignment, they exhibit significant deficiencies in generating accurate object state changes, particularly in novel and compositional settings. This highlights OSC as a core bottleneck in T2V generation and establishes a crucial evaluation dimension for future model development.

Technology Category

Application Category

πŸ“ Abstract
Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.
Problem

Research questions and friction points this paper is trying to address.

Object State Change
Text-to-Video Generation
Benchmarking
Action Understanding
Video Generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object State Change
Text-to-Video Generation
Benchmark
Compositional Generalization
Multimodal Evaluation
πŸ”Ž Similar Papers
No similar papers found.