OSCBench: Benchmarking Object State Change in Text-to-Video Generation
AI 摘要
提出了OSCBench基准测试,用于评估文本到视频生成模型对物体状态变化的理解能力。
主要贡献
- 构建了OSCBench基准测试数据集
- 提出了基于MLLM的自动评估方法
- 评估了多个T2V模型在OSC方面的性能
方法论
构建了包含常规、新颖和组合场景的烹饪数据集,并使用人工评估和MLLM自动评估相结合的方法。
原文摘要
Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.