Multimodal Learning 相关度: 9/10

RISE-Video: Can Video Generators Decode Implicit World Rules?

Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang
arXiv: 2602.05986v1 发布: 2026-02-05 更新: 2026-02-05

AI 摘要

提出RISE-Video基准测试,评估视频生成模型在理解隐式世界规则方面的推理能力。

主要贡献

  • 提出了RISE-Video基准测试
  • 设计了多维评估协议
  • 提出了基于LMM的自动化评估流程

方法论

构建包含467个样本的基准,涵盖常识、空间动态等领域,利用LMM进行自动化评估。

原文摘要

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

标签

视频生成 推理 基准测试 多模态

arXiv 分类

cs.CV cs.AI