Multimodal Learning 相关度: 9/10

GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

Yueying Zou, Pei Pei Li, Zekun Li, Xinyu Guo, Xing Cui, Huaibo Huang, Ran He
arXiv: 2603.18625v1 发布: 2026-03-19 更新: 2026-03-19

AI 摘要

GenVideoLens基准测试揭示LVLMs在AI生成视频检测中光学、物理和时序推理上的不足。

主要贡献

  • 提出了GenVideoLens,一个细粒度的AI生成视频检测基准测试。
  • 构建了包含真实和AI生成视频的数据集,并进行了多维度标注。
  • 评估了多个LVLMs,并分析了它们在不同维度上的性能差异。

方法论

构建包含400个AI生成视频和100个真实视频的数据集,通过15个维度专家标注,评估11个LVLM模型的性能。

原文摘要

In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.

标签

LVLM AI生成视频检测 多模态学习 基准测试 评估

arXiv 分类

cs.CV