Multimodal Learning 相关度: 9/10

Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

Zheyuan Gu, Qingsong Zhao, Yusong Wang, Zhaohong Huang, Xinqi Li, Cheng Yuan, Jiaowei Shao, Chi Zhang, Xuelong Li
arXiv: 2602.21779v1 发布: 2026-02-25 更新: 2026-02-25

AI 摘要

提出了FAQ基准测试,提升VLM在视频深度伪造时间一致性推理能力。

主要贡献

  • 提出了FAQ基准测试,用于评估VLM在视频深度伪造时间推理能力。
  • FAQ包含三个层级:面部感知、时间深度伪造定位和取证推理。
  • 通过在FAQ-IT上微调VLM,提高了模型在深度伪造检测上的性能。

方法论

构建大规模多选题数据集FAQ,分层评估VLM的时间深度伪造推理能力,并通过指令微调FAQ-IT数据集优化模型。

原文摘要

Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-tuned on FAQ-IT achieve advanced performance on both in-domain and cross-dataset detection benchmarks. Ablation studies further validate the impact of our key design choices, confirming that FAQ is the driving force behind the temporal reasoning capabilities of these VLMs.

标签

视频深度伪造 视觉语言模型 时间推理 基准测试

arXiv 分类

cs.CV cs.AI