Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
AI 摘要
提出了FAQ基准测试,提升VLM在视频深度伪造时间一致性推理能力。
主要贡献
- 提出了FAQ基准测试,用于评估VLM在视频深度伪造时间推理能力。
- FAQ包含三个层级:面部感知、时间深度伪造定位和取证推理。
- 通过在FAQ-IT上微调VLM,提高了模型在深度伪造检测上的性能。
方法论
构建大规模多选题数据集FAQ,分层评估VLM的时间深度伪造推理能力,并通过指令微调FAQ-IT数据集优化模型。
原文摘要
Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-tuned on FAQ-IT achieve advanced performance on both in-domain and cross-dataset detection benchmarks. Ablation studies further validate the impact of our key design choices, confirming that FAQ is the driving force behind the temporal reasoning capabilities of these VLMs.