Multimodal Learning 相关度: 9/10

Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs

Yuxin Liu, Fei Wang, Kun Li, Yiqi Nie, Junjie Chen, Zhangling Duan, Zhaohong Jia
arXiv: 2603.17761v1 发布: 2026-03-18 更新: 2026-03-18

AI 摘要

提出了一种无需微调LVLM的图像Deepfake检测框架SCEP,通过证据驱动推理提高检测泛化性。

主要贡献

  • 提出Semantic Consistent Evidence Pack (SCEP)框架
  • 使用证据驱动推理代替全图推理
  • 无需微调LVLM即可达到良好效果

方法论

SCEP通过聚类和评分提取关键patch,构建证据包,输入冻结的LVLM进行预测。

原文摘要

Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder's CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.

标签

Deepfake Detection Vision-Language Models Multimodal Learning Image Manipulation

arXiv 分类

cs.CV