Multimodal Learning 相关度: 9/10

SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

Sahibzada Adil Shahzad, Ammarah Hashmi, Junichi Yamagishi, Yusuke Yasuda, Yu Tsao, Chia-Wen Lin, Yan-Tsung Peng, Hsin-Min Wang
arXiv: 2603.25140v1 发布: 2026-03-26 更新: 2026-03-26

AI 摘要

SAVe提出了一种自监督音视频深度伪造检测框架,利用视觉伪影和音视频错位。

主要贡献

  • 提出一种自监督学习的音视频深度伪造检测框架
  • 利用身份保持、区域感知自混合伪造样本模拟篡改伪影
  • 通过音视频对齐检测唇动同步错误

方法论

使用自监督学习,生成伪造样本学习视觉伪影,并建模音视频同步关系检测错位。

原文摘要

Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.

标签

deepfake detection self-supervised learning audio-visual multimodal learning

arXiv 分类

cs.CV cs.AI cs.LG cs.MM cs.SD