Multimodal Learning 相关度: 9/10

ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao

arXiv: 2603.12208v1 发布: 2026-03-12 更新: 2026-03-12

下载 PDF arXiv 页面

AI 摘要

ForensicZip通过Forgery驱动的token压缩，加速多模态取证模型并保持检测性能。

主要贡献

提出了ForensicZip框架，用于取证视觉语言模型的token压缩。
利用Birth-Death Optimal Transport问题建模时间token演化，识别伪造痕迹。
结合高频先验，区分取证证据和语义内容，实现高压缩率下性能保持。

方法论

将时间token演化建模为Birth-Death Optimal Transport问题，结合高频先验进行Forgery驱动的token压缩。

原文摘要

Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类