VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense
AI 摘要
提出一种高效的LVLM对抗攻击检测防御方法,结合图像变换和数据整合,无需训练。
主要贡献
- 提出一种多阶段的对抗攻击检测机制
- 结合图像变换和Agent数据整合来恢复模型正确行为
- 在保证精度的情况下提高了效率
方法论
通过内容保持变换评估图像一致性,检查文本嵌入空间差异,最后用LLM整合多重响应。
原文摘要
Large Vision-Language Models (LVLMs) can be vulnerable to adversarial images that subtly bias their outputs toward plausible yet incorrect responses. We introduce a general, efficient, and training-free defense that combines image transformations with agentic data consolidation to recover correct model behavior. A key component of our approach is a two-stage detection mechanism that quickly filters out the majority of clean inputs. We first assess image consistency under content-preserving transformations at negligible computational cost. For more challenging cases, we examine discrepancies in a text-embedding space. Only when necessary do we invoke a powerful LLM to resolve attack-induced divergences. A key idea is to consolidate multiple responses, leveraging both their similarities and their differences. We show that our method achieves state-of-the-art accuracy while maintaining notable efficiency: most clean images skip costly processing, and even in the presence of numerous adversarial examples, the overhead remains minimal.