Multimodal Learning 相关度: 10/10

Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

Casey Ford, Madison Van Doren, Emily Dix
arXiv: 2602.04739v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

纵向评估了多模态LLM的安全性,发现其抗对抗性攻击能力随迭代发生漂移。

主要贡献

  • 构建了多模态LLM对抗性攻击基准测试集
  • 评估了多个MLLM版本的安全性,发现了安全性漂移现象
  • 揭示了不同模态输入对攻击成功率的影响

方法论

使用了由专业红队人员设计的对抗性prompt,分两个阶段评估了多个MLLM版本的安全性,并进行了人工评估。

原文摘要

Multimodal large language models (MLLMs) are increasingly deployed in real-world systems, yet their safety under adversarial prompting remains underexplored. We present a two-phase evaluation of MLLM harmlessness using a fixed benchmark of 726 adversarial prompts authored by 26 professional red teamers. Phase 1 assessed GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus; Phase 2 evaluated their successors (GPT-5, Claude Sonnet 4.5, Pixtral Large, and Qwen Omni) yielding 82,256 human harm ratings. Large, persistent differences emerged across model families: Pixtral models were consistently the most vulnerable, whereas Claude models appeared safest due to high refusal rates. Attack success rates (ASR) showed clear alignment drift: GPT and Claude models exhibited increased ASR across generations, while Pixtral and Qwen showed modest decreases. Modality effects also shifted over time: text-only prompts were more effective in Phase 1, whereas Phase 2 produced model-specific patterns, with GPT-5 and Claude 4.5 showing near-equivalent vulnerability across modalities. These findings demonstrate that MLLM harmlessness is neither uniform nor stable across updates, underscoring the need for longitudinal, multimodal benchmarks to track evolving safety behaviour.

标签

多模态 LLM 安全性 对抗攻击 漂移

arXiv 分类

cs.CL cs.AI cs.HC