Adversarial Prompt Injection Attack on Multimodal Large Language Models
AI 摘要
研究针对多模态大语言模型(MLLM)的不可察觉视觉提示注入攻击,提升攻击的有效性和隐蔽性。
主要贡献
- 提出了一种基于对抗性提示的视觉注入攻击方法。
- 设计了一种自适应嵌入恶意提示到图像中的方法,通过有界文本叠加实现语义引导。
- 通过优化不可察觉的视觉扰动,对齐攻击图像与恶意视觉和文本目标之间的特征表示。
方法论
通过有界文本叠加将恶意提示嵌入图像,迭代优化不可察觉的视觉扰动,对齐特征表示,逐步改进文本渲染图像以提高可迁移性。
原文摘要
Although multimodal large language models (MLLMs) are increasingly deployed in real-world applications, their instruction-following behavior leaves them vulnerable to prompt injection attacks. Existing prompt injection methods predominantly rely on textual prompts or perceptible visual prompts that are observable by human users. In this work, we study imperceptible visual prompt injection against powerful closed-source MLLMs, where adversarial instructions are embedded in the visual modality. Our method adaptively embeds the malicious prompt into the input image via a bounded text overlay to provide semantic guidance. Meanwhile, the imperceptible visual perturbation is iteratively optimized to align the feature representation of the attacked image with those of the malicious visual and textual targets at both coarse- and fine-grained levels. Specifically, the visual target is instantiated as a text-rendered image and progressively refined during optimization to more faithfully represent the desired semantics and improve transferability. Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods.