AI Agents 相关度: 9/10

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng
arXiv: 2603.04364v1 发布: 2026-03-04 更新: 2026-03-04

AI 摘要

针对多模态WebAgent的安全漏洞,提出一种双模态多阶段对抗安全训练框架DMAST。

主要贡献

  • 揭示了多模态WebAgent在跨模态攻击下的安全漏洞。
  • 提出了双模态多阶段对抗安全训练框架DMAST。
  • DMAST在泛化性和安全性上优于现有方法。

方法论

提出DMAST,通过模仿学习、有指导的微调和对抗强化学习三个阶段来训练鲁棒的多模态WebAgent。

原文摘要

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive narrative. Our vulnerability analysis on MiniWob++ reveals that attacks including a visual component far outperform text-only injections, exposing critical gaps in text-centric VLM safety training. Motivated by this finding, we propose Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline: (1) imitation learning from a strong teacher model, (2) oracle-guided supervised fine-tuning that uses a novel zero-acknowledgment strategy to instill task-focused reasoning under adversarial noise, and (3) adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. On out-of-distribution tasks, DMAST substantially mitigates adversarial risks while simultaneously doubling task completion efficiency. Our approach significantly outperforms established training-based and prompt-based defenses, demonstrating genuine co-evolutionary progress and robust generalization to complex, unseen environments.

标签

multimodal agent adversarial training web agent

arXiv 分类

cs.LG cs.AI cs.CL