AI Agents 相关度: 9/10

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng

arXiv: 2603.04364v1 发布: 2026-03-04 更新: 2026-03-04

下载 PDF arXiv 页面

AI 摘要

针对多模态WebAgent的安全漏洞，提出一种双模态多阶段对抗安全训练框架DMAST。

主要贡献

揭示了多模态WebAgent在跨模态攻击下的安全漏洞。
提出了双模态多阶段对抗安全训练框架DMAST。
DMAST在泛化性和安全性上优于现有方法。

方法论

提出DMAST，通过模仿学习、有指导的微调和对抗强化学习三个阶段来训练鲁棒的多模态WebAgent。

原文摘要

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive narrative. Our vulnerability analysis on MiniWob++ reveals that attacks including a visual component far outperform text-only injections, exposing critical gaps in text-centric VLM safety training. Motivated by this finding, we propose Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline: (1) imitation learning from a strong teacher model, (2) oracle-guided supervised fine-tuning that uses a novel zero-acknowledgment strategy to instill task-focused reasoning under adversarial noise, and (3) adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. On out-of-distribution tasks, DMAST substantially mitigates adversarial risks while simultaneously doubling task completion efficiency. Our approach significantly outperforms established training-based and prompt-based defenses, demonstrating genuine co-evolutionary progress and robust generalization to complex, unseen environments.

arXiv 分类

cs.LG cs.AI cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类