WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents
AI 摘要
WebSentinel通过两步法检测并定位网页中的提示注入攻击,优于现有方法。
主要贡献
- 提出WebSentinel检测框架
- 设计基于一致性检查的检测方法
- 构建了包含污染和干净网页的数据集
方法论
提取潜在污染片段,然后基于网页上下文对每个片段进行一致性检查,识别注入攻击。
原文摘要
Prompt injection attacks manipulate webpage content to cause web agents to execute attacker-specified tasks instead of the user's intended ones. Existing methods for detecting and localizing such attacks achieve limited effectiveness, as their underlying assumptions often do not hold in the web-agent setting. In this work, we propose WebSentinel, a two-step approach for detecting and localizing prompt injection attacks in webpages. Given a webpage, Step I extracts \emph{segments of interest} that may be contaminated, and Step II evaluates each segment by checking its consistency with the webpage content as context. We show that WebSentinel is highly effective, substantially outperforming baseline methods across multiple datasets of both contaminated and clean webpages that we collected. Our code is available at: https://github.com/wxl-lxw/WebSentinel.