SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues
AI 摘要
研究语义线索如何影响视觉语言模型(VLM)的安全判断,揭示其脆弱性。
主要贡献
- 提出了语义引导框架,用于控制VLM的安全行为
- 构建了SAVeS基准,用于评估情境安全
- 揭示了VLM安全决策对语义线索的敏感性
方法论
通过文本、视觉和认知干预,在不改变场景内容的情况下,研究语义线索对VLM安全判断的影响。
原文摘要
Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.