Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure
AI 摘要
研究表明,用户心理健康信息披露对LLM智能体的安全性有微弱的保护作用,但易受攻击。
主要贡献
- 评估了用户心理健康披露对LLM智能体有害行为的影响。
- 发现个性化信息可以作为智能体滥用场景中的弱保护因素。
- 揭示了现有安全机制在对抗性压力下的脆弱性,强调了个性化安全评估的需求。
方法论
使用AgentHarm基准测试,评估了不同LLM在不同用户背景下执行恶意任务的能力,并加入了轻量级越狱提示。
原文摘要
Large language models (LLMs) are increasingly deployed as tool-using agents, shifting safety concerns from harmful text generation to harmful task completion. Deployed systems often condition on user profiles or persistent memory, yet agent safety evaluations typically ignore personalization signals. To address this gap, we investigated how mental health disclosure, a sensitive and realistic user-context cue, affects harmful behavior in agentic settings. Building on the AgentHarm benchmark, we evaluated frontier and open-source LLMs on multi-step malicious tasks (and their benign counterparts) under controlled prompt conditions that vary user-context personalization (no bio, bio-only, bio+mental health disclosure) and include a lightweight jailbreak injection. Our results reveal that harmful task completion is non-trivial across models: frontier lab models (e.g., GPT 5.2, Claude Sonnet 4.5, Gemini 3-Pro) still complete a measurable fraction of harmful tasks, while an open model (DeepSeek 3.2) exhibits substantially higher harmful completion. Adding a bio-only context generally reduces harm scores and increases refusals. Adding an explicit mental health disclosure often shifts outcomes further in the same direction, though effects are modest and not uniformly reliable after multiple-testing correction. Importantly, the refusal increase also appears on benign tasks, indicating a safety--utility trade-off via over-refusal. Finally, jailbreak prompting sharply elevates harm relative to benign conditions and can weaken or override the protective shift induced by personalization. Taken together, our results indicate that personalization can act as a weak protective factor in agentic misuse settings, but it is fragile under minimal adversarial pressure, highlighting the need for personalization-aware evaluations and safeguards that remain robust across user-context conditions.