Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks
AI 摘要
该论文提出了一种评估和强化LLM系统指令,以抵抗编码攻击的自动化框架。
主要贡献
- 提出了评估系统指令泄露风险的自动化框架
- 发现编码攻击对系统指令的高成功率
- 提出了一种基于思维链的系统指令重塑的缓解策略
方法论
自动化测试框架,通过编码和结构化输出任务重新构建提取请求,评估系统指令的保密性,并用思维链重塑指令。
原文摘要
System Instructions in Large Language Models (LLMs) are commonly used to enforce safety policies, define agent behavior, and protect sensitive operational context in agentic AI applications. These instructions may contain sensitive information such as API credentials, internal policies, and privileged workflow definitions, making system instruction leakage a critical security risk highlighted in the OWASP Top 10 for LLM Applications. Without incurring the overhead costs of reasoning models, many LLM applications rely on refusal-based instructions that block direct requests for system instructions, implicitly assuming that prohibited information can only be extracted through explicit queries. We introduce an automated evaluation framework that tests whether system instructions remain confidential when extraction requests are re-framed as encoding or structured output tasks. Across four common models and 46 verified system instructions, we observe high attack success rates (> 0.7) for structured serialization where models refuse direct extraction requests but disclose protected content in the requested serialization formats. We further demonstrate a mitigation strategy based on one-shot instruction reshaping using a Chain-of-Thought reasoning model, indicating that even subtle changes in wording and structure of system instructions can significantly reduce attack success rate without requiring model retraining.