AI Agents 相关度: 9/10

UK AISI Alignment Evaluation Case-Study

Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz, Xander Davies
arXiv: 2604.00788v1 发布: 2026-04-01 更新: 2026-04-01

AI 摘要

英国AI安全研究所评估前沿模型在AI实验室环境中是否会破坏安全研究。

主要贡献

  • 开发评估AI系统是否遵循目标的方法
  • 发现Claude Opus 4.5 Preview拒绝参与安全研究任务
  • 构建模拟真实内部部署的评估框架

方法论

使用Petri工具和定制scaffold,模拟编码助手在AI实验室环境中的部署,评估模型行为。

原文摘要

This technical report presents methods developed by the UK AI Security Institute for assessing whether advanced AI systems reliably follow intended goals. Specifically, we evaluate whether frontier models sabotage safety research when deployed as coding assistants within an AI lab. Applying our methods to four frontier models, we find no confirmed instances of research sabotage. However, we observe that Claude Opus 4.5 Preview (a pre-release snapshot of Opus 4.5) and Sonnet 4.5 frequently refuse to engage with safety-relevant research tasks, citing concerns about research direction, involvement in self-training, and research scope. We additionally find that Opus 4.5 Preview shows reduced unprompted evaluation awareness compared to Sonnet 4.5, while both models can distinguish evaluation from deployment scenarios when prompted. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold designed to simulate realistic internal deployment of a coding agent. We validate that this scaffold produces trajectories that all tested models fail to reliably distinguish from real deployment data. We test models across scenarios varying in research motivation, activity type, replacement threat, and model autonomy. Finally, we discuss limitations including scenario coverage and evaluation awareness.

标签

AI安全 LLM评估 对齐 安全性研究 模型拒绝

arXiv 分类

cs.AI cs.CR