AI Agents 相关度: 9/10

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko

arXiv: 2603.24511v1 发布: 2026-03-25 更新: 2026-03-25

下载 PDF arXiv 页面

AI 摘要

利用Claude Code进行自动研究，发现优于现有方法的LLM对抗攻击算法。

主要贡献

发现新的LLM白盒对抗攻击算法
显著提升了LLM的越狱和提示注入攻击成功率
证明了利用LLM agent自动化安全研究的可行性

方法论

使用Claude Code驱动的自动研究流程，迭代改进现有攻击算法，并在代理模型上优化后迁移到其他模型。

原文摘要

LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citep{rank2026posttrainbench, novikov2025alphaevolve}. We show that an \emph{autoresearch}-style pipeline \citep{karpathy2026autoresearch} powered by Claude Code discovers novel white-box adversarial attack \textit{algorithms} that \textbf{significantly outperform all existing (30+) methods} in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citep{zou2023universal}, the agent iterates to produce new algorithms achieving up to 40\% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to $\leq$10\% for existing algorithms (\Cref{fig:teaser}, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving \textbf{100\% ASR against Meta-SecAlign-70B} \citep{chen2025secalign} versus 56\% for the best baseline (\Cref{fig:teaser}, middle). Extending the findings of~\cite{carlini2025autoadvexbench}, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White-box adversarial red-teaming is particularly well-suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at https://github.com/romovpa/claudini.

arXiv 分类

cs.LG cs.AI cs.CR

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类