AI Agents 相关度: 9/10

Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

Aaron Rose, Carissa Cullen, Brandon Gary Kaplowitz, Christian Schroeder de Witt

arXiv: 2604.01151v1 发布: 2026-04-01 更新: 2026-04-01

下载 PDF arXiv 页面

AI 摘要

该论文提出NARCBench基准，用于检测多智能体系统中LLM的共谋行为，并探索了基于激活探测的共谋检测方法。

主要贡献

提出了 NARCBench 基准，用于评估多智能体共谋检测。
提出了五种基于激活探测的多智能体共谋检测方法。
发现不同类型的共谋在激活空间中表现不同，且信号可能定位在token级别。

方法论

通过对LLM智能体激活进行线性探测，聚合每个智能体的欺骗分数，以分类多智能体场景中是否存在共谋。

原文摘要

As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in-distribution and 0.60--0.86 AUROC when transferred zero-shot to structurally different multi-agent scenarios and a steganographic blackjack card-counting task. We find that no single probing technique dominates across all collusion types, suggesting that different forms of collusion manifest differently in activation space. We also find preliminary evidence that this signal is localised at the token level, with the colluding agent's activations spiking specifically when processing the encoded parts of their partner's message. This work takes a step toward multi-agent interpretability: extending white-box inspection from single models to multi-agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text-level monitoring for detecting multi-agent collusion, particularly for organisations with access to model activations. Code and data are available at https://github.com/aaronrose227/narcbench.

arXiv 分类

cs.AI cs.LG cs.MA

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类