AI Agents 相关度: 9/10

Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

Jia Qing Yap
arXiv: 2603.16335v1 发布: 2026-03-17 更新: 2026-03-17

AI 摘要

通过稀疏自编码器解码探针向量,研究35B MoE语言模型的行为引导,发现主要受单轴影响。

主要贡献

  • 提出了一种基于SAE解码的探针向量的行为引导方法
  • 发现五种行为特征主要受单一代理轴控制
  • 揭示了GatedDeltaNet架构中行为承诺在预填充阶段计算

方法论

训练SAE解码器,通过线性探针提取潜在激活,再投影回原始激活空间,实现无微调的细粒度行为干预。

原文摘要

We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model's native activation space. This bypasses the SAE's top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen's d = 1.01 (p < 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p > 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

标签

行为引导 稀疏自编码器 大型语言模型 MoE AI Agent

arXiv 分类

cs.LG cs.CL