LLM Reasoning 相关度: 8/10

$C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal

Aditya Kasliwal, Pratinav Seth, Vinay Kumar Sankarapu

arXiv: 2602.04521v1 发布: 2026-02-04 更新: 2026-02-04

下载 PDF arXiv 页面

AI 摘要

提出一种离线权重更新方法C-Δθ，用于选择性拒绝，无需推理时干预。

主要贡献

提出 Circuit Restricted Weight Arithmetic (C-Δθ) 方法
通过稀疏电路定位拒绝相关的计算
将拒绝的成本从每次请求转移到一次离线更新

方法论

使用EAP-IG方法定位拒绝计算的关键电路，然后在该电路约束下进行权重更新，生成编辑后的模型检查点。

原文摘要

Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.

arXiv 分类

cs.CL cs.ET

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类