Agent Tuning & Optimization 相关度: 8/10

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Jyotin Goel, Souvik Maji, Pratik Mazumder

arXiv: 2602.17546v1 发布: 2026-02-19 更新: 2026-02-19

下载 PDF arXiv 页面

AI 摘要

该论文提出了一种自适应正则化框架，用于在微调过程中防止语言模型的安全性下降。

主要贡献

提出自适应正则化框架，在微调中保持模型安全性
探索了基于安全评判器和激活的风险预测器两种安全风险评估方法
验证了该方法在降低攻击成功率的同时，保持了模型效用

方法论

通过安全评判器或激活风险预测器评估风险，并据此自适应地调整正则化强度，约束高风险更新。

原文摘要

Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.

arXiv 分类

cs.CL cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类