ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment
AI 摘要
提出了基于常微分方程(ODE)的LLM对齐新框架ODESteer,提升了对齐效果。
主要贡献
- 建立了基于ODE的LLM对齐激活Steering理论框架。
- 将激活Steering方向的识别等价于控制理论中的Barrier函数设计。
- 提出了基于Barrier函数的ODE Steering方法ODESteer,并在多个基准测试上验证了其有效性。
方法论
将激活Steering视为ODE的解,利用Barrier函数设计指导Steering方向,实现多步自适应Steering。
原文摘要
Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit{(i)} the lack of a unified theoretical framework for guiding the design of steering directions, and \textit{(ii)} an over-reliance on \textit{one-step steering} that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textit{theoretical} framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textit{barrier function} from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows \textit{empirical} advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textit{multi-step and adaptive} steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable $5.7\%$ improvement over TruthfulQA, $2.5\%$ over UltraFeedback, and $2.4\%$ over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.