LLM Reasoning 相关度: 8/10

ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

Hongjue Zhao, Haosen Sun, Jiangtao Kong, Xiaochang Li, Qineng Wang, Liwei Jiang, Qi Zhu, Tarek Abdelzaher, Yejin Choi, Manling Li, Huajie Shao
arXiv: 2602.17560v1 发布: 2026-02-19 更新: 2026-02-19

AI 摘要

提出了基于常微分方程(ODE)的LLM对齐新框架ODESteer,提升了对齐效果。

主要贡献

  • 建立了基于ODE的LLM对齐激活Steering理论框架。
  • 将激活Steering方向的识别等价于控制理论中的Barrier函数设计。
  • 提出了基于Barrier函数的ODE Steering方法ODESteer,并在多个基准测试上验证了其有效性。

方法论

将激活Steering视为ODE的解,利用Barrier函数设计指导Steering方向,实现多步自适应Steering。

原文摘要

Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit{(i)} the lack of a unified theoretical framework for guiding the design of steering directions, and \textit{(ii)} an over-reliance on \textit{one-step steering} that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textit{theoretical} framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textit{barrier function} from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows \textit{empirical} advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textit{multi-step and adaptive} steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable $5.7\%$ improvement over TruthfulQA, $2.5\%$ over UltraFeedback, and $2.4\%$ over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.

标签

LLM Alignment Activation Steering ODE Control Theory

arXiv 分类

cs.AI