Agent Tuning & Optimization 相关度: 8/10

Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment

Elias Malomgré, Pieter Simoens

arXiv: 2602.14844v1 发布: 2026-02-16 更新: 2026-02-16

下载 PDF arXiv 页面

AI 摘要

提出了一种解耦对齐学习和策略优化的无交互逆强化学习框架，构建可检验的奖励模型。

主要贡献

解耦对齐和策略优化
引入无交互逆强化学习
提出对齐飞轮框架

方法论

通过无交互逆强化学习学习奖励模型，并利用对齐飞轮迭代优化奖励模型。

原文摘要

AI alignment is growing in importance, yet current approaches suffer from a critical structural flaw that entangles the safety objectives with the agent's policy. Methods such as Reinforcement Learning from Human Feedback and Direct Preference Optimization create opaque, single-use alignment artifacts, which we term Alignment Waste. We propose Interactionless Inverse Reinforcement Learning to decouple alignment artifact learning from policy optimization, producing an inspectable, editable, and model-agnostic reward model. Additionally, we introduce the Alignment Flywheel, a human-in-the-loop lifecycle that iteratively hardens the reward model through automated audits and refinement. This architecture transforms safety from a disposable expense into a durable, verifiable engineering asset.

arXiv 分类

cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类