Agent Tuning & Optimization 相关度: 8/10

Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization

Jihao Zhao, Shuaishuai Zu, Zhiyuan Ji, Chunlai Zhou, Biao Qin
arXiv: 2603.15061v1 发布: 2026-03-16 更新: 2026-03-16

AI 摘要

提出MRPO算法,利用自动构建的细粒度标准和记忆增强实现LLM创造性写作的迭代优化。

主要贡献

  • 设计基于Grounded Theory的多智能体协同写作流程
  • 提出Memory-augmented Replay Policy Optimization (MRPO) 算法
  • 自动构建的评估标准媲美人力标注效果

方法论

采用多智能体协同工作流动态生成评估标准,结合监督微调和强化学习,通过MRPO算法实现端到端优化。

原文摘要

As a typical open-ended generation task, creative writing lacks verifiable reference answers, which has long constrained reward modeling and automatic evaluation due to high human annotation costs, evaluative bias, and coarse feedback signals. To address these challenges, this paper first designs a multi-agent collaborative workflow based on Grounded Theory, performing dimensional decomposition and hierarchical induction of the problem to dynamically produce interpretable and reusable fine-grained criteria. Furthermore, we propose the Memory-augmented Replay Policy Optimization (MRPO) algorithm: on the one hand, without additional training, MRPO guides models to engage in self-reflection based on dynamic criteria, enabling controlled iterative improvement; on the other hand, we adopt the training paradigm that combines supervised fine-tuning with reinforcement learning to convert evaluation criteria into reward signals, achieving end-to-end optimization. Experimental results demonstrate that the automatically constructed criteria achieve performance gains comparable to human annotations. Writer-R1-4B models trained with this approach outperform baselines across multiple creative writing tasks and surpass some 100B+ parameter open-source models.

标签

强化学习 创造性写作 LLM 评估标准 多智能体

arXiv 分类

cs.CL