AI Agents 相关度: 8/10

SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Jiarui Yuan, Tailin Jin, Weize Chen, Zeyuan Liu, Zhiyuan Liu, Maosong Sun

arXiv: 2602.04811v1 发布: 2026-02-04 更新: 2026-02-04

下载 PDF arXiv 页面

AI 摘要

SE-Bench提供了一个基准测试，用于评估模型内化新知识的自进化能力。

主要贡献

提出了SE-Bench诊断环境，用于评估知识内化能力。
揭示了开放式书籍悖论、强化学习差距和自博弈在知识内化中的作用。
提供了一个评估自进化和知识内化的严格平台。

方法论

构建混淆的NumPy库，训练智能体内部化该库，并在没有文档的情况下评估其编码能力。

原文摘要

True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring "Closed-Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE-Bench.

arXiv 分类

cs.CL cs.AI cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类