LLM Reasoning 相关度: 8/10

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

Bogdan Zagribelnyy, Ivan Ilin, Maksim Kuznetsov, Nikita Bondarev, Roman Schutski, Thomas MacDougall, Rim Shayakhmetov, Zulfat Miftakhutdinov, Mikolaj Mizera, Vladimir Aladinskiy, Alex Aliper, Alex Zhavoronkov

arXiv: 2602.03554v1 发布: 2026-02-03 更新: 2026-02-03

下载 PDF arXiv 页面

AI 摘要

论文提出一种新的单步逆合成基准测试框架，并使用化学合理性指标ChemCensor评估LLM的性能。

主要贡献

提出了新的逆合成基准测试框架
引入了化学合理性指标ChemCensor
构建了大规模数据集CREED用于LLM训练

方法论

使用ChemCensor评估LLM生成的逆合成反应的化学合理性，并使用CREED数据集训练LLM。

原文摘要

Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.

arXiv 分类

cs.LG cs.AI cs.CE cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类