LLM Reasoning 相关度: 10/10

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li

arXiv: 2603.16654v1 发布: 2026-03-17 更新: 2026-03-17

下载 PDF arXiv 页面

AI 摘要

Omanic:一个多跳QA数据集，用于评估LLM推理过程中的中间步骤表现，包含合成和人工标注数据。

主要贡献

提出了Omanic数据集，包含结构化标注的多跳QA数据
系统评估了SOTA LLM在OmanicBench上的表现
验证了OmanicSynth作为推理能力迁移监督的有效性

方法论

构建了包含分解子问题和中间答案的结构化多跳QA数据集，并进行人工验证和LLM实验评估。

原文摘要

Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open-domain multi-hop QA resource that provides decomposed sub-questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench). Systematic evaluations show that state-of-the-art LLMs achieve only 73.11% multiple-choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT's performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine-tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset's quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.

arXiv 分类

cs.CL cs.AI cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类