LLM Reasoning 相关度: 9/10

Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability

Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar
arXiv: 2602.17544v1 发布: 2026-02-19 更新: 2026-02-19

AI 摘要

论文提出可重用性和可验证性两个指标,用于评估CoT推理质量,揭示了现有评估方法的盲点。

主要贡献

  • 提出可重用性与可验证性指标
  • 构建Thinker-Executor框架进行CoT评估
  • 发现传统准确率无法有效评估CoT质量

方法论

使用Thinker-Executor框架,将CoT生成和执行分离,通过可重用性和可验证性指标评估CoT质量。

原文摘要

In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusability measures how easily an Executor can reuse the Thinker's CoT. Verifiability measures how frequently an Executor can match the Thinker's answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy-based leaderboards for reasoning capability. Surprisingly, we find that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.

标签

Chain-of-Thought Reasoning Evaluation Reusability Verifiability

arXiv 分类

cs.AI cs.CL cs.IR