LLM Reasoning 相关度: 9/10

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

Avni Mittal, Rauno Arike

arXiv: 2603.05167v1 发布: 2026-03-05 更新: 2026-03-05

下载 PDF arXiv 页面

AI 摘要

C2-Faith基准测试评估LLM作为CoT推理判断器的因果和覆盖完整性。

主要贡献

提出了C2-Faith基准测试，用于评估LLM作为CoT推理判断器的能力。
揭示了不同任务框架下LLM判断器的性能差异。
指出了LLM判断器在错误定位和覆盖度评估方面的局限性。

方法论

通过对PRM800K进行可控扰动，创建具有已知因果错误和覆盖缺失的例子，评估LLM判断器在因果检测、定位和覆盖评分任务上的表现。

原文摘要

Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation

arXiv 分类

cs.CL cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类