C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
AI 摘要
C2-Faith基准测试评估LLM作为CoT推理判断器的因果和覆盖完整性。
主要贡献
- 提出了C2-Faith基准测试,用于评估LLM作为CoT推理判断器的能力。
- 揭示了不同任务框架下LLM判断器的性能差异。
- 指出了LLM判断器在错误定位和覆盖度评估方面的局限性。
方法论
通过对PRM800K进行可控扰动,创建具有已知因果错误和覆盖缺失的例子,评估LLM判断器在因果检测、定位和覆盖评分任务上的表现。
原文摘要
Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation