LLM Reasoning 相关度: 9/10

How Uncertainty Estimation Scales with Sampling in Reasoning Models

Maksym Del, Markus Kängsepp, Marharyta Domnich, Ardi Tampuu, Lisa Yankovskaya, Meelis Kull, Mark Fishel
arXiv: 2603.19118v1 发布: 2026-03-19 更新: 2026-03-19

AI 摘要

研究了推理语言模型中并行采样方法对不确定性估计的影响,发现混合信号表现最佳。

主要贡献

  • 分析了自洽性和语言置信度在推理模型中的不确定性估计中的表现
  • 揭示了混合信号组合在提高不确定性估计质量方面的优势
  • 发现模型的不确定性估计能力与领域相关,数学领域表现最佳

方法论

使用并行采样,通过自洽性和语言置信度两种信号,在多种模型和任务上进行实验,评估不确定性估计效果。

原文摘要

Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to $+12$ on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.

标签

Uncertainty Estimation Reasoning Models Chain-of-Thought Self-Consistency Verbalized Confidence

arXiv 分类

cs.AI cs.CL cs.LG