LLM Reasoning 相关度: 9/10

How Uncertainty Estimation Scales with Sampling in Reasoning Models

Maksym Del, Markus Kängsepp, Marharyta Domnich, Ardi Tampuu, Lisa Yankovskaya, Meelis Kull, Mark Fishel

arXiv: 2603.19118v1 发布: 2026-03-19 更新: 2026-03-19

下载 PDF arXiv 页面

AI 摘要

研究了推理语言模型中并行采样方法对不确定性估计的影响，发现混合信号表现最佳。

主要贡献

分析了自洽性和语言置信度在推理模型中的不确定性估计中的表现
揭示了混合信号组合在提高不确定性估计质量方面的优势
发现模型的不确定性估计能力与领域相关，数学领域表现最佳

方法论

使用并行采样，通过自洽性和语言置信度两种信号，在多种模型和任务上进行实验，评估不确定性估计效果。

原文摘要

Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to $+12$ on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.

arXiv 分类

cs.AI cs.CL cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类