LLM Reasoning 相关度: 9/10

How do LLMs Compute Verbal Confidence

Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, Petar Velickovic
arXiv: 2603.17839v1 发布: 2026-03-18 更新: 2026-03-18

AI 摘要

该论文研究了LLM如何计算verbal confidence,揭示了其automatic、sophisticated的自我评估机制。

主要贡献

  • 揭示LLM的verbal confidence是cached retrieval而非just-in-time计算
  • 发现confidence representations在答案附近的位置出现,并被缓存
  • 证明verbal confidence不仅仅是token log-probabilities,而是更丰富的答案质量评估

方法论

采用activation steering、patching、noising、swap以及attention blocking等实验方法,并结合linear probing和variance partitioning。

原文摘要

Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.

标签

LLM verbal confidence metacognition calibration

arXiv 分类

cs.CL cs.AI cs.LG