LLM Reasoning 相关度: 9/10

Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary
arXiv: 2603.10960v1 发布: 2026-03-11 更新: 2026-03-11

AI 摘要

该论文研究了测试时缩放场景下推理LLM的排序问题,并提出了Scorio库。

主要贡献

  • 形式化了测试时缩放下的密集基准排序
  • 提出了Scorio库,包含多种统计排序方法
  • 评估了20个推理模型在奥林匹克数学基准上的排序性能

方法论

使用多种统计排序方法(如配对比较模型、IRT模型等)对LLM进行排序,并与贝叶斯黄金标准进行比较。

原文摘要

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}_{\mathcal{U}}@80$ (mean Kendall's $τ_b = 0.93$--$0.95$), and $19$--$34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $τ_b \approx 0.86$. Using greedy decoding as an empirical prior ($\mathrm{Bayes}_{\mathbf{R}_0}@N$) reduces variance at $N=1$ by $16$--$52\%$, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.

标签

LLM Reasoning Ranking Test-time Scaling

arXiv 分类

cs.LG math.ST